Troubleshooting server crashes

Fraser Campbell fraser-Txk5XLRqZ6CsTnJN9+BGXg at public.gmane.org
Sat Oct 4 13:41:19 UTC 2003


On Friday 03 October 2003 15:18, Ilya Palagin wrote:

> > When a server crashes and absolutely nothing interesting is in the logs
> > what does a person do?  I generally suspect hardware problems but when a
> > server
>
> Turn on the monitor to find out if there is a kernel panic. What
> actually happens when it crashes?

Nothing, it crashed ;-)  Seriously, no life, could not wake up the video, fans 
whirring but no other obvious signs that the computer is even on.

> > has been rock solid historically I don't put a lot of faith in that and
> > in any case it's just a guess.
>
> How old is it? Maybe it's time to clean contacts on SIMMs, run
> memtest86,  replace a power supply (electrolytic capasitors get dry in
> 2-3 years), make sure fans are good, run badblock?

What I  realized after sending the first email is that even though this server 
has historically been very stable (yes, 2-3 years) about 2 weeks ago it was 
pushed into some extra services (many more websites and went from 2 databases 
to 44).

Although the server still doesn't break a sweat there is significantly more 
processing going on.  I'm leaning towards bad ram in light of the fact that 
it's almost certainly using more ram and bad bits might be getting tickled 
that were previously unused.

> Crashes of stable Linux distributions don't happen on regular basis,
> there is no need in troubleshooting tips website :-). Seriously - if one
> starts to experience problems having no recent soft/config changes in
> the Linux system, hardware must be checked. I've listed the most weak
> parts above.

You are correct.  I've rarely had Linux server crashes and 99% of the time 
swapping out hardware has fixed them, or increasing resources in the event of 
OOM type problems.

Still there are so many possible Linux error messages, some common and not 
resulting in a crash, some more serious ... sometimes it's hard to find 
definitive answers.  For example:

    hda: timeout waiting for DMA

Wouldn't it be nice to have a knowledgebase somewhere telling you that this 
error is usually nothing to worry about unless accompanied by other errors, 
this error is a sign that you're using unsupported DMA modes, this error 
means that your that your motherboard needs to be replaced, ???

I just tried searching Redhat's knowledgebase for "timeout waiting for DMA" 
and the results are a joke. A total of 7 hits, first hit was "nsupdate not 
working", the second hit was "sendmail hangs at boot" dma matched because of 
senDMAil.

-- 
Fraser Campbell <fraser-Txk5XLRqZ6CsTnJN9+BGXg at public.gmane.org>                 http://www.wehave.net/
Halton Hills, Ontario, Canada                             Debian GNU/Linux

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list