Where's the culprit?

Mon Sep 5 16:09:36 UTC 2011

On Sun, Sep 04, 2011 at 10:19:49PM -0400, Peter King wrote:
> On Sun, Sep 04, 2011 at 09:57:07PM -0400, Lennart Sorensen wrote:
> 
> > > Sep  4 15:44:00 theseus kernel: Pid: 16344, comm: sed Not tainted 3.0.3-gentoo #6 System Manufacturer System Name/A7V8X-X
> > > Sep  4 15:44:00 theseus kernel: EIP: 0060:[<c10920fb>] EFLAGS: 00210286 CPU: 0
> > > Sep  4 15:44:00 theseus kernel: EIP is at __destroy_inode+0x29/0x62
> > > Sep  4 15:44:00 theseus kernel: EAX: ffffdfff EBX: db04bb88 ECX: 00000003 EDX: ffffdffe
> > > Sep  4 15:44:00 theseus kernel: ESI: db04bb88 EDI: 00000000 EBP: db04bb88 ESP: ef6e7f44
> > > Sep  4 15:44:00 theseus kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > > Sep  4 15:44:00 theseus kernel: Process sed (pid: 16344, ti=ef6e6000 task=f5465be0 task.ti=ef6e6000)

Oh there it is.  sed is the process.  How strange.

Now of course it is entirely possible the kernel was doing some filesystem
writes in the background flushing out the cache and sed just happened
to be the active process when things went wrong.  It might not be involved.

> > > Sep  4 15:44:00 theseus kernel: Stack:
> > > Sep  4 15:44:00 theseus kernel: db04bb88 c109214a db3dfb14 c108f389 db3dfb14 00000000 db04bb88 c10911a2
> > > Sep  4 15:44:00 theseus kernel: f4e4e6c0 00000010 db04bb88 c10831e8 00000020 f4e4e6c8 db3dfb14 f5416620
> > > Sep  4 15:44:00 theseus kernel: f4e4e6c0 00000000 f542d660 ef6e6000 c108091d f542d660 00000000 00000000
> > > Sep  4 15:44:00 theseus kernel: Call Trace:
> > > Sep  4 15:44:00 theseus kernel: [<c109214a>] ? destroy_inode+0x16/0x37
> > > Sep  4 15:44:00 theseus kernel: [<c108f389>] ? d_kill+0x93/0xa5
> > > Sep  4 15:44:00 theseus kernel: [<c10911a2>] ? dput+0xf7/0x100
> > > Sep  4 15:44:00 theseus kernel: [<c10831e8>] ? fput+0x198/0x1b0
> > > Sep  4 15:44:00 theseus kernel: [<c108091d>] ? filp_close+0x54/0x5a
> > > Sep  4 15:44:00 theseus kernel: [<c108097b>] ? sys_close+0x58/0x85
> > > Sep  4 15:44:00 theseus kernel: [<c13bb490>] ? sysenter_do_call+0x12/0x26
> > 
> > So this was the interesting bit.  Something called close on a file,
> > which then caused an inode to be removed, and then the kernel blew up.
> > 
> > Would have been nice if it had said what pid 16344 was.  I thought it
> > usually did.  Maybe the kernel is missing some config for that.
> 
> I take it, then, that it isn't sed (which is mentioned on the top line). It may well be that some
> debugging configuration is absent

Sure looks that way.

Usually 'BUG' outputs that I have seen include line numbers and such
which makes it much easier to track down.  Of course that is a kernel
config option so it appears to be disabled on this kernel.

CONFIG_DEBUG_BUGVERBOSE=y is very useful.

> > > Sep  4 17:42:35 theseus kernel: BUG: unable to handle kernel NULL pointer dereference at   (null)
> > > Sep  4 17:42:35 theseus kernel: IP: [<f564c0c0>] 0xf564c0bf
> > > Sep  4 17:42:35 theseus kernel: *pde = 00000000 
> > > Sep  4 17:42:35 theseus kernel: Oops: 0002 [#2] SMP 
> > > Sep  4 17:42:35 theseus kernel: Modules linked in: e1000
> > > Sep  4 17:42:35 theseus kernel: 
> > > Sep  4 17:42:35 theseus kernel: Pid: 17180, comm: emerge Tainted: G      D     3.0.3-gentoo #6 System Manufacturer System Name/A7V8X-X
> > > Sep  4 17:42:35 theseus kernel: EIP: 0060:[<f564c0c0>] EFLAGS: 00210256 CPU: 0
> > > Sep  4 17:42:35 theseus kernel: EIP is at 0xf564c0c0
> > > Sep  4 17:42:35 theseus kernel: EAX: 00000000 EBX: 0000013e ECX: f564c390 EDX: 00200286
> > 
> > This one strangely doesn't provide any useful info.
> 
> Again, I guess it isn't emerge, although I can attest that emerge was running at the
> time.
> 
> > > ...and then silence. The machine has crashed and required a reboot at this
> > > point. Looks to me like there is a problem with a "kernel paging request" in
> > > SMP #1, leading to the first oops at 15:44. Then, amazingly, it chugs along
> > > more or less well until 17:42, when there is a "kernel NULL pointer dereference"
> > > in SMP #2, which brings the whole thing down.
> > > 
> > > I can't get a sense from that whether it's the modules linked in (namely e1000),
> > > or syslog-ng, or something else. Anyone?
> > 
> > SMP simply means symetric multi procesing, which pretty much all systems
> > are these days.
> 
> I know about SMP, but I don't know why these are referenced to apparently different
> sources, since the processor is single-core. Perhaps because it's running make -j2.
> 
> Well, now I'm less certain than ever where the problem lies. There were a few notes
> on kernel oopses and e1000, but mostly back in 2.4 days (although one in 2.6.26 IIRC).
> Puzzling.

This one does not look like a network related one.  It very much looks
filesystem related, although it could be a virtual filesystem one (like
/proc or /sys).

-- 
Len Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists