Where's the culprit?

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Mon Sep 5 01:57:07 UTC 2011


On Sun, Sep 04, 2011 at 09:20:11PM -0400, Peter King wrote:
> The logs aren't entirely useless, although there is no hint of a problem
> in dmesg. Here is the offending stretch in /var/log/messages. Everything
> before the first entry is entirely mundane, and then the trouble seems to
> start... (parts anonymized)
> 
> Sep  4 15:10:50 theseus cron[14260]: (root) MAIL (mailed 94 bytes of output but got status 0x004e
> Sep  4 15:10:50 theseus )
> Sep  4 15:14:07 theseus sshd[15230]: SSH: Server;Ltype: Version;Remote: XXXXXXXXXXXXX;Protocol: 2.0;Client: OpenSSH_5.8p1-hpn13v10
> Sep  4 15:14:07 theseus sshd[15230]: Accepted publickey for XXXXXX from XXXXXXXXXXX port XXX ssh2
> Sep  4 15:14:07 theseus sshd[15230]: pam_unix(sshd:session): session opened for user XXXXXXXX by (uid=0)
> Sep  4 15:14:14 theseus sudo:    pking : TTY=pts/0 ; PWD=/home/XXXXXXXX ; USER=root ; COMMAND=/usr/bin/emerge -avuNDt world
> Sep  4 15:14:14 theseus sudo: pam_unix(sudo:session): session opened for user root by XXXXXXX (uid=1000)
> Sep  4 15:20:01 theseus cron[31327]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 15:29:39 theseus -- MARK --
> Sep  4 15:30:01 theseus cron[20414]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 15:40:01 theseus cron[13297]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 15:44:00 theseus kernel: BUG: unable to handle kernel paging request at ffffdfff
> Sep  4 15:44:00 theseus kernel: IP: [<c10920fb>] __destroy_inode+0x29/0x62
> Sep  4 15:44:00 theseus kernel: *pde = 015e3067 *pte = 00000000 
> Sep  4 15:44:00 theseus kernel: Oops: 0002 [#1] SMP 
> Sep  4 15:44:00 theseus kernel: Modules linked in: e1000
> Sep  4 15:44:00 theseus kernel: 
> Sep  4 15:44:00 theseus kernel: Pid: 16344, comm: sed Not tainted 3.0.3-gentoo #6 System Manufacturer System Name/A7V8X-X
> Sep  4 15:44:00 theseus kernel: EIP: 0060:[<c10920fb>] EFLAGS: 00210286 CPU: 0
> Sep  4 15:44:00 theseus kernel: EIP is at __destroy_inode+0x29/0x62
> Sep  4 15:44:00 theseus kernel: EAX: ffffdfff EBX: db04bb88 ECX: 00000003 EDX: ffffdffe
> Sep  4 15:44:00 theseus kernel: ESI: db04bb88 EDI: 00000000 EBP: db04bb88 ESP: ef6e7f44
> Sep  4 15:44:00 theseus kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Sep  4 15:44:00 theseus kernel: Process sed (pid: 16344, ti=ef6e6000 task=f5465be0 task.ti=ef6e6000)
> Sep  4 15:44:00 theseus kernel: Stack:
> Sep  4 15:44:00 theseus kernel: db04bb88 c109214a db3dfb14 c108f389 db3dfb14 00000000 db04bb88 c10911a2
> Sep  4 15:44:00 theseus kernel: f4e4e6c0 00000010 db04bb88 c10831e8 00000020 f4e4e6c8 db3dfb14 f5416620
> Sep  4 15:44:00 theseus kernel: f4e4e6c0 00000000 f542d660 ef6e6000 c108091d f542d660 00000000 00000000
> Sep  4 15:44:00 theseus kernel: Call Trace:
> Sep  4 15:44:00 theseus kernel: [<c109214a>] ? destroy_inode+0x16/0x37
> Sep  4 15:44:00 theseus kernel: [<c108f389>] ? d_kill+0x93/0xa5
> Sep  4 15:44:00 theseus kernel: [<c10911a2>] ? dput+0xf7/0x100
> Sep  4 15:44:00 theseus kernel: [<c10831e8>] ? fput+0x198/0x1b0
> Sep  4 15:44:00 theseus kernel: [<c108091d>] ? filp_close+0x54/0x5a
> Sep  4 15:44:00 theseus kernel: [<c108097b>] ? sys_close+0x58/0x85
> Sep  4 15:44:00 theseus kernel: [<c13bb490>] ? sysenter_do_call+0x12/0x26

So this was the interesting bit.  Something called close on a file,
which then caused an inode to be removed, and then the kernel blew up.

Would have been nice if it had said what pid 16344 was.  I thought it
usually did.  Maybe the kernel is missing some config for that.

Strange how that Bug line above doesn't include a file and line number.
Bug messages are supposed to do so.  Perhaps this kernel compile is
missing some of the rather useful debug features.

> Sep  4 15:44:00 theseus kernel: Code: ff ff 53 89 c3 e8 a6 b0 00 00 85 c0 74 04 0f 0b eb fe 89 d8 e8 eb 37 01 00 8b 83 44 01 00 00 8d 50 ff 83 fa fd 77 13 85 c0 74 0f <3e> ff 08 0f 94 c2 84 d2 74 05 e8 9a d3 fe ff 8b 83 48 01 00 00 
> Sep  4 15:44:00 theseus kernel: EIP: [<c10920fb>] __destroy_inode+0x29/0x62 SS:ESP 0068:ef6e7f44
> Sep  4 15:44:00 theseus kernel: CR2: 00000000ffffdfff
> Sep  4 15:44:00 theseus kernel: ---[ end trace 7c661efb7a158895 ]---
> Sep  4 15:50:01 theseus cron[25131]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 15:59:01 theseus cron[25143]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly)
> Sep  4 16:00:01 theseus cron[25145]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:00:04 theseus sudo:    pking : TTY=pts/0 ; PWD=/home/XXXXXXXX ; USER=root ; COMMAND=/usr/sbin/dispatch-conf
> Sep  4 16:00:04 theseus sudo: pam_unix(sudo:session): session opened for user root by pking(uid=1000)
> Sep  4 16:00:18 theseus sudo:    pking : TTY=pts/0 ; PWD=/home/XXXXXXXX ; USER=root ; COMMAND=/usr/bin/revdep-rebuild
> Sep  4 16:00:18 theseus sudo: pam_unix(sudo:session): session opened for user root by pking(uid=1000)
> Sep  4 16:01:20 theseus sudo:    pking : TTY=pts/0 ; PWD=/home/XXXXXXXX ; USER=root ; COMMAND=/usr/bin/revdep-rebuild -i
> Sep  4 16:01:20 theseus sudo: pam_unix(sudo:session): session opened for user root by XXXXXXXX (uid=1000)
> Sep  4 16:10:01 theseus cron[32228]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:20:01 theseus cron[9252]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:29:39 theseus -- MARK --
> Sep  4 16:30:01 theseus cron[25723]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:40:01 theseus cron[23464]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:50:01 theseus cron[24496]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 16:59:01 theseus cron[25750]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly)
> Sep  4 17:00:01 theseus cron[25864]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 17:10:01 theseus cron[27020]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 17:20:01 theseus cron[28024]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 17:29:39 theseus -- MARK --
> Sep  4 17:30:01 theseus cron[29532]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 17:40:01 theseus cron[18520]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons )
> Sep  4 17:42:35 theseus kernel: BUG: unable to handle kernel NULL pointer dereference at   (null)
> Sep  4 17:42:35 theseus kernel: IP: [<f564c0c0>] 0xf564c0bf
> Sep  4 17:42:35 theseus kernel: *pde = 00000000 
> Sep  4 17:42:35 theseus kernel: Oops: 0002 [#2] SMP 
> Sep  4 17:42:35 theseus kernel: Modules linked in: e1000
> Sep  4 17:42:35 theseus kernel: 
> Sep  4 17:42:35 theseus kernel: Pid: 17180, comm: emerge Tainted: G      D     3.0.3-gentoo #6 System Manufacturer System Name/A7V8X-X
> Sep  4 17:42:35 theseus kernel: EIP: 0060:[<f564c0c0>] EFLAGS: 00210256 CPU: 0
> Sep  4 17:42:35 theseus kernel: EIP is at 0xf564c0c0
> Sep  4 17:42:35 theseus kernel: EAX: 00000000 EBX: 0000013e ECX: f564c390 EDX: 00200286

This one strangely doesn't provide any useful info.

> ...and then silence. The machine has crashed and required a reboot at this
> point. Looks to me like there is a problem with a "kernel paging request" in
> SMP #1, leading to the first oops at 15:44. Then, amazingly, it chugs along
> more or less well until 17:42, when there is a "kernel NULL pointer dereference"
> in SMP #2, which brings the whole thing down.
> 
> I can't get a sense from that whether it's the modules linked in (namely e1000),
> or syslog-ng, or something else. Anyone?

SMP simply means symetric multi procesing, which pretty much all systems
are these days.

-- 
LEn Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list