Scheduling lockup crash?
Peter King
peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Sun Jul 15 07:55:49 UTC 2012
I'm trying to diagnose a computer problem at a distance. Periodically,
under heavy cpu usage (which also raise its core temperature), it seems
to lock up and exhibit other distressing symptoms; and I find in my
logfiles the following message repeated many many times:
Jul 12 16:35:23 machine kernel: [339200.200303] INFO: rcu_sched self-detected stall on CPU { 2} (t=60001 jiffies)
Jul 12 16:35:23 machine kernel: [339200.200310] Pid: 27468, comm: sh Not tainted 3.4.4-gentoo #1
Jul 12 16:35:23 machine kernel: [339200.200311] Call Trace:
Jul 12 16:35:23 machine kernel: [339200.200313] <IRQ> [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
Jul 12 16:35:23 machine kernel: [339200.200323] [<ffffffff810b9f5c>] rcu_check_callbacks+0x69/0xa7
Jul 12 16:35:23 machine kernel: [339200.200327] [<ffffffff8106ab21>] update_process_times+0x3c/0x73
Jul 12 16:35:23 machine kernel: [339200.200331] [<ffffffff810944f9>] tick_sched_timer+0x6d/0x96
Jul 12 16:35:23 machine kernel: [339200.200335] [<ffffffff8107b608>] __run_hrtimer+0xb4/0x13d
Jul 12 16:35:23 machine kernel: [339200.200337] [<ffffffff8109448c>] ? tick_nohz_handler+0xd6/0xd6
Jul 12 16:35:23 machine kernel: [339200.200340] [<ffffffff8107bdb4>] hrtimer_interrupt+0xcf/0x192
Jul 12 16:35:23 machine kernel: [339200.200344] [<ffffffff8104b8b6>] smp_apic_timer_interrupt+0x72/0x85
Jul 12 16:35:23 machine kernel: [339200.200348] [<ffffffff815f07c7>] apic_timer_interrupt+0x67/0x70
Jul 12 16:35:23 machine kernel: [339200.200349] <EOI> [<ffffffff812b5f31>] ? rb_insert_color+0x61/0xe1
Jul 12 16:35:23 machine kernel: [339200.200354] [<ffffffff812b5f88>] ? rb_insert_color+0xb8/0xe1
Jul 12 16:35:23 machine kernel: [339200.200358] [<ffffffff810fba3f>] __vma_link_rb+0x2b/0x2d
Jul 12 16:35:23 machine kernel: [339200.200361] [<ffffffff8105c7af>] dup_mm+0x2e0/0x440
Jul 12 16:35:23 machine kernel: [339200.200363] [<ffffffff8105d2c7>] copy_process+0x987/0x1224
Jul 12 16:35:23 machine kernel: [339200.200365] [<ffffffff8105dc73>] do_fork+0xeb/0x25a
Jul 12 16:35:23 machine kernel: [339200.200368] [<ffffffff8106c6cd>] ? __set_task_blocked+0x61/0x68
Jul 12 16:35:23 machine kernel: [339200.200371] [<ffffffff810817f3>] ? need_resched+0x1e/0x28
Jul 12 16:35:23 machine kernel: [339200.200373] [<ffffffff81081806>] ? should_resched+0x9/0x29
Jul 12 16:35:23 machine kernel: [339200.200376] [<ffffffff8103a66e>] sys_clone+0x23/0x25
Jul 12 16:35:23 machine kernel: [339200.200378] [<ffffffff815f0033>] stub_clone+0x13/0x20
Jul 12 16:35:23 machine kernel: [339200.200380] [<ffffffff815efd62>] ? system_call_fastpath+0x16/0x1b
Jul 12 16:38:23 machine kernel: [339379.902578] INFO: rcu_sched self-detected stall on CPU { 2} (t=240004 jiffies)
Jul 12 16:38:23 machine kernel: [339379.902581] Pid: 27468, comm: sh Not tainted 3.4.4-gentoo #1
Jul 12 16:38:23 machine kernel: [339379.902585] Call Trace:
Jul 12 16:38:23 machine kernel: [339379.902586] <IRQ> [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
Jul 12 16:38:23 machine kernel: [339379.902590] [<ffffffff810b9f5c>] rcu_check_callbacks+0x69/0xa7
Jul 12 16:38:23 machine kernel: [339379.902592] [<ffffffff8106ab21>] update_process_times+0x3c/0x73
.
.
.
Until, after at least forty or so iterations at three-minute intervals (why?),
I get what looks like a segfault. Then the machine has to be rebooted by
hand, at which point it is fine -- until it goes under heavy cpu load again.
It looks like the relevant information is found in "rcu_sched self-detected
stall" on CPU 2 (it's an AMD Phenom quad-core). Can someone explain to me just
how that translates into operational English? Is it the kernel's way of saying,
time to buy a new computer? or, perhaps, congratulations on finding a compiler
bug? or something else?
I have run memtester at a distance, and in the second loop I did get some
failures -- but this might be attributable to high temperature. Or maybe
not. When I get physically close to the machine again I'll try swapping out
the memory. But this error doesn't look like it's caused by the memory...
--
Peter King peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Department of Philosophy
170 St. George Street #521
The University of Toronto (416)-978-4951 ofc
Toronto, ON M5R 2M8
CANADA
http://individual.utoronto.ca/pking/
=========================================================================
GPG keyID 0x7587EC42 (2B14 A355 46BC 2A16 D0BC 36F5 1FE6 D32A 7587 EC42)
gpg --keyserver pgp.mit.edu --recv-keys 7587EC42
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://gtalug.org/pipermail/legacy/attachments/20120715/41d09ad4/attachment.sig>
More information about the Legacy
mailing list