Scheduling lockup crash?

Peter King peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Sun Jul 15 07:55:49 UTC 2012


I'm trying to diagnose a computer problem at a distance. Periodically,
under heavy cpu usage (which also raise its core temperature), it seems
to lock up and exhibit other distressing symptoms; and I find in my
logfiles the following message repeated many many times:

Jul 12 16:35:23 machine kernel: [339200.200303] INFO: rcu_sched self-detected stall on CPU { 2}  (t=60001 jiffies)
Jul 12 16:35:23 machine kernel: [339200.200310] Pid: 27468, comm: sh Not tainted 3.4.4-gentoo #1
Jul 12 16:35:23 machine kernel: [339200.200311] Call Trace:
Jul 12 16:35:23 machine kernel: [339200.200313]  <IRQ>  [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
Jul 12 16:35:23 machine kernel: [339200.200323]  [<ffffffff810b9f5c>] rcu_check_callbacks+0x69/0xa7
Jul 12 16:35:23 machine kernel: [339200.200327]  [<ffffffff8106ab21>] update_process_times+0x3c/0x73
Jul 12 16:35:23 machine kernel: [339200.200331]  [<ffffffff810944f9>] tick_sched_timer+0x6d/0x96
Jul 12 16:35:23 machine kernel: [339200.200335]  [<ffffffff8107b608>] __run_hrtimer+0xb4/0x13d
Jul 12 16:35:23 machine kernel: [339200.200337]  [<ffffffff8109448c>] ? tick_nohz_handler+0xd6/0xd6
Jul 12 16:35:23 machine kernel: [339200.200340]  [<ffffffff8107bdb4>] hrtimer_interrupt+0xcf/0x192
Jul 12 16:35:23 machine kernel: [339200.200344]  [<ffffffff8104b8b6>] smp_apic_timer_interrupt+0x72/0x85
Jul 12 16:35:23 machine kernel: [339200.200348]  [<ffffffff815f07c7>] apic_timer_interrupt+0x67/0x70
Jul 12 16:35:23 machine kernel: [339200.200349]  <EOI>  [<ffffffff812b5f31>] ? rb_insert_color+0x61/0xe1
Jul 12 16:35:23 machine kernel: [339200.200354]  [<ffffffff812b5f88>] ? rb_insert_color+0xb8/0xe1
Jul 12 16:35:23 machine kernel: [339200.200358]  [<ffffffff810fba3f>] __vma_link_rb+0x2b/0x2d
Jul 12 16:35:23 machine kernel: [339200.200361]  [<ffffffff8105c7af>] dup_mm+0x2e0/0x440
Jul 12 16:35:23 machine kernel: [339200.200363]  [<ffffffff8105d2c7>] copy_process+0x987/0x1224
Jul 12 16:35:23 machine kernel: [339200.200365]  [<ffffffff8105dc73>] do_fork+0xeb/0x25a
Jul 12 16:35:23 machine kernel: [339200.200368]  [<ffffffff8106c6cd>] ? __set_task_blocked+0x61/0x68
Jul 12 16:35:23 machine kernel: [339200.200371]  [<ffffffff810817f3>] ? need_resched+0x1e/0x28
Jul 12 16:35:23 machine kernel: [339200.200373]  [<ffffffff81081806>] ? should_resched+0x9/0x29
Jul 12 16:35:23 machine kernel: [339200.200376]  [<ffffffff8103a66e>] sys_clone+0x23/0x25
Jul 12 16:35:23 machine kernel: [339200.200378]  [<ffffffff815f0033>] stub_clone+0x13/0x20
Jul 12 16:35:23 machine kernel: [339200.200380]  [<ffffffff815efd62>] ? system_call_fastpath+0x16/0x1b
Jul 12 16:38:23 machine kernel: [339379.902578] INFO: rcu_sched self-detected stall on CPU { 2}  (t=240004 jiffies)
Jul 12 16:38:23 machine kernel: [339379.902581] Pid: 27468, comm: sh Not tainted 3.4.4-gentoo #1
Jul 12 16:38:23 machine kernel: [339379.902585] Call Trace:
Jul 12 16:38:23 machine kernel: [339379.902586]  <IRQ>  [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
Jul 12 16:38:23 machine kernel: [339379.902590]  [<ffffffff810b9f5c>] rcu_check_callbacks+0x69/0xa7
Jul 12 16:38:23 machine kernel: [339379.902592]  [<ffffffff8106ab21>] update_process_times+0x3c/0x73
 .
 .
 .

Until, after at least forty or so iterations at three-minute intervals (why?),
I get what looks like a segfault. Then the machine has to be rebooted by
hand, at which point it is fine -- until it goes under heavy cpu load again.

It looks like the relevant information is found in "rcu_sched self-detected
stall" on CPU 2 (it's an AMD Phenom quad-core). Can someone explain to me just
how that translates into operational English? Is it the kernel's way of saying,
time to buy a new computer? or, perhaps, congratulations on finding a compiler
bug? or something else?

I have run memtester at a distance, and in the second loop I did get some 
failures -- but this might be attributable to high temperature. Or maybe
not. When I get physically close to the machine again I'll try swapping out
the memory. But this error doesn't look like it's caused by the memory...

-- 
Peter King			 	peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Department of Philosophy
170 St. George Street #521
The University of Toronto		    (416)-978-4951 ofc
Toronto, ON  M5R 2M8
       CANADA

http://individual.utoronto.ca/pking/

=========================================================================
GPG keyID 0x7587EC42 (2B14 A355 46BC 2A16 D0BC  36F5 1FE6 D32A 7587 EC42)
gpg --keyserver pgp.mit.edu --recv-keys 7587EC42
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://gtalug.org/pipermail/legacy/attachments/20120715/41d09ad4/attachment.sig>


More information about the Legacy mailing list