Scheduling lockup crash?

Mon Jul 16 12:51:40 UTC 2012

This appears to be a logic failure in core 2 of the cpu, as indicated by
the End Of Interrupt  "rb_insert_color". Most probably heat related as you
indicate the problem just started happening.

In the short term when you have physical access you could try
under-clocking the cpu and see if the problem persists. You could also try
beefing up the cpu cooling or just applying a new layer of thermal bond
between the existing cpu and heat sync.

In my mind heat problems in integrated components are either environmental
or entropic. My thought is that this would be how a microscopic flaw in
manufacturing would expose itself prior to exhibiting more serious End Of
Life indications. Without knowing the unit history I can't say much more
other that it might be time to replace the cpu.

Hope this helps
Russell

On Sun, Jul 15, 2012 at 3:55 AM, Peter King <peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org> wrote:

> I'm trying to diagnose a computer problem at a distance. Periodically,
> under heavy cpu usage (which also raise its core temperature), it seems
> to lock up and exhibit other distressing symptoms; and I find in my
> logfiles the following message repeated many many times:
>
> Jul 12 16:35:23 machine kernel: [339200.200303] INFO: rcu_sched
> self-detected stall on CPU { 2}  (t=60001 jiffies)
> Jul 12 16:35:23 machine kernel: [339200.200310] Pid: 27468, comm: sh Not
> tainted 3.4.4-gentoo #1
> Jul 12 16:35:23 machine kernel: [339200.200311] Call Trace:
> Jul 12 16:35:23 machine kernel: [339200.200313]  <IRQ>
>  [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
> Jul 12 16:35:23 machine kernel: [339200.200323]  [<ffffffff810b9f5c>]
> rcu_check_callbacks+0x69/0xa7
> Jul 12 16:35:23 machine kernel: [339200.200327]  [<ffffffff8106ab21>]
> update_process_times+0x3c/0x73
> Jul 12 16:35:23 machine kernel: [339200.200331]  [<ffffffff810944f9>]
> tick_sched_timer+0x6d/0x96
> Jul 12 16:35:23 machine kernel: [339200.200335]  [<ffffffff8107b608>]
> __run_hrtimer+0xb4/0x13d
> Jul 12 16:35:23 machine kernel: [339200.200337]  [<ffffffff8109448c>] ?
> tick_nohz_handler+0xd6/0xd6
> Jul 12 16:35:23 machine kernel: [339200.200340]  [<ffffffff8107bdb4>]
> hrtimer_interrupt+0xcf/0x192
> Jul 12 16:35:23 machine kernel: [339200.200344]  [<ffffffff8104b8b6>]
> smp_apic_timer_interrupt+0x72/0x85
> Jul 12 16:35:23 machine kernel: [339200.200348]  [<ffffffff815f07c7>]
> apic_timer_interrupt+0x67/0x70
> Jul 12 16:35:23 machine kernel: [339200.200349]  <EOI>
>  [<ffffffff812b5f31>] ? rb_insert_color+0x61/0xe1
> Jul 12 16:35:23 machine kernel: [339200.200354]  [<ffffffff812b5f88>] ?
> rb_insert_color+0xb8/0xe1
> Jul 12 16:35:23 machine kernel: [339200.200358]  [<ffffffff810fba3f>]
> __vma_link_rb+0x2b/0x2d
> Jul 12 16:35:23 machine kernel: [339200.200361]  [<ffffffff8105c7af>]
> dup_mm+0x2e0/0x440
> Jul 12 16:35:23 machine kernel: [339200.200363]  [<ffffffff8105d2c7>]
> copy_process+0x987/0x1224
> Jul 12 16:35:23 machine kernel: [339200.200365]  [<ffffffff8105dc73>]
> do_fork+0xeb/0x25a
> Jul 12 16:35:23 machine kernel: [339200.200368]  [<ffffffff8106c6cd>] ?
> __set_task_blocked+0x61/0x68
> Jul 12 16:35:23 machine kernel: [339200.200371]  [<ffffffff810817f3>] ?
> need_resched+0x1e/0x28
> Jul 12 16:35:23 machine kernel: [339200.200373]  [<ffffffff81081806>] ?
> should_resched+0x9/0x29
> Jul 12 16:35:23 machine kernel: [339200.200376]  [<ffffffff8103a66e>]
> sys_clone+0x23/0x25
> Jul 12 16:35:23 machine kernel: [339200.200378]  [<ffffffff815f0033>]
> stub_clone+0x13/0x20
> Jul 12 16:35:23 machine kernel: [339200.200380]  [<ffffffff815efd62>] ?
> system_call_fastpath+0x16/0x1b
> Jul 12 16:38:23 machine kernel: [339379.902578] INFO: rcu_sched
> self-detected stall on CPU { 2}  (t=240004 jiffies)
> Jul 12 16:38:23 machine kernel: [339379.902581] Pid: 27468, comm: sh Not
> tainted 3.4.4-gentoo #1
> Jul 12 16:38:23 machine kernel: [339379.902585] Call Trace:
> Jul 12 16:38:23 machine kernel: [339379.902586]  <IRQ>
>  [<ffffffff810b99a8>] __rcu_pending+0xab/0x39c
> Jul 12 16:38:23 machine kernel: [339379.902590]  [<ffffffff810b9f5c>]
> rcu_check_callbacks+0x69/0xa7
> Jul 12 16:38:23 machine kernel: [339379.902592]  [<ffffffff8106ab21>]
> update_process_times+0x3c/0x73
>  .
>  .
>  .
>
> Until, after at least forty or so iterations at three-minute intervals
> (why?),
> I get what looks like a segfault. Then the machine has to be rebooted by
> hand, at which point it is fine -- until it goes under heavy cpu load
> again.
>
> It looks like the relevant information is found in "rcu_sched self-detected
> stall" on CPU 2 (it's an AMD Phenom quad-core). Can someone explain to me
> just
> how that translates into operational English? Is it the kernel's way of
> saying,
> time to buy a new computer? or, perhaps, congratulations on finding a
> compiler
> bug? or something else?
>
> I have run memtester at a distance, and in the second loop I did get some
> failures -- but this might be attributable to high temperature. Or maybe
> not. When I get physically close to the machine again I'll try swapping out
> the memory. But this error doesn't look like it's caused by the memory...
>
> --
> Peter King                              peter.king-H217xnMUJC0sA/PxXw9srA at public.gmane.org
> Department of Philosophy
> 170 St. George Street #521
> The University of Toronto                   (416)-978-4951 ofc
> Toronto, ON  M5R 2M8
>        CANADA
>
> http://individual.utoronto.ca/pking/
>
> =========================================================================
> GPG keyID 0x7587EC42 (2B14 A355 46BC 2A16 D0BC  36F5 1FE6 D32A 7587 EC42)
> gpg --keyserver pgp.mit.edu --recv-keys 7587EC42
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gtalug.org/pipermail/legacy/attachments/20120716/dd542244/attachment.html>