[GTALUG] Crashes

Giles Orr gilesorr at gmail.com
Tue Jan 31 09:07:35 EST 2017


My primary machine is crashing with increasing frequency.  The
commonest error I'm seeing in the log looks like this:

Jan 29 18:29:39 toshi7 kernel: nouveau 0000:01:00.0: DRM: suspending
kernel object tree...
Jan 29 18:30:00 toshi7 kernel: NMI watchdog: BUG: soft lockup - CPU#3
stuck for 23s! [kscreenlocker_g:19647]
Jan 29 18:30:00 toshi7 kernel: Modules linked in: fuse uas usb_storage
rfcomm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat
nf_conntrack ...

I realize that I'm probably not giving enough information, but pasting
large chunks of log files would be just as counterproductive in its
own way.  I've seen this one A LOT - and sometimes I get it and the
machine goes hours (but not days) before crashing.  So ... is
kscreenlocker likely to be the problem here?  When I searched for "BUG
soft lockup CPU stuck for" on Google, the top result had exactly the
same number of seconds, and said that replacing the power supply fixed
the problem.  Which is a step I'd probably be willing to take, but
this isn't a desktop, it's a laptop.  So I'd want to be very sure as
the power supply is unique to this machine (if it's available at all)
and probably quite expensive.

The processor:

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (4594 bogomips)
current speed: 1274MHz, 4 cores, 8 threads

While it's not a current gen processor, this is still a good machine
and I'd rather fix it than toss it.

Got an immediate crash this morning, and to my surprise the error was
very different:

Jan 31 07:56:35 toshi7 kernel: ------------[ cut here ]------------
Jan 31 07:56:35 toshi7 kernel: kernel BUG at lib/radix-tree.c:769!
Jan 31 07:56:35 toshi7 kernel: invalid opcode: 0000 [#1] SMP
Jan 31 07:56:35 toshi7 kernel: Modules linked in: uas usb_storage
rfcomm ip6t_rpfilter ip6t_REJECT nf_reject
_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge
stp llc ip6table_nat nf_conntrack_ipv6 ...

Finally, I'm also getting this periodically:

Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature above threshold,
cpu clock throttled (total events = 1
)
Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature above threshold,
cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU7: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU4: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU1: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU5: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU3: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: mce: [Hardware Error]: Machine check
events logged
Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature above
threshold, cpu clock throttled (total events = 1)
Jan 28 08:49:52 toshi7 kernel: mce: [Hardware Error]: Machine check
events logged
Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU4: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU5: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU1: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU3: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU7: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature/speed normal
Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature/speed normal

This suggests that it's overheating, throttling, and recovering pretty
much instantaneously: my thought is that it's probably not a problem,
but I thought I should check.

How should I proceed from here:
- the processor is going funny, replace it
- junk the laptop, it's toast
- debug further (how?)
- replace the power supply
- uninstall kscreenlocker and see what happens

-- 
Giles
http://www.gilesorr.com/
gilesorr at gmail.com


More information about the talk mailing list