[GTALUG] Crashes

Alvin Starr alvin at netvel.net
Tue Jan 31 10:03:31 EST 2017


On 01/31/2017 09:07 AM, Giles Orr via talk wrote:
> My primary machine is crashing with increasing frequency.  The
> commonest error I'm seeing in the log looks like this:
>
> Jan 29 18:29:39 toshi7 kernel: nouveau 0000:01:00.0: DRM: suspending
> kernel object tree...
> Jan 29 18:30:00 toshi7 kernel: NMI watchdog: BUG: soft lockup - CPU#3
> stuck for 23s! [kscreenlocker_g:19647]
> Jan 29 18:30:00 toshi7 kernel: Modules linked in: fuse uas usb_storage
> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
> nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat
> nf_conntrack ...
>
> I realize that I'm probably not giving enough information, but pasting
> large chunks of log files would be just as counterproductive in its
> own way.  I've seen this one A LOT - and sometimes I get it and the
> machine goes hours (but not days) before crashing.  So ... is
> kscreenlocker likely to be the problem here?  When I searched for "BUG
> soft lockup CPU stuck for" on Google, the top result had exactly the
> same number of seconds, and said that replacing the power supply fixed
> the problem.  Which is a step I'd probably be willing to take, but
> this isn't a desktop, it's a laptop.  So I'd want to be very sure as
> the power supply is unique to this machine (if it's available at all)
> and probably quite expensive.
>
> The processor:
>
> Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (4594 bogomips)
> current speed: 1274MHz, 4 cores, 8 threads
>
> While it's not a current gen processor, this is still a good machine
> and I'd rather fix it than toss it.
>
> Got an immediate crash this morning, and to my surprise the error was
> very different:
>
> Jan 31 07:56:35 toshi7 kernel: ------------[ cut here ]------------
> Jan 31 07:56:35 toshi7 kernel: kernel BUG at lib/radix-tree.c:769!
> Jan 31 07:56:35 toshi7 kernel: invalid opcode: 0000 [#1] SMP
> Jan 31 07:56:35 toshi7 kernel: Modules linked in: uas usb_storage
> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject
> _ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge
> stp llc ip6table_nat nf_conntrack_ipv6 ...
>
> Finally, I'm also getting this periodically:
>
> Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature above threshold,
> cpu clock throttled (total events = 1
> )
> Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature above threshold,
> cpu clock throttled (total events = 1)
[snip]
> Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature/speed normal
> Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature/speed normal
> Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature/speed normal
>
> This suggests that it's overheating, throttling, and recovering pretty
> much instantaneously: my thought is that it's probably not a problem,
> but I thought I should check.
>
> How should I proceed from here:
> - the processor is going funny, replace it
> - junk the laptop, it's toast
> - debug further (how?)
> - replace the power supply
> - uninstall kscreenlocker and see what happens
>

If the CPU is going over temp then it could start acting unpredictably.

If you have lm_sensors installed then it would be worthwhile checking
the temp of the CPU during normal operation.
I would also check the fans because most fans out there are
"inexpensive" and will start to cease up over time slowing down till
things start getting hot.
Another thing that has bitten me in the past was pushing a computer with
a side vent up against a wall causing the still good fans from working
almost at all.

Another thing that will cause random problems is memory so if the
cooling is not the issue then try running a memory test.
Unless you have ECC and there are no errors being logged.


 

-- 
Alvin Starr                   ||   voice: (905)513-7688
Netvel Inc.                   ||   Cell:  (416)806-0133
alvin at netvel.net              ||




More information about the talk mailing list