[GTALUG] Crashes

Giles Orr gilesorr at gmail.com
Tue Jan 31 10:39:49 EST 2017


On 31 January 2017 at 10:37, Dhaval Giani <dhaval.giani at gmail.com> wrote:
>
> On Tue, Jan 31, 2017 at 10:28 AM Giles Orr via talk <talk at gtalug.org> wrote:
>>
>> On 31 January 2017 at 10:03, Alvin Starr via talk <talk at gtalug.org> wrote:
>> > On 01/31/2017 09:07 AM, Giles Orr via talk wrote:
>> >> My primary machine is crashing with increasing frequency.  The
>> >> commonest error I'm seeing in the log looks like this:
>> >>
>> >> Jan 29 18:29:39 toshi7 kernel: nouveau 0000:01:00.0: DRM: suspending
>> >> kernel object tree...
>> >> Jan 29 18:30:00 toshi7 kernel: NMI watchdog: BUG: soft lockup - CPU#3
>> >> stuck for 23s! [kscreenlocker_g:19647]
>> >> Jan 29 18:30:00 toshi7 kernel: Modules linked in: fuse uas usb_storage
>> >> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
>> >> nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_nat
>> >> nf_conntrack ...
>> >>
>> >> I realize that I'm probably not giving enough information, but pasting
>> >> large chunks of log files would be just as counterproductive in its
>> >> own way.  I've seen this one A LOT - and sometimes I get it and the
>> >> machine goes hours (but not days) before crashing.  So ... is
>> >> kscreenlocker likely to be the problem here?  When I searched for "BUG
>> >> soft lockup CPU stuck for" on Google, the top result had exactly the
>> >> same number of seconds, and said that replacing the power supply fixed
>> >> the problem.  Which is a step I'd probably be willing to take, but
>> >> this isn't a desktop, it's a laptop.  So I'd want to be very sure as
>> >> the power supply is unique to this machine (if it's available at all)
>> >> and probably quite expensive.
>> >>
>> >> The processor:
>> >>
>> >> Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (4594 bogomips)
>> >> current speed: 1274MHz, 4 cores, 8 threads
>> >>
>> >> While it's not a current gen processor, this is still a good machine
>> >> and I'd rather fix it than toss it.
>> >>
>> >> Got an immediate crash this morning, and to my surprise the error was
>> >> very different:
>> >>
>> >> Jan 31 07:56:35 toshi7 kernel: ------------[ cut here ]------------
>> >> Jan 31 07:56:35 toshi7 kernel: kernel BUG at lib/radix-tree.c:769!
>> >> Jan 31 07:56:35 toshi7 kernel: invalid opcode: 0000 [#1] SMP
>> >> Jan 31 07:56:35 toshi7 kernel: Modules linked in: uas usb_storage
>> >> rfcomm ip6t_rpfilter ip6t_REJECT nf_reject
>> >> _ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge
>> >> stp llc ip6table_nat nf_conntrack_ipv6 ...
>> >>
>> >> Finally, I'm also getting this periodically:
>> >>
>> >> Jan 28 08:49:52 toshi7 kernel: CPU2: Core temperature above threshold,
>> >> cpu clock throttled (total events = 1
>> >> )
>> >> Jan 28 08:49:52 toshi7 kernel: CPU6: Core temperature above threshold,
>> >> cpu clock throttled (total events = 1)
>> > [snip]
>> >> Jan 28 08:49:52 toshi7 kernel: CPU0: Package temperature/speed normal
>> >> Jan 28 08:49:52 toshi7 kernel: CPU2: Package temperature/speed normal
>> >> Jan 28 08:49:52 toshi7 kernel: CPU6: Package temperature/speed normal
>> >>
>> >> This suggests that it's overheating, throttling, and recovering pretty
>> >> much instantaneously: my thought is that it's probably not a problem,
>> >> but I thought I should check.
>> >>
>> >> How should I proceed from here:
>> >> - the processor is going funny, replace it
>> >> - junk the laptop, it's toast
>> >> - debug further (how?)
>> >> - replace the power supply
>> >> - uninstall kscreenlocker and see what happens
>> >>
>> >
>> > If the CPU is going over temp then it could start acting unpredictably.
>> >
>> > If you have lm_sensors installed then it would be worthwhile checking
>> > the temp of the CPU during normal operation.
>> > I would also check the fans because most fans out there are
>> > "inexpensive" and will start to cease up over time slowing down till
>> > things start getting hot.
>> > Another thing that has bitten me in the past was pushing a computer with
>> > a side vent up against a wall causing the still good fans from working
>> > almost at all.
>> >
>> > Another thing that will cause random problems is memory so if the
>> > cooling is not the issue then try running a memory test.
>> > Unless you have ECC and there are no errors being logged.
>>
>> I should add that I ran memtest86(+?) for a couple hours a month ago,
>> and it came up error-free.  And I ran the smartctl long test on the
>> hard drive quite recently, again without error.  I should run the
>> memory test again (and possibly even the HD one), but it makes me
>> think that these aren't the problem.  I think the fans are functioning
>> okay, but that's worth looking at and I'll get lmsensors installed
>> again.
>
> A good starting point would be knowing what you are running. Also updating
> to the latest packages for you distro as it might already be fixed.

Fair point ... Fedora Core 24 or 25 (sorry, not at home - can't tell
you for sure which) KDE spin.  I do keep it up-to-date: all packages
should be current as of approximately the last three days.

-- 
Giles
http://www.gilesorr.com/
gilesorr at gmail.com


More information about the talk mailing list