war story: another disk problem

D. Hugh Redelmeier hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Sat Dec 21 20:09:07 UTC 2013


I previously reported disk problems with this machine.

| From: D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org>
| Date: Mon, 26 Aug 2013 16:18:41 -0400 (EDT)
| Subject: [TLUG]: war story: another disk problem

| I soon figured out the real cause: disk problems.  These were made
| visible by the dmesg command.

| What happened?
| 
| - not software: pretty up-to-date Ubuntu 12.04 on the disk didn't even
|   get to booting once the system was busted.
| 
| - not heat: I left the machine off while I slept.  The problem was
|   still there when I booted the live CD in the morning.
| 
| - possibly the disk drive had an intermittent fault
| 
| - perhaps a marginal power supply.  Symptoms of power supply problems
|   can be subtle and intermittent.
| 
| - perhaps a crappy motherboard (Gigabyte GA-P43T-ES3G).  Another
|   datapoint on the motherboard:  a few months ago the on-board
|   ethernet interface became unreliable and then reliably broken.
| 
| - my best guess: some kind of SATA cable or connector failure.
|   Perhaps corrosion that was cleared up by plugging and unplugging.
| 
| -->> your theory goes here <<--

I had more disk problems recently.

One disk drive kept disappearing.  It is a 3T Seagate disk that I
bought last Boxing Day (perhaps the warranty is about to expire -- I
don't remember its duration).

dmesg showed lots of errors.  Sample:
    [203401.179374] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x2400000 action 0x0
    [203401.179384] ata2.00: BMDMA2 stat 0x650001
    [203401.179391] ata2: SError: { Handshk UnrecFIS }
    [203401.179404] ata2.00: failed command: WRITE DMA EXT
    [203401.179416] ata2.00: cmd 35/00:e0:80:9e:05/00:02:24:01:00/e0 tag 0 dma 376832 out
    [203401.179419]          res 51/04:c1:c7:9a:49/00:01:24:01:00/e0 Emask 0x1 (device error)
    [203401.179424] ata2.00: status: { DRDY ERR }
    [203401.179428] ata2.00: error: { ABRT }
    [203401.216400] ata2.00: configured for UDMA/33
    [203401.216434] ata2: EH complete


On rebooting (perhaps after a cooling of period) the disk seemed fine.
SMART testing showed nothing remarkable.

The disk is SATA and the computer is too old for SATA so this disk and
another are run by a SATA host adapter that is an after-market PCI
card.  I swapped the connections between these two disks.

The errors came back.  The same disk had problems.  So it isn't the
particular SATA port that is the problem.

I moved the disk drive within the case (Antec Sonata I), giving it
more space to improve cooling.  This had no effect.  So I assumed that
disk heating was not the problem.

I found that there was a Seagate firmware update for this drive.  Not
easy to find: google finds lots of folks asking but no answers.
I found the update through:
  https://apps1.seagate.com/downloads/request.html
It was listed in a funny way: not under "Firmware" (which said that
there was no update for this drive) but under "Certificate".  It took
my firmware from CC24 to CC29.

The firmware update comes in two forms: a Windows program that creates
a bootable USB stick and a .iso image to create a bootable CD.  I
tried dding the .iso image onto a USB stick but my computer would not
boot from it (didn't take the time to figure out why).  I burnt the CD
and used it.  This bootable system is a little Linux distro.  The
process was crude but it worked.

(When I try to log into the Seagate forum to post a pointer to the
firmware update for all those asking, the forum demands I change my
password.  When I attempt to, it says my account is disabled and that
if I think that that is a mistake, contact Account Support.  The
account support page will not accept my message: it claims that the
"description field" is empty (it is not).  Every time you submit the
form, and it perceives an error, it clears the form!  Grrr.)

In any case, the firmware update didn't solve the disk problem (still
failing after some hours).

I looked for a new disk controller but didn't have one in my
collection.

I found that the case has an air filter on the front, behind the
plastic.  It was loaded with dust bunnies (~10 years of service).  And
the case fan on the back was seized.  I removed the bunnies and the
fan.

The system still failed after some hours.

So I bought a new case fan.  Not knowing much about them, I just asked
Canada Computer to recommend one.
  <http://www.canadacomputers.com/product_info.php?cPath=8_130&item_id=048461>
  Corsair Air Series SP120 Quiet Edition 120 mm x 25 mm, 3 pin $19.99 + tax.
I suspect that a $10 one would have been good enough but I was tired
of experimental science at this point.

So far the machine is working.  But it hasn't been a day yet.

PS: I'm not totally happy with the Antec Sonata case.  It was supposed
to be high-end (not like the cheap cases I'd used up to then) and
quiet.  The power supply went a few years ago and the case fan went
who knows when.  And it isn't as quiet as the consumer HP computers
that I've purchased.  Perhaps this is a cascade of failures mediated
by heat.
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list