war story: another disk problem

D. Hugh Redelmeier hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Mon Aug 26 20:18:41 UTC 2013


Last night I was reading slashdot on a MythTV server (it was handy at the 
time).  I followed a link from slashdot to an unknown site, and my 
browser started acting odd: unresponsive, dimming.  Of course I thought: 
malware attacking Firefox.

I soon figured out the real cause: disk problems.  These were made
visible by the dmesg command.

I tried to shut the machine down for a refreshing reboot, but the
shutdown didn't work.  A long press on the power button did.

On the way back up, grub complained and stopped.  So rebooting didn't
work.

I booted a live Ubuntu 12.04.2 dvd.  The first hard drive was
misbehaving.  Not 100%: the kernel still read the partitioning

Recognizing the drive:
  [    3.886635] ata1.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133
  [    3.886639] ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
  [    3.890606] ata1.00: configured for UDMA/133
  [    3.890699] scsi 0:0:0:0: Direct-Access     ATA      WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5
  [    3.890775] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
  [    3.890799] sd 0:0:0:0: Attached scsi generic sg0 type 0
  [    3.890815] sd 0:0:0:0: [sda] Write Protect is off
  [    3.890817] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
  [    3.890832] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
  [    3.956719]  sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
  [    3.957070] sd 0:0:0:0: [sda] Attached SCSI disk

Here's the first error, about 2 seconds later:
  [    5.998775] ata1.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
  [    5.998778] ata1.00: irq_stat 0x08000000, interface fatal error
  [    5.998780] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
  [    5.998783] ata1.00: failed command: READ FPDMA QUEUED
  [    5.998786] ata1.00: cmd 60/08:00:90:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
  [    5.998786]          res 40/00:04:90:03:00/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
  [    5.998788] ata1.00: status: { DRDY }
  [    5.998791] ata1: hard resetting link

Here's where the disk wakes (and gets another error):
  [    6.497059] ata1.00: configured for UDMA/133
  [    6.497070] sd 0:0:0:0: [sda]  
  [    6.497071] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [    6.497072] sd 0:0:0:0: [sda]  
  [    6.497073] Sense Key : Aborted Command [current] [descriptor]
  [    6.497074] Descriptor sense data with sense descriptors (in hex):
  [    6.497075]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
  [    6.497080]         00 00 03 90 
  [    6.497082] sd 0:0:0:0: [sda]  
  [    6.497083] Add. Sense: No additional sense information
  [    6.497084] sd 0:0:0:0: [sda] CDB: 
  [    6.497085] Read(10): 28 00 00 00 03 90 00 00 08 00
  [    6.497089] end_request: I/O error, dev sda, sector 912
  [    6.497092] Buffer I/O error on device sda, logical block 114
  [    6.497102] ata1: EH complete

A moment later, a bunch of errors get logged.  Here's the first bit:
  [    6.511086] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0x280100 action 0x6 frozen
  [    6.511091] ata1.00: irq_stat 0x08000000, interface fatal error
  [    6.511094] ata1: SError: { UnrecovData 10B8B BadCRC }
  [    6.511098] ata1.00: failed command: READ FPDMA QUEUED
  [    6.511105] ata1.00: cmd 60/08:00:90:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
  [    6.511105]          res 40/00:04:90:03:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
  [    6.511108] ata1.00: status: { DRDY }
  [    6.511113] ata1: hard resetting link
  [    7.000015] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [    7.008673] ata1.00: configured for UDMA/133
  [    7.008689] sd 0:0:0:0: [sda]  
  [    7.008691] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [    7.008692] sd 0:0:0:0: [sda]  
  [    7.008693] Sense Key : Aborted Command [current] [descriptor]
  [    7.008695] Descriptor sense data with sense descriptors (in hex):
  [    7.008697]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
  [    7.008702]         00 00 03 90 
  [    7.008705] sd 0:0:0:0: [sda]  
  [    7.008707] Add. Sense: No additional sense information
  [    7.008708] sd 0:0:0:0: [sda] CDB: 
  [    7.008709] Read(10): 28 00 00 00 03 90 00 00 08 00
  [    7.008715] end_request: I/O error, dev sda, sector 912
  [    7.008717] Buffer I/O error on device sda, logical block 114
  [    7.008725] ata1: EH complete
  [    7.088674] ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x280100 action 0x6 frozen
  [    7.088679] ata1.00: irq_stat 0x08000000, interface fatal error
  [    7.088683] ata1: SError: { UnrecovData 10B8B BadCRC }
  [    7.088687] ata1.00: failed command: READ FPDMA QUEUED
  [    7.088694] ata1.00: cmd 60/08:00:f8:06:e2/00:00:04:00:00/40 tag 0 ncq 4096 in
  [    7.088694]          res 40/00:1c:f8:07:53/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
  [    7.088697] ata1.00: status: { DRDY }
  [    7.088700] ata1.00: failed command: READ FPDMA QUEUED
  [    7.088706] ata1.00: cmd 60/02:08:fe:0f:53/00:00:07:00:00/40 tag 1 ncq 1024 in
  [    7.088706]          res 40/00:1c:f8:07:53/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
  [    7.088709] ata1.00: status: { DRDY }


I powered off, wiggled the disk cables, and rebooted.  The result wasn't different.

I powered off, moved the SATA cable to another SATA socket on the
motherboard.  The live CD booted, saw the disk behaving and proceeded
to fsck (all without my asking -- scary).

I powered off, moved the SATA cable back to the original SATA socket
on the motherboard.  The live CD booted, saw the disk behaving.

So: I declared victory, rebooted from the formerly bad disk, and left
the field of battle.  I hope that the problem doesn't reappear.


What happened?

- not software: pretty up-to-date Ubuntu 12.04 on the disk didn't even
  get to booting once the system was busted.

- not heat: I left the machine off while I slept.  The problem was
  still there when I booted the live CD in the morning.

- possibly the disk drive had an intermittent fault

- perhaps a marginal power supply.  Symptoms of power supply problems
  can be subtle and intermittent.

- perhaps a crappy motherboard (Gigabyte GA-P43T-ES3G).  Another
  datapoint on the motherboard:  a few months ago the on-board
  ethernet interface became unreliable and then reliably broken.

- my best guess: some kind of SATA cable or connector failure.
  Perhaps corrosion that was cleared up by plugging and unplugging.

-->> your theory goes here <<--
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list