war story: another disk problem
D. Hugh Redelmeier
hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Mon Aug 26 20:18:41 UTC 2013
Last night I was reading slashdot on a MythTV server (it was handy at the
time). I followed a link from slashdot to an unknown site, and my
browser started acting odd: unresponsive, dimming. Of course I thought:
malware attacking Firefox.
I soon figured out the real cause: disk problems. These were made
visible by the dmesg command.
I tried to shut the machine down for a refreshing reboot, but the
shutdown didn't work. A long press on the power button did.
On the way back up, grub complained and stopped. So rebooting didn't
work.
I booted a live Ubuntu 12.04.2 dvd. The first hard drive was
misbehaving. Not 100%: the kernel still read the partitioning
Recognizing the drive:
[ 3.886635] ata1.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133
[ 3.886639] ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[ 3.890606] ata1.00: configured for UDMA/133
[ 3.890699] scsi 0:0:0:0: Direct-Access ATA WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5
[ 3.890775] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
[ 3.890799] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 3.890815] sd 0:0:0:0: [sda] Write Protect is off
[ 3.890817] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 3.890832] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 3.956719] sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
[ 3.957070] sd 0:0:0:0: [sda] Attached SCSI disk
Here's the first error, about 2 seconds later:
[ 5.998775] ata1.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
[ 5.998778] ata1.00: irq_stat 0x08000000, interface fatal error
[ 5.998780] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
[ 5.998783] ata1.00: failed command: READ FPDMA QUEUED
[ 5.998786] ata1.00: cmd 60/08:00:90:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
[ 5.998786] res 40/00:04:90:03:00/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
[ 5.998788] ata1.00: status: { DRDY }
[ 5.998791] ata1: hard resetting link
Here's where the disk wakes (and gets another error):
[ 6.497059] ata1.00: configured for UDMA/133
[ 6.497070] sd 0:0:0:0: [sda]
[ 6.497071] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 6.497072] sd 0:0:0:0: [sda]
[ 6.497073] Sense Key : Aborted Command [current] [descriptor]
[ 6.497074] Descriptor sense data with sense descriptors (in hex):
[ 6.497075] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 6.497080] 00 00 03 90
[ 6.497082] sd 0:0:0:0: [sda]
[ 6.497083] Add. Sense: No additional sense information
[ 6.497084] sd 0:0:0:0: [sda] CDB:
[ 6.497085] Read(10): 28 00 00 00 03 90 00 00 08 00
[ 6.497089] end_request: I/O error, dev sda, sector 912
[ 6.497092] Buffer I/O error on device sda, logical block 114
[ 6.497102] ata1: EH complete
A moment later, a bunch of errors get logged. Here's the first bit:
[ 6.511086] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0x280100 action 0x6 frozen
[ 6.511091] ata1.00: irq_stat 0x08000000, interface fatal error
[ 6.511094] ata1: SError: { UnrecovData 10B8B BadCRC }
[ 6.511098] ata1.00: failed command: READ FPDMA QUEUED
[ 6.511105] ata1.00: cmd 60/08:00:90:03:00/00:00:00:00:00/40 tag 0 ncq 4096 in
[ 6.511105] res 40/00:04:90:03:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[ 6.511108] ata1.00: status: { DRDY }
[ 6.511113] ata1: hard resetting link
[ 7.000015] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 7.008673] ata1.00: configured for UDMA/133
[ 7.008689] sd 0:0:0:0: [sda]
[ 7.008691] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 7.008692] sd 0:0:0:0: [sda]
[ 7.008693] Sense Key : Aborted Command [current] [descriptor]
[ 7.008695] Descriptor sense data with sense descriptors (in hex):
[ 7.008697] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 7.008702] 00 00 03 90
[ 7.008705] sd 0:0:0:0: [sda]
[ 7.008707] Add. Sense: No additional sense information
[ 7.008708] sd 0:0:0:0: [sda] CDB:
[ 7.008709] Read(10): 28 00 00 00 03 90 00 00 08 00
[ 7.008715] end_request: I/O error, dev sda, sector 912
[ 7.008717] Buffer I/O error on device sda, logical block 114
[ 7.008725] ata1: EH complete
[ 7.088674] ata1.00: exception Emask 0x10 SAct 0x3f SErr 0x280100 action 0x6 frozen
[ 7.088679] ata1.00: irq_stat 0x08000000, interface fatal error
[ 7.088683] ata1: SError: { UnrecovData 10B8B BadCRC }
[ 7.088687] ata1.00: failed command: READ FPDMA QUEUED
[ 7.088694] ata1.00: cmd 60/08:00:f8:06:e2/00:00:04:00:00/40 tag 0 ncq 4096 in
[ 7.088694] res 40/00:1c:f8:07:53/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
[ 7.088697] ata1.00: status: { DRDY }
[ 7.088700] ata1.00: failed command: READ FPDMA QUEUED
[ 7.088706] ata1.00: cmd 60/02:08:fe:0f:53/00:00:07:00:00/40 tag 1 ncq 1024 in
[ 7.088706] res 40/00:1c:f8:07:53/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
[ 7.088709] ata1.00: status: { DRDY }
I powered off, wiggled the disk cables, and rebooted. The result wasn't different.
I powered off, moved the SATA cable to another SATA socket on the
motherboard. The live CD booted, saw the disk behaving and proceeded
to fsck (all without my asking -- scary).
I powered off, moved the SATA cable back to the original SATA socket
on the motherboard. The live CD booted, saw the disk behaving.
So: I declared victory, rebooted from the formerly bad disk, and left
the field of battle. I hope that the problem doesn't reappear.
What happened?
- not software: pretty up-to-date Ubuntu 12.04 on the disk didn't even
get to booting once the system was busted.
- not heat: I left the machine off while I slept. The problem was
still there when I booted the live CD in the morning.
- possibly the disk drive had an intermittent fault
- perhaps a marginal power supply. Symptoms of power supply problems
can be subtle and intermittent.
- perhaps a crappy motherboard (Gigabyte GA-P43T-ES3G). Another
datapoint on the motherboard: a few months ago the on-board
ethernet interface became unreliable and then reliably broken.
- my best guess: some kind of SATA cable or connector failure.
Perhaps corrosion that was cleared up by plugging and unplugging.
-->> your theory goes here <<--
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list