confused about Current_Pending_Sector (fwd)
D. Hugh Redelmeier
hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Sat Apr 3 16:56:13 UTC 2010
I'm having problems with a disk drive.
I posted this to the smartmontools list. Someone on the TLUG list might
find this interesting too.
---------- Forwarded message ----------
Date: Sat, 3 Apr 2010 02:47:02 -0400 (EDT)
From: D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org>
To: smartmontools-support-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f at public.gmane.org
Subject: confused about Current_Pending_Sector
Summary: smartctl shows a large and growing number of
Current_Pending_Sectors yet badblocks finds no problem nor does dd.
What could this mean?
On my Fedora Core 11 x86_64 Linux system, Palimpsest told me that my
second disk drive (WD5000AAKS-0) wasn't healthy. It had a raw value
of 296 for Current_Pending_Sector.
I think that that means that there are 296 sectors known to read back
incorrectly but that cannot be mapped out behind the scenes because
that would cause silent data loss.
Normally no partitions from that drive were mounted. A swap patition
should never have been used since I have lots of RAM and a quite large
swap partition on another drive with a less negative priority.
When I mounted (read-only) one of the partitions, things seemed wonky.
I got error messages in the kernel log:
Mar 30 15:00:11 redsquare kernel: ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 30 15:00:11 redsquare kernel: ata1.01: cmd c8/00:20:e8:b3:9d/00:00:00:00:00/f5 tag 0 dma 16384 in
Mar 30 15:00:11 redsquare kernel: res 51/01:00:e8:b3:9d/00:00:00:00:00/f5 Emask 0x1 (device error)
Mar 30 15:00:11 redsquare kernel: ata1.01: status: { DRDY ERR }
Mar 30 15:00:11 redsquare kernel: ata1.00: configured for UDMA/100
Mar 30 15:00:11 redsquare kernel: ata1.01: configured for UDMA/133
Mar 30 15:00:11 redsquare kernel: ata1: EH complete
<repeated a bunch of times>
and a bunch more bad stuff.
I then rebooted and did a long test on the drive and got a raw count of
754. Yikes!
The output of "smartclt -a" on the failing drive lists several errors:
ATA Error Count: 14 (device log contains only the most recent five errors)
All look like this:
Error 14 occurred at disk power-on lifetime: 11091 hours (462 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
01 51 00 70 16 71 e2 Error: AMNF at LBA = 0x02711670 = 40965744
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 06 70 16 71 02 00 00:00:57.247 READ DMA
27 00 00 00 00 00 00 00 00:00:57.221 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00:00:57.218 IDENTIFY DEVICE
ef 03 46 00 00 00 00 00 00:00:57.215 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00:00:57.209 READ NATIVE MAX ADDRESS EXT
These errors were unchanged by the long test (there had been 14 before
and 14 after). So, as far as I can tell, the long test doesn't tell
me where the bad blocks are.
Of the five errors in the SMART log, only two LBAs were mentioned:
40965743 and 40965744. Those LBAs seem to be near the very end of
the first partition (40965749). Too near to be pure coincidence.
I rebooted and sometime later ran badblocks, read-only, on the drive
(not the partition) and it found no bad blocks! dmesg showed no
logged I/O errors on the drive. Not even for the two LBAs mentioned.
I added a new drive to the system and did a dd of the whole bad disk
to the new, larger drive. Again, no errors. A very superficial check
of the copy (fsck -f) finds no problems.
A couple of smartctl -a messages are unclear to me:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 119) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (12600) seconds.
What might have caused the data collection activity to suspend?
Something I did (what?) or something Linux did? I let Palimpsest run
the long test and didn't do anything with the drive until it said that
the test was finished.
Why does it not tell me more about the read element failure (like, say an LBA)?
PS: the documentation
<http://smartmontools.sourceforge.net/badblockhowto.html> was useful.
Thanks. I recommend that it mention badblocks(8), a little less
ad-hoc than the dd loop.
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list