confused about Current_Pending_Sector (fwd)

Sat Apr 3 16:56:13 UTC 2010

I'm having problems with a disk drive.

I posted this to the smartmontools list.  Someone on the TLUG list might 
find this interesting too.

---------- Forwarded message ----------
Date: Sat, 3 Apr 2010 02:47:02 -0400 (EDT)
From: D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org>
To: smartmontools-support-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f at public.gmane.org
Subject: confused about Current_Pending_Sector

Summary: smartctl shows a large and growing number of
Current_Pending_Sectors yet badblocks finds no problem nor does dd.
What could this mean?

On my Fedora Core 11 x86_64 Linux system, Palimpsest told me that my
second disk drive (WD5000AAKS-0) wasn't healthy.  It had a raw value
of 296 for Current_Pending_Sector.

I think that that means that there are 296 sectors known to read back
incorrectly but that cannot be mapped out behind the scenes because
that would cause silent data loss.

Normally no partitions from that drive were mounted.  A swap patition
should never have been used since I have lots of RAM and a quite large
swap partition on another drive with a less negative priority.

When I mounted (read-only) one of the partitions, things seemed wonky.
I got error messages in the kernel log:
    Mar 30 15:00:11 redsquare kernel: ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
    Mar 30 15:00:11 redsquare kernel: ata1.01: cmd c8/00:20:e8:b3:9d/00:00:00:00:00/f5 tag 0 dma 16384 in
    Mar 30 15:00:11 redsquare kernel:         res 51/01:00:e8:b3:9d/00:00:00:00:00/f5 Emask 0x1 (device error)
    Mar 30 15:00:11 redsquare kernel: ata1.01: status: { DRDY ERR }
    Mar 30 15:00:11 redsquare kernel: ata1.00: configured for UDMA/100
    Mar 30 15:00:11 redsquare kernel: ata1.01: configured for UDMA/133
    Mar 30 15:00:11 redsquare kernel: ata1: EH complete
<repeated a bunch of times>
and a bunch more bad stuff.

I then rebooted and did a long test on the drive and got a raw count of 
754. Yikes!

The output of "smartclt -a" on the failing drive lists several errors:
	ATA Error Count: 14 (device log contains only the most recent five errors)
All look like this:
    Error 14 occurred at disk power-on lifetime: 11091 hours (462 days + 3 hours)
      When the command that caused the error occurred, the device was active or idle.

      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      01 51 00 70 16 71 e2  Error: AMNF at LBA = 0x02711670 = 40965744

      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      c8 00 06 70 16 71 02 00      00:00:57.247  READ DMA
      27 00 00 00 00 00 00 00      00:00:57.221  READ NATIVE MAX ADDRESS EXT
      ec 00 00 00 00 00 00 00      00:00:57.218  IDENTIFY DEVICE
      ef 03 46 00 00 00 00 00      00:00:57.215  SET FEATURES [Set transfer mode]
      27 00 00 00 00 00 00 00      00:00:57.209  READ NATIVE MAX ADDRESS EXT

These errors were unchanged by the long test (there had been 14 before
and 14 after).  So, as far as I can tell, the long test doesn't tell
me where the bad blocks are.

Of the five errors in the SMART log, only two LBAs were mentioned:
40965743 and 40965744.  Those LBAs seem to be near the very end of
the first partition (40965749).  Too near to be pure coincidence.

I rebooted and sometime later ran badblocks, read-only, on the drive
(not the partition) and it found no bad blocks!  dmesg showed no
logged I/O errors on the drive.  Not even for the two LBAs mentioned.

I added a new drive to the system and did a dd of the whole bad disk
to the new, larger drive.  Again, no errors.  A very superficial check
of the copy (fsck -f) finds no problems.

A couple of smartctl -a messages are unclear to me:
    Offline data collection status:  (0x84) Offline data collection activity
					    was suspended by an interrupting command from host.
					    Auto Offline Data Collection: Enabled.
    Self-test execution status:      ( 119) The previous self-test completed having
					    the read element of the test failed.
    Total time to complete Offline 
    data collection:                 (12600) seconds.

What might have caused the data collection activity to suspend?
Something I did (what?) or something Linux did?  I let Palimpsest run
the long test and didn't do anything with the drive until it said that
the test was finished.

Why does it not tell me more about the read element failure (like, say an LBA)?

PS: the documentation
<http://smartmontools.sourceforge.net/badblockhowto.html> was useful.
Thanks.  I recommend that it mention badblocks(8), a little less
ad-hoc than the dd loop.
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists