bad blocks on SATA disk: another war story and a request for advice

Tue Feb 19 17:09:24 UTC 2013

On Sun, Feb 17, 2013 at 04:42:52PM -0500, D. Hugh Redelmeier wrote:
> I'm going to lay out this story in the hope that it helps others in a
> similar situation, and in the hope that others might might recommend
> improvements.
> 
> The 2.5" hard drive in one of my "nettop" computers has developed read 
> errors.
> 
> I discovered this when fsck threw up its hands during a reboot
> (planned: after updates).  I just shut the machine down.
> 
> I rebooted a different partition (the sick one was Fedora; I rebooted
> to Ubuntu).  It sure is handy to be able to boot a system that
> doesn't itself have bad sectors.  If you don't have a healthy system
> on your hard disk, consider booting an emergency system off CD or USB.
> 
> palimpsest is the real name of what Ubuntu menus call "Disk Utility".
> It seems to have no manpage and does not respond to a --help flag.
> Grrr.  Among other things, it is a GUI interface to S.M.A.R.T
> capabilities of disk drives.
> 
> palmipsest said "Current Pending Sector Count" is 64 after a long
> selftest.  That means that there are 64 sectors that cannot be read.
> 
> The "Reallocation Count" is 8.  That means that there were 8 sectors
> that the disk firmware has judged to be bad or going bad and has
> "remapped" to spare.  This is invisible to the computer: the new
> sectors appear to be at the original address.
> 
> If I wrote something to one of those pending sectors, the firmware
> would remap it too.  The Pending Sector Count would go down
> by one, and the Reallocation Count would go up by one.  That's where
> the odd euphemistic term "Pending" comes from: they are awaiting a
> write so that they can be remapped.
> 
> To be honest, I think I prefer the CLI tool smartctl(8) to
> Palimpsest(?).
> 
> For a complete report:
> 	sudo smartctl -x /dev/sda
> 
> To start a long test:
> 	sudo smartctl -t long /dev/sda
> The test runs a long time but command returns immediately and the
> system continues to operate, and the drive remains available to it.
> The command will tell you a guestimate of when the test will complete.
> To find out the results, ask for another report when the test is
> complete (the report will say if the test isn't yet complete).
> 
> Summary:
> 
> - Current Pending Sector Count is the number of sectors that cannot be
>   read by the drive and haven't subsequently been written to.  They are
>   bad news and the problem is visible to software.
> 
> - Reallocation Count is old bad news.  Everything is fine now.  But,
>   of course, it is a sign that the disk might not have a long and
>   healthy life ahead of it.
> 
> 
> Now, what to do about this?
> 
> badblocks(8) will look for badblocks.  It can optionally use
> nondestructive writes in the test (i.e. read a block and write the
> same contents back).
> 
> The output of badblocks can be fed into e2fsck (the file system check
> command for ext2, ext3, and ext4 filesystems).  But: badblocks needs
> to know the blocksize used by the filesystem.  Unfortunate: the
> badblocks output does not declare the blocksize used so e2fsck just
> assumes that it is right.  On my system, the defaults were wrong
> (badblocks assumes 1k, my filesystem actually used 4k).
> 
> How can you discover the extN blocksize?
> 	tune2fs -l /dev/sda5
> This reports a lot of info, including the blocksize.
> 
> Safer is to use the -c parameter to e2fsck and have it run
> badblocks the right way itself.  The downside is that you don't get to
> see the damage before acting on it.
> 
> In my case, badblocks found four 4KiB bad blocks in the Fedora
> partition.  That explains at most 32 512B bad blocks so there must be
> more lurking somewhere.  Yikes.
> 
> Getting e2fsck to handle the bad blocks is good because you can find
> out what the damage is in terms of files and inodes.  What is
> unfortunate is that it doesn't allow the disk firmware to get rid of
> the bad sectors: they are still there and they will still cause
> problems if you read them.  But the filesystem is adjusted so that
> they no longer are or will be used.  dd'ing the whole raw disk will
> still hit them though.
> 
> One unfortunate thing: it could be that only a part of a 4KiB block is
> lost (i.e. fewer than eight of the 512B sectors on the drive that hold
> one 4KiB block).  e2fsck does not seem to try to recover and use this
> partial info.  It should at least try to recover something if this
> contained inodes.
> 
> Hmmm.  When I ask for SMART data, without running another test, it
> seems to be down to 8 Pending (from 64) and up to 17 Reallocated (from
> 8 then 9).  So maybe e2fsck does write to the bad blocks.  And it
> appears that not all the errors are in the Fedora partition.
> 
> Is it likely that what has gone wrong is a one-time thing or is it the
> start of a trend?
> 
> - the bad sectors are not all in one place (three ranges in the Fedora
>   partition, and at least one elsewhere).  So it isn't a simple
>   physical bad-spot
> 
> - the Reallocated number went up from 8 to 9 for a reason I don't
>   understand
> 
> This makes me think that I should just discard the drive because it
> will be more trouble than it is worth.  But being a hacker, I will
> play with it a bit more.
> 
> The disk is a Hitachi 5K750 HTS547575A9E384 750G 2.5" drive.  I bought
> it as an external drive because that was cheaper than a bare drive.  I
> cracked it open and installed it in my nettop.  Consequently there is
> no warranty on it as far as I know.
> 
> Reading a review of the drive,
> 	All models in this product range employ AF and emulate
> 	512-byte sectors.
> That means that the actual sector size is 4KiB (probably).  So the
> filesystem's 4KiB block size is appropriate.
> 
> So why would any of the SMART sector counts be other than a multiple
> of 8?  How could there be 9 or 17 reallocated sectors when
> reallocation must be done 8 at a time?

The firmware might be smart enough to do partial remapping, waiting
for the remaining logical sectors in the larger physical sector to be
written before finishing the remapping.

> Hitachi took over IBM's disk business.  Recently HGST, the disk
> business, was sold to Western Digital.  So I guess the ERC will
> disappear.  Right now, smartctl reports:
> 
> SCT capabilities:              (0x003d) SCT Status supported.
>                                         SCT Error Recovery Control supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
> 
> I think that SCT Error Recovery Control supported means that you
> should be able to ask the drive to limit how long it will attempt to
> read bad sectors.

-- 
Len Sorensen
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists