bad blocks on SATA disk: another war story and a request for advice
Lennart Sorensen
lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Tue Feb 19 17:09:24 UTC 2013
On Sun, Feb 17, 2013 at 04:42:52PM -0500, D. Hugh Redelmeier wrote:
> I'm going to lay out this story in the hope that it helps others in a
> similar situation, and in the hope that others might might recommend
> improvements.
>
> The 2.5" hard drive in one of my "nettop" computers has developed read
> errors.
>
> I discovered this when fsck threw up its hands during a reboot
> (planned: after updates). I just shut the machine down.
>
> I rebooted a different partition (the sick one was Fedora; I rebooted
> to Ubuntu). It sure is handy to be able to boot a system that
> doesn't itself have bad sectors. If you don't have a healthy system
> on your hard disk, consider booting an emergency system off CD or USB.
>
> palimpsest is the real name of what Ubuntu menus call "Disk Utility".
> It seems to have no manpage and does not respond to a --help flag.
> Grrr. Among other things, it is a GUI interface to S.M.A.R.T
> capabilities of disk drives.
>
> palmipsest said "Current Pending Sector Count" is 64 after a long
> selftest. That means that there are 64 sectors that cannot be read.
>
> The "Reallocation Count" is 8. That means that there were 8 sectors
> that the disk firmware has judged to be bad or going bad and has
> "remapped" to spare. This is invisible to the computer: the new
> sectors appear to be at the original address.
>
> If I wrote something to one of those pending sectors, the firmware
> would remap it too. The Pending Sector Count would go down
> by one, and the Reallocation Count would go up by one. That's where
> the odd euphemistic term "Pending" comes from: they are awaiting a
> write so that they can be remapped.
>
> To be honest, I think I prefer the CLI tool smartctl(8) to
> Palimpsest(?).
>
> For a complete report:
> sudo smartctl -x /dev/sda
>
> To start a long test:
> sudo smartctl -t long /dev/sda
> The test runs a long time but command returns immediately and the
> system continues to operate, and the drive remains available to it.
> The command will tell you a guestimate of when the test will complete.
> To find out the results, ask for another report when the test is
> complete (the report will say if the test isn't yet complete).
>
> Summary:
>
> - Current Pending Sector Count is the number of sectors that cannot be
> read by the drive and haven't subsequently been written to. They are
> bad news and the problem is visible to software.
>
> - Reallocation Count is old bad news. Everything is fine now. But,
> of course, it is a sign that the disk might not have a long and
> healthy life ahead of it.
>
>
> Now, what to do about this?
>
> badblocks(8) will look for badblocks. It can optionally use
> nondestructive writes in the test (i.e. read a block and write the
> same contents back).
>
> The output of badblocks can be fed into e2fsck (the file system check
> command for ext2, ext3, and ext4 filesystems). But: badblocks needs
> to know the blocksize used by the filesystem. Unfortunate: the
> badblocks output does not declare the blocksize used so e2fsck just
> assumes that it is right. On my system, the defaults were wrong
> (badblocks assumes 1k, my filesystem actually used 4k).
>
> How can you discover the extN blocksize?
> tune2fs -l /dev/sda5
> This reports a lot of info, including the blocksize.
>
> Safer is to use the -c parameter to e2fsck and have it run
> badblocks the right way itself. The downside is that you don't get to
> see the damage before acting on it.
>
> In my case, badblocks found four 4KiB bad blocks in the Fedora
> partition. That explains at most 32 512B bad blocks so there must be
> more lurking somewhere. Yikes.
>
> Getting e2fsck to handle the bad blocks is good because you can find
> out what the damage is in terms of files and inodes. What is
> unfortunate is that it doesn't allow the disk firmware to get rid of
> the bad sectors: they are still there and they will still cause
> problems if you read them. But the filesystem is adjusted so that
> they no longer are or will be used. dd'ing the whole raw disk will
> still hit them though.
>
> One unfortunate thing: it could be that only a part of a 4KiB block is
> lost (i.e. fewer than eight of the 512B sectors on the drive that hold
> one 4KiB block). e2fsck does not seem to try to recover and use this
> partial info. It should at least try to recover something if this
> contained inodes.
>
> Hmmm. When I ask for SMART data, without running another test, it
> seems to be down to 8 Pending (from 64) and up to 17 Reallocated (from
> 8 then 9). So maybe e2fsck does write to the bad blocks. And it
> appears that not all the errors are in the Fedora partition.
>
> Is it likely that what has gone wrong is a one-time thing or is it the
> start of a trend?
>
> - the bad sectors are not all in one place (three ranges in the Fedora
> partition, and at least one elsewhere). So it isn't a simple
> physical bad-spot
>
> - the Reallocated number went up from 8 to 9 for a reason I don't
> understand
>
> This makes me think that I should just discard the drive because it
> will be more trouble than it is worth. But being a hacker, I will
> play with it a bit more.
>
> The disk is a Hitachi 5K750 HTS547575A9E384 750G 2.5" drive. I bought
> it as an external drive because that was cheaper than a bare drive. I
> cracked it open and installed it in my nettop. Consequently there is
> no warranty on it as far as I know.
>
> Reading a review of the drive,
> All models in this product range employ AF and emulate
> 512-byte sectors.
> That means that the actual sector size is 4KiB (probably). So the
> filesystem's 4KiB block size is appropriate.
>
> So why would any of the SMART sector counts be other than a multiple
> of 8? How could there be 9 or 17 reallocated sectors when
> reallocation must be done 8 at a time?
The firmware might be smart enough to do partial remapping, waiting
for the remaining logical sectors in the larger physical sector to be
written before finishing the remapping.
> Hitachi took over IBM's disk business. Recently HGST, the disk
> business, was sold to Western Digital. So I guess the ERC will
> disappear. Right now, smartctl reports:
>
> SCT capabilities: (0x003d) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> I think that SCT Error Recovery Control supported means that you
> should be able to ask the drive to limit how long it will attempt to
> read bad sectors.
--
Len Sorensen
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list