[GTALUG] New Desktop PC -- debian Linux - Proposed 2 TB HDD Partitioning;

D. Hugh Redelmeier hugh at mimosa.com
Wed Apr 18 17:12:39 EDT 2018



| From: Lennart Sorensen via talk <talk at gtalug.org>
| 
| > I guess that would mean that scattering unused space on an SSD between 
| > the partions, means the controller probably sees it as being used. I 
| > left chunks allocated at the ends of the drives as recommended. I was 
| > just wondering if my stripes would increase that wear level 
| > capability, as well as providing for emergency recovery space(s).
| 
| Trying to guess how a drive does its wear leveling is impossible.
| Even if you are buying SSDs directly from the manufacturer and have a
| relationship with them, they usually won't tell you how it works.

There are some things that are pretty well-known at the moment.  Things 
may change in the future.

| Usually the drive has some extra space by design that it can use as a
| pool for writes, and then the old blocks are erased and put into the pool.
| If you use trim, you can add currently unused space in the filesystem to
| that free pool too.  Some drives will occationally move data that never
| changes from blocks that have very few writes to blocks that are more work
| in the hopes that it will then be able to use those better blocks for more
| frequently changing data, but simpler drives may not do such housekeeping.

I don't understand what you are saying here.

What you want to avoid is "write amplification".  Every write to the 
device will cause at least one flash write.  But some cause a lot more, 
and you want to reduce that effect.  We care about the average write
amplification since it is lumpy.

Write amplification gets really bad when a drive is too full.  And it
isn't too bad when there's a fair bit of free space (for normal
workloads).  Counter-intuitively, the graph of this is sort of a
hockey stick.  The transition from good to bad is fairly sharp.

All drives have more than the OS-visible space (overprovisioning).  Cheap 
drives have less than "enterprise" drives.

By leaving some of your disk unused, in a way that the drive firmware
knows, adds to overprovisioning.  I think it's a reasonable idea.

None of this has any effect on reads.  Well, if the drive is busy
doing garbage collection or write amplified writes, I guess reading
would be slowed down.

Erase blocks, the unit of erasing in the raw hardware, are quite a
lot bigger than filesystem blocks.  (If they were not, there would be
no reason for write amplification.)

When the drive firmware gets a write request, it needs to have an
empty (erased) block to write it to.  If it doesn't have one, it must
find one, using garbage collection.  That generates more writes behind
the scenes.

Remember: when you rewrite a block, the flash block cannot be
rewritten in situ since the raw operation for writing is actually a
NAND, so the block needs to be erased.  You cannot erase a block
without erasing a lot of adjacent blocks: a whole erase block.

| There really isn't any way to know, unless they choose to advertise it.
| Of course it is likely a drive with a much higher promised number of
| write cycles likely is doing smarter housekeeping to keep block wear as
| even as possible.

More things that I think we know:

I think that the housekeeping is fairly well understood.  But they
keep adding tricks.

Some SSDs have RAM buffers.  Short-lived blocks might never hit the
flash memory.  Probaly only expesive / enterprise drives these days.

Some SSDs use a portion of flash in pseudo-SLC mode for buffering.  I
don't know exactly how it is used but one could imagine it is like the
RAM buffer would be.

SLC flash stores one bit per flash cell.  MLC stores several bits per 
cell, but in common usage it is 2 bits per cell.  TLC stores three bits 
per cell.

SLC is less dense (obviously) so it is more expensive and it seems to
be impossible to get these days but it had great speed and reliability
advantages.

Older drives are MLC, newer ones are TLC.

What I wonder about:

How stable is flash.  There are hints that it needs to be refreshed once 
in a while (months?  years?).  Is this automatically done?

Cheap SSDs can be corrupted in a power failure.  More expensive ones
have a bit of power reserve (supercapacitor?) to put your data to bed
before powering down.  How can you know which kind you are buying?  Why is 
this not a scandal?

| I am not currently convinced that keeping unallocated space is worth it.
| Sure you make the free pool a bit larger, but you still end up writing
| the same amount of blocks and you make the usable size smaller.  Having a
| larger free pool might help for systems that do a lot of writes since
| you are more likely to be able to have a free block to do a write,
| while the drive hasn't had time to erase the old blocks.  On the other
| hand if you are doing enough writing that it could be a proble, maybe
| and SSD is the wrong type of drive to be using.

The hole in that logic is that erase blocks are a lot larger than
filesystem blocks.  So you potentially end up with a bunch of erase
blocks with some free but not erased fs blocks.  All the valuable fs
blocks must be copied to a not-yet-filled erase block before these
swiss cheese erase blocks can be erased and added to the free pool.


More information about the talk mailing list