[GTALUG] long war story: growing the ESP (/boot/efi)

Thu Jul 15 12:57:23 EDT 2021

On Thu 2021-07-15 @ 10:37:03 AM, D. Hugh Redelmeier via talk wrote:
> NAND flash
> 
> - allows updates in pages, but only to previously erased pages

NAND flashes are divided into pages, pages are organized into blocks, and a
bunch of block go together to form planes. Modern devices also have sub-pages.
Erasing usually occurs on blocks, but reads and writes are to pages (or
sub-pages) at a time.

> - erases can only be done in large blocks.  The size of these blocks is
>   generally secret but I think that they are in the range of 128KiB to
>   1MiB.

The size of the erase block is *always* stated in each device's datasheet.
Otherwise how else would we be able to configure the SoC's NAND controller? An
SoC's NAND controller needs to know the size of the erase block in order to
work with the NAND correctly.

> - are so alien to deal with that they have a complicated controller
>   between them and your system.  The controller emulates a disk drive.
>   This does wear levelling.  There is some work at moving this into
>   the OS for performance reasons, but it is not common and only for
>   data-centres.

You're talking about "managed flash" devices, such as eMMC. True, these do
have a built-in "invisible" controller between you and flash that does wear
leveling etc. These devices generally look like hard-drives.

However, many embedded systems have "raw flash" NAND devices on them. You're
welcome to interact with them directly if you wish, but you risk wearing them
out prematurely due to lack of wear leveling etc. In this case the smartest
way to work with them is via the kernel's mtd subsystem which provides a
software-based abstraction between your filesystem and the raw device and
performs all the same duties as the aforementioned controller used with
managed devices (and often have several additional features). If nothing else,
at least with the software mtd layer you know and are in control of what's
going on.

Ideally you should use an mtd-aware filesystem on top of the kernel's mtd
subsystem. Currently the best fileystems to use on top of mtd for use with raw
NAND flash are UBI/UBIFS and F2FS. JFFS2 used to be a popular choice, but is
considered legacy at this point.

> - density is everything in NAND flash systems.  The original ones used
>   a single bit per cell (SLC) but now 3 bits per cell is common.  To store 3
>   bits per cell, 8 charge levels must be distinguished.  This makes
>   them much slower and less reliable (shorter-lived).
> 
> - each cell can only be erased a small number of times (perhaps 1000)
>   but the wear-levelling usually prevents this being a big problem.
>   It is not part of published specifications.

Although SLC was the original (and therefore "old" technology) they are still
very much readily available and very actively used. For example:

https://www.digikey.ca/en/products/filter/memory/774?s=N4IgTCBcDaIM4BsDGACAdgQzQExAXQF8g

SLC NAND flash is fast, usually supports a minimum of 100,000 erase cycles,
but is much more expensive. The first page/block of SLC is usually guaranteed
to (never?) wear out (it'll have an erase count that is an order or two higher
than the rest of the flash). So based on your requirements you can either go
expensive but fast and long-lasting; or cheap, slow, and a shorter life. All
types of flash are still actively produced.

The minimum number of expected erase cycles for any given device is *always*
specified in the datasheet along with the expected data retention length. Note
that the endurance of the device is almost always specified along with a given
expected level of ECC. So the datasheet will say, for example:

	endurance: typical 60k cycles (with 8-bit ECC per (512+32) bytes

In this case 512 is the sub-page size and 32 are the OOB bytes.

Although single cell per bit designs were called SLC, two bits per cell
designs were called "multi-level cell" (MLC). When three bits per cell
technology came along they were called TLC (three-level cell). So the
nomenclature goes:

	single → multi → three

:-)

> - it is unreasonable to run programs directly out of NAND flash.  They
>   can, however, be paged in, just as one would do from a HDD.
>   This wrecks real-time performance, so routers would not do this.
>   They copy the contents of flash to RAM as part of booting.

True. It's simply not possible to run code from a device that needs to be
accessed a page (or sub-page) at a time. A CPU expects byte (or at least
word) access to the instruction stream. Therefore if a raw NAND contains an
executable it must be copied to some other byte-accessible medium (e.g. static
RAM, dynamic RAM, etc) before it can be consumed by the CPU.