[GTALUG] war story: power failure makes CentOS system unbootable

D. Hugh Redelmeier hugh at mimosa.com
Thu Jul 6 13:12:11 EDT 2017


tl;dr / spoiler: change fstab so that /boot/efi is not automatically
mounted.  See the recommendation at the end of this message.

My gateway computer is a little PC running CentOS.
It does not come back after a power failure.
The reason (as best I can tell) is interesting and I think that I have a 
fix.

My system is UEFI.  It boots from a UEFI system partition, known as
/boot/efi to Linux.  If this gets corrupted, it won't boot.  It is a
VFAT partition.

On my gateway (a Zotac Zbox RI321, a cute little box with two ethernet
ports), the UEFI firmware apparently won't boot if the dirty bit is
set in the system partition.

CentOS normally runs with /boot/efi mounted.  So when the system isn't 
shut down cleanly, the dirty bit will be on.

Consequence: the system is not going to boot after a power failure.

Odd observation 1: if I put a live Fedora 25 USB stick in the system
and try to boot in this otherwise-unbootable state, CentOS boots from
the main disk.  So this really looks like a firmware bug/quirk.

Odd observation 2: fsck doesn't seem to automatically fix the system 
partition.  Once CentOS 7 booted, I dismounted /boot/efi and
did an fsck on it.

    $ sudo fsck /dev/sda1
    fsck from util-linux 2.23.2
    fsck.fat 3.0.20 (12 Jun 2013)
    0x25: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
    1) Remove dirty bit
    2) No action
    ? 1
    Leaving filesystem unchanged.
    /dev/sda1: 16 files, 2420/51145 clusters

Googling got me to <https://www.centos.org/forums/viewtopic.php?t=50917>
In particular, this advice seemed quite good:

==== recommendation ====

Two other changes I recommend for UEFI systems, to each OS's /etc/fstab.

- For the /boot/efi mountpoint, add the mount options 
  x-systemd.automount,noauto

- Change fs_passno (last column) to a 1 or 2; the canonical fstab 
  instructions suggest 2, but systemd treats 1 and 2 as the same.

The first change means the EFI System partition will not be automatically 
read-write mounted at boot time; it's a bad idea this is the default 
because it puts the ESP at risk especially if ther are crashes, FAT has no 
journal and will therefore always be marked dirty in such a case; no other 
UEFI OS mounts the ESP by default. Second, if anything tries to access 
/boot/efi (read or write), systemd will automatically mount it, and 
because of the fs_passno 1 or 2, it will fsck it first and that fixes and 
clears the dirty bit in case it's set.

Right now without these changes, it's just a matter of having the right 
number and bad timing of crashes to render the EFI System partition 
corrupt.


More information about the talk mailing list