Overlay rootfs

From regional-training

introduction

All files accessible in a Unix system are arranged in one big tree, the file hierarchy, rooted at /. These files can be spread out over several devices.

The mount command serves to attach the filesystem found on some device to the big file tree. Conversely, the umount command will detach it again.

The filesystem is used to control how data is stored on the device or provided in a virtual way by network or other services.

The unification of the filesystem entities is performed by the use of inode structures that contain the following meta-data for each content type: [1]

  • inode-number
  • type:
    • Regular File
    • Directory
    • Block Special fFile
    • Character Special File
    • Symbolic Link (soft link)
    • Shadow (used for ACL)
    • FIFO (named pipe)
    • Attribute directory
    • Socket
  • flags
  • Generation
  • Version
  • filesystem entry access
    • mode
    • Uid
    • Gid
    • Project - used to group related files and for applying project based quota limits
    • Size - number of bytes of the data block it links to
    • File ACL
  • Size
  • Links - number of directory entries referencing
  • Blockcount - sum of allocated inode blocks
  • Fragment:
    • Address
    • Number
    • Size
    • ctime: Change in status, such as directory entry metadata: file permissions, ownership, location, file type etc
    • atime: Access date/time directory entry was read
    • mtime: Modification date/time directory entry was written
  • crtime: Creation time - the birth date/time of the directory entry e.g. creation time of a file
  • Inode checksum

Examples:

Switch editor
Preview
Advanced
Special characters
Help

$ stat dev
  File: dev
  Size: 4220      	Blocks: 0          IO Block: 4096   directory
Device: 0,5	Inode: 1           Links: 23
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-09-21 09:33:16.063621989 +1000
Modify: 2025-09-21 09:33:35.422094883 +1000
Change: 2025-09-21 09:33:35.422094883 +1000
 Birth: 2025-09-21 09:33:07.068000000 +1000

[note 1]

$ stat /
  File: /
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 254,2	Inode: 2           Links: 24
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-09-20 06:59:38.857441662 +1000
Modify: 2025-09-01 20:19:21.045193653 +1000
Change: 2025-09-01 20:19:21.045193653 +1000
 Birth: 2023-07-31 13:31:50.000000000 +1000

and directory entries that contain the mapping between directory entity and its inodes.

When a filesystem is created a fixed number of inodes are set aside but are not allocated until they are needed.

Each inode contains a count of the number of directory entries linked to it. The fsck command verifies the link count of each inode by examining the entire directory structure, starting from the root directory, and calculating an actual link count for each inode.

Each imode contains a list, or pointers to lists (indirect blocks), of all the blocks claimed by an inode.

Each inode contains a count of the number of data blocks that it references, which should equal the sum of the allocated blocks and the indirect data blocks.

Each inode contains a 64-bit size field, which shows the number of characters (data bytes) in the file associated with the inode.

ext2, ext3 and ext4 filesystem inodes can be explored and managed with debugfs [2] [3]:

apt install e2fstools

In an ext4 filesystem, a directory more or less maps an arbitrary byte string (usually ASCII up to 255 bytes) to an inode number on the filesystem. There can be many directory entries across the filesystem that reference the same inode number - these are known as hard links, and that is why hard links cannot reference files on other filesystems. As such, directory entries are found by reading the data block(s) associated with a directory file for the particular directory entry that is desired.

There are two forms of directory entry for ext4:

  • hashed entries, or
  • linear

A linear directory is a series of data blocks and that each block contains a linear array of directory entries. The end of each per-block array is signified by reaching the end of the block; the last entry in the block has a record length that takes it all the way to the end of the block. The end of the entire directory is of course signified by reaching the end of the file. Unused directory entries are signified by inode = 0.

A hash tree directory was added at ext3 to provide a faster (but peculiar) balanced tree keyed off a has of the directory name when the EXT4_INDEX_FL (x1000) flag is set in the inode.

The root of a directory is stored within the first inode, and by custom the '.' and '..' soft-link entries myat appear at the beginning of the first block. The rest of the has tree directory root node contains metadata about the tree and finally a hash->block map to permit finding nodes that lower in the tree. The tree has two levels of structures, the intermediate contained zerod out struct ext4_dir_entry_2 and the leaf nodes contain a linear array of all stuct_dir_entry_2 (which have all been hashed to the same value i.e. they are overflow chains) - and the overflow chain continues to the next leaf-node block.

To traverse the directory as a htree the code calculates the has of the desired filename and uses it to find the corresponding block number. If the tree is flat the block is a linear array of directory entries that are searched; otehrwise the minor has of the filename is used against the block to find the corresponding next block - which will be a linear array of directory entries.

debugfs

The debugfs is a synthetic file system used for interacting with the kernel and mounts. [4]

overlayfs

OverlayFS [5] [6] is a Linux union filesystem that merges multiple directory trees (layers) into a single, unified view. It's commonly used to layer a read-write "upper" directory on top of one or more read-only "lower" directories. Changes are written to the upper layer, preserving the integrity of the lower, read-only base layers. This is essential for Docker containers and other applications like LiveCDs, where isolated, read-write environments are built on top of static or read-only base images.

Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer. This type of mechanism is most often used for live CDs but there is a wide variety of other uses.

The implementation differs from other "union filesystem" implementations in that after a file is opened all operations go directly to the underlying, lower or upper, filesystems. This simplifies the implementation and allows native performance in these cases. Overlayfs has been in the Linux kernel since 3.18.

To mount an overlay use the following mount options:

# mount -t overlay overlay -o lowerdir=/lower,upperdir=/upper,workdir=/work /merged

Note

  • The working directory (workdir) needs to be an empty directory on the same filesystem as the upper directory.
  • The lower directory can be read-only or could be an overlay itself.
  • The upper directory is normally writable.
  • The workdir is used to prepare files as they are switched between the layers.
  • The lower directory can actually be a list of directories separated by :, all changes in the merged directory are still reflected in upper.

Example:

# mount -t overlay overlay -o lowerdir=/lower1:/lower2:/lower3,upperdir=/upper,workdir=/work /merged

The order of lower directories is the rightmost is the lowest, thus the upper directory is on top of the first directory in the left-to-right list of lower directories; NOT on top of the last directory in the list, as the order might seem to suggest. The above example will have the order:

/upper
/lower1 
/lower2
/lower3

To add an overlayfs entry to /etc/fstab use the following format:

 $vi /etc/fstab
 overlay /merged overlay noauto,x-systemd.automount,lowerdir=/lower,upperdir=/upper,workdir=/work 0 0

The noauto and x-systemd.automount mount options are necessary to prevent systemd from hanging on boot because it failed to mount the overlay. The overlay is now mounted whenever it is first accessed and requests are buffered until it is ready.

Read-only overlay

Sometimes, it is only desired to create a read-only view of the combination of two or more directories. In that case, it can be created in an easier manner, as the directories upper and work are not required:

$ mount -t overlay overlay -o lowerdir=/lower1:/lower2 /merged

When upperdir is not specified, the overlay is automatically mounted as read-only.

unionfs

UnionFS works by overlaying multiple existing directories, called "branches," into a single, unified view at a single mount point. It uses a Copy-on-Write (CoW) (or "copy-up") mechanism, where read-only lower branches remain unchanged while modifications are copied to a designated writable upper branch. This creates a transparent, read-write file system without altering the original source files. e.g.

unionfs-fuse -o cow,max_files=32768 \
            -o allow_other,use_ino,suid,dev,nonempty \
            /u/host/etc=RW:/u/group/etc=RO:/u/common/etc=RO \
            /u/union/etc
Branching and Layering
UnionFS takes several separate file systems (branches) and lays them on top of each other in a specific order, typically with read-only branches at the bottom and one or more read-write branches at the top.
Unified View
The union mount point presents a single, coherent directory structure that merges all the files and directories from these layered branches.
Priority Rules
When multiple branches contain a file with the same name, UnionFS gives priority to the file in the higher-priority (often the leftmost) branch.
Copy-on-Write (CoW)
  • Reading: When you read a file, UnionFS checks the higher-priority branches first. If the file is found, it's presented to you directly from that branch.
  • Writing: If you attempt to modify a file located in a read-only branch, UnionFS doesn't change the original file. Instead, it copies that file from the read-only branch to a writable upper branch and then applies your changes to the newly copied file.

Modifications in the Upper Layer: All new files and modifications are stored in the writable upper layer. The original read-only branches remain intact, which is crucial for maintaining the integrity of base images in environments like Docker.

Whiteouts
To handle file deletion in a union file system, a special "whiteout" entry is created in a writable layer. When a lookup operation encounters a whiteout entry, it signals that the file or directory should be considered as deleted, even if it still exists in a lower-layer read-only branch.
In essence

UnionFS creates the illusion of a single, writable file system by layering multiple directories, with a Copy-on-Write mechanism handling any writes to the lower, read-only layers by copying and modifying files in a separate upper layer.

problems

bibliography

size

s (sectors) - 512 byte sectors
K (kilobytes) - 1,024 bytes
M (megabytes) - 1,048,576 bytes
G (gigabytes) - 1,073,741,824 bytes
T (terabytes) - 1,099,511,627,776 bytes

notes

  1. Note that /dev is the first inode deployed, then the / inode. A bit of a chicken and egg situation.

references

categories