Files, Inodes, and Metadata
Topic 07

Files, Inodes, and Metadata

FilesystemFoundations

A file on Linux is not the name you type. The name is a directory entry; the file itself is an inode — a fixed-size record on disk that holds every attribute of the file except its name and its contents. Size, owner UID and group GID, the permission bits, the three timestamps, the link count, and the pointers to the data blocks all live in the inode. The directory entry is just a string mapped to an inode number, and the kernel resolves the rest from there.

That separation between name and inode is the single fact that explains hard links, why rm does not always free space, why renaming a 50 GB file inside one filesystem is instant, and why a filesystem can run out of files while gigabytes of disk sit empty. Once you see directory entries as pointers to inodes rather than as the files themselves, the rest of the filesystem stops surprising you.

The Inode

Every file, directory, symlink, device node, socket, and named pipe is an inode. When a filesystem like ext4 or XFS is created, ext4 allocates its inode table up front — a fixed count decided at mkfs time — while XFS grows inodes dynamically as needed. Each inode carries a number, unique within that filesystem, and that number is what a directory entry actually references. Run stat on any path and you are reading the inode straight out, name aside.

# stat shows the inode contents: number, size, blocks, mode, owner, times
stat /etc/hosts
  File: /etc/hosts
  Size: 221       Blocks: 8          IO Block: 4096   regular file
Device: 259,2   Inode: 1442        Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Modify: 2026-05-12 09:41:08
 Birth: 2026-05-12 09:41:08

What the inode does not store is the filename. There is no field for it, and a single inode can be reached by many names at once. The kernel finds an inode by walking directory entries to its number, then reads the inode to learn where the data blocks are and whether the caller is allowed to touch them.

Directory Entries and Hard Links

A directory is itself an inode whose data is a table of (name, inode number) pairs. A hard link is nothing more than a second directory entry pointing at the same inode. Both names are equal — neither is the "original" — and the inode's link count tracks how many names refer to it. The kernel only frees the inode and its data blocks when that count reaches zero and no process still holds the file open.

# two names, one inode: note the identical inode number and Links: 2
echo "payload" > a.txt
ln a.txt b.txt
ls -li a.txt b.txt
1442 -rw-r--r-- 2 root root 8 May 30 11:02 a.txt
1442 -rw-r--r-- 2 root root 8 May 30 11:02 b.txt
rm a.txt   # link count drops to 1; data survives under b.txt

This is why rm is more honestly named unlink — it removes a directory entry and decrements the link count, not the bytes. It also explains the classic disk-space surprise: delete a 10 GB log that a running process still has open, and df shows no space recovered. The directory entry is gone, but the link count is still held by the open file descriptor, so the inode and its blocks stay allocated until the process closes the file or restarts.

Hard links carry one hard constraint: they cannot cross filesystem boundaries, because an inode number is only meaningful within its own filesystem. You also cannot hard-link a directory — that restriction exists to keep the directory tree acyclic.

Symbolic Links

A symbolic link is a different mechanism entirely. It is its own inode whose data is a text string — a path to another file. The kernel follows that string at lookup time, which means a symlink can point across filesystems, point at a directory, and point at a target that does not exist yet or has since been deleted (a dangling link). Where a hard link is indistinguishable from the file, a symlink is a signpost that can rot.

PropertyHard linkSymbolic link
What it storesSame inode numberA path string
Crosses filesystemsNoYes
Can target a directoryNoYes
Survives target deletionYes — it is the fileNo — becomes dangling
Affects target link countYes (+1)No

On Debian and Ubuntu this matters daily: /etc/alternatives is a tree of symlinks selecting which binary editor or java resolves to, and systemd enables a unit by creating a symlink from a .wants directory to the unit file under /usr/lib/systemd/system. Use ln -s target linkname to create one and readlink -f to resolve it all the way to the real path.

Permission Bits and Ownership

The inode stores a 16-bit mode: the file type, the special bits (setuid, setgid, sticky), and the nine permission bits for owner, group, and other. Those nine bits are what chmod 0644 sets and what the first column of ls -l renders. Ownership is two numeric IDs in the inode — UID and GID — and the names ls prints are looked up from /etc/passwd and /etc/group at display time, not stored on disk.

The special bits are easy to overlook and expensive to get wrong. Setuid (4000) on an executable makes it run as the file's owner rather than the caller — that is how /usr/bin/passwd runs as root for an unprivileged user, and it is the first thing an attacker hunts for. Setgid (2000) on a directory forces new files to inherit the directory's group, which is the standard way to run a shared project directory. The sticky bit (1000) on /tmp stops users from deleting each other's files in a world-writable directory.

Timestamps and Extended Attributes

The inode keeps three timestamps, and the difference between them traps people. atime is last access, mtime is last content modification, and ctime is last inode change — a permission change, an ownership change, or a link-count change bumps ctime but leaves mtime alone. There is no creation time in the classic inode; ext4 and XFS added a fourth, crtime (birth), which stat shows as "Birth" when the filesystem records it. Modern mounts default to relatime, which only updates atime when it lags mtime, so do not trust atime as a precise read log unless the filesystem is mounted strictatime.

Beyond the fixed fields, files carry extended attributes — arbitrary name/value pairs in namespaces. The security.* namespace is where AppArmor and SELinux labels live; system.posix_acl_access holds POSIX ACLs that extend permissions past the single owner/group/other model. View them with getfattr -d and ACLs with getfacl; on Debian and Ubuntu the userland tools come from the attr and acl packages. A plain cp drops xattrs and ACLs — you need cp -a or cp --preserve=all to carry them.

Common Mistakes
  • Deleting a large log that a service still has open and expecting df to recover the space — the inode stays allocated until the process closes the descriptor, so you must restart it or truncate via truncate -s 0 or : > file instead.
  • Running out of inodes while disk space is free — an ext4 filesystem full of tiny files (mail spools, session caches) exhausts the fixed inode table, and df -i, not df -h, is the only command that shows it; the only fix is to delete files or reformat with a higher inode count.
  • Treating a symlink like a hard link and moving the target — the link goes dangling and silently resolves to nothing, whereas a hard link would have survived because it is the file.
  • Trying to hard-link across filesystems or onto a separate mount and getting Invalid cross-device link — inode numbers are filesystem-local, so the link cannot exist; you need a symlink instead.
  • Copying files with plain cp for a backup and losing ownership, ACLs, xattrs, and timestamps — only cp -a (or rsync -aHAX) preserves the full inode metadata, and the loss is invisible until a permission or AppArmor label is wrong in production.
  • Reading atime as a reliable "last opened" audit trail when the filesystem is mounted relatime — atime only advances when it trails mtime, so frequent reads of an already-modified file leave it unchanged.
  • Leaving an unexpected setuid-root binary in place — any writable setuid-root file is a direct privilege-escalation path, and find / -perm -4000 is the audit that should turn up only the handful you expect.
Best Practices
  • Check df -i alongside df -h in monitoring — inode exhaustion produces the same "No space left on device" error as a full disk but needs a different fix.
  • Reach for stat before guessing — it shows the link count, all three timestamps, the inode number, and the device in one view, so you know whether a file has other hard links before deleting it.
  • Reclaim space from a deleted-but-open file with truncate -s 0 /proc/<pid>/fd/<n> or restart the holding service, rather than hunting a directory entry that no longer exists.
  • Use cp -a or rsync -aHAX whenever the copy must keep ownership, permissions, ACLs, xattrs, and timestamps — plain cp is for throwaway copies only.
  • Audit setuid and setgid binaries with find / -xdev -perm -4000 -o -perm -2000 on a schedule, and strip the bit with chmod u-s on anything you did not put there.
  • Set a shared group directory with the setgid bit (chmod 2775) so new files inherit the group automatically instead of relying on every user's umask.
  • Resolve symlinks with readlink -f or realpath in scripts before acting on a path, so a relocated or dangling target fails loudly instead of writing to the wrong place.
Comparable toolsWindows NTFS — Master File Table records play the inode role; junctions and symbolic links via mklink, hard links via fsutil hardlinkmacOS APFS/HFS+ — Unix inode model with crtime, plus Finder aliases as a higher-level symlink-like referenceBSD UFS/ZFS — the same inode and link-count semantics; ZFS replaces extended attributes and ACLs with NFSv4-style ACLs

Knowledge Check

You rm a 10 GB log file, but df -h shows no space recovered. What is happening?

  • A running process still holds the file open, so the inode's link count is non-zero and its data blocks stay allocated until the process closes the descriptor or restarts
  • The filesystem batches deletions in the page cache and only frees the released block ranges asynchronously, once the next periodic background sync flushes the pending metadata changes out to the disk
  • df reads stale figures; the space is already free and a remount would show it
  • The file had a hard link elsewhere on the same filesystem keeping the data alive

Why can a hard link not cross from one mounted filesystem to another, while a symlink can?

  • A hard link references an inode number, which is only unique within its own filesystem; a symlink stores a path string the kernel resolves at lookup time
  • Hard links require both names to share the same owning user and group, and a separately mounted filesystem is initialized with a different default ownership map
  • The kernel forbids hard links across filesystems only as a security policy that can be disabled at mount time
  • Symlinks are stored in the inode table while hard links are stored in the directory data, which is mount-local

A filesystem reports "No space left on device" but df -h shows 40% free. What should you check?

  • df -i — the inode table is likely exhausted by a large number of tiny files, which is a separate limit from raw disk capacity
  • The mount's quota settings, since per-user quotas override the global free-space figure
  • The reserved-blocks percentage, because ext4 hides 5% from non-root and that triggers the same error
  • The journal size, since a full ext4 journal blocks every write with ENOSPC until a background commit flushes it and reclaims the reserved log area

You change a file's permissions with chmod. Which timestamp moves, and which does not?

  • ctime moves because the inode changed; mtime does not, because the file's contents were untouched
  • mtime moves because chmod rewrites the inode, while ctime tracks only content edits
  • Both mtime and ctime move, since any inode write updates the modification time too
  • Only atime moves, because chmod first reads the inode to load the current mode bits, and that read updates the access time

Why does a plain cp followed by a restore sometimes break a service that worked before?

  • cp without -a or --preserve=all drops ownership, ACLs, xattrs, and AppArmor/SELinux labels stored on the inode, so the restored file has the wrong metadata
  • cp rewrites the file's inode number, breaking any hard links the service depended on
  • cp always dereferences symlinks, so each link in the tree is silently replaced by a full copy of its target file, doubling the on-disk footprint the service expected
  • cp truncates extended attributes longer than 255 bytes, corrupting the security label

You got correct