Topic 55

Software RAID

StorageFoundations

Software RAID combines two or more block devices into a single logical array that survives disk failure, increases throughput, or both, using the kernel's md (multiple device) driver instead of a dedicated hardware controller. On Debian and Ubuntu you create and manage arrays with the mdadm tool; the array appears as /dev/md0 and you partition, format, or hand it to LVM exactly like a physical disk.

Because the parity and mirroring work happens in the kernel on your CPU, software RAID has no proprietary on-disk format and no controller to fail or to source a matching replacement for. The operational consequence is concrete: a failed disk in a RAID 1 or RAID 5 array keeps serving reads and writes in degraded mode, but you remain responsible for noticing the failure, swapping the disk, and triggering the rebuild — none of that is automatic, and a second failure during a rebuild can lose the whole array.

RAID Levels and Their Trade-offs

The md driver implements the standard levels plus a Linux-specific RAID 10. RAID 0 stripes data across all members for capacity and speed with zero redundancy — one disk lost means the array is gone. RAID 1 mirrors every block to all members, so an N-disk mirror tolerates N−1 failures but gives you only one disk of usable capacity. RAID 5 stripes data with one parity block per stripe and tolerates a single failure; RAID 6 carries two parity blocks and tolerates two.

Level	Min disks	Usable capacity	Failures tolerated	Typical use
RAID 0	2	100%	0	Scratch, caches, rebuildable data
RAID 1	2	1 disk	N−1	Boot and root volumes
RAID 5	3	N−1 disks	1	Capacity with some redundancy
RAID 6	4	N−2 disks	2	Large arrays, long rebuilds
RAID 10	4	50%	1 per mirror set	Databases, mixed read/write

RAID 5 carries a hidden cost called the write penalty: every small write must read the old data and old parity, recompute, and write both back, turning one logical write into four I/O operations. On large arrays of multi-terabyte disks, RAID 6 is preferred over RAID 5 because rebuild times now run for many hours, and the probability of a second disk or an unrecoverable read error appearing during that window is no longer negligible.

Creating and Inspecting an Array

You assemble an array from whole disks or partitions with mdadm --create. Using GPT partitions of type Linux RAID rather than raw disks makes the members self-describing and avoids confusion when a disk is moved between machines. After creation the kernel begins an initial sync, and the array is usable — though slower — while that sync runs.

# create a 3-disk RAID 5 array from partitions
mdadm --create /dev/md0 --level=5 --raid-devices=3 \
      /dev/sdb1 /dev/sdc1 /dev/sdd1

# watch the initial sync and overall state
cat /proc/mdstat
mdadm --detail /dev/md0

The file /proc/mdstat is the fastest health check: it lists each array, its level, its members, and a status line such as [UU_] where each U is an up member and an underscore is a missing one. mdadm --detail adds the array UUID, the sync percentage, and the per-device roles you need before replacing hardware.

Persistent Assembly and the Boot Path

An array that works after --create will not reassemble after a reboot unless its definition is recorded. Write the array's UUID into /etc/mdadm/mdadm.conf (the path is /etc/mdadm.conf on Red Hat and Fedora), then refresh the initramfs so the array can be assembled early enough to mount the root filesystem.

# append the running array's definition to the config
mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf

# Debian/Ubuntu: rebuild the initramfs so md0 assembles at boot
update-initramfs -u
# Red Hat/Fedora equivalent: dracut -f

Reference the array in /etc/fstab by its filesystem UUID from blkid, never as /dev/md0, because the kernel may enumerate an unconfigured array as /dev/md127 after a reboot. If you forget to update the initramfs on a root-on-RAID system, the machine drops to an initramfs shell because the root device cannot be assembled.

Failure Handling and Rebuilds

When a disk fails, the array continues in degraded mode and the md driver marks the member faulty. You replace it by failing and removing the bad device, partitioning the new disk identically, and adding it back; the rebuild starts automatically. Configure mdadm --monitor with a MAILADDR line in the config so a failure pages you instead of waiting to be discovered.

# fail, remove, then add the replacement disk
mdadm /dev/md0 --fail /dev/sdc1 --remove /dev/sdc1
mdadm /dev/md0 --add /dev/sde1

# cap rebuild speed so it does not starve production I/O
echo 50000 > /proc/sys/dev/raid/speed_limit_max

Add a hot spare with --add-spare so a rebuild begins the instant a disk is marked faulty, before anyone reads the alert. Schedule a periodic check scrub — Debian and Ubuntu ship a monthly cron job for this — to read every block and detect silent unrecoverable read errors while the array is still redundant, rather than discovering them mid-rebuild when there is no longer any parity to reconstruct from.

RAID Is Availability, Not Backup

RAID protects against disk hardware failure, and nothing else. It happily mirrors or stripes an rm -rf, a corrupting application bug, a ransomware encryption pass, or a bad write to all members at once. A controller-free mirror also does nothing for fire, theft, or a filesystem that corrupts its own metadata.

Keep independent, versioned, off-host backups regardless of RAID level, and test restores. Treat RAID as an availability mechanism that buys time to replace a disk without downtime, not as a recovery mechanism for data you deleted or corrupted. Parity levels also carry a write-hole risk: if power is lost mid-stripe, data and parity can disagree, which is why a battery-backed cache or a journaled array matters for RAID 5 and 6.

RAID 5 vs RAID 6 vs RAID 10

RAID 5 — single parity, survives one failure, gives N−1 disks of capacity. The cheapest redundant level, but it carries a four-I/O write penalty and a rebuild that re-reads every surviving disk. Reasonable only for small arrays of modest disks.

RAID 6 — double parity, survives two failures, gives N−2 disks of capacity. Choose it for any large array of multi-terabyte disks, where the multi-hour rebuild window makes a second failure or an unrecoverable read error likely.

RAID 10 — mirrored stripes, 50% usable capacity, no parity math. Choose it for databases and write-heavy workloads where low latency and a fast rebuild (copying one mirror member, not recomputing parity across the whole array) matter more than capacity efficiency.

Common Mistakes

Creating the array but never running mdadm --detail --scan into /etc/mdadm/mdadm.conf, so after a reboot the array reappears as /dev/md127 and any /dev/md0 entry in fstab fails to mount.
Forgetting update-initramfs -u on a root-on-RAID system, which leaves the rebuilt config out of the initramfs and drops the machine to an emergency shell on the next boot.
Never configuring mdadm --monitor or a MAILADDR, so a disk fails silently and the array runs degraded for weeks until a second failure destroys it.
Building large RAID 5 arrays from multi-terabyte disks, where a multi-hour rebuild plus a single unrecoverable read error on a surviving disk takes the whole array down.
Skipping the periodic check scrub, so latent bad sectors stay hidden until they are needed during a rebuild and the reconstruction fails.
Reusing disks that still carry an old md superblock without running mdadm --zero-superblock, causing the kernel to auto-assemble a stale array over your new one.
Building an array from mismatched disk sizes, which wastes the excess on every larger member because the array sizes itself to the smallest device.

Best Practices

Record every array with mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf and then run update-initramfs -u immediately after creating it.
Mount arrays by filesystem UUID from blkid in /etc/fstab, never by the /dev/md0 kernel name, which is not stable across reboots.
Enable email alerts by setting MAILADDR in mdadm.conf and running mdadm --monitor so a faulty disk reaches you the same day.
Choose RAID 6 or RAID 10 over RAID 5 for any array of large disks so it survives a failure during the long rebuild window.
Assign at least one hot spare with mdadm --add-spare so rebuilds begin automatically the moment a member is marked faulty.
Leave the monthly check scrub cron job enabled and review its results, so latent bad sectors surface while the array is still redundant.
Keep independent off-host backups regardless of RAID level, and test restores — RAID survives a dead disk, not a bad delete.

Comparable toolsWindows — Storage Spaces, the software-pooling and resiliency layer that plays the role mdadm does on LinuxZFS — raidz/raidz2/raidz3, where redundancy is integrated with the filesystem and volume manager rather than a separate md layerHardware RAID — a dedicated controller card with battery-backed cache and a proprietary on-disk format you must match to recover

Knowledge Check

You need a root volume that keeps booting after one disk dies, with no parity write penalty and the simplest possible recovery. Which level fits best?

RAID 1, which mirrors every block so a survivor can serve the system unchanged
RAID 0, which stripes blocks across both disks for speed but loses everything the moment one disk fails
RAID 5, which spreads parity for capacity but adds a write penalty and a slow rebuild
RAID 6, which needs at least four disks and is aimed at large capacity arrays

After creating /dev/md0 with mdadm --create, the array vanishes and comes back as /dev/md127 after a reboot. What was skipped?

Writing the array definition into /etc/mdadm/mdadm.conf and rebuilding the initramfs
Running the initial resync to completion, which the kernel kicks off automatically on create anyway
Formatting the array with a filesystem and writing its label before the first reboot
Adding a dedicated hot spare to the array with mdadm --add-spare

Why is RAID 6 preferred over RAID 5 for large arrays of multi-terabyte disks?

Its second parity block lets the array survive an unrecoverable read error during the long single-disk rebuild
It carries no write penalty at all, unlike RAID 5 which has to read, recompute, and rewrite parity on every single stripe update
It rebuilds a failed disk faster than RAID 5 because it has to read fewer surviving members
It needs only two disks for double parity, making it cheaper to deploy than RAID 5

A teammate argues the RAID 1 mirror makes nightly backups unnecessary. What is the flaw?

RAID replicates deletions, corruption, and ransomware to every member; it only protects against disk hardware failure
RAID 1 cannot actually survive a single disk failure, so off-host backups are still required
Mirrors silently drop a fraction of writes, so the two copies drift apart over time without warning
The mirror turns read-only the moment it is degraded, so it refuses to store any new data until the failed member has been replaced and resynced

You got correct