Topic 50

etcd Operation and Backup

StateBackup

etcd is the database that holds all of a cluster's state — every object, every status. The cluster is, in a real sense, whatever etcd says it is. Operating it well and backing it up reliably is the single most important durability task for a self-managed cluster.

Lose etcd without a backup and the cluster's desired state is gone. Restore from a tested backup and you can rebuild from disaster. The gap between those two outcomes is preparation you do before you need it.

What etcd Stores and Why It Matters

Every Kubernetes object lives in etcd: Deployments, Services, ConfigMaps, Secrets, and their status. The API server is the only component that talks to it. Because it is the sole source of truth, etcd's health is the cluster's health — and its loss is the cluster's loss. Treat it as the critical database it is, not as an implementation detail.

Quorum and Performance

etcd quorum — a majority must agree

etcd cluster · 5 members (Raft)

Quorum: 3 of 5 members must agree to commit a write

member 1

member 2

member 3

member 4

member 5

Lose two members and three remain — quorum holds, writes continue. Lose three and the cluster goes read-only until restored. Always run an odd number.

etcd uses the Raft consensus protocol and needs a quorum — a majority of members — to accept writes. That is why you run an odd number: three members tolerate one failure, five tolerate two. An even count adds cost without improving fault tolerance and complicates quorum. etcd is also acutely sensitive to disk latency; on slow storage, write latency rises and the whole control plane becomes sluggish and flaky. Give it fast, dedicated disks.

etcd members	Quorum	Failures tolerated
1	1	0 (no HA)
3	2	1
5	3	2

Backup with Snapshots

The backup mechanism is a point-in-time snapshot: etcdctl snapshot save writes a consistent copy of the entire keyspace to a file. Snapshots should be automated on a schedule and shipped off the cluster — a backup on the same failed disk is no backup. Crucially, take a snapshot before any risky operation, especially a control-plane upgrade. On managed clusters the provider handles etcd backup, which is one of the larger operational burdens that managed Kubernetes removes.

Snapshot and restore

# back up
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=ca.crt --cert=server.crt --key=server.key

# restore (rebuilds the data dir from the snapshot)
etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restored

Restore, Defrag, and Encryption

A backup you have never restored is a hope, not a plan — rehearse the restore, because the procedure (stop the API server, restore the data dir, restart) is unforgiving under real pressure. Over time etcd needs compaction and defragmentation to reclaim space from old revisions. And remember that Secrets sit in etcd: without encryption at rest (Topic 35), an etcd snapshot is a plaintext copy of every secret — secure the backups accordingly.

etcd snapshot vs Velero

etcd snapshot — the entire cluster state at the API level — the control plane's database. For disaster recovery of the cluster itself.

Velero — backs up Kubernetes objects and PersistentVolume data, selectively and per-namespace. For application/data backup and migration, not control-plane recovery.

Common Mistakes

Never testing a restore, so the backup is unproven when disaster strikes.
Running an even number of etcd members, gaining cost but no extra fault tolerance.
Putting etcd on slow disks and turning the whole control plane sluggish.
Skipping a snapshot before a control-plane upgrade.
Storing etcd snapshots unencrypted, exposing every Secret in the cluster.

Best Practices

Run an odd number of etcd members (three or five) on fast, dedicated disks.
Automate snapshots on a schedule and ship them off-cluster, encrypted.
Always snapshot before risky operations, especially control-plane upgrades.
Rehearse the restore procedure so it is muscle memory, not improvisation.
Encrypt Secrets at rest so etcd backups don't leak credentials; use Velero for app/PV backup.

RelatedCluster architecture — etcd's place in the control plane (Topic 03)Secrets management — why encryption at rest protects etcd backups (Topic 35)Velero / managed backup — application and PV backup, or provider-managed etcd

Knowledge Check

Why do you run an odd number of etcd members?

Quorum needs a majority; odd counts maximize fault tolerance for the member count (3 tolerate 1, 5 tolerate 2)
Odd member counts consume noticeably less disk per node than even counts do
The Raft consensus protocol rejects even member counts outright and refuses to form a cluster with that many nodes
An odd number of members lowers read latency across the cluster

What is the etcd backup mechanism?

A point-in-time snapshot (etcdctl snapshot save) of the entire keyspace, shipped off-cluster
File-level copying of each member's live data directory on disk
Exporting every API object to YAML with kubectl get --all-namespaces and archiving the dump off-cluster
A Velero volume backup of the control-plane host

Why must etcd snapshots be stored encrypted?

Secrets live in etcd; without encryption at rest, a snapshot is a plaintext copy of every secret
An unencrypted snapshot is otherwise far too large to ship off-cluster
etcd validates the snapshot on restore and refuses to load any file that was not encrypted at rest beforehand
Encryption makes the restore step measurably faster

You got correct