etcd Operation and Backup
etcd is the database that holds all of a cluster's state — every object, every status. The cluster is, in a real sense, whatever etcd says it is. Operating it well and backing it up reliably is the single most important durability task for a self-managed cluster.
Lose etcd without a backup and the cluster's desired state is gone. Restore from a tested backup and you can rebuild from disaster. The gap between those two outcomes is preparation you do before you need it.
What etcd Stores and Why It Matters
Every Kubernetes object lives in etcd: Deployments, Services, ConfigMaps, Secrets, and their status. The API server is the only component that talks to it. Because it is the sole source of truth, etcd's health is the cluster's health — and its loss is the cluster's loss. Treat it as the critical database it is, not as an implementation detail.
Quorum and Performance
etcd uses the Raft consensus protocol and needs a quorum — a majority of members — to accept writes. That is why you run an odd number: three members tolerate one failure, five tolerate two. An even count adds cost without improving fault tolerance and complicates quorum. etcd is also acutely sensitive to disk latency; on slow storage, write latency rises and the whole control plane becomes sluggish and flaky. Give it fast, dedicated disks.
| etcd members | Quorum | Failures tolerated |
|---|---|---|
| 1 | 1 | 0 (no HA) |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
Backup with Snapshots
The backup mechanism is a point-in-time snapshot: etcdctl snapshot save writes a consistent copy of the entire keyspace to a file. Snapshots should be automated on a schedule and shipped off the cluster — a backup on the same failed disk is no backup. Crucially, take a snapshot before any risky operation, especially a control-plane upgrade. On managed clusters the provider handles etcd backup, which is one of the larger operational burdens that managed Kubernetes removes.
# back up ETCDCTL_API=3 etcdctl snapshot save snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=ca.crt --cert=server.crt --key=server.key # restore (rebuilds the data dir from the snapshot) etcdctl snapshot restore snapshot.db \ --data-dir=/var/lib/etcd-restored
Restore, Defrag, and Encryption
A backup you have never restored is a hope, not a plan — rehearse the restore, because the procedure (stop the API server, restore the data dir, restart) is unforgiving under real pressure. Over time etcd needs compaction and defragmentation to reclaim space from old revisions. And remember that Secrets sit in etcd: without encryption at rest (Topic 35), an etcd snapshot is a plaintext copy of every secret — secure the backups accordingly.
etcd snapshot — the entire cluster state at the API level — the control plane's database. For disaster recovery of the cluster itself.
Velero — backs up Kubernetes objects and PersistentVolume data, selectively and per-namespace. For application/data backup and migration, not control-plane recovery.
- Never testing a restore, so the backup is unproven when disaster strikes.
- Running an even number of etcd members, gaining cost but no extra fault tolerance.
- Putting etcd on slow disks and turning the whole control plane sluggish.
- Skipping a snapshot before a control-plane upgrade.
- Storing etcd snapshots unencrypted, exposing every Secret in the cluster.
- Run an odd number of etcd members (three or five) on fast, dedicated disks.
- Automate snapshots on a schedule and ship them off-cluster, encrypted.
- Always snapshot before risky operations, especially control-plane upgrades.
- Rehearse the restore procedure so it is muscle memory, not improvisation.
- Encrypt Secrets at rest so etcd backups don't leak credentials; use Velero for app/PV backup.
Knowledge Check
Why do you run an odd number of etcd members?
- Quorum needs a majority; odd counts maximize fault tolerance for the member count (3 tolerate 1, 5 tolerate 2)
- Odd member counts consume noticeably less disk per node than even counts do
- The Raft consensus protocol rejects even member counts outright and refuses to form a cluster with that many nodes
- An odd number of members lowers read latency across the cluster
What is the etcd backup mechanism?
- A point-in-time snapshot (etcdctl snapshot save) of the entire keyspace, shipped off-cluster
- File-level copying of each member's live data directory on disk
- Exporting every API object to YAML with kubectl get --all-namespaces and archiving the dump off-cluster
- A Velero volume backup of the control-plane host
Why must etcd snapshots be stored encrypted?
- Secrets live in etcd; without encryption at rest, a snapshot is a plaintext copy of every secret
- An unencrypted snapshot is otherwise far too large to ship off-cluster
- etcd validates the snapshot on restore and refuses to load any file that was not encrypted at rest beforehand
- Encryption makes the restore step measurably faster
You got correct