Topic 48

mTLS and Operating TLS

mTLS

Ordinary TLS authenticates one end: the server proves its identity with a certificate, and the client stays anonymous, typically proving itself afterward with a password or token at the application layer. Mutual TLS makes both ends present certificates during the handshake, so the client's identity is established cryptographically before any request is processed — the client proves who it is with a private key, not a shared secret.

mTLS is the backbone of zero-trust service-to-service authentication and the reason service meshes exist. It is also where TLS stops being a one-time setup and becomes a certificate-lifecycle problem: thousands of workloads each need a certificate, and each certificate expires. ACME and Let's Encrypt automated that lifecycle away for the public web, and the same discipline — automate or die — applies inside the mesh.

One-way TLS versus mTLS — who proves identity

One-way TLSserver authenticates

Only the server presents a certificate. The client stays anonymous in the handshake and proves itself afterward with a password or token at the app layer.

mTLSboth ends authenticate

Both sides present certificates. The server sends a CertificateRequest; the client answers with its own cert and a signature — identity is cryptographic before any request runs.

Mutual Authentication

In an mTLS handshake the server sends a CertificateRequest, and the client responds with its own certificate and a signature proving it holds the matching private key. The server validates the client's certificate against a CA it trusts, exactly as the client validated the server's. Neither side talks to an unauthenticated peer.

This fits anywhere you need strong identity between machines rather than humans: service-to-service calls inside a mesh, partner APIs where an API key is too weak, and any zero-trust network where "inside the perimeter" is no longer a sufficient credential. The payoff over a bearer token is that the credential is a private key that never crosses the wire — there is no secret to steal in transit and replay.

Identity at Scale

Handing thousands of services long-lived client certificates does not scale and is dangerous: a leaked certificate stays abusable for its whole life. The pattern that works is short-lived workload certificates — minted on startup, valid for hours, rotated automatically — so a stolen identity expires almost immediately and revocation is unnecessary.

SPIFFE standardizes this. Each workload gets a SPIFFE ID like spiffe://cluster/ns/payments/sa/api encoded into a short-lived SVID (SPIFFE Verifiable Identity Document), an X.509 certificate or JWT issued by the platform. A service mesh (Istio, Linkerd) or SPIRE issues and rotates these transparently, so application code never handles a certificate — it just gets an authenticated peer identity from the sidecar.

ACME and Automated Issuance

ACME (Automatic Certificate Management Environment) is the protocol Let's Encrypt uses to issue certificates without a human. A client proves control of a domain by answering a challenge — serving a token at a well-known HTTP path, or publishing a DNS TXT record — and the CA issues the certificate automatically. This turned certificate provisioning from a manual purchase into an API call, which is what made 90-day certificates practical at internet scale.

The model is renew-or-die. A 90-day certificate with no automated renewal is just a deferred outage; the discipline is to renew well before expiry — typically at one-third of the lifetime remaining — so a few failed renewals still leave a margin to fix the problem before anything breaks.

# issue and auto-renew via ACME; certbot installs a renewal timer
certbot certonly --nginx -d api.example.com -d www.example.com
# Certificate is saved at: /etc/letsencrypt/live/api.example.com/
# renews automatically at ~30 days remaining (1/3 of 90)
certbot renew --dry-run   # verify the renewal path works before it matters

Rotation and Expiry Operations

Every certificate expires, so the only safe operating model is automation — a renewal timer, a mesh control plane, or an ACME client running unattended, never a calendar reminder for a human. Manual rotation works until the one time it does not: the engineer who owned the runbook left, the reminder was muted, and the certificate lapsed at 3am. The longer the certificate's life, the rarer the rotation and the more certain everyone is to have forgotten how it works.

Monitor expiry independently of the renewal mechanism, because the renewer can silently fail. Alert on days-to-expiry from outside the system that renews — a separate probe hitting the live endpoint — so a broken cron job or a failed challenge surfaces while there is still slack to fix it, not when clients start rejecting the handshake. mTLS authenticates identity; it does not authorize an action, so keep authorization as a separate decision layered on top.

mTLS vs API Keys vs Bearer Tokens

mTLS proves identity with a private key that never leaves the client, validated cryptographically each handshake — the strongest of the three, with rotation handled by short-lived certificates. API keys are a single shared secret sent on every request; if intercepted or logged, they are replayable until manually rotated, and rotation means coordinating every caller.

Bearer tokens (OAuth, JWT) are credentials anyone holding them can use — bearer means exactly that — so a leaked token works until it expires, which is why they are kept short-lived. Choose mTLS for service-to-service identity, tokens for delegated user authorization, and treat a long-lived API key as the weakest option to migrate off.

Common Mistakes

Running mTLS with manual certificate rotation. With thousands of workload certificates, manual rotation guarantees one is missed; that certificate expires and the service it fronts goes hard-down with no graceful degradation.
Shipping long-lived client certificates you cannot rotate. A multi-year client cert that leaks stays abusable for years, and if it is baked into a device image there is no practical way to pull it back.
Having no independent monitoring of certificate expiry. When the only signal is the renewer itself, a silently broken renewal job is invisible until the handshake starts failing for every client at once.
Treating mTLS as authorization. It proves who the client is (authentication), not what it may do; without a separate authorization layer, any valid certificate holder can call any endpoint.
Skipping the ACME dry-run before relying on renewal. A misconfigured challenge path passes the initial issuance but fails every renewal, so the certificate quietly stops renewing and lapses 90 days later.

Best Practices

Issue short-lived workload certificates (hours, not years) through a mesh or SPIRE, so a leaked identity expires almost immediately and you never depend on revocation.
Automate every renewal with ACME, certbot, or a mesh control plane, and renew at one-third lifetime remaining so a few failed attempts still leave margin to recover.
Probe certificate expiry from outside the renewing system and alert on days remaining, so a broken renewer surfaces before clients reject the handshake.
Run certbot renew --dry-run after any config change to confirm the challenge path still works, rather than discovering it failed at the next real renewal.
Layer authorization on top of mTLS identity — map the certificate's SPIFFE ID or subject to a policy — so authentication alone never grants access to every endpoint.

Comparable conceptsService mesh mTLS (Topic 73)SSH certificate authOAuth 2.0 (bearer tokens)

Knowledge Check

What does mTLS add over one-way TLS plus an API key for service-to-service calls?

The client proves identity with a private key that never crosses the wire
It encrypts the traffic more strongly than one-way TLS does on its own
It replaces the need for any authorization policy on the receiving service
It removes certificate expiry, so nothing on either end needs rotation

Why is automation mandatory rather than optional when operating mTLS at scale?

Every certificate expires, and manual rotation across thousands of workloads eventually misses one
Automated issuance uses stronger keys than a human-issued certificate would
Manually issued certificates are rejected by the TLS handshake outright, so only an automated pipeline produces certs a client will accept
Automation is what enforces authorization rules between the services

In ACME-based renewal, why renew at roughly one-third of the certificate's lifetime remaining rather than the day before expiry?

It leaves slack so a few failed renewals can still be caught and fixed before expiry
Renewing earlier extends the validity of the existing certificate further into the future
It reduces the fee the CA charges by spreading renewals out over time
It lowers the handshake latency for clients connecting near expiry

You got correct