Web SaaS Platform
The first case study is the most common one: a customer-facing SaaS — a multi-service web application serving many tenants over the internet, with the usual demands of availability, predictable releases, and elastic capacity. It assembles the stateless-web pattern, ingress, autoscaling, managed data, and GitOps into one architecture.
Nothing here is new; the value is seeing the earlier pieces fit together, and the decisions and trade-offs that shape a real system rather than a diagram.
Requirements and Traffic Shape
The workload is HTTP, bursty (daytime peaks, marketing spikes), latency-sensitive, and multi-tenant. It must deploy frequently without downtime, scale with traffic, isolate tenants reasonably, and survive a zone failure. These requirements, not technology preference, drive every choice that follows — the architecture is a response to them.
The Architecture
The shape is the stateless-web pattern at scale. An Ingress/Gateway terminates TLS and routes to Deployments per service (frontend, API, a few backend services), each fronted by a Service and scaled by an HPA on request rate. The primary database is a managed service, not in-cluster, with a managed cache alongside; the cluster stays stateless. Replicas spread across zones with topology spread and PDBs for zone-failure survival and safe operations.
Tenancy, Config, and Release
Tenancy here is soft and application-level — one shared deployment serving all tenants, isolated in the data layer — with namespaces separating environments rather than tenants. Config and per-environment values come from ConfigMaps and Secrets (secrets from an external store). Releases run through GitOps with canary rollouts: a new version takes a slice of traffic, metrics are watched, and it ramps or rolls back automatically. Readiness probes make each rollout zero-downtime.
Alternatives Considered
Two roads were not taken. Running the database in-cluster (a StatefulSet or operator) was rejected — the operational cost of self-running the primary datastore outweighed the marginal control, and a managed service is the safer default. Skipping Kubernetes entirely for a serverless-container platform (Cloud Run / ECS) was viable and simpler for a smaller system, but the team valued portability, the ecosystem, and consistent tooling across many services. The honest note: at low scale, the serverless-container option would have been less to operate.
Kubernetes — portable, rich ecosystem, consistent across many services — at the cost of operating a cluster.
Serverless containers (Cloud Run/ECS) — less to operate and often simpler at small scale — at the cost of portability and ecosystem depth.
- Running the primary database in-cluster by default instead of using a managed service.
- Deploying without readiness probes or PDBs, so releases and zone events cause downtime.
- Spreading replicas across nodes but not zones, leaving a zone failure fatal.
- Per-service LoadBalancers instead of one Gateway/Ingress, multiplying cost and complexity.
- Over-engineering hard tenancy when application-level isolation met the requirement.
- Keep the cluster stateless; put the primary datastore and cache on managed services.
- Front services with one Gateway/Ingress; scale on request rate with HPAs.
- Spread across zones with topology spread + PDBs for zone-failure survival and safe rollouts.
- Release through GitOps with canary and automated rollback; rely on readiness probes.
- Right-size tenancy to the requirement — application-level isolation is often enough.
Knowledge Check
Why keep the primary database outside the cluster in this SaaS design?
- The operational cost of self-running the datastore outweighs the marginal control; managed is safer
- Kubernetes simply cannot run a stateful relational database inside the cluster at all, even with StatefulSets and operators
- StatefulSets cannot attach any durable persistent storage to a replica through a PersistentVolumeClaim
- A managed database service is always strictly cheaper per gigabyte than self-running one
What makes the SaaS releases zero-downtime?
- Readiness probes plus canary rollouts via GitOps with automated rollback
- Using the Recreate deployment strategy for every release
- Running exactly one replica per service so there is nothing to drain
- Disabling the HorizontalPodAutoscaler during every deploy
When might skipping Kubernetes for a serverless-container platform have been better?
- At small scale, where it is simpler to operate, trading away portability and ecosystem
- Never — running a full Kubernetes cluster is always the simpler option at any scale whatsoever
- Only for heavy stateful workloads that need durable persistent volumes attached to every replica
- Only when the whole system happens to consist of exactly one single service
You got correct