Serverless SaaS Application
Service 70

Serverless SaaS Application

Case StudySaaSArchitecture

A composite B2B SaaS for legal teams: multi-tenant, per-seat pricing, eight engineers, three years on a pure-serverless AWS stack. Tenant isolation is existential, cost must scale with usage (trial tenants near-zero, heavy tenants priced accordingly), and SOC 2 plus GDPR plus optional EU data residency apply from year one.

The team picked serverless for its cost and operational shape — eight engineers cannot run an EKS cluster, a Kafka cluster, and a multi-region database operation on top of building the product. Three years in, the choice held up.

The Stack and Tenant Isolation

Pure serverless with one exception (managed OpenSearch): CloudFront + WAF, API Gateway HTTP API with a Cognito JWT authorizer, Lambda handlers, DynamoDB single-table, S3 for documents, Aurora Serverless v2 for billing and queryable audit, EventBridge + SQS, Step Functions, and Bedrock. The whole stack is CDK.

Serverless SaaS — Layers, with Tenant-Scoped Data
Edge + API
CloudFront + WAF
API GatewayCognito JWT → tenant_id
Compute + Workflow + AI
Lambdatenant-aware context
Step FunctionsExpress + Standard
BedrockAI drafts + Guardrails
Data — pooled, tenant-scoped
DynamoDBtenant# key prefix
S3per-tenant prefixes
Aurora Serverless v2billing + audit · RLS

Tenant isolation is an explicit enforcement point, not just a stored tenant_id: the JWT authorizer extracts the trusted claim into Lambda context (the client cannot override it), and every storage layer enforces it — DynamoDB partition-key prefix, S3 presigned-URL scoping, OpenSearch required query filter, Aurora row-level security. Pooled search is the highest-risk leak surface and is enforced structurally through a shared query builder, verified at CI.

Workflows, AI, and Auditing

The tenant workflow engine uses both Step Functions types: Express for short automated flows (high volume, at-least-once acceptable because actions are idempotent) and Standard for long-running human-approval flows (callback tokens that wait hours or days). The AI draft-suggestion feature calls Bedrock with prompt redaction, Guardrails as defense-in-depth, tenant opt-in, invocation logging, and mandatory human review.

Auditable history needs two stores: Aurora holds the queryable last-90-days audit; an immutable S3 Object Lock bucket in a separate audit account holds the regulator-grade evidence. CloudTrail captures AWS API events; a separate application audit stream captures tenant-user actions CloudTrail never sees.

Cost Shape and EU Residency

Serverless billing is why the team chose it: a trial tenant costs ~$0.20/month marginal, a small paying tenant $5–15, an enterprise tenant low hundreds — all without capacity planning. Per-tenant KMS Customer Managed Keys are an enterprise-tier option, not the default, because of their policy, grant, and rotation overhead.

EU data residency is honest only as a per-Region split — a separate eu-west-1 deployment with EU-only DynamoDB tables (not Global Tables across Regions), EU S3, EU Aurora, EU OpenSearch, EU Bedrock, EU logs and backups, and a separate EU audit account. Config rules forbid any cross-Region replication of EU data.

Lambda vs Fargate; Pooled vs Siloed tenancy

Lambda (chosen) — scale-to-zero matches a spiky, variable, multi-tenant workload better than even Fargate's small idle cost.

Pooled tenancy (default) — cheaper per tenant and scales better; safe only with disciplined tenant-aware data access at every layer.

Siloed tenancy — a paid enterprise option (per-tenant tables, buckets, or even AWS accounts) for tenants who require it.

Common Mistakes
  • Treating a stored tenant_id as isolation instead of enforcing the trusted claim from the JWT authorizer through every data-access path.
  • Letting a search query be built without the active tenant — pooled search is the highest tenant-leak risk and must be structural, not convention.
  • Conflating the queryable audit store (Aurora) with the immutable evidence store (S3 Object Lock in a separate account).
  • Making per-tenant KMS keys the default instead of an enterprise-tier option, taking on policy, grant, and rotation overhead at scale.
  • Claiming EU residency while using DynamoDB Global Tables or cross-Region replication that moves EU data out of the EU.
  • Adopting single-table DynamoDB on day one before the access patterns are understood.
Best Practices
  • Enforce tenant isolation at an explicit point (JWT authorizer → Lambda context) plus tenant-aware access at every storage layer.
  • Make the OpenSearch tenant filter structural via a shared query builder, verified by CI tests.
  • Use Step Functions Express for short automated workflows and Standard for long-running human-approval ones.
  • Treat Bedrock Guardrails as defense-in-depth alongside prompt redaction, opt-in, logging, and human review.
  • Keep a queryable audit store and a separate immutable evidence store; do not conflate them.
  • Implement EU residency as a separate per-Region deployment, with Config rules forbidding cross-Region replication of EU data.
Comparable services GCP Cloud Run + Firestore + Cloud Storage + Workflows + Vertex AIAzure Functions + Cosmos DB + Blob + Logic Apps + Azure OpenAI

Knowledge Check

How is multi-tenant isolation actually achieved in this SaaS?

  • A JWT authorizer extracts the trusted claim into Lambda context, plus tenant-aware access at every storage layer
  • By storing a tenant_id column on every row and trusting the client application to faithfully send the correct one on each request it makes to the API
  • By provisioning every tenant its own fully separate AWS account
  • By encrypting all tenants' data under one shared KMS key

Why does the team use both Step Functions Express and Standard?

  • Express for short, high-volume automated flows; Standard for long-running human-approval flows that wait hours or days
  • Express runs all the live production traffic while Standard is reserved purely for staging and integration testing, so the split is drawn along environment lines rather than workflow shape
  • Standard costs less per execution once flow volume gets high
  • Express handles human-approval waits while Standard cannot pause

How does the team meet EU data-residency requirements honestly?

  • A separate eu-west-1 deployment with EU-only resources and no cross-Region replication of EU data
  • DynamoDB Global Tables continuously replicating EU items into a US Region for extra durability and faster reads for the operations team based there
  • Serving EU users from a us-east-1 origin fronted by CloudFront caching
  • Encrypting EU data at rest with a US-based KMS customer key

Why keep two separate audit stores?

  • Aurora is the queryable copy for support and dashboards; S3 Object Lock in a separate account is the immutable regulator-grade evidence
  • To keep two byte-for-byte identical copies of exactly the same audit data purely for redundant backup, so that if one store is lost the other can be restored from it unchanged
  • Because CloudTrail is unable to deliver its events to an S3 bucket
  • To spread the audit-storage cost evenly across two accounts

You got correct