Serverless SaaS Application
A composite B2B SaaS for legal teams: multi-tenant, per-seat pricing, eight engineers, three years on a pure-serverless AWS stack. Tenant isolation is existential, cost must scale with usage (trial tenants near-zero, heavy tenants priced accordingly), and SOC 2 plus GDPR plus optional EU data residency apply from year one.
The team picked serverless for its cost and operational shape — eight engineers cannot run an EKS cluster, a Kafka cluster, and a multi-region database operation on top of building the product. Three years in, the choice held up.
The Stack and Tenant Isolation
Pure serverless with one exception (managed OpenSearch): CloudFront + WAF, API Gateway HTTP API with a Cognito JWT authorizer, Lambda handlers, DynamoDB single-table, S3 for documents, Aurora Serverless v2 for billing and queryable audit, EventBridge + SQS, Step Functions, and Bedrock. The whole stack is CDK.
Tenant isolation is an explicit enforcement point, not just a stored tenant_id: the JWT authorizer extracts the trusted claim into Lambda context (the client cannot override it), and every storage layer enforces it — DynamoDB partition-key prefix, S3 presigned-URL scoping, OpenSearch required query filter, Aurora row-level security. Pooled search is the highest-risk leak surface and is enforced structurally through a shared query builder, verified at CI.
Workflows, AI, and Auditing
The tenant workflow engine uses both Step Functions types: Express for short automated flows (high volume, at-least-once acceptable because actions are idempotent) and Standard for long-running human-approval flows (callback tokens that wait hours or days). The AI draft-suggestion feature calls Bedrock with prompt redaction, Guardrails as defense-in-depth, tenant opt-in, invocation logging, and mandatory human review.
Auditable history needs two stores: Aurora holds the queryable last-90-days audit; an immutable S3 Object Lock bucket in a separate audit account holds the regulator-grade evidence. CloudTrail captures AWS API events; a separate application audit stream captures tenant-user actions CloudTrail never sees.
Cost Shape and EU Residency
Serverless billing is why the team chose it: a trial tenant costs ~$0.20/month marginal, a small paying tenant $5–15, an enterprise tenant low hundreds — all without capacity planning. Per-tenant KMS Customer Managed Keys are an enterprise-tier option, not the default, because of their policy, grant, and rotation overhead.
EU data residency is honest only as a per-Region split — a separate eu-west-1 deployment with EU-only DynamoDB tables (not Global Tables across Regions), EU S3, EU Aurora, EU OpenSearch, EU Bedrock, EU logs and backups, and a separate EU audit account. Config rules forbid any cross-Region replication of EU data.
Lambda (chosen) — scale-to-zero matches a spiky, variable, multi-tenant workload better than even Fargate's small idle cost.
Pooled tenancy (default) — cheaper per tenant and scales better; safe only with disciplined tenant-aware data access at every layer.
Siloed tenancy — a paid enterprise option (per-tenant tables, buckets, or even AWS accounts) for tenants who require it.
- Treating a stored tenant_id as isolation instead of enforcing the trusted claim from the JWT authorizer through every data-access path.
- Letting a search query be built without the active tenant — pooled search is the highest tenant-leak risk and must be structural, not convention.
- Conflating the queryable audit store (Aurora) with the immutable evidence store (S3 Object Lock in a separate account).
- Making per-tenant KMS keys the default instead of an enterprise-tier option, taking on policy, grant, and rotation overhead at scale.
- Claiming EU residency while using DynamoDB Global Tables or cross-Region replication that moves EU data out of the EU.
- Adopting single-table DynamoDB on day one before the access patterns are understood.
- Enforce tenant isolation at an explicit point (JWT authorizer → Lambda context) plus tenant-aware access at every storage layer.
- Make the OpenSearch tenant filter structural via a shared query builder, verified by CI tests.
- Use Step Functions Express for short automated workflows and Standard for long-running human-approval ones.
- Treat Bedrock Guardrails as defense-in-depth alongside prompt redaction, opt-in, logging, and human review.
- Keep a queryable audit store and a separate immutable evidence store; do not conflate them.
- Implement EU residency as a separate per-Region deployment, with Config rules forbidding cross-Region replication of EU data.
Knowledge Check
How is multi-tenant isolation actually achieved in this SaaS?
- A JWT authorizer extracts the trusted claim into Lambda context, plus tenant-aware access at every storage layer
- By storing a tenant_id column on every row and trusting the client application to faithfully send the correct one on each request it makes to the API
- By provisioning every tenant its own fully separate AWS account
- By encrypting all tenants' data under one shared KMS key
Why does the team use both Step Functions Express and Standard?
- Express for short, high-volume automated flows; Standard for long-running human-approval flows that wait hours or days
- Express runs all the live production traffic while Standard is reserved purely for staging and integration testing, so the split is drawn along environment lines rather than workflow shape
- Standard costs less per execution once flow volume gets high
- Express handles human-approval waits while Standard cannot pause
How does the team meet EU data-residency requirements honestly?
- A separate eu-west-1 deployment with EU-only resources and no cross-Region replication of EU data
- DynamoDB Global Tables continuously replicating EU items into a US Region for extra durability and faster reads for the operations team based there
- Serving EU users from a us-east-1 origin fronted by CloudFront caching
- Encrypting EU data at rest with a US-based KMS customer key
Why keep two separate audit stores?
- Aurora is the queryable copy for support and dashboards; S3 Object Lock in a separate account is the immutable regulator-grade evidence
- To keep two byte-for-byte identical copies of exactly the same audit data purely for redundant backup, so that if one store is lost the other can be restored from it unchanged
- Because CloudTrail is unable to deliver its events to an S3 bucket
- To spread the audit-storage cost evenly across two accounts
You got correct