E-commerce Platform
A composite case study: a direct-to-consumer online retailer, a few thousand SKUs, global customers concentrated in North America and Europe, low-tens-of-millions revenue growing ~30% a year, six engineers and one ops person. The constraint that shapes everything is operational simplicity — the architecture cannot require running Kubernetes or a custom data pipeline.
The traffic profile is steady with sharp spikes (Black Friday at 20× daily average), checkout availability is targeted at 99.95%, and PCI scope is minimized by never touching raw card data.
The Architecture
The stack is serverless-leaning with managed services everywhere a clear default exists. Edge: CloudFront with WAF fronts everything — /static/* served directly from S3, /api/* through to an ALB. Application: ECS Fargate runs stateless services (storefront, order API, admin) behind the ALB across two AZs, with Lambda for async and scheduled work.
Data: Aurora PostgreSQL Multi-AZ is the relational source of truth (orders, customers, products); DynamoDB holds high-volume per-user state (carts, sessions) with TTL; ElastiCache fronts read-heavy product queries; OpenSearch powers search and recommendations. Async: EventBridge as the bus, SQS per consumer, Lambda consumers, SES for email.
Event Flow and the Right Store per Pattern
Checkout writes the order to Aurora in one transaction, then publishes order.placed to EventBridge, which fans out to per-consumer SQS queues — fulfillment, email, analytics, recommendations — each with its own retries and dead-letter queue. A failure in fulfillment does not block email. Payments stay synchronous: payment success is required before order.placed is published, avoiding distributed-transaction ambiguity.
The team matched each store to its access pattern: relational truth in Aurora, per-user state in DynamoDB, hot reads in ElastiCache, search in OpenSearch. They deliberately kept the relational product catalog out of single-table DynamoDB because the model is relational and admin-edited.
Single-Region Now, Multi-Region Later
The platform runs single-Region (us-east-1) with Multi-AZ HA, serving EU customers through CloudFront edge caching — a deliberate fit for a six-engineer team. A second Region is on the roadmap, triggered by a residency contract, EU latency breaching the SLO, or a Region-outage cost case.
The evolution would add Aurora Global Database, DynamoDB Global Tables for customer data (sessions stay regional), S3 Cross-Region Replication, Route 53 latency routing, and per-Region SES and OpenSearch — roughly 1.6× the single-Region cost.
ECS Fargate — the right default here — no nodes to manage, low operational surface for six engineers. Handles the workload fine.
EKS — considered and rejected — adds cluster upgrades, node management, and networking complexity without proportional benefit at this scale.
Single-Region + Multi-AZ — the right baseline; multi-Region is a triggered evolution, not a day-one default.
- Putting the relational product catalog into single-table DynamoDB when the model is relational and admin-edited — Aurora plus ElastiCache plus OpenSearch fits better.
- Adopting EKS for 'Kubernetes-native' credibility when ECS Fargate handles the workload with far less operational overhead.
- Going multi-Region on day one for a six-engineer team instead of single-Region Multi-AZ with CloudFront edge caching.
- Making payments asynchronous through the event bus, creating 'did we charge the customer or not' ambiguity.
- Hand-writing CloudFormation from the start instead of CDK, leaving large templates hard to refactor two years in.
- Sticking with the AWS-native observability stack past ~20 services instead of adopting an APM when cross-service incidents grow.
- Use the practical default stack: CloudFront + WAF + ALB + ECS Fargate + Aurora + DynamoDB + ElastiCache + OpenSearch + EventBridge + SQS + Lambda + SES.
- Match each data store to its access pattern rather than forcing one database to do everything.
- Use EventBridge fan-out with a per-consumer SQS queue and dead-letter queue for resilient async work.
- Keep payments synchronous so an order only enters the bus after a confirmed charge.
- Run single-Region with Multi-AZ until a specific business condition triggers multi-Region.
- Favor operational simplicity over perfect efficiency for a small team.
Knowledge Check
Why did the team keep the product catalog in Aurora rather than single-table DynamoDB?
- The product model is relational and admin-edited; Aurora plus ElastiCache plus OpenSearch fit the access pattern better
- DynamoDB has no underlying mechanism to durably persist the catalog's product item records, leaving a relational engine like Aurora as the only store capable of holding them
- Aurora's per-hour pricing always beats DynamoDB on-demand for any catalog
- DynamoDB offers no encryption-at-rest for sensitive product fields
Why are payments kept synchronous while order fulfillment is event-driven?
- Payment success is required before publishing order.placed, avoiding distributed-transaction ambiguity over whether the customer was charged
- Synchronous request/response always returns to the user faster than async
- EventBridge is technically unable to route payment-related event payloads, so the charge step has to stay a synchronous in-process call simply because no event bus can carry it asynchronously
- Warehouse fulfillment must finish before the order row is ever written
What is the right baseline for a six-engineer e-commerce team, per this case study?
- Single-Region with Multi-AZ HA, serving distant users via CloudFront edge caching
- Active-active across two Regions from the very first deploy
- A single Availability Zone to keep the monthly bill lowest
- A self-managed on-premises data center running the full stack, with nightly backups shipped to the cloud for disaster recovery only
Why did the team choose ECS Fargate over EKS?
- Fargate handles the workload with much lower operational overhead; EKS adds cluster and node management without proportional benefit here
- EKS is unable to schedule and run containerized service workloads
- Fargate is the single platform on AWS capable of running stateless services, since EKS, ECS on EC2, and plain Lambda are all incapable of hosting this kind of stateless container workload
- EKS pods cannot receive traffic from an Application Load Balancer
You got correct