Service 66

E-commerce Platform

Case StudyRetailArchitecture

A composite case study: a direct-to-consumer online retailer, a few thousand SKUs, global customers concentrated in North America and Europe, low-tens-of-millions revenue growing ~30% a year, six engineers and one ops person. The constraint that shapes everything is operational simplicity — the architecture cannot require running Kubernetes or a custom data pipeline.

The traffic profile is steady with sharp spikes (Black Friday at 20× daily average), checkout availability is targeted at 99.95%, and PCI scope is minimized by never touching raw card data.

The Architecture

The stack is serverless-leaning with managed services everywhere a clear default exists. Edge: CloudFront with WAF fronts everything — /static/* served directly from S3, /api/* through to an ALB. Application: ECS Fargate runs stateless services (storefront, order API, admin) behind the ALB across two AZs, with Lambda for async and scheduled work.

E-commerce — Four-Layer, Serverless-Leaning Architecture

Edge

CloudFront + WAF/static→S3 · /api→ALB

Application

ALB

ECS Fargatestorefront · order · admin

Lambdaasync + cron

Data

Auroraorders, customers

DynamoDBcarts, sessions

ElastiCachehot reads

S3images

Async

EventBridge

SQSper-consumer queues

Lambdaconsumers

Data: Aurora PostgreSQL Multi-AZ is the relational source of truth (orders, customers, products); DynamoDB holds high-volume per-user state (carts, sessions) with TTL; ElastiCache fronts read-heavy product queries; OpenSearch powers search and recommendations. Async: EventBridge as the bus, SQS per consumer, Lambda consumers, SES for email.

Event Flow and the Right Store per Pattern

Checkout writes the order to Aurora in one transaction, then publishes order.placed to EventBridge, which fans out to per-consumer SQS queues — fulfillment, email, analytics, recommendations — each with its own retries and dead-letter queue. A failure in fulfillment does not block email. Payments stay synchronous: payment success is required before order.placed is published, avoiding distributed-transaction ambiguity.

The team matched each store to its access pattern: relational truth in Aurora, per-user state in DynamoDB, hot reads in ElastiCache, search in OpenSearch. They deliberately kept the relational product catalog out of single-table DynamoDB because the model is relational and admin-edited.

Single-Region Now, Multi-Region Later

The platform runs single-Region (us-east-1) with Multi-AZ HA, serving EU customers through CloudFront edge caching — a deliberate fit for a six-engineer team. A second Region is on the roadmap, triggered by a residency contract, EU latency breaching the SLO, or a Region-outage cost case.

The evolution would add Aurora Global Database, DynamoDB Global Tables for customer data (sessions stay regional), S3 Cross-Region Replication, Route 53 latency routing, and per-Region SES and OpenSearch — roughly 1.6× the single-Region cost.

ECS Fargate vs EKS for a small team

ECS Fargate — the right default here — no nodes to manage, low operational surface for six engineers. Handles the workload fine.

EKS — considered and rejected — adds cluster upgrades, node management, and networking complexity without proportional benefit at this scale.

Single-Region + Multi-AZ — the right baseline; multi-Region is a triggered evolution, not a day-one default.

Common Mistakes

Putting the relational product catalog into single-table DynamoDB when the model is relational and admin-edited — Aurora plus ElastiCache plus OpenSearch fits better.
Adopting EKS for 'Kubernetes-native' credibility when ECS Fargate handles the workload with far less operational overhead.
Going multi-Region on day one for a six-engineer team instead of single-Region Multi-AZ with CloudFront edge caching.
Making payments asynchronous through the event bus, creating 'did we charge the customer or not' ambiguity.
Hand-writing CloudFormation from the start instead of CDK, leaving large templates hard to refactor two years in.
Sticking with the AWS-native observability stack past ~20 services instead of adopting an APM when cross-service incidents grow.

Best Practices

Use the practical default stack: CloudFront + WAF + ALB + ECS Fargate + Aurora + DynamoDB + ElastiCache + OpenSearch + EventBridge + SQS + Lambda + SES.
Match each data store to its access pattern rather than forcing one database to do everything.
Use EventBridge fan-out with a per-consumer SQS queue and dead-letter queue for resilient async work.
Keep payments synchronous so an order only enters the bus after a confirmed charge.
Run single-Region with Multi-AZ until a specific business condition triggers multi-Region.
Favor operational simplicity over perfect efficiency for a small team.

Comparable services GCP Cloud Run/GKE + Cloud SQL + Memorystore + Cloud CDN + Pub/SubAzure AKS/Container Apps + Azure SQL + Cache for Redis + Front Door + Service Bus

Knowledge Check

Why did the team keep the product catalog in Aurora rather than single-table DynamoDB?

The product model is relational and admin-edited; Aurora plus ElastiCache plus OpenSearch fit the access pattern better
DynamoDB has no underlying mechanism to durably persist the catalog's product item records, leaving a relational engine like Aurora as the only store capable of holding them
Aurora's per-hour pricing always beats DynamoDB on-demand for any catalog
DynamoDB offers no encryption-at-rest for sensitive product fields

Why are payments kept synchronous while order fulfillment is event-driven?

Payment success is required before publishing order.placed, avoiding distributed-transaction ambiguity over whether the customer was charged
Synchronous request/response always returns to the user faster than async
EventBridge is technically unable to route payment-related event payloads, so the charge step has to stay a synchronous in-process call simply because no event bus can carry it asynchronously
Warehouse fulfillment must finish before the order row is ever written

What is the right baseline for a six-engineer e-commerce team, per this case study?

Single-Region with Multi-AZ HA, serving distant users via CloudFront edge caching
Active-active across two Regions from the very first deploy
A single Availability Zone to keep the monthly bill lowest
A self-managed on-premises data center running the full stack, with nightly backups shipped to the cloud for disaster recovery only

Why did the team choose ECS Fargate over EKS?

Fargate handles the workload with much lower operational overhead; EKS adds cluster and node management without proportional benefit here
EKS is unable to schedule and run containerized service workloads
Fargate is the single platform on AWS capable of running stateless services, since EKS, ECS on EC2, and plain Lambda are all incapable of hosting this kind of stateless container workload
EKS pods cannot receive traffic from an Application Load Balancer

You got correct