E-commerce at Scale
Service 68

E-commerce at Scale

ScaleEdge

Consider a storefront facing spiky, sometimes extreme traffic — a launch, a sale, a viral moment — where the catalog must load instantly worldwide and the checkout must never drop an order even as load spikes tenfold in minutes. The requirements are global low latency, elastic compute, a fast catalog, and order processing that survives the surge.

The architecture's theme is absorbing spikes without losing orders. Front Door and caching serve the catalog from the edge; autoscaling compute handles the variable middle; Azure Cache for Redis fronts the hot catalog and cart data; and crucially, checkout is decoupled through Service Bus so a traffic spike queues work rather than dropping it.

Edge and Caching

Front Door serves static catalog content — images, product pages, scripts — from the edge with aggressive caching, so the vast majority of read traffic never reaches the origin. This is the single biggest lever: a sale's traffic is dominated by browsing, and edge caching turns that flood into cache hits instead of origin load.

Elastic Compute

The application tier runs on autoscaling compute — Container Apps or App Service scaling out on load — sized to scale fast enough for the spike, not for steady state. Schedule-based pre-scaling ahead of a known sale beats waiting for a metric to trip after customers are already queuing. The catalog and cart lean on Azure Cache for Redis so hot reads hit microsecond memory, not the database.

Order Processing

Checkout is the part that must not fail, so it is decoupled: the web tier accepts an order and places a message on Service Bus, and worker processes settle it asynchronously. This queue-based load leveling means a tenfold spike lengthens the queue rather than overwhelming the order system — orders are accepted fast and processed reliably, exactly once, even under peak.

Data and Resilience

The catalog suits a read-scaled relational or document store; the order and inventory data needs consistency for stock counts. Zone redundancy is the availability baseline, and the design degrades gracefully — if the recommendation service is slow, the catalog and checkout still work, because the critical path is isolated from the nice-to-have.

Synchronous checkout vs queue-based load leveling

Synchronous checkout — The web request processes the order inline. Simple, but a spike overwhelms the order system and drops orders under load.

Queue-based load leveling (Service Bus) — The web tier queues the order and workers process it asynchronously. A spike lengthens the queue instead of failing — the right pattern at scale.

Common Mistakes
  • Serving the catalog from the origin with no edge caching, so a browsing flood becomes origin load and the site buckles.
  • Processing checkout synchronously, so a traffic spike overwhelms the order system and drops orders.
  • Sizing autoscale for steady state, so it cannot scale fast enough for a sudden spike.
  • Hitting the database for hot catalog and cart reads instead of fronting them with Redis.
  • Coupling the critical checkout path to non-essential services, so a slow recommendation engine takes down checkout.
  • Waiting for a CPU metric to trip instead of pre-scaling ahead of a known sale.
Best Practices
  • Cache the catalog aggressively at the edge with Front Door so browsing traffic becomes cache hits.
  • Decouple checkout through Service Bus so spikes queue work rather than dropping orders.
  • Autoscale the app tier sized for the spike, and pre-scale ahead of known sales.
  • Front hot catalog and cart data with Azure Cache for Redis.
  • Isolate the critical checkout path from non-essential services so it degrades gracefully.
  • Keep order and inventory data consistent while read-scaling the catalog.
Comparable servicesAWS CloudFront + ECS + ElastiCache + SQSGCP Cloud CDN + Cloud Run + Memorystore + Pub/Sub

Knowledge Check

What is the single biggest lever for surviving a browsing-traffic spike on a storefront?

  • Aggressive edge caching of the catalog with Front Door, so most reads never reach the origin
  • Provisioning a much larger database instance so the origin can absorb every browsing read directly
  • Processing every checkout synchronously inline for lower latency
  • Disabling the Front Door WAF to shave request latency

Why decouple checkout through Service Bus?

  • Queue-based load leveling lets a spike lengthen the queue instead of overwhelming and dropping orders
  • It makes the checkout request synchronous and faster to complete inline within the web request thread
  • It removes the need for a consistent orders database
  • It caches the product catalog at the Front Door edge

How should the critical checkout path relate to non-essential services like recommendations?

  • It should be isolated so a slow or failing recommendation service does not take down checkout
  • It should share the same request thread so the two scale together
  • It should depend on the recommendation service before an order can be completed and confirmed
  • They should both run inside the same synchronous request

You got correct