Service 67

Fintech Application

Case StudyFinanceArchitecture

A composite payment-processing platform for SMBs: accepts payments across channels, keeps an accurate double-entry ledger, exposes merchant APIs, and produces compliance reports. 25 engineers, ~30,000 merchants, a few million transactions a day.

Every decision is shaped by handling money: ACID consistency on the ledger, immutable audit logging, PCI DSS Level 1, SOC 2 Type II, 99.99% API availability, and a recovery-point objective measured in seconds even through a full-Region outage — which together rule out a single-Region ledger.

Transactional Core and the Ledger

The architecture splits a synchronous transactional core (where consistency matters) from an asynchronous reporting layer (where eventual consistency is fine). The ledger is a double-entry, append-only table in Aurora PostgreSQL — refunds and reversals are new rows, never modifications. Access goes through stored procedures that enforce the consistency invariants and run SERIALIZABLE isolation.

Aurora runs Multi-AZ DB cluster plus Aurora Global Database to a second Region, on I/O-Optimized storage (the ledger is write-heavy). Encryption uses a Customer Managed multi-Region KMS key so the standby Region decrypts the same data; PITR plus seven-year snapshots meet retention.

Workflow Orchestration and Audit

Payment operations touch the ledger, the payment rail, sanctions screening, the merchant webhook, and analytics — orchestrated as Sagas in Step Functions Standard (exactly-once, multi-second runs, and the execution history is part of the audit record). A failed authorization triggers a compensating ledger reversal so the books match reality.

Audit is the harder-than-it-looks part: an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account (seven-year compliance-mode retention), plus a separate application-level audit stream for business events. Config, Macie, GuardDuty, Access Analyzer, Identity Center, and SCPs round it out.

Disaster Recovery and Cost Posture

The 99.99% and 30-minute-RTO targets drive a warm standby in a second Region: Aurora Global Database (sub-minute promotion, its cross-Region replica trailing the primary by about a second, so an unplanned failover can lose that last second of writes), Lambda and Step Functions deployed identically via CDK, Route 53 health-checked failover, DynamoDB Global Tables for small operational tables, and S3 CRR. A Region-failover drill runs twice a year and has caught three real problems.

The bill looks different from a cost-optimized startup's: Aurora Multi-AZ plus Global Database is roughly half of it, plus Shield Advanced, paid WAF rule groups, CloudTrail data events, Macie, and GuardDuty. Compliance and reliability come first; cost-cutting that risks either is off the table.

Aurora vs DynamoDB for the ledger

Aurora PostgreSQL — chosen for the ledger — needs joins for reporting, cross-row consistency for double-entry invariants, and SQL queries.

DynamoDB — the right call for many fintech workloads, but the ledger's relational shape and consistency needs fit Aurora better.

Aurora DSQL — evaluated for active-active multi-Region but too new for the team's regulators; revisit later.

Common Mistakes

Putting a double-entry ledger in DynamoDB when it needs joins, cross-row consistency, and SQL — Aurora is the right fit here.
Running the ledger single-Region when the targets are a seconds-level RPO and 99.99% availability through a full-Region outage.
Keeping audit logs mutable or in the same account as the data instead of S3 Object Lock in a separate audit account.
Using Step Functions Express for ledger workflows where exactly-once and a durable audit-grade execution history are required.
Building multi-Region DR and never drilling failover — the drills are what make it actually work.
Aggressively cost-cutting in ways that weaken the compliance or reliability posture.

Best Practices

Keep the ledger in Aurora PostgreSQL, append-only, behind stored procedures enforcing invariants at SERIALIZABLE isolation.
Orchestrate payment Sagas with Step Functions Standard, with compensating actions on failure.
Run an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account, plus an application audit stream.
Use Aurora Global Database plus Route 53 health-checked failover for sub-minute DR, and drill it regularly.
Use a Customer Managed multi-Region KMS key so the standby Region decrypts the same data.
Accept compliance-driven cost lines (Shield Advanced, paid WAF, CloudTrail data events) as the cost of the regulatory posture.

Comparable services GCP Spanner/Cloud SQL + Workflows + Cloud Audit Logs + Cloud ArmorAzure Azure SQL + Logic Apps + immutable Blob + Azure WAF/DDoS

Knowledge Check

Why is the ledger built on Aurora PostgreSQL rather than DynamoDB?

It needs joins for reporting, cross-row consistency for double-entry invariants, and SQL — a relational fit
DynamoDB is fundamentally incapable of storing monetary or financial records, so any regulated double-entry ledger has to run on a relational engine like Aurora by default
Aurora is the lowest-cost database for a multi-Region ledger
DynamoDB offers no at-rest encryption for ledger entries

How is the immutable audit trail implemented?

Organization-wide CloudTrail to an S3 Object Lock (Compliance Mode) bucket in a separate audit account
A regular updatable audit table living inside the production ledger database, written by the same services and editable by the same operators it is meant to hold accountable
CloudWatch Logs with a one-day retention window on the events
Plain log files written to the local disk of each instance

Why does the team use Step Functions Standard (not Express) for payment workflows?

Exactly-once execution matters for the ledger, runs can take several seconds, and the execution history is part of the audit record
Express bills more per execution than Standard at the ledger's volume
Standard is the only Step Functions workflow type able to invoke a Lambda function, since Express state machines cannot integrate with Lambda and so cannot run any of the payment steps
Express has no retry mechanism for a failed payment step

What makes the multi-Region DR actually work?

Regular Region-failover drills — they caught three real problems that would have caused outages
Enabling Aurora Global Database once and trusting it to just work
Holding the entire standby Region completely powered off and unprovisioned until a disaster strikes, then cold-starting every component only at failover time
A single Availability Zone backed by frequent automated snapshots

You got correct