Fintech Application
A composite payment-processing platform for SMBs: accepts payments across channels, keeps an accurate double-entry ledger, exposes merchant APIs, and produces compliance reports. 25 engineers, ~30,000 merchants, a few million transactions a day.
Every decision is shaped by handling money: ACID consistency on the ledger, immutable audit logging, PCI DSS Level 1, SOC 2 Type II, 99.99% API availability, and a recovery-point objective measured in seconds even through a full-Region outage — which together rule out a single-Region ledger.
Transactional Core and the Ledger
The architecture splits a synchronous transactional core (where consistency matters) from an asynchronous reporting layer (where eventual consistency is fine). The ledger is a double-entry, append-only table in Aurora PostgreSQL — refunds and reversals are new rows, never modifications. Access goes through stored procedures that enforce the consistency invariants and run SERIALIZABLE isolation.
Aurora runs Multi-AZ DB cluster plus Aurora Global Database to a second Region, on I/O-Optimized storage (the ledger is write-heavy). Encryption uses a Customer Managed multi-Region KMS key so the standby Region decrypts the same data; PITR plus seven-year snapshots meet retention.
Workflow Orchestration and Audit
Payment operations touch the ledger, the payment rail, sanctions screening, the merchant webhook, and analytics — orchestrated as Sagas in Step Functions Standard (exactly-once, multi-second runs, and the execution history is part of the audit record). A failed authorization triggers a compensating ledger reversal so the books match reality.
Audit is the harder-than-it-looks part: an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account (seven-year compliance-mode retention), plus a separate application-level audit stream for business events. Config, Macie, GuardDuty, Access Analyzer, Identity Center, and SCPs round it out.
Disaster Recovery and Cost Posture
The 99.99% and 30-minute-RTO targets drive a warm standby in a second Region: Aurora Global Database (sub-minute promotion, its cross-Region replica trailing the primary by about a second, so an unplanned failover can lose that last second of writes), Lambda and Step Functions deployed identically via CDK, Route 53 health-checked failover, DynamoDB Global Tables for small operational tables, and S3 CRR. A Region-failover drill runs twice a year and has caught three real problems.
The bill looks different from a cost-optimized startup's: Aurora Multi-AZ plus Global Database is roughly half of it, plus Shield Advanced, paid WAF rule groups, CloudTrail data events, Macie, and GuardDuty. Compliance and reliability come first; cost-cutting that risks either is off the table.
Aurora PostgreSQL — chosen for the ledger — needs joins for reporting, cross-row consistency for double-entry invariants, and SQL queries.
DynamoDB — the right call for many fintech workloads, but the ledger's relational shape and consistency needs fit Aurora better.
Aurora DSQL — evaluated for active-active multi-Region but too new for the team's regulators; revisit later.
- Putting a double-entry ledger in DynamoDB when it needs joins, cross-row consistency, and SQL — Aurora is the right fit here.
- Running the ledger single-Region when the targets are a seconds-level RPO and 99.99% availability through a full-Region outage.
- Keeping audit logs mutable or in the same account as the data instead of S3 Object Lock in a separate audit account.
- Using Step Functions Express for ledger workflows where exactly-once and a durable audit-grade execution history are required.
- Building multi-Region DR and never drilling failover — the drills are what make it actually work.
- Aggressively cost-cutting in ways that weaken the compliance or reliability posture.
- Keep the ledger in Aurora PostgreSQL, append-only, behind stored procedures enforcing invariants at SERIALIZABLE isolation.
- Orchestrate payment Sagas with Step Functions Standard, with compensating actions on failure.
- Run an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account, plus an application audit stream.
- Use Aurora Global Database plus Route 53 health-checked failover for sub-minute DR, and drill it regularly.
- Use a Customer Managed multi-Region KMS key so the standby Region decrypts the same data.
- Accept compliance-driven cost lines (Shield Advanced, paid WAF, CloudTrail data events) as the cost of the regulatory posture.
Knowledge Check
Why is the ledger built on Aurora PostgreSQL rather than DynamoDB?
- It needs joins for reporting, cross-row consistency for double-entry invariants, and SQL — a relational fit
- DynamoDB is fundamentally incapable of storing monetary or financial records, so any regulated double-entry ledger has to run on a relational engine like Aurora by default
- Aurora is the lowest-cost database for a multi-Region ledger
- DynamoDB offers no at-rest encryption for ledger entries
How is the immutable audit trail implemented?
- Organization-wide CloudTrail to an S3 Object Lock (Compliance Mode) bucket in a separate audit account
- A regular updatable audit table living inside the production ledger database, written by the same services and editable by the same operators it is meant to hold accountable
- CloudWatch Logs with a one-day retention window on the events
- Plain log files written to the local disk of each instance
Why does the team use Step Functions Standard (not Express) for payment workflows?
- Exactly-once execution matters for the ledger, runs can take several seconds, and the execution history is part of the audit record
- Express bills more per execution than Standard at the ledger's volume
- Standard is the only Step Functions workflow type able to invoke a Lambda function, since Express state machines cannot integrate with Lambda and so cannot run any of the payment steps
- Express has no retry mechanism for a failed payment step
What makes the multi-Region DR actually work?
- Regular Region-failover drills — they caught three real problems that would have caused outages
- Enabling Aurora Global Database once and trusting it to just work
- Holding the entire standby Region completely powered off and unprovisioned until a disaster strikes, then cold-starting every component only at failover time
- A single Availability Zone backed by frequent automated snapshots
You got correct