Fintech Application
Service 67

Fintech Application

Case StudyFinanceArchitecture

A composite payment-processing platform for SMBs: accepts payments across channels, keeps an accurate double-entry ledger, exposes merchant APIs, and produces compliance reports. 25 engineers, ~30,000 merchants, a few million transactions a day.

Every decision is shaped by handling money: ACID consistency on the ledger, immutable audit logging, PCI DSS Level 1, SOC 2 Type II, 99.99% API availability, and a recovery-point objective measured in seconds even through a full-Region outage — which together rule out a single-Region ledger.

Transactional Core and the Ledger

The architecture splits a synchronous transactional core (where consistency matters) from an asynchronous reporting layer (where eventual consistency is fine). The ledger is a double-entry, append-only table in Aurora PostgreSQL — refunds and reversals are new rows, never modifications. Access goes through stored procedures that enforce the consistency invariants and run SERIALIZABLE isolation.

Aurora runs Multi-AZ DB cluster plus Aurora Global Database to a second Region, on I/O-Optimized storage (the ledger is write-heavy). Encryption uses a Customer Managed multi-Region KMS key so the standby Region decrypts the same data; PITR plus seven-year snapshots meet retention.

Workflow Orchestration and Audit

Payment operations touch the ledger, the payment rail, sanctions screening, the merchant webhook, and analytics — orchestrated as Sagas in Step Functions Standard (exactly-once, multi-second runs, and the execution history is part of the audit record). A failed authorization triggers a compensating ledger reversal so the books match reality.

Audit is the harder-than-it-looks part: an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account (seven-year compliance-mode retention), plus a separate application-level audit stream for business events. Config, Macie, GuardDuty, Access Analyzer, Identity Center, and SCPs round it out.

Disaster Recovery and Cost Posture

The 99.99% and 30-minute-RTO targets drive a warm standby in a second Region: Aurora Global Database (sub-minute promotion, its cross-Region replica trailing the primary by about a second, so an unplanned failover can lose that last second of writes), Lambda and Step Functions deployed identically via CDK, Route 53 health-checked failover, DynamoDB Global Tables for small operational tables, and S3 CRR. A Region-failover drill runs twice a year and has caught three real problems.

The bill looks different from a cost-optimized startup's: Aurora Multi-AZ plus Global Database is roughly half of it, plus Shield Advanced, paid WAF rule groups, CloudTrail data events, Macie, and GuardDuty. Compliance and reliability come first; cost-cutting that risks either is off the table.

Aurora vs DynamoDB for the ledger

Aurora PostgreSQL — chosen for the ledger — needs joins for reporting, cross-row consistency for double-entry invariants, and SQL queries.

DynamoDB — the right call for many fintech workloads, but the ledger's relational shape and consistency needs fit Aurora better.

Aurora DSQL — evaluated for active-active multi-Region but too new for the team's regulators; revisit later.

Common Mistakes
  • Putting a double-entry ledger in DynamoDB when it needs joins, cross-row consistency, and SQL — Aurora is the right fit here.
  • Running the ledger single-Region when the targets are a seconds-level RPO and 99.99% availability through a full-Region outage.
  • Keeping audit logs mutable or in the same account as the data instead of S3 Object Lock in a separate audit account.
  • Using Step Functions Express for ledger workflows where exactly-once and a durable audit-grade execution history are required.
  • Building multi-Region DR and never drilling failover — the drills are what make it actually work.
  • Aggressively cost-cutting in ways that weaken the compliance or reliability posture.
Best Practices
  • Keep the ledger in Aurora PostgreSQL, append-only, behind stored procedures enforcing invariants at SERIALIZABLE isolation.
  • Orchestrate payment Sagas with Step Functions Standard, with compensating actions on failure.
  • Run an organization-wide CloudTrail to an S3 Object Lock bucket in a separate audit account, plus an application audit stream.
  • Use Aurora Global Database plus Route 53 health-checked failover for sub-minute DR, and drill it regularly.
  • Use a Customer Managed multi-Region KMS key so the standby Region decrypts the same data.
  • Accept compliance-driven cost lines (Shield Advanced, paid WAF, CloudTrail data events) as the cost of the regulatory posture.
Comparable services GCP Spanner/Cloud SQL + Workflows + Cloud Audit Logs + Cloud ArmorAzure Azure SQL + Logic Apps + immutable Blob + Azure WAF/DDoS

Knowledge Check

Why is the ledger built on Aurora PostgreSQL rather than DynamoDB?

  • It needs joins for reporting, cross-row consistency for double-entry invariants, and SQL — a relational fit
  • DynamoDB is fundamentally incapable of storing monetary or financial records, so any regulated double-entry ledger has to run on a relational engine like Aurora by default
  • Aurora is the lowest-cost database for a multi-Region ledger
  • DynamoDB offers no at-rest encryption for ledger entries

How is the immutable audit trail implemented?

  • Organization-wide CloudTrail to an S3 Object Lock (Compliance Mode) bucket in a separate audit account
  • A regular updatable audit table living inside the production ledger database, written by the same services and editable by the same operators it is meant to hold accountable
  • CloudWatch Logs with a one-day retention window on the events
  • Plain log files written to the local disk of each instance

Why does the team use Step Functions Standard (not Express) for payment workflows?

  • Exactly-once execution matters for the ledger, runs can take several seconds, and the execution history is part of the audit record
  • Express bills more per execution than Standard at the ledger's volume
  • Standard is the only Step Functions workflow type able to invoke a Lambda function, since Express state machines cannot integrate with Lambda and so cannot run any of the payment steps
  • Express has no retry mechanism for a failed payment step

What makes the multi-Region DR actually work?

  • Regular Region-failover drills — they caught three real problems that would have caused outages
  • Enabling Aurora Global Database once and trusting it to just work
  • Holding the entire standby Region completely powered off and unprovisioned until a disaster strikes, then cold-starting every component only at failover time
  • A single Availability Zone backed by frequent automated snapshots

You got correct