Data Analytics Pipeline
A composite connected-devices company: telemetry from a few hundred thousand IoT devices, ~50,000 events/sec peak (bursting to 100k), serving three audiences — internal analysts, customer-facing real-time dashboards, and ML training. Twelve engineers across data, platform, and product analytics.
The three audiences pull in different directions: real-time for dashboards, cost-efficient batch for internal analytics, and large historical scans for ML — with cost-per-event as a first-class metric and strict multi-tenant isolation.
Ingestion and the Data Lake
Devices connect through IoT Core (per-device X.509 auth). IoT Rules fan to two paths: Data Firehose buffers and lands raw events in S3 as Parquet, and Kinesis Data Streams (50 shards) carries the real-time path. S3 is the canonical store — every query engine is an interchangeable head, so storage is paid for exactly once.
The lake has a raw zone (partitioned, tiered Standard → IA → Glacier by age) and a curated zone in Apache Iceberg (schema evolution, time travel, efficient MERGE). The Glue Data Catalog is metadata only — shared table and schema definitions, not ETL. Lake Formation enforces row- and column-level tenant isolation on the lake tables.
Batch and Real-Time Paths
Batch: Glue ETL jobs (5-minute to nightly) read the raw zone, validate, deduplicate, and write curated Iceberg tables; EMR Serverless handles the heaviest jobs. Real-time: Amazon Managed Service for Apache Flink consumes Kinesis with stateful, checkpointed processing, writing per-device aggregates to Timestream for the customer dashboard and enriched events to the curated Iceberg zone.
Flink — not Lambda — is the stream processor at 50k events/sec: it handles checkpointing, exactly-once with compatible sources and transactional sinks, and stateful windowed aggregation as first-class features. Lambda fits lower-throughput streams or simple per-record enrichment.
Query Surfaces and Cost
Five query patterns over four engines: Athena for ad-hoc analyst queries, Redshift Serverless for BI dashboards (loaded nightly), Timestream for LiveAnalytics for the customer real-time view, and SageMaker reading curated Iceberg tables for ML — no data copy.
Cost-per-event drives design: aggregate in Flink before writing to expensive stores (Timestream costs scale with dashboard queries, not raw events), use Parquet and partitioning to cut Athena scan bytes, and tier S3 by access pattern. The team runs a monthly cost-per-event review.
Managed Service for Apache Flink — the right default at sustained 50k events/sec — checkpointing, exactly-once, and stateful windowed aggregation built in.
Lambda on Kinesis — fine for lower-volume streams or simple per-record enrichment; needs careful tuning and struggles with stateful windows at scale.
S3 as canonical store — pay for storage once; Athena, Redshift, Timestream, and SageMaker are interchangeable query heads.
- Using Lambda-on-Kinesis for stateful windowed aggregation at 50k events/sec, where Flink is the safer default.
- Writing raw events into Timestream instead of aggregating in Flink first, so cost scales with raw volume not dashboard queries.
- Starting the curated zone in plain Parquet instead of Iceberg, then rewriting custom schema-evolution and MERGE code later.
- Retrofitting per-customer cost attribution and partitioning months after launch instead of designing for it on day one.
- Running unbounded
SELECT *Athena queries over a year of data instead of partitioning and using Parquet. - Keeping all data hot in S3 Standard instead of tiering to IA and Glacier by access pattern.
- Make S3 the canonical store and treat each query engine as an interchangeable head.
- Use Apache Iceberg in the curated zone for schema evolution, time travel, and MERGE.
- Use Managed Service for Apache Flink plus Timestream for customer-facing real-time at IoT scale.
- Enforce multi-tenant isolation at every layer — IoT Core auth, Lake Formation row filters, Timestream dimension filters.
- Aggregate before ingesting into expensive stores, and partition and compress in S3.
- Track cost-per-event monthly and tier storage by access pattern.
Knowledge Check
Why is Apache Flink (Managed Service for Apache Flink) chosen over Lambda for the stream processor?
- At sustained 50k events/sec it handles checkpointing, exactly-once, and stateful windowed aggregation as first-class features
- Lambda has no event-source mapping or polling mechanism for reading records off a Kinesis Data Stream shard, so it cannot participate in the real-time path at all
- Flink always costs less per million events than Lambda at any volume
- Lambda is blocked from writing processed output objects into an S3 bucket
What is the role of S3 in this pipeline?
- The canonical store, paid for once, with Athena, Redshift, Timestream, and SageMaker as interchangeable query heads
- A transient scratch buffer that each query engine fills on demand and that is purged automatically the moment an Athena query finishes scanning it, holding nothing between runs
- The low-latency database powering the customer real-time dashboard reads
- A cold archive used only for nightly backups of the curated tables
How does the team keep Timestream cost under control at IoT scale?
- Aggregating events in Flink before writing, so cost scales with dashboard query volume rather than raw event volume
- Writing every raw event straight into the memory store unaggregated and then capping cost by simply running far fewer dashboard queries against it each day
- Turning off the in-memory store so recent data falls through to magnetic
- Reserving Timestream purely as a backup target for the S3 curated zone
What does the Glue Data Catalog do in this architecture?
- Holds shared table and schema metadata for the query engines — it does not run ETL
- Executes the scheduled batch ETL jobs itself, validating, deduplicating, and transforming raw events into the curated Iceberg tables on its own compute
- Physically stores the raw event records as Parquet files on its own disk
- Serves the customer-facing dashboard with sub-second metric lookups
You got correct