Service 68

Data Analytics Pipeline

Case StudyAnalyticsArchitecture

A composite connected-devices company: telemetry from a few hundred thousand IoT devices, ~50,000 events/sec peak (bursting to 100k), serving three audiences — internal analysts, customer-facing real-time dashboards, and ML training. Twelve engineers across data, platform, and product analytics.

The three audiences pull in different directions: real-time for dashboards, cost-efficient batch for internal analytics, and large historical scans for ML — with cost-per-event as a first-class metric and strict multi-tenant isolation.

Ingestion and the Data Lake

Devices connect through IoT Core (per-device X.509 auth). IoT Rules fan to two paths: Data Firehose buffers and lands raw events in S3 as Parquet, and Kinesis Data Streams (50 shards) carries the real-time path. S3 is the canonical store — every query engine is an interchangeable head, so storage is paid for exactly once.

The lake has a raw zone (partitioned, tiered Standard → IA → Glacier by age) and a curated zone in Apache Iceberg (schema evolution, time travel, efficient MERGE). The Glue Data Catalog is metadata only — shared table and schema definitions, not ETL. Lake Formation enforces row- and column-level tenant isolation on the lake tables.

Batch and Real-Time Paths

Batch: Glue ETL jobs (5-minute to nightly) read the raw zone, validate, deduplicate, and write curated Iceberg tables; EMR Serverless handles the heaviest jobs. Real-time: Amazon Managed Service for Apache Flink consumes Kinesis with stateful, checkpointed processing, writing per-device aggregates to Timestream for the customer dashboard and enriched events to the curated Iceberg zone.

Flink — not Lambda — is the stream processor at 50k events/sec: it handles checkpointing, exactly-once with compatible sources and transactional sinks, and stateful windowed aggregation as first-class features. Lambda fits lower-throughput streams or simple per-record enrichment.

Query Surfaces and Cost

Five query patterns over four engines: Athena for ad-hoc analyst queries, Redshift Serverless for BI dashboards (loaded nightly), Timestream for LiveAnalytics for the customer real-time view, and SageMaker reading curated Iceberg tables for ML — no data copy.

Cost-per-event drives design: aggregate in Flink before writing to expensive stores (Timestream costs scale with dashboard queries, not raw events), use Parquet and partitioning to cut Athena scan bytes, and tier S3 by access pattern. The team runs a monthly cost-per-event review.

Flink vs Lambda for the stream processor

Managed Service for Apache Flink — the right default at sustained 50k events/sec — checkpointing, exactly-once, and stateful windowed aggregation built in.

Lambda on Kinesis — fine for lower-volume streams or simple per-record enrichment; needs careful tuning and struggles with stateful windows at scale.

S3 as canonical store — pay for storage once; Athena, Redshift, Timestream, and SageMaker are interchangeable query heads.

Common Mistakes

Using Lambda-on-Kinesis for stateful windowed aggregation at 50k events/sec, where Flink is the safer default.
Writing raw events into Timestream instead of aggregating in Flink first, so cost scales with raw volume not dashboard queries.
Starting the curated zone in plain Parquet instead of Iceberg, then rewriting custom schema-evolution and MERGE code later.
Retrofitting per-customer cost attribution and partitioning months after launch instead of designing for it on day one.
Running unbounded SELECT * Athena queries over a year of data instead of partitioning and using Parquet.
Keeping all data hot in S3 Standard instead of tiering to IA and Glacier by access pattern.

Best Practices

Make S3 the canonical store and treat each query engine as an interchangeable head.
Use Apache Iceberg in the curated zone for schema evolution, time travel, and MERGE.
Use Managed Service for Apache Flink plus Timestream for customer-facing real-time at IoT scale.
Enforce multi-tenant isolation at every layer — IoT Core auth, Lake Formation row filters, Timestream dimension filters.
Aggregate before ingesting into expensive stores, and partition and compress in S3.
Track cost-per-event monthly and tier storage by access pattern.

Comparable services GCP Pub/Sub + Dataflow + BigQuery + LookerAzure Event Hubs + Stream Analytics + Synapse + Power BI

Knowledge Check

Why is Apache Flink (Managed Service for Apache Flink) chosen over Lambda for the stream processor?

At sustained 50k events/sec it handles checkpointing, exactly-once, and stateful windowed aggregation as first-class features
Lambda has no event-source mapping or polling mechanism for reading records off a Kinesis Data Stream shard, so it cannot participate in the real-time path at all
Flink always costs less per million events than Lambda at any volume
Lambda is blocked from writing processed output objects into an S3 bucket

What is the role of S3 in this pipeline?

The canonical store, paid for once, with Athena, Redshift, Timestream, and SageMaker as interchangeable query heads
A transient scratch buffer that each query engine fills on demand and that is purged automatically the moment an Athena query finishes scanning it, holding nothing between runs
The low-latency database powering the customer real-time dashboard reads
A cold archive used only for nightly backups of the curated tables

How does the team keep Timestream cost under control at IoT scale?

Aggregating events in Flink before writing, so cost scales with dashboard query volume rather than raw event volume
Writing every raw event straight into the memory store unaggregated and then capping cost by simply running far fewer dashboard queries against it each day
Turning off the in-memory store so recent data falls through to magnetic
Reserving Timestream purely as a backup target for the S3 curated zone

What does the Glue Data Catalog do in this architecture?

Holds shared table and schema metadata for the query engines — it does not run ETL
Executes the scheduled batch ETL jobs itself, validating, deduplicating, and transforming raw events into the curated Iceberg tables on its own compute
Physically stores the raw event records as Parquet files on its own disk
Serves the customer-facing dashboard with sub-second metric lookups

You got correct