Analytics Pipeline
Consider an analytics platform that must ingest high-volume event data — telemetry, clickstreams, IoT — land it durably, transform it, and serve both real-time dashboards and large historical queries. The requirements are high-throughput ingestion, a durable raw store, scalable transformation, and a clean separation of streaming from batch.
The architecture is the standard Azure analytics flow: Event Hubs ingests the stream, Data Lake Storage holds the raw and curated data, Synapse or Databricks transforms and queries it, and Power BI serves the results. The recurring decision is where streaming gives way to batch — real-time for what must be immediate, batch for what can wait and costs far less.
Ingestion
Event Hubs accepts the high-volume stream — millions of events per second — into a replayable log that multiple consumers read independently. Its Capture feature writes the raw stream directly to Data Lake Storage with no code, giving an immutable landing zone for the data while real-time consumers process the same stream in parallel.
Storage and the Lake
Data Lake Storage (Blob with the hierarchical namespace) is the backbone: raw landing data, curated and partitioned datasets, and serving tables, all in columnar Parquet or Delta. Keeping storage separate from compute lets the transformation and query engines scale independently, and lifecycle rules age raw data to cheaper tiers as it cools.
Transformation and Query
Synapse (serverless or dedicated SQL pools, and Spark) or Azure Databricks transforms raw data into curated datasets and answers analytical queries. Serverless SQL and Spark read the lake in place, so the same data serves exploration, scheduled transformation, and reporting without copies. Data Factory (or Synapse pipelines) orchestrates the scheduled movement and transformation. For a green-field analytics estate, evaluate Microsoft Fabric — the next-generation, SaaS-unified platform that succeeds Synapse and folds these pieces together over OneLake.
Streaming vs Batch
The defining choice is per-use-case: a live operations dashboard needs stream processing (Stream Analytics or Spark Structured Streaming) reading Event Hubs directly; a daily revenue report is a batch job over the lake that costs a fraction of the streaming path. Forcing everything to real time is expensive and usually unnecessary — most analytics tolerate batch latency, and the savings are large.
Streaming — Process events as they arrive (Stream Analytics, Spark Structured Streaming). Choose it only for use cases that genuinely need real-time results.
Batch — Process accumulated data on a schedule over the lake. Far cheaper, and sufficient for most analytics that tolerate latency.
- Forcing every use case to real-time streaming when most analytics tolerate batch latency at a fraction of the cost.
- Building a custom archival path when Event Hubs Capture would land the raw stream in the lake automatically.
- Storing data as row-oriented CSV instead of partitioned columnar Parquet, scanning far more bytes than queries need.
- Coupling storage and compute so they cannot scale independently, or copying data between engines instead of reading the lake in place.
- Leaving a dedicated Synapse SQL pool running idle between jobs instead of pausing it or using serverless.
- Never tiering raw landing data, so the lake grows on the Hot tier without bound.
- Ingest with Event Hubs and use Capture to land the raw stream in Data Lake Storage automatically.
- Keep the lake as the backbone — partitioned Parquet/Delta, storage separate from compute, lifecycle tiering for cold data.
- Transform and query with Synapse or Databricks reading the lake in place; orchestrate with Data Factory.
- Use streaming only where real-time results are genuinely required; use batch for everything else.
- Pause dedicated Synapse pools when idle or use serverless SQL to pay per query.
- Serve curated results to Power BI rather than querying raw data repeatedly.
Knowledge Check
What is the defining cost-and-design decision in an analytics pipeline?
- Where streaming gives way to batch — real-time only where needed, batch for the rest at a fraction of the cost
- Which VM size and disk tier to run the Power BI dashboard on
- Whether to use a relational SQL database or a NoSQL document store as the front-line buffer for stream ingestion before processing
- How many paired regions to deploy the data lake across
How does Event Hubs Capture help the pipeline?
- It writes the raw incoming stream to Data Lake Storage automatically, giving an immutable landing zone with no custom code
- It transforms the raw events into partitioned columnar Parquet and loads them straight into refreshed Power BI datasets for dashboards
- It runs the scheduled batch transformation jobs across the lake
- It replaces the need for a separate data lake entirely
Why store lake data as partitioned columnar Parquet?
- Queries scan far fewer bytes, cutting both query time and per-terabyte cost
- It is the only file format Event Hubs Capture is able to write to the lake
- It enables real-time streaming over the lake by default
- It removes the need to separate storage from compute
You got correct