Service 22

Azure Synapse Analytics

AnalyticsWarehouse

Azure Synapse Analytics is an analytics platform for querying data at warehouse scale — dedicated and serverless SQL pools, Apache Spark pools, and data-integration pipelines over a data lake. It is built for scanning and aggregating billions of rows, not for the single-row reads and writes of an operational database.

Using Synapse for transactional workloads is the classic category error: a warehouse is optimized for large scans and columnar aggregation, and it will be slow and expensive for the point lookups an OLTP database does in milliseconds. Microsoft Fabric is the strategic direction for new analytics, but Synapse remains widely deployed and the right lens for understanding Azure's warehouse model.

Dedicated SQL Pools

A dedicated SQL pool (formerly SQL Data Warehouse) is a provisioned, massively parallel data warehouse that distributes data and query processing across compute nodes. You pay for the provisioned size whether or not it is querying, so a pool left running idle burns money — pause it when not in use, and size it to the workload.

Serverless SQL Pools

A serverless SQL pool queries data in the lake in place, billed per terabyte scanned with no infrastructure to provision. It is ideal for ad-hoc exploration and querying raw files without loading them first. The cost model rewards scanning less — partitioning and columnar formats like Parquet cut both the bytes scanned and the bill.

Spark Pools

Apache Spark pools run distributed data engineering and machine learning over the lake — transformations, big-data processing, and notebooks. Spark and SQL pools share the same workspace and data, so a pipeline can land data with Spark and serve it with SQL without copying it between systems.

Pipelines and Integration

Synapse pipelines (the same engine as Azure Data Factory) orchestrate ingestion and transformation — moving data from sources, scheduling jobs, and chaining steps. With Data Lake Storage underneath, Synapse keeps storage and compute separate, so the lake holds the data and the pools scale independently over it.

Dedicated vs serverless SQL pool

Dedicated SQL pool — Provisioned MPP warehouse billed for its size. Choose it for steady, high-concurrency warehouse workloads — and pause it when idle.

Serverless SQL pool — Query files in the lake in place, billed per terabyte scanned. Choose it for ad-hoc exploration and querying raw data without loading.

Common Mistakes

Using Synapse for transactional, single-row workloads — a warehouse is slow and expensive at the point lookups an OLTP database does in milliseconds.
Leaving a dedicated SQL pool running idle, paying for provisioned compute that is not querying anything.
Querying raw, unpartitioned data with serverless pools and being surprised by the bill — cost is per terabyte scanned.
Storing data as row-oriented CSV instead of columnar Parquet, scanning far more bytes than the query needs.
Copying data between Spark and SQL pools instead of sharing it in the lake they both read.
Sizing a dedicated pool for peak and never adjusting it, paying peak rates during quiet periods.

Best Practices

Use Synapse for analytical scans and aggregation; keep transactional workloads on an operational database.
Pause dedicated SQL pools when idle and size them to the workload.
Use serverless SQL pools for ad-hoc querying, and partition and store data as Parquet to cut bytes scanned.
Share data in Data Lake Storage between Spark and SQL pools rather than copying between systems.
Keep storage and compute separate so each scales and bills independently.
Evaluate Microsoft Fabric for new analytics estates, where Microsoft is consolidating the analytics stack.

Comparable servicesAWS RedshiftGCP BigQuery

Knowledge Check

Why is Synapse the wrong choice for a transactional, single-row workload?

It is a warehouse optimized for large columnar scans, and it is slow and costly at the point lookups an OLTP database does in milliseconds
It cannot store relational data at all, so the transactional tables an OLTP application depends on have nowhere to live anywhere inside the workspace
It has no SQL interface for querying tables
It only supports batch loading once per day on a fixed schedule

How is a serverless SQL pool billed, and what reduces the cost?

Per terabyte scanned — partitioning and columnar Parquet cut the bytes scanned and the bill
Per provisioned node-hour regardless of how many queries run, with reserved-instance discounts
A flat monthly fee per workspace, queries included
Per row returned to the client in the result set

What is the cost discipline for a dedicated SQL pool?

Pause it when idle, since you pay for the provisioned size whether or not it is querying
Leave it running around the clock so queries stay warm in cache and incur no additional compute cost
Scan more data per query to amortize the fixed provisioned fee
Store data as row-oriented CSV to reduce the node count

You got correct