Amazon Redshift
Redshift is AWS's data warehouse, built for analytical queries that scan billions of rows and aggregate, group, and join — not the small fast lookups transactional databases handle. It speaks a PostgreSQL-compatible SQL dialect over an entirely different columnar, massively parallel engine.
The positioning is OLAP versus OLTP: a single Redshift query can scan terabytes, while an RDS transaction touches a few rows. Different jobs, different tools.
Columnar Storage and MPP
Two ideas make Redshift fast. Columnar storage writes data one column at a time, so a query reading three of a hundred columns touches only those three on disk, and compression is high because each column holds one repeating type. Massively parallel processing spreads the work across many compute nodes that each scan their slice and return partial results.
SUM(price) still reads every column of every row.SUM(price) reads only the price column; high compression, scanned in parallel across MPP nodes.Together they favor queries that scan a lot of data but read few columns and parallelize well — star-schema analytics, aggregations over billions of rows, dashboards on years of history.
Operating Shapes and RA3
Redshift Serverless bills per RPU-second with no nodes to size and is the recommended starting point. Provisioned RA3 clusters separate compute from a managed S3-backed storage tier (the compute nodes cache hot data on local SSDs), so storage and compute scale independently and resizes are fast — cheaper at near-100% steady utilization.
Loading Data and Spectrum
Data arrives via COPY from S3 (the parallel bulk-load workhorse), Auto-Copy for continuous ingestion, Streaming Ingestion from Kinesis and MSK, zero-ETL integrations that sync from Aurora, RDS, and DynamoDB with no ETL code, and federated query that reads RDS/Aurora tables in place.
Redshift Spectrum queries Parquet/ORC/CSV files directly in S3 via an external schema, billed per TB scanned. The classic pattern keeps hot data in Redshift and years of cold history in S3, reaching back through Spectrum — cheap storage with full query reach.
Performance and Cost
Three knobs dominate speed. The distribution key spreads rows across nodes — KEY co-locates join columns, EVEN balances, ALL replicates small dimension tables; the wrong choice causes data skew where one node bottlenecks every query. The sort key orders rows so filters skip data via zone maps. VACUUM reclaims and re-sorts space after large deletes. Automatic Table Optimization can pick keys from observed query patterns.
Serverless is competitive for intermittent load; provisioned RA3 with Reserved Instances wins near 100% utilization. Start with Serverless and move only after measuring a steady workload.
Redshift — repeated, large-scale analytical queries over structured warehouse data, where second-level latency is acceptable.
RDS / Aurora — OLTP — single-row reads, frequent updates, transactional consistency, millisecond latency.
Athena — occasional, ad-hoc SQL directly over S3 without provisioning a warehouse; per-TB-scanned pricing.
- Using Redshift for OLTP — single-row reads and frequent updates belong in RDS or Aurora, not a warehouse.
- Picking the wrong distribution key, causing data skew where one overloaded node bottlenecks every query.
- Keeping all cold history inside Redshift instead of storing it as Parquet in S3 and querying via Spectrum.
- Expecting millisecond dashboard latency — Redshift query latency is seconds; front small hot data with a relational DB plus cache.
- Manually tuning sort and distribution keys on day one instead of letting Automatic Table Optimization observe the workload.
- Provisioning a large cluster for intermittent load when Serverless would cost less and scale to zero work.
- Start with Redshift Serverless unless you can describe a steady, high-utilization workload in numbers.
- Store cold history as Parquet in S3 and query it through Spectrum.
- Distribute tables on the join key and let Auto WLM and Automatic Table Optimization run before manual tuning.
- Compress everything and use materialized views for recurring dashboard aggregations.
- Use zero-ETL integrations to sync from Aurora/RDS instead of building ETL pipelines.
- Watch for data skew via the console or system views when queries slow down.
Knowledge Check
Why is columnar storage faster for analytical queries?
- A query reading a few columns touches only those columns on disk, and per-column compression is high
- It physically stores each row contiguously on disk so that full-row reads come back almost instantly every time
- It keeps the entire table resident in memory at all times for speed
- It automatically builds a B-tree index on every column in the table
What is the most common performance trap when designing a Redshift table?
- A poor distribution key causing data skew, where one node holds far more data and bottlenecks queries
- Enabling columnar storage on a table that is queried row by row
- Defining too few columns per table for the query patterns you run
- Running ordinary large analytical queries written using only the standard PostgreSQL-compatible query syntax
What does Redshift Spectrum let you do?
- Query data files directly in S3 via an external schema, without loading them into the cluster, billed per TB
- Cache the cluster's query results in memory so repeated reads return in microseconds
- Replicate the entire cluster to another AWS Region for disaster recovery
- Run high-volume OLTP write transactions directly against the cluster with consistent single-millisecond latency
A workload needs millisecond-latency single-row reads with frequent updates. Is Redshift appropriate?
- No — that is OLTP; use RDS or Aurora. Redshift query latency is measured in seconds
- Yes — Redshift is specifically optimized for very fast single-row primary-key point reads
- Yes, as long as you first enable the Concurrency Scaling feature
- Yes, provided that you choose the ALL table distribution style
You got correct