Service 40

Amazon CloudWatch

MonitoringObservabilityOps

CloudWatch is AWS's monitoring and observability service. It collects metrics, logs, and traces from AWS services and your applications, stores them, and provides alarms, dashboards, and query tools. Almost every AWS service publishes metrics automatically — EC2 CPU, ALB request count, Lambda duration, RDS connections — with no opt-in.

It is the default observability backbone on AWS and the trigger source for autoscaling and alerting. It is also one of the larger line items on most bills, so cost discipline matters.

Metrics and Alarms

A metric is a named time series in a namespace, identified by dimensions, shipped every minute (or five) per resource. Custom metrics let your application publish its own via PutMetricData. CloudWatch keeps 15 months of data at progressively coarser granularity.

An alarm watches a metric (or a math expression over several) and fires when it crosses a threshold for a set number of periods — sending an SNS notification, triggering autoscaling, or recovering an instance. Composite alarms combine base alarms with AND/OR/NOT so "the service is broken" can mean several symptoms at once.

Logs and Logs Insights

CloudWatch Logs organizes events into log groups and streams; most AWS services ship logs there by default. Retention is per log group, and the default of "forever" is expensive at scale — set it explicitly. Subscriptions stream events in near real time to Lambda, Firehose, or OpenSearch.

Logs Insights is a query language that turns log groups into something like a database — filter, parse JSON, compute statistics — the right tool for ad-hoc investigation when you do not yet know what you are looking for.

Dashboards and Higher-Level Bundles

Dashboards lay out metric, log, and text widgets; the free tier covers 3 dashboards (up to 50 metrics each) plus all automatic per-service dashboards, with additional custom dashboards billed per dashboard per month. Synthetics run scripted canaries that probe your app from outside; RUM collects real-user browser telemetry; Internet Monitor watches public-internet performance to your users.

Container Insights gives curated dashboards for EKS, ECS, and Fargate; Application Signals auto-discovers SLOs and builds a service map for microservices observability without a full third-party APM.

CloudWatch vs CloudTrail vs X-Ray

CloudWatch — metrics, logs, alarms, and dashboards — the health and performance of resources and applications.

CloudTrail — the audit log of who called which AWS API — not performance metrics.

X-Ray — fine-grained distributed tracing across services, deeper than CloudWatch metrics.

Common Mistakes

Leaving log groups at the default "forever" retention, accumulating large and avoidable storage cost.
Creating high-cardinality custom metrics (one per user or request), which explodes the metrics bill.
Alerting on every base metric instead of using composite alarms for meaningful high-level signals.
Building dashboards during an incident instead of having per-service dashboards ready beforehand.
Treating CloudWatch as a full APM or SIEM — it is good but not as deep as dedicated tools for those.
Skipping Container Insights on container workloads, then lacking cluster and pod visibility when something breaks.

Best Practices

Set explicit retention on every log group.
Use composite alarms for high-level signals instead of alerting on every metric.
Use Logs Insights for ad-hoc investigation and subscriptions to stream logs to S3 or a SIEM.
Build per-service dashboards before you need them.
Watch for cardinality explosions and unbounded retention — the two biggest CloudWatch cost traps.
Enable Container Insights from day one on container workloads.

Comparable services GCP Cloud Monitoring, Cloud LoggingAzure Azure Monitor

Knowledge Check

Which two patterns most often drive a surprising CloudWatch bill?

High-cardinality custom metrics and unbounded log retention
Configuring too many standard alarms and too few dashboards
Using composite alarms and running Logs Insights queries
Enabling Container Insights and Synthetics canaries

What does a composite alarm do?

Combines base alarms with AND/OR/NOT so an alert fires only on a meaningful combination of symptoms
Watches a single high-resolution metric at one-second granularity and pages an operator on each breach
Stores the underlying log events for up to ten years
Replaces the need for SNS topics and notifications entirely

What is CloudWatch Logs Insights best used for?

Ad-hoc investigation across log groups when you do not yet know what you are looking for
Long-term storage and retention of aggregated metric data
Replacing standard alarms for continuous production alerting
Distributed tracing of individual requests across microservices and their downstream dependencies

What is the default retention for a new CloudWatch log group?

Forever — which is expensive at scale and should be set explicitly
One day, after which events are deleted automatically
Thirty days, matching the standard metric retention window
A retention period must be chosen before any log events are accepted

You got correct