Service 07

AWS Batch

ComputeBatchContainers

AWS Batch is a fully managed service for running batch computing jobs — work that runs on its own, finishes, and produces output: video transcoding, genomics, risk modeling, ML training, end-of-day reports. You give Batch a containerized job and its resource needs; it finds EC2 or Fargate capacity, runs the job, retries on failure, and returns the result.

Batch is not Lambda. Lambda runs short, event-driven functions; Batch runs longer, heavier jobs — often hours of work — across many machines, scaling to thousands of containers and back to zero.

Key Concepts

A job definition is a template — container image, vCPU, memory, IAM role, retries. A job is a single running unit of work. A job queue is an ordered list where jobs wait, with priorities. A compute environment is the resource pool that runs them — Fargate, EC2 On-Demand, EC2 Spot, or a mix. A scheduling policy shares capacity fairly across teams.

How a Job Runs

You submit a job to a queue; Batch finds capacity in the matching compute environment, scaling it up if there is none, starts a container, and records success or failure — retrying up to the configured limit. Outputs typically land in S3 or DynamoDB.

Array jobs turn one submission into thousands of near-identical jobs — the clean way to process a file list or sweep parameters. Job dependencies let Job B start only after Job A succeeds, so Batch handles ordering instead of your own glue code.

Compute Environments and Cost

Choose how jobs run: Fargate for the simplest path within its CPU and memory limits; EC2 On-Demand for any instance type, including very large or many-core; or EC2 Spot, often 70–90% cheaper. Batch is built to use Spot well because most batch jobs can simply be retried after an interruption.

There is no charge for Batch itself — you pay only for the underlying EC2 or Fargate capacity and storage. Spot is the single biggest cost lever: for retry-tolerant workloads it can cut compute cost by an order of magnitude.

Batch vs Lambda vs Step Functions

AWS Batch — long-running, parallel, throughput-bound jobs measured in minutes to hours, across many containers.

Lambda — many small, short, event-driven tasks under 15 minutes — simpler and cheaper for that pattern.

Step Functions — workflows with complex branching, ordering, or human approval steps between tasks.

Common Mistakes

Running retry-tolerant jobs on On-Demand instead of Spot — Spot is the single biggest cost lever and can cut compute by an order of magnitude.
Submitting thousands of jobs one at a time instead of using an array job, which does it in one call.
Writing non-idempotent jobs — since Batch retries on failure, two runs must not produce wrong or duplicated results.
Using Batch for real-time or millisecond work — it is built for minutes-to-hours jobs; use Lambda for event-driven tasks.
Hand-coding ordering between dependent jobs instead of declaring job dependencies and letting Batch sequence them.
Running jobs without shipping logs to CloudWatch, making a failed job nearly impossible to debug.

Best Practices

Use Spot capacity wherever jobs tolerate restarts — the largest cost saving available.
Use array jobs for parallelism instead of submitting jobs individually.
Write idempotent jobs so a retry produces the same correct result.
Keep job containers small for faster image pulls and start times.
Declare job dependencies rather than building your own ordering glue.
Send all job logs to CloudWatch for debugging.

Comparable services GCP Batch, DataflowAzure Azure Batch

Knowledge Check

What distinguishes AWS Batch from AWS Lambda?

Batch runs longer, heavier jobs (minutes to hours) across many machines; Lambda runs short, event-driven functions
Batch fires on each incoming event and returns in milliseconds, while Lambda is the engine for jobs that run for hours
Batch compute environments are locked to Fargate alone, and Lambda functions can only execute on dedicated EC2 hosts
They are interchangeable for any workload, so you can swap one for the other without changing how the job runs

What is the single biggest cost lever for AWS Batch workloads?

Running retry-tolerant jobs on EC2 Spot, often 70–90% cheaper, since interrupted jobs are simply retried
Picking the largest instance types available so each job finishes in less wall-clock time and frees the queue faster
Routing every compute environment to Fargate On-Demand for its hands-off convenience and per-second granularity
Turning off CloudWatch log streams for every job so you stop paying to store their output and metrics

Why must AWS Batch jobs be written to be idempotent?

Batch retries failed jobs, so running the same job twice must not produce wrong or duplicated results
Jobs marked idempotent qualify for a reduced compute rate that AWS applies automatically at the end of each month
Non-idempotent jobs are rejected by the array job API and can only be submitted one at a time as single jobs
Batch enforces a policy that accepts only stateless jobs and refuses anything that writes durable state

You need to process 10,000 input files with the same job. What is the right approach?

Submit a single array job, which creates the many near-identical jobs in one call
Write a script that loops 10,000 times and calls SubmitJob once per file as a separate single job
Hand the whole set to one Lambda function configured with the maximum 15-minute timeout to churn through all the files
Create 10,000 separate job queues

You got correct