Amazon Bedrock
Amazon Bedrock is AWS's managed service for foundation models — large language models, image generators, and embeddings models from many providers, through one API. AWS handles model hosting, capacity, and provider authentication, so the same API works whether you call a model from Anthropic, Meta, Mistral, Cohere, Amazon, or others.
Launched in 2023, it is the default place on AWS to put a large language model behind a production application: simpler than running inference on SageMaker, faster than juggling each provider's SDK.
Models and Inference Modes
Bedrock offers models from multiple providers (the lineup shifts as models are added and retired — check the console for current models and Region availability). For most workloads the choice is between a frontier general-purpose model and a smaller, faster, cheaper one; embeddings are a separate category.
Three inference modes: on-demand (per input/output token, the default), provisioned throughput (reserved capacity for high steady traffic and consistent latency; required for some custom models), and batch (asynchronous, roughly half the per-token price). Start on-demand and move to provisioned only with measured demand.
Knowledge Bases, Agents, and Guardrails
Knowledge Bases manage retrieval-augmented generation end to end — point Bedrock at a data source, and it chunks, embeds, indexes, and injects relevant context into prompts. Simple but less flexible than rolling your own. Agents add tool use: an agent reasons about a request, calls action groups defined by an API schema, and returns a result.
Guardrails enforce content policy on both inputs and outputs — blocking topics, filtering phrases, redacting PII, and catching prompt-injection attempts. For user-facing applications, guardrails are the right baseline.
Customization
Bedrock adapts models three ways: continued pre-training on unlabeled domain data, fine-tuning on labeled task examples, and model import of an open-weights model trained elsewhere. Fine-tuned and imported models almost always require provisioned throughput. For most teams in 2026, prompting plus retrieval beats fine-tuning — reach for fine-tuning only when you have measured that prompting is not enough.
Bedrock — using foundation models someone else trained, with optional RAG, agents, and fine-tuning.
SageMaker — training and serving your own models from scratch with full control.
Rekognition / Comprehend / Textract — specific ready-made tasks (vision, NLP, OCR) that a foundation model is overkill for.
- Jumping to provisioned throughput before measuring demand, paying for reserved capacity an on-demand workload does not need.
- Shipping a user-facing LLM app with no guardrails on inputs and outputs.
- Reaching for fine-tuning when prompting plus retrieval would have worked, adding cost and provisioned-throughput requirements.
- Not pinning model IDs, so a deprecated foundation model breaks the application without warning.
- Ignoring the token bill — long context windows cost real money; trim prompts to what matters.
- Building a custom RAG stack first when managed Knowledge Bases would cover the use case.
- Start with on-demand inference; move to provisioned throughput only with measured demand.
- Use Knowledge Bases for early RAG and roll your own only when you outgrow it.
- Attach guardrails to user-facing applications before production.
- Pin model IDs in code and plan for model deprecation and migration.
- Stream long responses and trim prompts to control the token bill.
- For multi-tenant SaaS, isolate per-tenant data with Knowledge Base access-control filters.
Knowledge Check
What does Bedrock provide?
- Foundation models from many providers behind one API, with hosting and auth handled
- A managed Kubernetes cluster with GPU node pools you provision and scale for serving ML models
- Optical character recognition and structured form-field extraction
- A workbench to train and tune models from scratch on your own data
Which inference mode should a new Bedrock project start with?
- On-demand — per-token billing with no commitment
- Provisioned throughput with a six-month reserved-capacity commitment
- Batch only — asynchronous jobs at half the per-token price
- Whichever is cheapest at maximum scale
What do Bedrock Guardrails do?
- Apply content policy to inputs and outputs — denied topics, PII redaction, injection detection
- Reserve guaranteed model throughput for steady high-volume traffic, holding latency consistent
- Fine-tune a foundation model on your own labeled examples
- Cache repeated completions to cut your per-token bill
For most teams in 2026, what beats fine-tuning a foundation model?
- Prompting plus retrieval (RAG); fine-tune only when prompting measurably falls short
- Continued pre-training of the base model on the whole of your unlabeled company data
- Always importing an open-weights model trained elsewhere
- Switching providers in turn until one of them works
You got correct