Polly, Transcribe & Translate
Three managed AI APIs that belong together: Polly is text-to-speech, Transcribe is speech-to-text, and Translate is machine translation. Voice in, voice out, and language conversion in between — each a well-defined task with no model to train and no GPU to provision.
Together they cover most "convert speech or language" workloads on AWS, and they chain cleanly into pipelines.
Polly and Transcribe
Polly turns text into audio across 100+ voices and 30+ languages. Default to neural voices (far more natural than standard); long-form and generative voices are premium options for audiobooks and expressive speech. SSML, lexicons, and speech marks give precise control over pronunciation, pauses, and word-level timing.
Transcribe turns audio into a timestamped transcript with speaker diarization across 100+ languages. Specialized variants cover medical dictation, call analytics (sentiment, talk-time), and real-time streaming. Custom vocabularies improve accuracy on domain-specific terms with a few minutes of work.
Translate
Translate is neural machine translation across 75+ languages, with source auto-detection, a real-time API for short text, and a batch API for documents (DOCX, HTML, PPTX). Custom Terminology forces brand and product names to translate the way you want; Active Custom Translation fine-tunes on parallel-text data.
For most language pairs and content, Translate is good enough; for nuanced or creative content, a Bedrock foundation model can read more naturally — at higher cost.
Polly / Transcribe / Translate — well-defined speech and language conversion as a managed API call.
Bedrock — nuanced or creative language output that needs a foundation model's quality.
SageMaker — training a custom speech or language model when off-the-shelf accuracy is not enough.
- Defaulting to Polly standard voices when neural voices sound far more natural for the same use case.
- Skipping Transcribe custom vocabularies for domain terms, accepting avoidable transcription errors.
- Choosing the wrong Transcribe streaming protocol — WebSocket for browsers, HTTP/2 for server-side.
- Reaching for Active Custom Translation before trying Custom Terminology for brand and product names.
- Using Translate for highly idiomatic or creative content where a Bedrock model reads more naturally.
- Building each step separately instead of chaining Transcribe → Comprehend → Translate → Polly into a pipeline.
- Default to Polly neural voices; use long-form or generative only when needed.
- Use SSML and lexicons when Polly's default reading is wrong.
- Add Transcribe custom vocabularies for domain terms.
- Use Custom Terminology in Translate for brand and product names before full custom translation.
- Evaluate Bedrock against Translate for nuanced or creative content.
- Chain the services into pipelines — transcribe, analyze, translate, and read back.
Knowledge Check
Which Polly voice family should new work default to?
- Neural voices — far more natural than standard, no long-form/generative premium
- Standard voices — the older, cheaper tier, yet still the most natural-sounding of them all
- Generative voices for everything, expressive or not
- Long-form voices for all content regardless of length
How do you improve Transcribe accuracy on domain-specific terms?
- Add a custom vocabulary (or custom language model) with the domain words
- Switch to Polly to synthesize the terms first
- Re-encode the source recording at a much higher sample rate and bitrate before submitting
- Run the audio through Translate first
When might a Bedrock model beat Amazon Translate?
- For nuanced or creative content where more natural output justifies the cost
- For all routine batch document translation jobs, regardless of volume or content type
- For auto-detecting the source language of the input
- For the cheapest possible per-token translation
What makes these three services easy to combine into a pipeline?
- They are composable managed APIs whose outputs feed cleanly into the next service's input
- They expose a single shared endpoint URL behind which one common request and response format covers all three
- They must all run inside the same Lambda function and share its execution context to chain together at all
- They only work when all three are invoked together as one mandatory bundled call to a combined service
You got correct