Service 58

Polly, Transcribe & Translate

AI/MLSpeechAPI

Three managed AI APIs that belong together: Polly is text-to-speech, Transcribe is speech-to-text, and Translate is machine translation. Voice in, voice out, and language conversion in between — each a well-defined task with no model to train and no GPU to provision.

Together they cover most "convert speech or language" workloads on AWS, and they chain cleanly into pipelines.

Polly and Transcribe

Polly turns text into audio across 100+ voices and 30+ languages. Default to neural voices (far more natural than standard); long-form and generative voices are premium options for audiobooks and expressive speech. SSML, lexicons, and speech marks give precise control over pronunciation, pauses, and word-level timing.

Transcribe turns audio into a timestamped transcript with speaker diarization across 100+ languages. Specialized variants cover medical dictation, call analytics (sentiment, talk-time), and real-time streaming. Custom vocabularies improve accuracy on domain-specific terms with a few minutes of work.

Translate

Translate is neural machine translation across 75+ languages, with source auto-detection, a real-time API for short text, and a batch API for documents (DOCX, HTML, PPTX). Custom Terminology forces brand and product names to translate the way you want; Active Custom Translation fine-tunes on parallel-text data.

For most language pairs and content, Translate is good enough; for nuanced or creative content, a Bedrock foundation model can read more naturally — at higher cost.

These APIs vs Bedrock vs SageMaker

Polly / Transcribe / Translate — well-defined speech and language conversion as a managed API call.

Bedrock — nuanced or creative language output that needs a foundation model's quality.

SageMaker — training a custom speech or language model when off-the-shelf accuracy is not enough.

Common Mistakes

Defaulting to Polly standard voices when neural voices sound far more natural for the same use case.
Skipping Transcribe custom vocabularies for domain terms, accepting avoidable transcription errors.
Choosing the wrong Transcribe streaming protocol — WebSocket for browsers, HTTP/2 for server-side.
Reaching for Active Custom Translation before trying Custom Terminology for brand and product names.
Using Translate for highly idiomatic or creative content where a Bedrock model reads more naturally.
Building each step separately instead of chaining Transcribe → Comprehend → Translate → Polly into a pipeline.

Best Practices

Default to Polly neural voices; use long-form or generative only when needed.
Use SSML and lexicons when Polly's default reading is wrong.
Add Transcribe custom vocabularies for domain terms.
Use Custom Terminology in Translate for brand and product names before full custom translation.
Evaluate Bedrock against Translate for nuanced or creative content.
Chain the services into pipelines — transcribe, analyze, translate, and read back.

Comparable services GCP Text-to-Speech, Speech-to-Text, TranslationAzure Azure AI Speech, Translator

Knowledge Check

Which Polly voice family should new work default to?

Neural voices — far more natural than standard, no long-form/generative premium
Standard voices — the older, cheaper tier, yet still the most natural-sounding of them all
Generative voices for everything, expressive or not
Long-form voices for all content regardless of length

How do you improve Transcribe accuracy on domain-specific terms?

Add a custom vocabulary (or custom language model) with the domain words
Switch to Polly to synthesize the terms first
Re-encode the source recording at a much higher sample rate and bitrate before submitting
Run the audio through Translate first

When might a Bedrock model beat Amazon Translate?

For nuanced or creative content where more natural output justifies the cost
For all routine batch document translation jobs, regardless of volume or content type
For auto-detecting the source language of the input
For the cheapest possible per-token translation

What makes these three services easy to combine into a pipeline?

They are composable managed APIs whose outputs feed cleanly into the next service's input
They expose a single shared endpoint URL behind which one common request and response format covers all three
They must all run inside the same Lambda function and share its execution context to chain together at all
They only work when all three are invoked together as one mandatory bundled call to a combined service

You got correct