Service 57

Azure AI Speech

AI API

Azure AI Speech is a pre-trained API for the voice modality: speech-to-text, text-to-speech, speech translation, and speaker recognition. Send audio or text, get the transcription, synthesized speech, or translation back, with no model to train for the common cases. It is the audio counterpart to the vision and language APIs in this chapter.

Like the others, it is the call side of build-versus-call — and like Face, parts of it carry responsibility obligations. Custom neural voice, which can clone a specific voice, is gated behind Limited Access approval precisely because the capability is powerful enough to misuse. Standard transcription and synthesis need no such gate.

Speech to Text

Speech-to-text transcribes audio in real time or in batch, across many languages, with options for custom models tuned to domain vocabulary and acoustics. Real-time transcription powers captions and voice commands; batch transcription processes recorded calls and media. Custom speech improves accuracy on jargon and noisy conditions the base model handles poorly.

Text to Speech

Text-to-speech synthesizes natural neural voices from text, with control over pronunciation, pace, and emphasis through SSML. A large catalog of prebuilt voices covers most needs; custom neural voice builds a unique brand voice — and is the capability gated behind Limited Access and Responsible AI review for its misuse potential.

Translation

Speech translation transcribes and translates spoken audio in near real time, enabling live multilingual scenarios — captioned meetings, translated support calls. It combines the speech and translation capabilities so the application does not chain them by hand.

Speaker Recognition

Speaker recognition verifies a claimed identity by voice or identifies a speaker among enrolled voices. As a biometric, it carries the same sensitivity as facial recognition and must be designed with consent and security in mind — a voiceprint is personal data, governed accordingly.

Common Mistakes

Building a custom speech pipeline in Azure Machine Learning for transcription or synthesis a pre-trained API already does well.
Skipping a custom speech model for heavy domain jargon or noisy audio, then blaming the base model for poor accuracy.
Overlooking the Limited Access requirement on custom neural voice and being blocked at deployment.
Treating voiceprints from speaker recognition as ordinary data rather than sensitive biometric personal data.
Ignoring SSML and shipping flat, robotic synthesis when pronunciation and prosody control were available.
Underestimating per-minute audio costs on a large batch-transcription workload.

Best Practices

Use pre-trained AI Speech for transcription, synthesis, and translation rather than building custom audio models.
Train a custom speech model when domain vocabulary or acoustics hurt base-model accuracy.
Plan for Limited Access approval and Responsible AI review before using custom neural voice.
Treat speaker-recognition voiceprints as sensitive biometric data, with consent and security controls.
Use SSML to control pronunciation, pace, and emphasis for natural synthesis.
Estimate per-minute audio cost before large batch-transcription workloads.

Comparable servicesAWS Transcribe / PollyGCP Speech-to-Text / Text-to-Speech

Knowledge Check

Why is custom neural voice gated behind Limited Access approval?

Cloning a specific voice is powerful enough to misuse, so Microsoft requires Responsible AI review
Training the custom voice model is far too computationally expensive to allow open, general self-service use
The feature is limited to working in only a small handful of supported languages
It depends on a custom-trained vision model that must be approved first

When should you train a custom speech model?

When domain jargon or noisy audio hurts the base model's transcription accuracy
As the default starting point for every transcription task you build
Only to tune text-to-speech output, never to improve speech-to-text recognition
Whenever you need to translate speech into another language

How should speaker-recognition voiceprints be treated?

As sensitive biometric personal data, designed with consent and security controls
As ordinary application data that can be stored and handled with no special privacy controls
As public data that needs no protection once a user is enrolled
As data fully exempt from privacy regulation and consent requirements

You got correct