Service 55

Azure AI Vision

AI API

Azure AI Vision is a pre-trained computer-vision API: send an image or video, get back text, objects, descriptions, or faces — no model training required. It is part of Azure AI services (formerly Cognitive Services), billed per transaction, and it adds sight to an application in an afternoon rather than a research project.

Its place in this chapter is the call side of build-versus-call. For standard vision tasks — reading text from an image, detecting common objects, describing a scene — a pre-trained API is faster, cheaper, and more accurate than a model you would train yourself. (Content moderation now lives in a dedicated service, Azure AI Content Safety, which screens both text and images.) Only a genuinely domain-specific recognition need justifies dropping to Azure Machine Learning.

Optical Character Recognition

The Read (OCR) capability extracts printed and handwritten text from images and documents at scale, with layout awareness. It is the most widely used vision feature — digitizing forms, signs, screenshots — and the foundation that Document Intelligence builds structured extraction on top of.

Image Analysis

Image analysis tags content, detects and locates objects, generates captions, and reads other visual properties from a single API call. For cataloguing, accessibility (alt text), and content organization, it provides usable results immediately with no training data.

Face

The Face capability detects faces and facial attributes and supports verification and identification. Because facial recognition is sensitive, Microsoft gates the higher-risk capabilities (identification) behind a Limited Access approval process and Responsible AI commitments — a reminder that some AI features carry obligations beyond the technical.

Custom Vision

When the built-in models do not recognize your specific categories — a particular product defect, a specialized part — Custom Vision trains a classifier or detector on your own labeled images with a small dataset and no ML expertise. It is the bridge between pure pre-trained APIs and full Azure Machine Learning: custom recognition without building a training pipeline.

Common Mistakes

Building a custom model in Azure Machine Learning for a standard vision task a pre-trained API already does better and cheaper.
Using Custom Vision for a task the built-in models already cover, adding labeling effort for no gain.
Ignoring the Limited Access and Responsible AI requirements on Face identification, then being blocked at deployment.
Sending full-resolution images when a smaller size meets accuracy needs, inflating per-transaction cost and latency.
Treating per-transaction pricing as free at scale, then being surprised by the bill on a high-volume pipeline.
Assuming the API understands domain-specific categories it was never trained on instead of using Custom Vision.

Best Practices

Use pre-trained AI Vision for standard tasks — OCR, object detection, captioning, moderation.
Use Custom Vision only when built-in models cannot recognize your specific categories.
Plan for the Limited Access approval and Responsible AI obligations on Face identification.
Right-size images to balance accuracy against per-transaction cost and latency.
Estimate transaction volume and cost before deploying a high-throughput vision pipeline.
Drop to Azure Machine Learning only when neither the pre-trained API nor Custom Vision fits.

Comparable servicesAWS RekognitionGCP Vision AI

Knowledge Check

When should you use a pre-trained AI Vision API instead of training a model in Azure Machine Learning?

For standard tasks like OCR, object detection, and moderation, where a ready-made model is faster, cheaper, and more accurate
In every case, because the Azure Machine Learning platform is unable to build or train any kind of computer-vision image model at all
Only when the input is video frames rather than individual still images
Only when your labeled training dataset exceeds roughly one million images

What is Custom Vision for?

Training a classifier or detector on your own labeled images when built-in models do not recognize your specific categories
Running the largest multimodal GPT model against your uploaded images
Replacing the built-in Read OCR capability so that all text extraction from your scanned images now runs through Custom Vision instead
Hosting and serving a vision model that was trained somewhere else

Why are some Face capabilities gated behind a Limited Access process?

Facial identification is sensitive, so Microsoft requires approval and Responsible AI commitments
They remain in a closed private preview stage and are simply not yet generally available to any user at all
They require you to supply and train a custom recognition model first
They are only available from a limited set of programming languages

You got correct