Service 55

Amazon Rekognition

AI/MLVisionAPI

Amazon Rekognition is AWS's computer-vision API. You send an image or video; it returns detected objects, scenes, faces, text, celebrities, or content-moderation labels as structured output — no model to train, no GPU to provision, no ML expertise required.

Launched in 2016, it is the right starting point when the question is "what is in this image or video" rather than "let's train a custom model for our specific problem."

What It Detects

Core features include label detection (objects and scenes with confidence scores), face detection and analysis (bounding boxes plus attributes), face comparison and search against an indexed collection, content moderation for unsafe content, celebrity recognition, PPE detection for safety monitoring, and text-in-the-wild detection.

Custom Labels trains a domain-specific classifier on your own labeled images — for a specific product line or defect pattern the built-in labels do not cover.

Image vs Video, and Faces

The image API is synchronous — send one image, get a response in milliseconds to seconds. The video API is asynchronous — start a job on an S3-hosted video, get notified via SNS, and read time-coded results. A streaming-video variant integrates with Kinesis Video Streams.

A face collection stores face vectors (not original images) so SearchFacesByImage can find matches. Those vectors are still PII in many jurisdictions, and face features carry Region-specific restrictions — check the current documentation before building on them.

Rekognition vs Textract vs Bedrock

Rekognition — "what is in this image or video" — objects, faces, moderation, in-the-wild text.

Textract — document understanding — structured text, forms, and tables from documents.

Bedrock multimodal — reasoning about image content in natural language, beyond structured detection.

Common Mistakes

Stitching repeated image-API calls together to analyze a video instead of using the asynchronous video API.
Leaving confidence thresholds at the default for tasks like face matching where a higher bar is needed.
Sending large images as request bytes instead of an S3 reference, wasting bandwidth.
Using Rekognition text-detection for document OCR, where Textract is the right tool.
Storing more face data, for longer, than needed — even allowed face features carry privacy and regulatory weight.
Trying to coerce built-in labels for a domain-specific task instead of training Custom Labels.

Best Practices

Use the image API for synchronous workflows and the video API for stored or streaming video.
Tune confidence thresholds to the task rather than relying on the default.
Use S3 references for large images.
Train Custom Labels for domain-specific detection.
Store the minimum face data for the minimum time and document its purpose.
Use Textract for document OCR and Bedrock for multimodal reasoning.

Comparable services GCP Vision AI, Video IntelligenceAzure Azure AI Vision

Knowledge Check

What is Rekognition best suited for?

Answering "what is in this image or video" — objects, faces, moderation — no model to train
Extracting key-value form fields, signatures, and tables from scanned documents and receipts
Translating extracted text between dozens of languages
Training arbitrary custom vision models from scratch

How do the Rekognition image and video APIs differ?

The image API is synchronous; the video API is an asynchronous S3 job
The image API is the asynchronous one; the video API answers synchronously in seconds
Both APIs are synchronous and return in milliseconds
The video API only runs on edge devices, not in the cloud

What does a Rekognition face collection store?

Face vectors (not original images), still PII under many jurisdictions' laws
The full original source photographs, each one indexed by the person's name and ID
Only a running count of the faces it has seen
Nothing — it processes each face without storing anything

For extracting structured data from a scanned form, which service is the right choice?

Amazon Textract — document understanding with forms and tables
Rekognition text detection run across the full scanned form image
Amazon Comprehend to analyze the form's contents and pull out its labelled fields
Amazon Polly to read each of the form fields aloud as narrated audio

You got correct