Service 57

Amazon Textract

AI/MLDocumentsAPI

Amazon Textract is AWS's document-understanding service. It performs OCR and goes beyond plain text by extracting forms (key-value pairs), tables, signatures, and structured fields from identity documents and receipts — reading a document the way a human does rather than returning a flat list of words.

Launched in 2019, it is the right starting point when the goal is "extract structured data from these documents" rather than "train a model on these documents."

What It Returns

DetectDocumentText is plain OCR — words and lines with bounding boxes. AnalyzeDocument is the richer call, taking feature flags: FORMS returns key-value pairs, TABLES returns rows and columns, SIGNATURES returns signature locations, and LAYOUT returns titles, headings, and paragraphs.

AnalyzeID normalizes identity-document fields (name, date of birth, document number) regardless of layout, and AnalyzeExpense extracts merchant, total, and line items from receipts. Each call is metered separately — pick the smallest that gives you what you need.

API Shapes and Custom Queries

The synchronous API handles single-page and small PDFs; the asynchronous API handles larger multi-page PDFs via an S3 job with SNS notification. The standard high-volume pattern is documents landing in S3, an EventBridge rule starting a Textract job, and a Lambda processing the result when SNS fires.

Custom queries let you ask natural-language questions of a document ("What is the policy holder's name?") without training a model — easier than form-based extraction for free-form documents where the answer moves around.

Textract vs Rekognition vs Bedrock

Textract — structured extraction from documents — forms, tables, IDs, receipts.

Rekognition — in-the-wild text in photos (signs, license plates), not document structure.

Bedrock multimodal — reasoning about document content, beyond extracting fields.

Common Mistakes

Using the synchronous API for large multi-page PDFs that exceed its size limits — use the asynchronous API.
Turning on every feature flag when you only need plain text, doubling the per-page bill needlessly.
Hand-writing a parser for the block tree instead of using an AWS-published helper or library.
Sending PDFs that already have a text layer through OCR instead of just parsing them.
Pushing low-confidence extracted fields straight into a system of record instead of a human review queue.
Using Textract for in-the-wild photo text, where Rekognition is the right tool.

Best Practices

Pick the smallest API call that returns what you need — detection-only is cheaper than full analysis.
Use the asynchronous API for multi-page or large PDFs.
Parse the block tree with an AWS-published helper rather than reinventing it.
Use FORMS for consistent layouts and Queries for variable ones.
Use AnalyzeID and AnalyzeExpense for identity documents and receipts.
Route below-threshold fields to human review, not directly into your data store.

Comparable services GCP Document AIAzure Azure AI Document Intelligence

Knowledge Check

How does Textract go beyond plain OCR?

It extracts forms, tables, signatures, and fields from IDs and receipts, not just words
It translates the document's text into other languages
It reads in-the-wild text from photos of street signs
It generates a natural-language summary of the document and answers questions about its contents

For a large multi-page PDF, which Textract API shape is required?

The asynchronous API — an S3 job with SNS notification
The synchronous API, which has no page-count or file-size limits
The streaming API that pages results in real time
Custom queries only, with no async job needed

What do Textract custom queries enable?

Asking natural-language questions of a document without training a model
Translating the whole document into a range of other languages on the fly
Training a custom OCR model on your documents
Detecting in-the-wild text in scene photos

What should happen to fields Textract extracts with low confidence?

Route them to a human review queue, not straight into a system of record
Accept every one of them automatically so the downstream pipeline keeps running at full speed
Discard the entire document and send it back to be re-scanned from scratch
Re-run just the uncertain fields through Rekognition for a second opinion

You got correct