Amazon Textract
Service 57

Amazon Textract

AI/MLDocumentsAPI

Amazon Textract is AWS's document-understanding service. It performs OCR and goes beyond plain text by extracting forms (key-value pairs), tables, signatures, and structured fields from identity documents and receipts — reading a document the way a human does rather than returning a flat list of words.

Launched in 2019, it is the right starting point when the goal is "extract structured data from these documents" rather than "train a model on these documents."

What It Returns

DetectDocumentText is plain OCR — words and lines with bounding boxes. AnalyzeDocument is the richer call, taking feature flags: FORMS returns key-value pairs, TABLES returns rows and columns, SIGNATURES returns signature locations, and LAYOUT returns titles, headings, and paragraphs.

AnalyzeID normalizes identity-document fields (name, date of birth, document number) regardless of layout, and AnalyzeExpense extracts merchant, total, and line items from receipts. Each call is metered separately — pick the smallest that gives you what you need.

API Shapes and Custom Queries

The synchronous API handles single-page and small PDFs; the asynchronous API handles larger multi-page PDFs via an S3 job with SNS notification. The standard high-volume pattern is documents landing in S3, an EventBridge rule starting a Textract job, and a Lambda processing the result when SNS fires.

Custom queries let you ask natural-language questions of a document ("What is the policy holder's name?") without training a model — easier than form-based extraction for free-form documents where the answer moves around.

Textract vs Rekognition vs Bedrock

Textract — structured extraction from documents — forms, tables, IDs, receipts.

Rekognition — in-the-wild text in photos (signs, license plates), not document structure.

Bedrock multimodal — reasoning about document content, beyond extracting fields.

Common Mistakes
  • Using the synchronous API for large multi-page PDFs that exceed its size limits — use the asynchronous API.
  • Turning on every feature flag when you only need plain text, doubling the per-page bill needlessly.
  • Hand-writing a parser for the block tree instead of using an AWS-published helper or library.
  • Sending PDFs that already have a text layer through OCR instead of just parsing them.
  • Pushing low-confidence extracted fields straight into a system of record instead of a human review queue.
  • Using Textract for in-the-wild photo text, where Rekognition is the right tool.
Best Practices
  • Pick the smallest API call that returns what you need — detection-only is cheaper than full analysis.
  • Use the asynchronous API for multi-page or large PDFs.
  • Parse the block tree with an AWS-published helper rather than reinventing it.
  • Use FORMS for consistent layouts and Queries for variable ones.
  • Use AnalyzeID and AnalyzeExpense for identity documents and receipts.
  • Route below-threshold fields to human review, not directly into your data store.
Comparable services GCP Document AIAzure Azure AI Document Intelligence

Knowledge Check

How does Textract go beyond plain OCR?

  • It extracts forms, tables, signatures, and fields from IDs and receipts, not just words
  • It translates the document's text into other languages
  • It reads in-the-wild text from photos of street signs
  • It generates a natural-language summary of the document and answers questions about its contents

For a large multi-page PDF, which Textract API shape is required?

  • The asynchronous API — an S3 job with SNS notification
  • The synchronous API, which has no page-count or file-size limits
  • The streaming API that pages results in real time
  • Custom queries only, with no async job needed

What do Textract custom queries enable?

  • Asking natural-language questions of a document without training a model
  • Translating the whole document into a range of other languages on the fly
  • Training a custom OCR model on your documents
  • Detecting in-the-wild text in scene photos

What should happen to fields Textract extracts with low confidence?

  • Route them to a human review queue, not straight into a system of record
  • Accept every one of them automatically so the downstream pipeline keeps running at full speed
  • Discard the entire document and send it back to be re-scanned from scratch
  • Re-run just the uncertain fields through Rekognition for a second opinion

You got correct