Intelligent Document Processing on AWS

OCR, Data Extraction, AI Classification, and Document Analysis — Running Entirely Within Your AWS Account

Organizations process thousands of documents daily — invoices, contracts, applications, claims, correspondence, regulatory filings, and forms — most of which arrive as unstructured content: scanned PDFs, images, faxes, and multi-page paper submissions. Converting this unstructured content into classified, metadata-enriched, workflow-ready records is the bottleneck that slows intake, delays decisions, and forces expensive manual data entry.

Intelligent document processing (IDP) automates this conversion. FormKiQ provides a multi-layer IDP pipeline — from foundational OCR text extraction through structured data capture with Amazon Textract to AI-powered classification, enrichment, and analysis with Amazon Bedrock — all running within your own AWS account. Your documents never leave your cloud environment. Every processing action is logged in your audit trail. And the output — classified, metadata-enriched documents — feeds directly into FormKiQ's workflows, search indexes, and governance controls.

FormKiQ AI-powered document processing workflow, metadata extraction, and human review — automation and AI with humans in the loop

What Is Intelligent Document Processing?

IDP extends traditional OCR (optical character recognition) by adding machine learning and AI capabilities that understand document structure, extract meaning, and make classification decisions — not just read text:

Processing Layer	What It Does	Technology
OCR	Extracts raw text from scanned documents, images, and image-based PDFs	Tesseract (all editions), Amazon Textract (Essentials+)
Structured extraction	Identifies and extracts tables, form fields, key-value pairs, and spatial relationships	Amazon Textract with custom mappings
Document classification	Determines the type of document (invoice, contract, application, correspondence) and routes it accordingly	Amazon Bedrock LLMs (Advanced/Enterprise)
Metadata enrichment	Extracts entities (names, dates, amounts, identifiers) and applies them as structured metadata	Amazon Bedrock LLMs (Advanced/Enterprise)
Content analysis	Analyzes document content for sensitivity, compliance, completeness, or domain-specific criteria	Amazon Bedrock LLMs (Advanced/Enterprise)

The key distinction: OCR tells you what the text says. IDP tells you what the document means.

The FormKiQ IDP Pipeline

FormKiQ's IDP capabilities are layered — each layer builds on the previous, and organizations can enable the layers they need without activating the full pipeline:

Layer 1: OCR — Text Extraction

Tesseract (All Editions)

FormKiQ Core includes OCR processing using Tesseract, an open-source OCR engine that provides reliable text extraction for standard digitization workflows. Tesseract processing runs within your AWS account using Lambda functions, with extracted text stored as part of the document's metadata record and available for full-text indexing and search.

Best for: scanned typed documents, printed forms, and document types where text is clearly legible and consistently formatted.

Amazon Textract (Essentials and above)

From Essentials onward, FormKiQ provides OCR and structured extraction using Amazon Textract — AWS's managed machine learning service for document text and data extraction.

Capability	Tesseract (Core)	Textract (Essentials+)
Raw text extraction	✓	✓
Handwriting recognition	Limited	✓
Table extraction	—	✓ (structured table data)
Form field extraction	—	✓ (key-value pairs)
Layout understanding	—	✓ (spatial relationships preserved)
Multi-column documents	Limited	✓
Low-quality scans	Limited	Higher accuracy
Custom extraction mappings	—	✓ (map fields to FormKiQ metadata)
Processing location	Your AWS account (Lambda)	Your AWS account (Textract API)

Layer 2: Structured Extraction — Tables, Forms, and Key-Value Pairs

Amazon Textract goes beyond raw text to extract structured data that preserves the meaning and relationships within documents:

Table extraction — identifies tables within documents and extracts their content as structured data with rows, columns, and cell values preserved
Form field extraction — identifies labeled fields (key-value pairs) and extracts both the label and the value as structured data
Custom extraction mappings — configurable field-level extraction that maps specific document elements to FormKiQ metadata attributes

What custom mappings enable:

Document Type	Extracted Fields	Metadata Result
Invoice	Vendor name, invoice number, date, line items, total amount	Structured metadata for AP automation, search, and retention
Contract	Counterparty, effective date, expiry date, contract value	Contract metadata for lifecycle tracking and renewal alerting
Application form	Applicant name, date of birth, address, application type	Application metadata for eligibility routing and case file creation
Insurance claim	Claimant, policy number, date of loss, claim amount	Claims metadata for intake routing and adjudication workflows
Tax form	Taxpayer ID, filing period, reported amounts	Tax document metadata for compliance filing and audit readiness

Layer 3: AI-Powered Classification and Analysis

Available on Advanced and Enterprise editions, FormKiQ's AI Processing and Analysis module — powered by Amazon Bedrock — adds intelligent classification, enrichment, and analysis capabilities on top of OCR and structured extraction.

All AI processing runs within your AWS account through Amazon Bedrock, using supported large language models including Anthropic Claude, Amazon Nova, and other available models. Inference region controls allow you to specify which AWS regions are used for model processing, ensuring document content stays within your data residency boundaries.

Five AI Processing Capabilities

1. Document Type Classification

Automatically identify and classify the type of each document as it enters FormKiQ — and apply the appropriate classification schema, metadata attributes, and workflow routing based on that determination.

Feature	Description
Automatic sorting	Incoming documents classified by type without manual intervention
Configurable categories	Classification models configured to the specific document types relevant to your program
Confidence scoring (optional)	When enabled, each classification includes a confidence score — low-confidence results routed to a human review queue
Workflow routing	Classified documents automatically routed to the appropriate workflow, queue, or case file

Best for: high-volume intake programs where documents arrive in mixed formats from multiple sources — grants intake, accounts payable, insurance claims, regulatory submissions, mailroom digitization.

2. Content Sensitivity Classification

Identify and flag documents containing sensitive content — PII, PHI, financial data, privileged communications, or minor-related information — for appropriate access control and handling:

Automated detection — sensitive content identified at the point of ingestion without requiring manual review
Classification tagging — sensitivity level applied as metadata, enabling ABAC-driven access policies
Handling rules — configurable actions triggered by sensitivity classification (restricted access, encryption enforcement, routing to security review)
Regulatory alignment — supports identification of content subject to HIPAA, GDPR, CCPA, FERPA, and other privacy frameworks

3. Metadata Extraction

Extract key entities from documents and apply them as structured, searchable metadata:

Entity Type	Examples
People	Names, titles, roles, signatories
Organizations	Company names, agency names, departments
Dates	Effective dates, expiry dates, filing dates, deadlines
Financial	Amounts, payment terms, account numbers
Identifiers	Case numbers, policy numbers, contract IDs, permit numbers
Locations	Addresses, jurisdictions, regions

Metadata extraction goes beyond Textract's form-field extraction by understanding context — extracting entities from unstructured narrative text, not just labeled form fields.

4. Document Summarization

Generate concise summaries of document content — capturing key points, decisions, obligations, and context:

Intake acceleration — summaries of lengthy submissions (medical records, investigation reports, legal filings) allow reviewers to triage and prioritize without reading full documents
Case review — case file summaries produced from multiple case documents, providing a consolidated view of case history and status
Correspondence digest — summaries of correspondence chains that capture the key exchanges without requiring full-thread review
Archive discovery — summaries of archived documents that support discovery without retrieving full documents from cold storage

5. Document Analysis

Apply structured analytical judgment to document content — assessing documents against criteria, identifying issues, and producing actionable outputs:

Analysis Type	What It Does	Example Use Case
Contract analysis	Identify obligations, rights, termination conditions, non-standard terms	Flag contracts deviating from standard templates before legal review
Compliance analysis	Assess whether documents meet defined compliance requirements	Evaluate regulatory submissions for completeness and conformity
Application analysis	Evaluate applications against eligibility criteria	Assess grant applications for required elements and eligibility
Policy analysis	Identify conflicts or gaps between policy documents	Flag policy provisions that conflict with regulatory requirements
Risk analysis	Evaluate content for risk indicators and obligation exposure	Surface high-risk clauses in vendor contracts before approval

Analytical outputs are stored as structured metadata — making them searchable and available to workflow routing rules. A document that fails a compliance analysis can be automatically routed to a remediation queue.

Selective Processing and Human Review

FormKiQ's IDP pipeline is designed for selective, incremental deployment — not all-or-nothing:

Per-document-type — enable specific IDP capabilities for specific document types (OCR for all scanned documents, AI classification only for intake submissions, full analysis only for contracts)
Per-workflow — apply IDP processing at specific workflow stages (classification at intake, summarization before review, analysis before approval)
Per-site — in multi-tenant deployments, enable different IDP capabilities for different organizational sites or business units
Confidence-based routing — when confidence scoring is enabled, low-confidence AI outputs can be routed to human review queues for verification before metadata is finalized
Incremental enablement — start with OCR, add Textract extraction, then layer AI capabilities as confidence in the outputs grows

Data Sovereignty and Processing Security

A critical differentiator for FormKiQ's IDP pipeline: every processing step runs within your AWS account.

Processing Step	Where It Runs	Data Residency Control
Tesseract OCR	AWS Lambda in your account	Your selected AWS region
Textract extraction	Amazon Textract API in your account	Your selected AWS region
Bedrock AI processing	Amazon Bedrock in your account	Configurable inference region controls
Metadata storage	Amazon DynamoDB in your account	Your selected AWS region
Full-text index	Amazon OpenSearch in your account	Your selected AWS region
Audit trail	AWS CloudTrail + FormKiQ audit log in your account	Your selected AWS region

Documents are never sent to a third-party AI service, a vendor-hosted processing environment, or an external OCR API. For organizations subject to HIPAA, GDPR, PIPEDA, or data sovereignty requirements, this architecture eliminates the data residency risk that third-party IDP services create.

IDP Use Cases by Industry

Industry	High-Volume Document Types	IDP Capabilities Applied
Financial Services	Invoices, loan applications, account opening forms, compliance filings, trade confirmations	Textract form extraction, AI classification, metadata extraction, compliance analysis
Healthcare	Patient intake forms, insurance claims, lab reports, prescriptions, referral letters	Textract form extraction, sensitivity classification (PHI), metadata extraction, summarization
Government	Permit applications, benefits forms, correspondence, regulatory filings, FOIA requests	AI classification, metadata extraction, completeness analysis, correspondence summarization
Insurance	Claim forms, medical records, police reports, damage assessments, correspondence	AI classification and routing, Textract extraction, metadata enrichment, summarization
Higher Education	Admissions applications, transcripts, financial aid forms, research submissions	Textract form extraction, AI classification, application analysis
Legal	Contracts, discovery documents, correspondence, filings, evidence	Contract analysis, metadata extraction, sensitivity classification, summarization
Manufacturing	Quality inspection reports, supplier certifications, compliance documentation, SOPs	Textract table extraction, compliance analysis, metadata enrichment
Accounts Payable (cross-industry)	Invoices, purchase orders, receipts, delivery confirmations	Textract extraction, three-way matching metadata, AP workflow routing

FormKiQ Editions for Intelligent Document Processing

IDP capabilities are available across FormKiQ editions, with processing depth increasing at each tier:

Capability	Core	Essentials	Advanced	Enterprise
OCR — Tesseract	✓	✓	✓	✓
OCR & IDP — Amazon Textract		✓	✓	✓
Custom Extraction Mappings		✓	✓	✓
Document Type Classification (Bedrock)			✓	✓
Content Sensitivity Classification (Bedrock)			✓	✓
Metadata Extraction (Bedrock)			✓	✓
Document Summarization (Bedrock)			✓	✓
Document Analysis (Bedrock)			✓	✓
Inference Region Controls			✓	✓
Enhanced Full-Text Search (OpenSearch)			✓	✓
Document Gateway Modules			✓	✓
Integration Framework Modules			✓	✓
Multi-Instance & Multi-Region Licensing			✓	✓
Vendor-Managed & Hybrid Deployment				✓
Custom SLAs & Compliance Consulting				✓
Support	Community (Slack & GitHub)	Support Portal (2-business-day SLA)	Private Slack + videoconference + 40 hrs onboarding	Rapid response (8-business-hour SLA) + strategic architecture support

Deployment Models

Model	Description	Availability
Customer-Managed AWS	Deploys directly into your AWS account via CloudFormation. Full control of infrastructure, networking, encryption keys, and operations.	All editions
Vendor-Managed	FormKiQ manages the AWS infrastructure on your behalf — deployment, updates, and operational support.	Enterprise
Hybrid	You retain control of specific components (encryption keys, network config) while delegating operational management to FormKiQ.	Enterprise

Every deployment is a dedicated, isolated instance in an AWS account owned by or designated by the customer. FormKiQ does not operate a shared multi-tenant environment.

Getting Started

FormKiQ Core — including Tesseract OCR — can be deployed to your AWS account in fifteen to twenty minutes using a one-click install via AWS CloudFormation. Amazon Textract integration is available from Essentials onward. AI-powered classification, extraction, summarization, and analysis capabilities are available on Advanced and Enterprise.

For organizations evaluating intelligent document processing on AWS, FormKiQ offers a Proof-of-Value program — a three-month deployment in a FormKiQ-managed AWS environment that provides full platform access in a non-production setting.

Start a Proof-of-Value deployment

Frequently Asked Questions

What is intelligent document processing on AWS?

Intelligent document processing (IDP) on AWS refers to automating the extraction, classification, and enrichment of document content using OCR, machine learning, and AI services — all running within your own Amazon Web Services environment. FormKiQ's IDP pipeline combines Tesseract OCR, Amazon Textract structured extraction, and Amazon Bedrock AI analysis to convert unstructured documents into classified, metadata-enriched, workflow-ready records.

What is the difference between OCR and IDP?

OCR (optical character recognition) extracts raw text from scanned documents and images. IDP extends OCR with machine learning and AI to understand document structure (tables, forms, key-value pairs), classify document types, extract meaningful entities (names, dates, amounts), and analyze content for compliance, completeness, or risk. OCR tells you what the text says. IDP tells you what the document means.

What is the difference between Tesseract and Amazon Textract?

Tesseract is an open-source OCR engine included in FormKiQ Core that provides reliable text extraction for standard documents. Amazon Textract is AWS's managed ML service available from Essentials onward — it adds handwriting recognition, table extraction, form field extraction, layout understanding, and higher accuracy across complex document types. Textract also supports custom extraction mappings that map specific document fields directly to FormKiQ metadata attributes.

How does Amazon Bedrock power IDP in FormKiQ?

FormKiQ's AI Processing and Analysis module uses Amazon Bedrock to provide five capabilities beyond OCR and Textract: document type classification, content sensitivity classification, metadata extraction from unstructured text, document summarization, and document analysis. All processing runs within your AWS account using Bedrock's supported models (Anthropic Claude, Amazon Nova, and others). Inference region controls ensure document content stays within your data residency boundaries.

Does document content leave my AWS account during IDP processing?

No. Every step of FormKiQ's IDP pipeline — Tesseract OCR, Textract extraction, Bedrock AI processing, metadata storage, and search indexing — runs within your own AWS account. Documents are never sent to a third-party service, a vendor-hosted environment, or an external API.

Can I apply IDP selectively to specific document types?

Yes. FormKiQ's IDP capabilities can be enabled per-document-type, per-workflow, and per-site in multi-tenant deployments. You can apply OCR to all scanned documents, Textract extraction to invoices and forms, and full AI analysis only to contracts — without processing every document through the entire pipeline.

How does FormKiQ handle low-confidence AI outputs?

When confidence scoring is enabled, low-confidence AI classification and extraction results can be routed to a human review queue where staff can verify and correct the output before metadata is finalized. This ensures AI processing augments human judgment rather than replacing it unsupervised. Confidence scoring is a configurable option, not a requirement for all deployments.

What document formats does FormKiQ's IDP support?

FormKiQ's IDP pipeline processes PDFs (both text-based and image-based), TIFF, JPEG, PNG, and other common image formats. Multi-page documents are handled natively. Born-digital PDFs with embedded text can bypass OCR and proceed directly to AI classification and analysis.