Intelligent Document Processing on AWS

OCR, data extraction, AI classification, and document analysis running entirely within your AWS account.

OCR, Data Extraction, AI Classification, and Document Analysis — Running Entirely Within Your AWS Account

Organizations process thousands of documents daily — invoices, contracts, applications, claims, correspondence, regulatory filings, and forms — most of which arrive as unstructured content: scanned PDFs, images, faxes, and multi-page paper submissions. Converting this unstructured content into classified, metadata-enriched, workflow-ready records is the bottleneck that slows intake, delays decisions, and forces expensive manual data entry.

Intelligent document processing (IDP) automates this conversion. FormKiQ provides a multi-layer IDP pipeline — from foundational OCR text extraction through structured data capture with Amazon Textract to AI-powered classification, enrichment, and analysis with Amazon Bedrock — all running within your own AWS account. Your documents never leave your cloud environment. Every processing action is logged in your audit trail. And the output — classified, metadata-enriched documents — feeds directly into FormKiQ's workflows, search indexes, and governance controls.

What Is Intelligent Document Processing?

IDP extends traditional OCR (optical character recognition) by adding machine learning and AI capabilities that understand document structure, extract meaning, and make classification decisions — not just read text:

Processing Layer What It Does Technology
OCR Extracts raw text from scanned documents, images, and image-based PDFs Tesseract (all editions), Amazon Textract (Essentials+)
Structured extraction Identifies and extracts tables, form fields, key-value pairs, and spatial relationships Amazon Textract with custom mappings
Document classification Determines the type of document (invoice, contract, application, correspondence) and routes it accordingly Amazon Bedrock LLMs (Advanced/Enterprise)
Metadata enrichment Extracts entities (names, dates, amounts, identifiers) and applies them as structured metadata Amazon Bedrock LLMs (Advanced/Enterprise)
Content analysis Analyzes document content for sensitivity, compliance, completeness, or domain-specific criteria Amazon Bedrock LLMs (Advanced/Enterprise)

The key distinction: OCR tells you what the text says. IDP tells you what the document means.

The FormKiQ IDP Pipeline

FormKiQ's IDP capabilities are layered — each layer builds on the previous, and organizations can enable the layers they need without activating the full pipeline:

Layer 1: OCR — Text Extraction

Tesseract (All Editions)

FormKiQ Core includes OCR processing using Tesseract, an open-source OCR engine that provides reliable text extraction for standard digitization workflows. Tesseract processing runs within your AWS account using Lambda functions, with extracted text stored as part of the document's metadata record and available for full-text indexing and search.

Best for: scanned typed documents, printed forms, and document types where text is clearly legible and consistently formatted.

Amazon Textract (Essentials and above)

From Essentials onward, FormKiQ provides OCR and structured extraction using Amazon Textract — AWS's managed machine learning service for document text and data extraction.

Capability Tesseract (Core) Textract (Essentials+)
Raw text extraction
Handwriting recognition Limited
Table extraction ✓ (structured table data)
Form field extraction ✓ (key-value pairs)
Layout understanding ✓ (spatial relationships preserved)
Multi-column documents Limited
Low-quality scans Limited Higher accuracy
Custom extraction mappings ✓ (map fields to FormKiQ metadata)
Processing location Your AWS account (Lambda) Your AWS account (Textract API)

Layer 2: Structured Extraction — Tables, Forms, and Key-Value Pairs

Amazon Textract goes beyond raw text to extract structured data that preserves the meaning and relationships within documents:

  • Table extraction — identifies tables within documents and extracts their content as structured data with rows, columns, and cell values preserved
  • Form field extraction — identifies labeled fields (key-value pairs) and extracts both the label and the value as structured data
  • Custom extraction mappings — configurable field-level extraction that maps specific document elements to FormKiQ metadata attributes

What custom mappings enable:

Document Type Extracted Fields Metadata Result
Invoice Vendor name, invoice number, date, line items, total amount Structured metadata for AP automation, search, and retention
Contract Counterparty, effective date, expiry date, contract value Contract metadata for lifecycle tracking and renewal alerting
Application form Applicant name, date of birth, address, application type Application metadata for eligibility routing and case file creation
Insurance claim Claimant, policy number, date of loss, claim amount Claims metadata for intake routing and adjudication workflows
Tax form Taxpayer ID, filing period, reported amounts Tax document metadata for compliance filing and audit readiness

Layer 3: AI-Powered Classification and Analysis

Available on Advanced and Enterprise editions, FormKiQ's AI Processing and Analysis module — powered by Amazon Bedrock — adds intelligent classification, enrichment, and analysis capabilities on top of OCR and structured extraction.

All AI processing runs within your AWS account through Amazon Bedrock, using supported large language models including Anthropic Claude, Amazon Nova, and other available models. Inference region controls allow you to specify which AWS regions are used for model processing, ensuring document content stays within your data residency boundaries.

Five AI Processing Capabilities

1. Document Type Classification

Automatically identify and classify the type of each document as it enters FormKiQ — and apply the appropriate classification schema, metadata attributes, and workflow routing based on that determination.

Feature Description
Automatic sorting Incoming documents classified by type without manual intervention
Configurable categories Classification models configured to the specific document types relevant to your program
Confidence scoring (optional) When enabled, each classification includes a confidence score — low-confidence results routed to a human review queue
Workflow routing Classified documents automatically routed to the appropriate workflow, queue, or case file

Best for: high-volume intake programs where documents arrive in mixed formats from multiple sources — grants intake, accounts payable, insurance claims, regulatory submissions, mailroom digitization.

2. Content Sensitivity Classification

Identify and flag documents containing sensitive content — PII, PHI, financial data, privileged communications, or minor-related information — for appropriate access control and handling:

  • Automated detection — sensitive content identified at the point of ingestion without requiring manual review
  • Classification tagging — sensitivity level applied as metadata, enabling ABAC-driven access policies
  • Handling rules — configurable actions triggered by sensitivity classification (restricted access, encryption enforcement, routing to security review)
  • Regulatory alignment — supports identification of content subject to HIPAA, GDPR, CCPA, FERPA, and other privacy frameworks

3. Metadata Extraction

Extract key entities from documents and apply them as structured, searchable metadata:

Entity Type Examples
People Names, titles, roles, signatories
Organizations Company names, agency names, departments
Dates Effective dates, expiry dates, filing dates, deadlines
Financial Amounts, payment terms, account numbers
Identifiers Case numbers, policy numbers, contract IDs, permit numbers
Locations Addresses, jurisdictions, regions

Metadata extraction goes beyond Textract's form-field extraction by understanding context — extracting entities from unstructured narrative text, not just labeled form fields.

4. Document Summarization

Generate concise summaries of document content — capturing key points, decisions, obligations, and context:

  • Intake acceleration — summaries of lengthy submissions (medical records, investigation reports, legal filings) allow reviewers to triage and prioritize without reading full documents
  • Case review — case file summaries produced from multiple case documents, providing a consolidated view of case history and status
  • Correspondence digest — summaries of correspondence chains that capture the key exchanges without requiring full-thread review
  • Archive discovery — summaries of archived documents that support discovery without retrieving full documents from cold storage

5. Document Analysis

Apply structured analytical judgment to document content — assessing documents against criteria, identifying issues, and producing actionable outputs:

Analysis Type What It Does Example Use Case
Contract analysis Identify obligations, rights, termination conditions, non-standard terms Flag contracts deviating from standard templates before legal review
Compliance analysis Assess whether documents meet defined compliance requirements Evaluate regulatory submissions for completeness and conformity
Application analysis Evaluate applications against eligibility criteria Assess grant applications for required elements and eligibility
Policy analysis Identify conflicts or gaps between policy documents Flag policy provisions that conflict with regulatory requirements
Risk analysis Evaluate content for risk indicators and obligation exposure Surface high-risk clauses in vendor contracts before approval

Analytical outputs are stored as structured metadata — making them searchable and available to workflow routing rules. A document that fails a compliance analysis can be automatically routed to a remediation queue.

Selective Processing and Human Review

FormKiQ's IDP pipeline is designed for selective, incremental deployment — not all-or-nothing:

  • Per-document-type — enable specific IDP capabilities for specific document types (OCR for all scanned documents, AI classification only for intake submissions, full analysis only for contracts)
  • Per-workflow — apply IDP processing at specific workflow stages (classification at intake, summarization before review, analysis before approval)
  • Per-site — in multi-tenant deployments, enable different IDP capabilities for different organizational sites or business units
  • Confidence-based routing — when confidence scoring is enabled, low-confidence AI outputs can be routed to human review queues for verification before metadata is finalized
  • Incremental enablement — start with OCR, add Textract extraction, then layer AI capabilities as confidence in the outputs grows

Data Sovereignty and Processing Security

A critical differentiator for FormKiQ's IDP pipeline: every processing step runs within your AWS account.

Processing Step Where It Runs Data Residency Control
Tesseract OCR AWS Lambda in your account Your selected AWS region
Textract extraction Amazon Textract API in your account Your selected AWS region
Bedrock AI processing Amazon Bedrock in your account Configurable inference region controls
Metadata storage Amazon DynamoDB in your account Your selected AWS region
Full-text index Amazon OpenSearch in your account Your selected AWS region
Audit trail AWS CloudTrail + FormKiQ audit log in your account Your selected AWS region

Documents are never sent to a third-party AI service, a vendor-hosted processing environment, or an external OCR API. For organizations subject to HIPAA, GDPR, PIPEDA, or data sovereignty requirements, this architecture eliminates the data residency risk that third-party IDP services create.

IDP Use Cases by Industry

Industry High-Volume Document Types IDP Capabilities Applied
Financial Services Invoices, loan applications, account opening forms, compliance filings, trade confirmations Textract form extraction, AI classification, metadata extraction, compliance analysis
Healthcare Patient intake forms, insurance claims, lab reports, prescriptions, referral letters Textract form extraction, sensitivity classification (PHI), metadata extraction, summarization
Government Permit applications, benefits forms, correspondence, regulatory filings, FOIA requests AI classification, metadata extraction, completeness analysis, correspondence summarization
Insurance Claim forms, medical records, police reports, damage assessments, correspondence AI classification and routing, Textract extraction, metadata enrichment, summarization
Higher Education Admissions applications, transcripts, financial aid forms, research submissions Textract form extraction, AI classification, application analysis
Legal Contracts, discovery documents, correspondence, filings, evidence Contract analysis, metadata extraction, sensitivity classification, summarization
Manufacturing Quality inspection reports, supplier certifications, compliance documentation, SOPs Textract table extraction, compliance analysis, metadata enrichment
Accounts Payable (cross-industry) Invoices, purchase orders, receipts, delivery confirmations Textract extraction, three-way matching metadata, AP workflow routing

FormKiQ Editions for Intelligent Document Processing

IDP capabilities are available across FormKiQ editions, with processing depth increasing at each tier:

Capability Core Essentials Advanced Enterprise
OCR — Tesseract
OCR & IDP — Amazon Textract
Custom Extraction Mappings
Document Type Classification (Bedrock)
Content Sensitivity Classification (Bedrock)
Metadata Extraction (Bedrock)
Document Summarization (Bedrock)
Document Analysis (Bedrock)
Inference Region Controls
Enhanced Full-Text Search (OpenSearch)
Document Gateway Modules
Integration Framework Modules
Multi-Instance & Multi-Region Licensing
Vendor-Managed & Hybrid Deployment
Custom SLAs & Compliance Consulting
Support Community (Slack & GitHub) Support Portal (2-business-day SLA) Private Slack + videoconference + 40 hrs onboarding Rapid response (8-business-hour SLA) + strategic architecture support

Deployment Models

Model Description Availability
Customer-Managed AWS Deploys directly into your AWS account via CloudFormation. Full control of infrastructure, networking, encryption keys, and operations. All editions
Vendor-Managed FormKiQ manages the AWS infrastructure on your behalf — deployment, updates, and operational support. Enterprise
Hybrid You retain control of specific components (encryption keys, network config) while delegating operational management to FormKiQ. Enterprise

Every deployment is a dedicated, isolated instance in an AWS account owned by or designated by the customer. FormKiQ does not operate a shared multi-tenant environment.

Getting Started

FormKiQ Core — including Tesseract OCR — can be deployed to your AWS account in fifteen to twenty minutes using a one-click install via AWS CloudFormation. Amazon Textract integration is available from Essentials onward. AI-powered classification, extraction, summarization, and analysis capabilities are available on Advanced and Enterprise.

For organizations evaluating intelligent document processing on AWS, FormKiQ offers a Proof-of-Value program — a three-month deployment in a FormKiQ-managed AWS environment that provides full platform access in a non-production setting.

Schedule a consultation · Start a Proof-of-Value deployment

Frequently Asked Questions

What is intelligent document processing on AWS?

Intelligent document processing (IDP) on AWS refers to automating the extraction, classification, and enrichment of document content using OCR, machine learning, and AI services — all running within your own Amazon Web Services environment. FormKiQ's IDP pipeline combines Tesseract OCR, Amazon Textract structured extraction, and Amazon Bedrock AI analysis to convert unstructured documents into classified, metadata-enriched, workflow-ready records.

What is the difference between OCR and IDP?

OCR (optical character recognition) extracts raw text from scanned documents and images. IDP extends OCR with machine learning and AI to understand document structure (tables, forms, key-value pairs), classify document types, extract meaningful entities (names, dates, amounts), and analyze content for compliance, completeness, or risk. OCR tells you what the text says. IDP tells you what the document means.

What is the difference between Tesseract and Amazon Textract?

Tesseract is an open-source OCR engine included in FormKiQ Core that provides reliable text extraction for standard documents. Amazon Textract is AWS's managed ML service available from Essentials onward — it adds handwriting recognition, table extraction, form field extraction, layout understanding, and higher accuracy across complex document types. Textract also supports custom extraction mappings that map specific document fields directly to FormKiQ metadata attributes.

How does Amazon Bedrock power IDP in FormKiQ?

FormKiQ's AI Processing and Analysis module uses Amazon Bedrock to provide five capabilities beyond OCR and Textract: document type classification, content sensitivity classification, metadata extraction from unstructured text, document summarization, and document analysis. All processing runs within your AWS account using Bedrock's supported models (Anthropic Claude, Amazon Nova, and others). Inference region controls ensure document content stays within your data residency boundaries.

Does document content leave my AWS account during IDP processing?

No. Every step of FormKiQ's IDP pipeline — Tesseract OCR, Textract extraction, Bedrock AI processing, metadata storage, and search indexing — runs within your own AWS account. Documents are never sent to a third-party service, a vendor-hosted environment, or an external API.

Can I apply IDP selectively to specific document types?

Yes. FormKiQ's IDP capabilities can be enabled per-document-type, per-workflow, and per-site in multi-tenant deployments. You can apply OCR to all scanned documents, Textract extraction to invoices and forms, and full AI analysis only to contracts — without processing every document through the entire pipeline.

How does FormKiQ handle low-confidence AI outputs?

When confidence scoring is enabled, low-confidence AI classification and extraction results can be routed to a human review queue where staff can verify and correct the output before metadata is finalized. This ensures AI processing augments human judgment rather than replacing it unsupervised. Confidence scoring is a configurable option, not a requirement for all deployments.

What document formats does FormKiQ's IDP support?

FormKiQ's IDP pipeline processes PDFs (both text-based and image-based), TIFF, JPEG, PNG, and other common image formats. Multi-page documents are handled natively. Born-digital PDFs with embedded text can bypass OCR and proceed directly to AI classification and analysis.

Start with FormKiQ Core

The open-source foundation — API-first, deployable into your own AWS account, and free to use. Right for architecture validation and early implementation.

Get Started Free

Deploy FormKiQ Essentials or Advanced

Production-ready editions for departments and complex workflows. Start with a Proof-of-Value deployment or go straight to production.

Explore Options

Plan an Enterprise Rollout

For governance-heavy environments with residency, sovereignty, assurance, and multi-jurisdiction requirements. Talk to us about the right deployment model.

Book a Call