OCR, Data Extraction, AI Classification, and Document Analysis — Running Entirely Within Your AWS Account
Organizations process thousands of documents daily — invoices, contracts, applications, claims, correspondence, regulatory filings, and forms — most of which arrive as unstructured content: scanned PDFs, images, faxes, and multi-page paper submissions. Converting this unstructured content into classified, metadata-enriched, workflow-ready records is the bottleneck that slows intake, delays decisions, and forces expensive manual data entry.
Intelligent document processing (IDP) automates this conversion. FormKiQ provides a multi-layer IDP pipeline — from foundational OCR text extraction through structured data capture with Amazon Textract to AI-powered classification, enrichment, and analysis with Amazon Bedrock — all running within your own AWS account. Your documents never leave your cloud environment. Every processing action is logged in your audit trail. And the output — classified, metadata-enriched documents — feeds directly into FormKiQ's workflows, search indexes, and governance controls.
What Is Intelligent Document Processing?
IDP extends traditional OCR (optical character recognition) by adding machine learning and AI capabilities that understand document structure, extract meaning, and make classification decisions — not just read text:
| Processing Layer | What It Does | Technology |
|---|---|---|
| OCR | Extracts raw text from scanned documents, images, and image-based PDFs | Tesseract (all editions), Amazon Textract (Essentials+) |
| Structured extraction | Identifies and extracts tables, form fields, key-value pairs, and spatial relationships | Amazon Textract with custom mappings |
| Document classification | Determines the type of document (invoice, contract, application, correspondence) and routes it accordingly | Amazon Bedrock LLMs (Advanced/Enterprise) |
| Metadata enrichment | Extracts entities (names, dates, amounts, identifiers) and applies them as structured metadata | Amazon Bedrock LLMs (Advanced/Enterprise) |
| Content analysis | Analyzes document content for sensitivity, compliance, completeness, or domain-specific criteria | Amazon Bedrock LLMs (Advanced/Enterprise) |
The key distinction: OCR tells you what the text says. IDP tells you what the document means.
The FormKiQ IDP Pipeline
FormKiQ's IDP capabilities are layered — each layer builds on the previous, and organizations can enable the layers they need without activating the full pipeline:
Layer 1: OCR — Text Extraction
Tesseract (All Editions)
FormKiQ Core includes OCR processing using Tesseract, an open-source OCR engine that provides reliable text extraction for standard digitization workflows. Tesseract processing runs within your AWS account using Lambda functions, with extracted text stored as part of the document's metadata record and available for full-text indexing and search.
Best for: scanned typed documents, printed forms, and document types where text is clearly legible and consistently formatted.
Amazon Textract (Essentials and above)
From Essentials onward, FormKiQ provides OCR and structured extraction using Amazon Textract — AWS's managed machine learning service for document text and data extraction.
| Capability | Tesseract (Core) | Textract (Essentials+) |
|---|---|---|
| Raw text extraction | ✓ | ✓ |
| Handwriting recognition | Limited | ✓ |
| Table extraction | — | ✓ (structured table data) |
| Form field extraction | — | ✓ (key-value pairs) |
| Layout understanding | — | ✓ (spatial relationships preserved) |
| Multi-column documents | Limited | ✓ |
| Low-quality scans | Limited | Higher accuracy |
| Custom extraction mappings | — | ✓ (map fields to FormKiQ metadata) |
| Processing location | Your AWS account (Lambda) | Your AWS account (Textract API) |
Layer 2: Structured Extraction — Tables, Forms, and Key-Value Pairs
Amazon Textract goes beyond raw text to extract structured data that preserves the meaning and relationships within documents:
- Table extraction — identifies tables within documents and extracts their content as structured data with rows, columns, and cell values preserved
- Form field extraction — identifies labeled fields (key-value pairs) and extracts both the label and the value as structured data
- Custom extraction mappings — configurable field-level extraction that maps specific document elements to FormKiQ metadata attributes
What custom mappings enable:
| Document Type | Extracted Fields | Metadata Result |
|---|---|---|
| Invoice | Vendor name, invoice number, date, line items, total amount | Structured metadata for AP automation, search, and retention |
| Contract | Counterparty, effective date, expiry date, contract value | Contract metadata for lifecycle tracking and renewal alerting |
| Application form | Applicant name, date of birth, address, application type | Application metadata for eligibility routing and case file creation |
| Insurance claim | Claimant, policy number, date of loss, claim amount | Claims metadata for intake routing and adjudication workflows |
| Tax form | Taxpayer ID, filing period, reported amounts | Tax document metadata for compliance filing and audit readiness |
Layer 3: AI-Powered Classification and Analysis
Available on Advanced and Enterprise editions, FormKiQ's AI Processing and Analysis module — powered by Amazon Bedrock — adds intelligent classification, enrichment, and analysis capabilities on top of OCR and structured extraction.
All AI processing runs within your AWS account through Amazon Bedrock, using supported large language models including Anthropic Claude, Amazon Nova, and other available models. Inference region controls allow you to specify which AWS regions are used for model processing, ensuring document content stays within your data residency boundaries.
Five AI Processing Capabilities
1. Document Type Classification
Automatically identify and classify the type of each document as it enters FormKiQ — and apply the appropriate classification schema, metadata attributes, and workflow routing based on that determination.
| Feature | Description |
|---|---|
| Automatic sorting | Incoming documents classified by type without manual intervention |
| Configurable categories | Classification models configured to the specific document types relevant to your program |
| Confidence scoring (optional) | When enabled, each classification includes a confidence score — low-confidence results routed to a human review queue |
| Workflow routing | Classified documents automatically routed to the appropriate workflow, queue, or case file |
Best for: high-volume intake programs where documents arrive in mixed formats from multiple sources — grants intake, accounts payable, insurance claims, regulatory submissions, mailroom digitization.
2. Content Sensitivity Classification
Identify and flag documents containing sensitive content — PII, PHI, financial data, privileged communications, or minor-related information — for appropriate access control and handling:
- Automated detection — sensitive content identified at the point of ingestion without requiring manual review
- Classification tagging — sensitivity level applied as metadata, enabling ABAC-driven access policies
- Handling rules — configurable actions triggered by sensitivity classification (restricted access, encryption enforcement, routing to security review)
- Regulatory alignment — supports identification of content subject to HIPAA, GDPR, CCPA, FERPA, and other privacy frameworks
3. Metadata Extraction
Extract key entities from documents and apply them as structured, searchable metadata:
| Entity Type | Examples |
|---|---|
| People | Names, titles, roles, signatories |
| Organizations | Company names, agency names, departments |
| Dates | Effective dates, expiry dates, filing dates, deadlines |
| Financial | Amounts, payment terms, account numbers |
| Identifiers | Case numbers, policy numbers, contract IDs, permit numbers |
| Locations | Addresses, jurisdictions, regions |
Metadata extraction goes beyond Textract's form-field extraction by understanding context — extracting entities from unstructured narrative text, not just labeled form fields.
4. Document Summarization
Generate concise summaries of document content — capturing key points, decisions, obligations, and context:
- Intake acceleration — summaries of lengthy submissions (medical records, investigation reports, legal filings) allow reviewers to triage and prioritize without reading full documents
- Case review — case file summaries produced from multiple case documents, providing a consolidated view of case history and status
- Correspondence digest — summaries of correspondence chains that capture the key exchanges without requiring full-thread review
- Archive discovery — summaries of archived documents that support discovery without retrieving full documents from cold storage
5. Document Analysis
Apply structured analytical judgment to document content — assessing documents against criteria, identifying issues, and producing actionable outputs:
| Analysis Type | What It Does | Example Use Case |
|---|---|---|
| Contract analysis | Identify obligations, rights, termination conditions, non-standard terms | Flag contracts deviating from standard templates before legal review |
| Compliance analysis | Assess whether documents meet defined compliance requirements | Evaluate regulatory submissions for completeness and conformity |
| Application analysis | Evaluate applications against eligibility criteria | Assess grant applications for required elements and eligibility |
| Policy analysis | Identify conflicts or gaps between policy documents | Flag policy provisions that conflict with regulatory requirements |
| Risk analysis | Evaluate content for risk indicators and obligation exposure | Surface high-risk clauses in vendor contracts before approval |
Analytical outputs are stored as structured metadata — making them searchable and available to workflow routing rules. A document that fails a compliance analysis can be automatically routed to a remediation queue.
Selective Processing and Human Review
FormKiQ's IDP pipeline is designed for selective, incremental deployment — not all-or-nothing:
- Per-document-type — enable specific IDP capabilities for specific document types (OCR for all scanned documents, AI classification only for intake submissions, full analysis only for contracts)
- Per-workflow — apply IDP processing at specific workflow stages (classification at intake, summarization before review, analysis before approval)
- Per-site — in multi-tenant deployments, enable different IDP capabilities for different organizational sites or business units
- Confidence-based routing — when confidence scoring is enabled, low-confidence AI outputs can be routed to human review queues for verification before metadata is finalized
- Incremental enablement — start with OCR, add Textract extraction, then layer AI capabilities as confidence in the outputs grows
Data Sovereignty and Processing Security
A critical differentiator for FormKiQ's IDP pipeline: every processing step runs within your AWS account.
| Processing Step | Where It Runs | Data Residency Control |
|---|---|---|
| Tesseract OCR | AWS Lambda in your account | Your selected AWS region |
| Textract extraction | Amazon Textract API in your account | Your selected AWS region |
| Bedrock AI processing | Amazon Bedrock in your account | Configurable inference region controls |
| Metadata storage | Amazon DynamoDB in your account | Your selected AWS region |
| Full-text index | Amazon OpenSearch in your account | Your selected AWS region |
| Audit trail | AWS CloudTrail + FormKiQ audit log in your account | Your selected AWS region |
Documents are never sent to a third-party AI service, a vendor-hosted processing environment, or an external OCR API. For organizations subject to HIPAA, GDPR, PIPEDA, or data sovereignty requirements, this architecture eliminates the data residency risk that third-party IDP services create.
IDP Use Cases by Industry
| Industry | High-Volume Document Types | IDP Capabilities Applied |
|---|---|---|
| Financial Services | Invoices, loan applications, account opening forms, compliance filings, trade confirmations | Textract form extraction, AI classification, metadata extraction, compliance analysis |
| Healthcare | Patient intake forms, insurance claims, lab reports, prescriptions, referral letters | Textract form extraction, sensitivity classification (PHI), metadata extraction, summarization |
| Government | Permit applications, benefits forms, correspondence, regulatory filings, FOIA requests | AI classification, metadata extraction, completeness analysis, correspondence summarization |
| Insurance | Claim forms, medical records, police reports, damage assessments, correspondence | AI classification and routing, Textract extraction, metadata enrichment, summarization |
| Higher Education | Admissions applications, transcripts, financial aid forms, research submissions | Textract form extraction, AI classification, application analysis |
| Legal | Contracts, discovery documents, correspondence, filings, evidence | Contract analysis, metadata extraction, sensitivity classification, summarization |
| Manufacturing | Quality inspection reports, supplier certifications, compliance documentation, SOPs | Textract table extraction, compliance analysis, metadata enrichment |
| Accounts Payable (cross-industry) | Invoices, purchase orders, receipts, delivery confirmations | Textract extraction, three-way matching metadata, AP workflow routing |
FormKiQ Editions for Intelligent Document Processing
IDP capabilities are available across FormKiQ editions, with processing depth increasing at each tier:
| Capability | Core | Essentials | Advanced | Enterprise |
|---|---|---|---|---|
| OCR — Tesseract | ✓ | ✓ | ✓ | ✓ |
| OCR & IDP — Amazon Textract | ✓ | ✓ | ✓ | |
| Custom Extraction Mappings | ✓ | ✓ | ✓ | |
| Document Type Classification (Bedrock) | ✓ | ✓ | ||
| Content Sensitivity Classification (Bedrock) | ✓ | ✓ | ||
| Metadata Extraction (Bedrock) | ✓ | ✓ | ||
| Document Summarization (Bedrock) | ✓ | ✓ | ||
| Document Analysis (Bedrock) | ✓ | ✓ | ||
| Inference Region Controls | ✓ | ✓ | ||
| Enhanced Full-Text Search (OpenSearch) | ✓ | ✓ | ||
| Document Gateway Modules | ✓ | ✓ | ||
| Integration Framework Modules | ✓ | ✓ | ||
| Multi-Instance & Multi-Region Licensing | ✓ | ✓ | ||
| Vendor-Managed & Hybrid Deployment | ✓ | |||
| Custom SLAs & Compliance Consulting | ✓ | |||
| Support | Community (Slack & GitHub) | Support Portal (2-business-day SLA) | Private Slack + videoconference + 40 hrs onboarding | Rapid response (8-business-hour SLA) + strategic architecture support |
Deployment Models
| Model | Description | Availability |
|---|---|---|
| Customer-Managed AWS | Deploys directly into your AWS account via CloudFormation. Full control of infrastructure, networking, encryption keys, and operations. | All editions |
| Vendor-Managed | FormKiQ manages the AWS infrastructure on your behalf — deployment, updates, and operational support. | Enterprise |
| Hybrid | You retain control of specific components (encryption keys, network config) while delegating operational management to FormKiQ. | Enterprise |
Every deployment is a dedicated, isolated instance in an AWS account owned by or designated by the customer. FormKiQ does not operate a shared multi-tenant environment.
Getting Started
FormKiQ Core — including Tesseract OCR — can be deployed to your AWS account in fifteen to twenty minutes using a one-click install via AWS CloudFormation. Amazon Textract integration is available from Essentials onward. AI-powered classification, extraction, summarization, and analysis capabilities are available on Advanced and Enterprise.
For organizations evaluating intelligent document processing on AWS, FormKiQ offers a Proof-of-Value program — a three-month deployment in a FormKiQ-managed AWS environment that provides full platform access in a non-production setting.
Frequently Asked Questions
What is intelligent document processing on AWS?
Intelligent document processing (IDP) on AWS refers to automating the extraction, classification, and enrichment of document content using OCR, machine learning, and AI services — all running within your own Amazon Web Services environment. FormKiQ's IDP pipeline combines Tesseract OCR, Amazon Textract structured extraction, and Amazon Bedrock AI analysis to convert unstructured documents into classified, metadata-enriched, workflow-ready records.
What is the difference between OCR and IDP?
OCR (optical character recognition) extracts raw text from scanned documents and images. IDP extends OCR with machine learning and AI to understand document structure (tables, forms, key-value pairs), classify document types, extract meaningful entities (names, dates, amounts), and analyze content for compliance, completeness, or risk. OCR tells you what the text says. IDP tells you what the document means.
What is the difference between Tesseract and Amazon Textract?
Tesseract is an open-source OCR engine included in FormKiQ Core that provides reliable text extraction for standard documents. Amazon Textract is AWS's managed ML service available from Essentials onward — it adds handwriting recognition, table extraction, form field extraction, layout understanding, and higher accuracy across complex document types. Textract also supports custom extraction mappings that map specific document fields directly to FormKiQ metadata attributes.
How does Amazon Bedrock power IDP in FormKiQ?
FormKiQ's AI Processing and Analysis module uses Amazon Bedrock to provide five capabilities beyond OCR and Textract: document type classification, content sensitivity classification, metadata extraction from unstructured text, document summarization, and document analysis. All processing runs within your AWS account using Bedrock's supported models (Anthropic Claude, Amazon Nova, and others). Inference region controls ensure document content stays within your data residency boundaries.
Does document content leave my AWS account during IDP processing?
No. Every step of FormKiQ's IDP pipeline — Tesseract OCR, Textract extraction, Bedrock AI processing, metadata storage, and search indexing — runs within your own AWS account. Documents are never sent to a third-party service, a vendor-hosted environment, or an external API.
Can I apply IDP selectively to specific document types?
Yes. FormKiQ's IDP capabilities can be enabled per-document-type, per-workflow, and per-site in multi-tenant deployments. You can apply OCR to all scanned documents, Textract extraction to invoices and forms, and full AI analysis only to contracts — without processing every document through the entire pipeline.
How does FormKiQ handle low-confidence AI outputs?
When confidence scoring is enabled, low-confidence AI classification and extraction results can be routed to a human review queue where staff can verify and correct the output before metadata is finalized. This ensures AI processing augments human judgment rather than replacing it unsupervised. Confidence scoring is a configurable option, not a requirement for all deployments.
What document formats does FormKiQ's IDP support?
FormKiQ's IDP pipeline processes PDFs (both text-based and image-based), TIFF, JPEG, PNG, and other common image formats. Multi-page documents are handled natively. Born-digital PDFs with embedded text can bypass OCR and proceed directly to AI classification and analysis.