AI Document Processing on AWS

Classification, Extraction, Summarisation, and Analysis — Powered by Amazon Bedrock and Amazon Textract, Running Entirely Within Your AWS Account

Every organisation has a document processing bottleneck. Documents arrive — invoices, contracts, applications, correspondence, compliance filings, medical records, onboarding packets — and someone has to read them, classify them, extract the relevant data, enter it into a system, route them to the right person, and file them in the right place. This manual processing is slow, expensive, error-prone, and the single largest barrier to scaling document-intensive operations.

AI document processing automates this work. Not by replacing human judgement, but by handling the mechanical steps — identifying what type of document it is, extracting the structured data from it, classifying its sensitivity, summarising its content, and routing it to the right workflow — so that human attention is reserved for decisions that require it.

FormKiQ provides AI document processing as an integrated capability within a governed document management platform — not as a standalone AI service that processes documents in isolation. Documents are classified, extracted, enriched, and analysed within FormKiQ, and the results feed directly into metadata, search indexes, workflows, and governance controls. All AI processing runs within your AWS account through Amazon Bedrock and Amazon Textract. Your documents never leave your cloud environment, and inference region controls ensure processing stays within your data residency boundaries.

FormKiQ AI-powered document processing workflow, metadata extraction, and human review — automation and AI with humans in the loop

Start with a Controlled Business Problem

AI document processing works best when it is tied to a specific operational problem rather than a broad mandate to "use AI on documents." The useful question is: which document-heavy process is slow, expensive, risky, or difficult to scale today?

Classification bottlenecks

Documents arrive from email, portals, APIs, scans, and file uploads. AI identifies document type at ingestion and applies classification metadata automatically.

Data entry burden

Staff manually read forms, invoices, applications, and correspondence. OCR, Textract, and Bedrock extract fields and entities as structured metadata.

Review volume

Reviewers need to triage large document queues quickly. AI-generated summaries help identify priority, relevance, and next action before full review.

Hidden obligations

Contracts and agreements contain deadlines, renewal dates, notice periods, and obligations that are hard to track manually. AI extracts them as actionable metadata.

Sensitivity exposure

Documents containing PII, PHI, privileged material, or confidential business content may arrive without proper labels. AI sensitivity classification can trigger handling rules.

Governance gaps

Large collections often have missing metadata, inconsistent classification, and unclear retention categories. AI can flag gaps for remediation and human review.

The narrower the AI task, the easier it is to test, govern, validate, and improve. "Extract renewal dates and notice periods from supplier agreements" is a stronger starting point than "run AI on contracts."

Match AI Processing to Data Sensitivity

Before applying AI to documents, organizations should understand the sensitivity of the content being processed. The goal is not to avoid AI; it is to match the processing model, review rules, and access controls to the risk level of the data.

Data Category	Examples	AI Processing Approach
Public or low sensitivity	Published policies, public reports, marketing materials	Classification and summarisation can usually proceed with lighter review requirements.
Internal business sensitive	Project documents, internal reports, operational records	Use approved infrastructure, governed outputs, access controls, and audit logging.
Personal information	Employee records, customer files, applicant information	Apply privacy, data residency, retention, and access-control rules to both documents and AI outputs.
Regulated records	PHI, financial records, audit evidence, compliance filings	Use confidence thresholds, human review, auditable processing history, and validated governance configuration.
Legally privileged	Attorney-client communications, litigation strategy, investigation files	Ensure AI summaries, extracted metadata, and knowledge-search answers inherit privilege boundaries.

The AI Document Processing Stack

FormKiQ's AI document processing operates as a layered stack — each layer builds on the previous, and organisations can enable the layers they need:

Layer	Technology	What It Does	Availability
OCR	Tesseract	Extracts raw text from scanned documents, images, and image-based PDFs	All editions
Structured extraction	Amazon Textract	Extracts tables, form fields, key-value pairs, and spatial relationships from documents	Essentials+
AI classification	Amazon Bedrock	Identifies document type and applies classification metadata automatically	Advanced / Enterprise
AI extraction	Amazon Bedrock	Extracts entities (names, dates, amounts, identifiers) from unstructured text and applies them as structured metadata	Advanced / Enterprise
AI sensitivity	Amazon Bedrock	Identifies documents containing PII, PHI, financial data, privileged content, or other sensitive information	Advanced / Enterprise
AI summarisation	Amazon Bedrock	Generates concise summaries of document content for triage, review, and discovery	Advanced / Enterprise
AI analysis	Amazon Bedrock	Analyses document content against criteria, checklists, or requirements — producing structured analytical output	Advanced / Enterprise

The distinction between layers matters. OCR tells you what the text says. Textract tells you what the form fields contain. Bedrock tells you what the document means. Each layer adds understanding, and the output of each layer feeds into FormKiQ's metadata, search, and workflow systems.

OCR and Structured Extraction

The foundation of AI document processing is getting text out of documents that aren't born digital — scanned PDFs, faxes, images, and multi-page paper submissions.

Tesseract OCR (All Editions)

FormKiQ Core includes OCR processing using Tesseract, an open-source OCR engine that provides reliable text extraction for standard digitisation workflows. Tesseract runs within your AWS account using Lambda functions, with extracted text stored as part of the document's metadata and available for full-text search.

Tesseract works well for clearly printed text in standard layouts. It has limitations with handwriting, complex multi-column layouts, low-quality scans, and documents where spatial relationships matter (tables, forms).

Amazon Textract (Essentials and Above)

From Essentials onward, FormKiQ provides structured extraction using Amazon Textract — AWS's managed machine learning service for document text and data extraction. Textract goes beyond raw text to understand document structure:

Textract Capability	What It Extracts	Why It Matters
Text extraction	All text from the document — including handwriting	Handles the document types that Tesseract struggles with
Table extraction	Tables as structured data with rows, columns, and cell values	Financial statements, lab results, comparison tables — extracted as data, not as a wall of text
Form extraction	Labelled fields as key-value pairs	Application forms, intake forms, registration documents — each field extracted with its label and value
Layout understanding	Spatial relationships between text elements — columns, headers, sections	Multi-column documents, letters with headers and footers — text extracted in reading order, not pixel order

Custom Extraction Mappings

Textract's form and table extraction can be combined with custom mappings that route specific extracted fields directly to FormKiQ metadata attributes:

Document Type	Extracted Fields	Metadata Result
Invoice	Vendor name, invoice number, date, line items, total	AP metadata for matching, approval, and archival
Contract	Counterparty, effective date, expiry, value	Contract metadata for lifecycle tracking and renewal
Application	Applicant name, date, type, key responses	Application metadata for eligibility routing and case creation
Insurance claim	Claimant, policy number, date of loss, amount	Claims metadata for intake routing and adjudication
Employee onboarding form	Name, role, start date, department, emergency contact	HR metadata for employee record creation

AI-Powered Classification

Document classification is the gateway to governance. Until a document is classified — identified by type, tagged with the right metadata, assigned to the right category — governance rules can't apply to it. Manual classification is the bottleneck that slows intake, introduces inconsistency, and creates gaps where documents sit unclassified and ungoverned. FormKiQ's AI classification — powered by Amazon Bedrock — automates this step.

How AI Classification Works

Document arrives — uploaded via API, web console, email, SFTP, or Document Gateway
AI analysis triggered — a document action triggers Bedrock analysis based on the document's source, workflow, or document type definition
Type identification — Bedrock analyses the document content and determines its type (invoice, contract, correspondence, application, medical record, policy document, etc.)
Confidence scoring — the classification includes a confidence score indicating how certain the model is about its determination
High-confidence routing — documents classified with high confidence are automatically tagged and routed to the appropriate workflow
Low-confidence review — documents classified with low confidence are routed to a human review queue for verification before metadata is finalised

What Makes This Different from Rule-Based Classification

Rule-based classification (matching on filename, sender, folder, or keyword) works for predictable, consistent document sources. It breaks down when documents arrive from varied sources in varied formats — which is the reality for most intake-heavy organisations. AI classification works on document content, not on superficial attributes, which means it handles the messy, unpredictable documents that rules can't classify.

Classification Approach	Strengths	Weaknesses
Filename / folder rules	Fast, simple, predictable	Breaks when filenames are inconsistent or documents are misnamed
Keyword matching	Catches specific terms	High false positive rate; can't handle synonyms, abbreviations, or context
AI classification (Bedrock)	Understands content and context; handles varied formats and sources	Requires confidence thresholds; low-confidence results need human review

The practical approach is to combine all three: rules handle the predictable cases, AI handles the varied and unpredictable ones, and human review handles the edge cases that neither can resolve confidently.

AI-Powered Extraction and Enrichment

Beyond classification, Bedrock extracts meaningful entities from document content and applies them as structured, searchable metadata. This goes beyond Textract's form-field extraction — Bedrock understands context and can extract entities from unstructured narrative text, not just labelled form fields.

Entity Type	What Gets Extracted	How It's Used in FormKiQ
People	Names, titles, roles, signatories	Metadata for party-based search, access control, and correspondence tracking
Organisations	Company names, agency names, departments	Metadata for counterparty-level views and vendor/client portfolio search
Dates	Effective dates, expiry dates, filing dates, deadlines	Metadata for milestone tracking, renewal alerting, and retention trigger
Financial	Amounts, payment terms, account numbers	Metadata for financial reporting, approval routing, and three-way matching
Identifiers	Case numbers, policy numbers, contract IDs, permit numbers	Metadata for cross-referencing with enterprise systems
Locations	Addresses, jurisdictions, regions	Metadata for jurisdiction-aware governance and data residency

The key insight is that extraction without governance is just data entry automation. In FormKiQ, extracted entities become governed metadata — searchable, access-controlled, and tied to the document's lifecycle. An extracted expiry date becomes a renewal alert. An extracted counterparty becomes a portfolio-level search dimension. An extracted sensitivity indicator becomes an ABAC access restriction.

AI-Powered Sensitivity Classification

One of the highest-value AI capabilities for regulated organisations is automatic identification of documents containing sensitive information. Without sensitivity classification, organisations rely on users to correctly classify documents at the point of upload — which means sensitive documents regularly end up in general-access locations without appropriate controls. FormKiQ's sensitivity classification detects:

Personally identifiable information (PII) — names combined with identifiers, addresses, dates of birth, financial account numbers
Protected health information (PHI) — patient names, medical record numbers, diagnoses, treatment information
Financial data — account numbers, transaction details, credit card numbers, financial statements
Privileged communications — attorney-client communications, legal advice, litigation strategy
Confidential business information — trade secrets, proprietary processes, M&A-related content, board materials
Minor-related information — content relating to children or young people requiring heightened protection

When sensitivity is detected, configurable handling rules apply automatically — restricting access, applying encryption enforcement, routing to compliance review, or flagging for manual classification. The sensitivity classification itself becomes metadata on the document, enabling ABAC policies that restrict access based on the sensitivity level.

AI-Powered Summarisation

Document summarisation reduces the time required to triage, review, and understand lengthy documents. It's particularly valuable in high-volume intake workflows where reviewers process hundreds of documents daily and need to quickly determine relevance, priority, and required action.

Use Case	What Gets Summarised	How It Helps
Intake triage	Incoming applications, submissions, or filings	Reviewers read the summary to determine priority and routing before reading the full document
Case review	Case file documents — investigation reports, medical records, legal filings	Case workers get a consolidated view of case history without reading every document in full
Correspondence digest	Email threads and correspondence chains	Account managers and case workers understand the key exchanges without full-thread review
Contract review	Executed contracts and agreements	Legal reviewers identify key terms, obligations, and risk indicators before detailed review
Archive discovery	Archived documents retrieved from cold storage	Researchers and investigators determine document relevance from the summary before retrieving the full document
Management reporting	Lengthy reports, audit findings, or compliance assessments	Executives receive plain-language summaries of documents that would otherwise require specialist interpretation

Summaries are stored as metadata on the document — searchable and accessible without opening the full document. This is particularly valuable for archived documents in Glacier storage: the summary is available instantly even when the full document requires a retrieval request.

AI-Powered Document Analysis

Document analysis goes beyond extraction and summarisation to apply structured analytical judgement to document content. Analysis evaluates documents against criteria, identifies issues, and produces actionable outputs:

Analysis Type	What It Does	Example Application
Compliance analysis	Assesses whether a document meets defined compliance requirements	Evaluate regulatory submissions for completeness and conformity before acceptance
Contract analysis	Identifies obligations, rights, termination conditions, non-standard terms	Flag contracts deviating from standard templates before legal review
Application analysis	Evaluates applications against eligibility criteria	Assess grant applications for required elements and eligibility before scoring
Consistency analysis	Identifies conflicts or gaps between related documents	Flag policy provisions that conflict with regulatory requirements or other policies
Risk analysis	Evaluates content for risk indicators	Surface high-risk clauses in vendor contracts before approval
Completeness analysis	Checks whether a document set meets defined requirements	Verify that an onboarding document package contains all required items before the employee's start date

Analytical outputs are stored as structured metadata — making them searchable and available to workflow routing. A document that fails a compliance analysis can be automatically routed to a remediation queue. A contract flagged for non-standard terms can be automatically escalated to senior legal review.

Treat AI Outputs as Governed Information

AI outputs should not sit in a separate, informal layer if they affect workflows, decisions, records, obligations, access controls, or retention. In FormKiQ, classifications, extracted values, summaries, risk flags, sensitivity labels, confidence scores, and reviewer corrections become governed document metadata.

AI Output	Governance Question	FormKiQ Pattern
Document classification	Does the classification drive access controls, retention, or workflow routing?	Classification becomes searchable document metadata, editable by authorized users and tracked in the audit trail.
Extracted metadata	Will extracted values trigger alerts, deadlines, payments, approvals, or compliance actions?	Extracted values are stored as structured metadata; low-confidence values can route to human review.
Summaries	Could the summary reveal restricted content?	Summaries inherit document access controls and are stored with the governed document record.
Sensitivity labels	Does the label change who can access the document?	Sensitivity metadata can trigger ABAC restrictions, review queues, or compliance handling rules.
Confidence scores	When should automation pause for human review?	Thresholds can be configured by document type, workflow, site, or use case.

An AI-generated summary may be a convenience view. An AI-extracted renewal date is different: if that date triggers a reminder, task, escalation, or commercial decision, it should be reviewed, governed, and auditable.

Data Sovereignty for AI Processing

A critical differentiator for FormKiQ's AI processing: every step runs within your AWS account. This eliminates the data residency and data sovereignty concerns that prevent many regulated organisations from adopting AI document processing.

Concern	Third-Party AI Services	FormKiQ on Your AWS Account
Where is data processed?	Vendor's infrastructure — often unclear which region or country	Your AWS account, in the region you select, with inference region controls
Who has access to data during processing?	The vendor's systems and potentially their staff	Your AWS services — no vendor access to document content during processing
Is data used for model training?	Varies by vendor — some use customer data for model improvement	Amazon Bedrock does not use customer data for model training
Can you audit processing?	Limited — vendor provides logs if available	CloudTrail records every API call; Bedrock invocation logs in your account
Data residency compliance	Difficult to verify — depends on vendor's architecture	Verifiable — your AWS region configuration determines where processing occurs

For organisations subject to GDPR, HIPAA, PIPEDA, or data sovereignty requirements, this architecture eliminates the need to choose between AI capability and regulatory compliance.

Auditability and Access Control

AI should not become a shortcut around permissions or audit obligations. If a user cannot access a source document, they should not be able to access its AI-generated summary, extracted metadata, or answers derived from it through a knowledge interface.

Access controls apply to outputs

AI summaries, extracted entities, sensitivity labels, and analysis results inherit the document's ABAC model. Sensitive metadata fields can also be independently restricted where needed.

KnowledgeBase respects permissions

Knowledge search should filter answers and source references based on the querying user's document-level permissions, not expose restricted content indirectly.

Processing events are logged

Audit evidence should show which document version was processed, which AI service and model were used, what output was produced, and which workflow action followed.

Reviewer actions are preserved

Human approvals, corrections, rejections, and before-and-after values should be tracked in the same audit model as other document workflow decisions.

Because FormKiQ deploys into your AWS account, the evidence is available in infrastructure you control: CloudTrail for API activity, Bedrock invocation logging where configured, and FormKiQ's document and workflow audit trails for document-level actions.

Selective Processing and Human Oversight

AI document processing in FormKiQ is designed for selective, incremental deployment with human oversight — not as an all-or-nothing automation layer:

Per-document-type — enable specific AI capabilities for specific document types (OCR for all scans, classification for intake submissions, full analysis for contracts only)
Per-workflow — apply AI processing at specific workflow stages (classification at intake, summarisation before review, analysis before approval)
Per-site — in multi-tenant deployments, enable different AI capabilities for different organisational sites or business units
Confidence-based routing — low-confidence AI outputs routed to human review queues for verification before metadata is finalised
Incremental enablement — start with OCR, add Textract extraction, then layer AI capabilities as confidence in the outputs grows

This approach ensures that AI augments human judgement rather than replacing it unsupervised. The organisation controls which document types are processed by AI, at which workflow stages, and with what confidence thresholds — and every AI decision is auditable and reversible.

A Practical Adoption Roadmap

The most durable AI document processing programs start narrow, validate controls, and expand only after the organization trusts the processing path and governance model.

1. Assess

Identify high-value document bottlenecks, document types, data sensitivity, and expected outputs.

2. Define controls

Set requirements for residency, access control, audit logging, review rules, metadata governance, and retention.

3. Pilot narrowly

Use a bounded document set, clear success criteria, human review, and measured accuracy before production use.

4. Productionize

Use approved prompt templates, confidence thresholds, validation rules, exception handling, and audit trails.

5. Expand

Reuse the same control model for additional document types, departments, workflows, and regions.

FormKiQ Editions for AI Document Processing

Capability	Core Foundation	Essentials Operational	Advanced AI + Automation	Enterprise Full platform
Foundation
OCR — Tesseract
Essentials and above
OCR & Structured Extraction — Textract	—
Custom Extraction Mappings	—
Advanced and Enterprise
AI Classification (Bedrock)	—	—
AI Entity Extraction (Bedrock)	—	—
AI Sensitivity Classification (Bedrock)	—	—
AI Summarisation (Bedrock)	—	—
AI Document Analysis (Bedrock)	—	—
Inference Region Controls	—	—
KnowledgeBase	—	—
Document Gateway Modules	—	—
Integration Framework Modules	—	—
Multi-Instance & Multi-Region Licensing	—	—
Enterprise only
Vendor-Managed & Hybrid Deployment	—	—	—
Compliance Consulting	—	—	—
Support
Support	Community	2-business-day SLA	Private Slack + 40 hrs onboarding	8-business-hour SLA + strategic support

Getting Started

FormKiQ Core — including Tesseract OCR — can be deployed to your AWS account in fifteen to twenty minutes. Amazon Textract integration is available from Essentials onward. AI-powered classification, extraction, sensitivity detection, summarisation, and analysis are available on Advanced and Enterprise.

For enterprise buyers evaluating AI governance, deployment options, access-control alignment, auditability, and vendor questions, read the companion buyer guide: AI Document Processing Without Losing Control of Your Data.

Start a Proof-of-Value deployment

Frequently Asked Questions

What is AI document processing?

AI document processing is the use of machine learning and large language models to automate the classification, data extraction, sensitivity detection, summarisation, and analysis of documents. It extends traditional OCR (which extracts text) by adding understanding — identifying what type of document it is, what the important data elements are, whether it contains sensitive information, and what action should be taken.

How is FormKiQ's AI processing different from standalone AI document processing services?

Standalone AI services process documents in isolation — you send a document, get results, and then manually integrate those results into your document management workflow. FormKiQ integrates AI processing directly into the governed document lifecycle. AI outputs become metadata on the document, feed into search indexes, trigger workflow routing, and apply governance controls. The document is classified, enriched, and governed in a single platform rather than processed in one system and stored in another.

Does document content leave my AWS account during AI processing?

No. Every layer of FormKiQ's AI processing stack — Tesseract OCR, Amazon Textract, Amazon Bedrock — runs within your AWS account. Documents are never sent to external services. Inference region controls for Bedrock allow you to specify which AWS regions are used for model processing, ensuring content stays within your data residency boundaries.

What happens when AI classification is uncertain?

Every AI classification includes a confidence score. Documents classified with high confidence are automatically tagged and routed. Documents classified with low confidence are routed to a human review queue for manual verification before metadata is finalised. Confidence thresholds are configurable per document type, allowing organisations to set the boundary between automated and human-reviewed classification based on their risk tolerance.

Are AI outputs governed the same way as the source document?

Yes. AI outputs are stored as metadata on the governed document record. Summaries, extracted values, classifications, confidence scores, and sensitivity labels inherit the document's access controls and are subject to the same audit, retention, legal hold, and disposition model.

Can AI processing be limited to specific document types or workflows?

Yes. FormKiQ supports selective enablement by document type, workflow stage, site, and use case. Organizations can start with OCR or classification on low-risk intake documents, then add extraction, summarization, analysis, or sensitivity classification as confidence and governance maturity increase.

How do auditors review AI document processing decisions?

AI processing history can show which document was processed, which model and processing configuration were used, what output was produced, whether a reviewer accepted or corrected it, and what downstream workflow action followed. This evidence is retained through FormKiQ audit trails and AWS logging in the customer's own account.

Can I start with basic OCR and add AI later?

Yes. FormKiQ's AI capabilities are layered and independently enableable. You can deploy Core with Tesseract OCR, upgrade to Essentials for Textract structured extraction, and add AI classification, extraction, and analysis on Advanced — all within the same deployment. Each layer builds on the previous without requiring migration or reconfiguration.

What large language models does FormKiQ use?

FormKiQ's AI processing runs through Amazon Bedrock, which supports multiple large language models including Anthropic Claude, Amazon Nova, and other available models. The specific model used can be configured based on the processing task, cost considerations, and regional availability. All models are accessed within your AWS account — no external API calls are made.