AI Document Processing on AWS

Classification, extraction, summarisation, and analysis — powered by Amazon Bedrock and Amazon Textract, running entirely within your AWS account.

Classification, Extraction, Summarisation, and Analysis — Powered by Amazon Bedrock and Amazon Textract, Running Entirely Within Your AWS Account

Every organisation has a document processing bottleneck. Documents arrive — invoices, contracts, applications, correspondence, compliance filings, medical records, onboarding packets — and someone has to read them, classify them, extract the relevant data, enter it into a system, route them to the right person, and file them in the right place. This manual processing is slow, expensive, error-prone, and the single largest barrier to scaling document-intensive operations.

AI document processing automates this work. Not by replacing human judgement, but by handling the mechanical steps — identifying what type of document it is, extracting the structured data from it, classifying its sensitivity, summarising its content, and routing it to the right workflow — so that human attention is reserved for decisions that require it.

FormKiQ provides AI document processing as an integrated capability within a governed document management platform — not as a standalone AI service that processes documents in isolation. Documents are classified, extracted, enriched, and analysed within FormKiQ, and the results feed directly into metadata, search indexes, workflows, and governance controls. All AI processing runs within your AWS account through Amazon Bedrock and Amazon Textract. Your documents never leave your cloud environment, and inference region controls ensure processing stays within your data residency boundaries.

The AI Document Processing Stack

FormKiQ's AI document processing operates as a layered stack — each layer builds on the previous, and organisations can enable the layers they need:

Layer Technology What It Does Availability
OCR Tesseract Extracts raw text from scanned documents, images, and image-based PDFs All editions
Structured extraction Amazon Textract Extracts tables, form fields, key-value pairs, and spatial relationships from documents Essentials+
AI classification Amazon Bedrock Identifies document type and applies classification metadata automatically Advanced / Enterprise
AI extraction Amazon Bedrock Extracts entities (names, dates, amounts, identifiers) from unstructured text and applies them as structured metadata Advanced / Enterprise
AI sensitivity Amazon Bedrock Identifies documents containing PII, PHI, financial data, privileged content, or other sensitive information Advanced / Enterprise
AI summarisation Amazon Bedrock Generates concise summaries of document content for triage, review, and discovery Advanced / Enterprise
AI analysis Amazon Bedrock Analyses document content against criteria, checklists, or requirements — producing structured analytical output Advanced / Enterprise

The distinction between layers matters. OCR tells you what the text says. Textract tells you what the form fields contain. Bedrock tells you what the document means. Each layer adds understanding, and the output of each layer feeds into FormKiQ's metadata, search, and workflow systems.

OCR and Structured Extraction

The foundation of AI document processing is getting text out of documents that aren't born digital — scanned PDFs, faxes, images, and multi-page paper submissions.

Tesseract OCR (All Editions)

FormKiQ Core includes OCR processing using Tesseract, an open-source OCR engine that provides reliable text extraction for standard digitisation workflows. Tesseract runs within your AWS account using Lambda functions, with extracted text stored as part of the document's metadata and available for full-text search.

Tesseract works well for clearly printed text in standard layouts. It has limitations with handwriting, complex multi-column layouts, low-quality scans, and documents where spatial relationships matter (tables, forms).

Amazon Textract (Essentials and Above)

From Essentials onward, FormKiQ provides structured extraction using Amazon Textract — AWS's managed machine learning service for document text and data extraction. Textract goes beyond raw text to understand document structure:

Textract Capability What It Extracts Why It Matters
Text extraction All text from the document — including handwriting Handles the document types that Tesseract struggles with
Table extraction Tables as structured data with rows, columns, and cell values Financial statements, lab results, comparison tables — extracted as data, not as a wall of text
Form extraction Labelled fields as key-value pairs Application forms, intake forms, registration documents — each field extracted with its label and value
Layout understanding Spatial relationships between text elements — columns, headers, sections Multi-column documents, letters with headers and footers — text extracted in reading order, not pixel order

Custom Extraction Mappings

Textract's form and table extraction can be combined with custom mappings that route specific extracted fields directly to FormKiQ metadata attributes:

Document Type Extracted Fields Metadata Result
Invoice Vendor name, invoice number, date, line items, total AP metadata for matching, approval, and archival
Contract Counterparty, effective date, expiry, value Contract metadata for lifecycle tracking and renewal
Application Applicant name, date, type, key responses Application metadata for eligibility routing and case creation
Insurance claim Claimant, policy number, date of loss, amount Claims metadata for intake routing and adjudication
Employee onboarding form Name, role, start date, department, emergency contact HR metadata for employee record creation

AI-Powered Classification

Document classification is the gateway to governance. Until a document is classified — identified by type, tagged with the right metadata, assigned to the right category — governance rules can't apply to it. Manual classification is the bottleneck that slows intake, introduces inconsistency, and creates gaps where documents sit unclassified and ungoverned. FormKiQ's AI classification — powered by Amazon Bedrock — automates this step.

How AI Classification Works

  1. Document arrives — uploaded via API, web console, email, SFTP, or Document Gateway
  2. AI analysis triggered — a document action triggers Bedrock analysis based on the document's source, workflow, or document type definition
  3. Type identification — Bedrock analyses the document content and determines its type (invoice, contract, correspondence, application, medical record, policy document, etc.)
  4. Confidence scoring — the classification includes a confidence score indicating how certain the model is about its determination
  5. High-confidence routing — documents classified with high confidence are automatically tagged and routed to the appropriate workflow
  6. Low-confidence review — documents classified with low confidence are routed to a human review queue for verification before metadata is finalised

What Makes This Different from Rule-Based Classification

Rule-based classification (matching on filename, sender, folder, or keyword) works for predictable, consistent document sources. It breaks down when documents arrive from varied sources in varied formats — which is the reality for most intake-heavy organisations. AI classification works on document content, not on superficial attributes, which means it handles the messy, unpredictable documents that rules can't classify.

Classification Approach Strengths Weaknesses
Filename / folder rules Fast, simple, predictable Breaks when filenames are inconsistent or documents are misnamed
Keyword matching Catches specific terms High false positive rate; can't handle synonyms, abbreviations, or context
AI classification (Bedrock) Understands content and context; handles varied formats and sources Requires confidence thresholds; low-confidence results need human review

The practical approach is to combine all three: rules handle the predictable cases, AI handles the varied and unpredictable ones, and human review handles the edge cases that neither can resolve confidently.

AI-Powered Extraction and Enrichment

Beyond classification, Bedrock extracts meaningful entities from document content and applies them as structured, searchable metadata. This goes beyond Textract's form-field extraction — Bedrock understands context and can extract entities from unstructured narrative text, not just labelled form fields.

Entity Type What Gets Extracted How It's Used in FormKiQ
People Names, titles, roles, signatories Metadata for party-based search, access control, and correspondence tracking
Organisations Company names, agency names, departments Metadata for counterparty-level views and vendor/client portfolio search
Dates Effective dates, expiry dates, filing dates, deadlines Metadata for milestone tracking, renewal alerting, and retention trigger
Financial Amounts, payment terms, account numbers Metadata for financial reporting, approval routing, and three-way matching
Identifiers Case numbers, policy numbers, contract IDs, permit numbers Metadata for cross-referencing with enterprise systems
Locations Addresses, jurisdictions, regions Metadata for jurisdiction-aware governance and data residency

The key insight is that extraction without governance is just data entry automation. In FormKiQ, extracted entities become governed metadata — searchable, access-controlled, and tied to the document's lifecycle. An extracted expiry date becomes a renewal alert. An extracted counterparty becomes a portfolio-level search dimension. An extracted sensitivity indicator becomes an ABAC access restriction.

AI-Powered Sensitivity Classification

One of the highest-value AI capabilities for regulated organisations is automatic identification of documents containing sensitive information. Without sensitivity classification, organisations rely on users to correctly classify documents at the point of upload — which means sensitive documents regularly end up in general-access locations without appropriate controls. FormKiQ's sensitivity classification detects:

  • Personally identifiable information (PII) — names combined with identifiers, addresses, dates of birth, financial account numbers
  • Protected health information (PHI) — patient names, medical record numbers, diagnoses, treatment information
  • Financial data — account numbers, transaction details, credit card numbers, financial statements
  • Privileged communications — attorney-client communications, legal advice, litigation strategy
  • Confidential business information — trade secrets, proprietary processes, M&A-related content, board materials
  • Minor-related information — content relating to children or young people requiring heightened protection

When sensitivity is detected, configurable handling rules apply automatically — restricting access, applying encryption enforcement, routing to compliance review, or flagging for manual classification. The sensitivity classification itself becomes metadata on the document, enabling ABAC policies that restrict access based on the sensitivity level.

AI-Powered Summarisation

Document summarisation reduces the time required to triage, review, and understand lengthy documents. It's particularly valuable in high-volume intake workflows where reviewers process hundreds of documents daily and need to quickly determine relevance, priority, and required action.

Use Case What Gets Summarised How It Helps
Intake triage Incoming applications, submissions, or filings Reviewers read the summary to determine priority and routing before reading the full document
Case review Case file documents — investigation reports, medical records, legal filings Case workers get a consolidated view of case history without reading every document in full
Correspondence digest Email threads and correspondence chains Account managers and case workers understand the key exchanges without full-thread review
Contract review Executed contracts and agreements Legal reviewers identify key terms, obligations, and risk indicators before detailed review
Archive discovery Archived documents retrieved from cold storage Researchers and investigators determine document relevance from the summary before retrieving the full document
Management reporting Lengthy reports, audit findings, or compliance assessments Executives receive plain-language summaries of documents that would otherwise require specialist interpretation

Summaries are stored as metadata on the document — searchable and accessible without opening the full document. This is particularly valuable for archived documents in Glacier storage: the summary is available instantly even when the full document requires a retrieval request.

AI-Powered Document Analysis

Document analysis goes beyond extraction and summarisation to apply structured analytical judgement to document content. Analysis evaluates documents against criteria, identifies issues, and produces actionable outputs:

Analysis Type What It Does Example Application
Compliance analysis Assesses whether a document meets defined compliance requirements Evaluate regulatory submissions for completeness and conformity before acceptance
Contract analysis Identifies obligations, rights, termination conditions, non-standard terms Flag contracts deviating from standard templates before legal review
Application analysis Evaluates applications against eligibility criteria Assess grant applications for required elements and eligibility before scoring
Consistency analysis Identifies conflicts or gaps between related documents Flag policy provisions that conflict with regulatory requirements or other policies
Risk analysis Evaluates content for risk indicators Surface high-risk clauses in vendor contracts before approval
Completeness analysis Checks whether a document set meets defined requirements Verify that an onboarding document package contains all required items before the employee's start date

Analytical outputs are stored as structured metadata — making them searchable and available to workflow routing. A document that fails a compliance analysis can be automatically routed to a remediation queue. A contract flagged for non-standard terms can be automatically escalated to senior legal review.

Data Sovereignty for AI Processing

A critical differentiator for FormKiQ's AI processing: every step runs within your AWS account. This eliminates the data residency and data sovereignty concerns that prevent many regulated organisations from adopting AI document processing.

Concern Third-Party AI Services FormKiQ on Your AWS Account
Where is data processed? Vendor's infrastructure — often unclear which region or country Your AWS account, in the region you select, with inference region controls
Who has access to data during processing? The vendor's systems and potentially their staff Your AWS services — no vendor access to document content during processing
Is data used for model training? Varies by vendor — some use customer data for model improvement Amazon Bedrock does not use customer data for model training
Can you audit processing? Limited — vendor provides logs if available CloudTrail records every API call; Bedrock invocation logs in your account
Data residency compliance Difficult to verify — depends on vendor's architecture Verifiable — your AWS region configuration determines where processing occurs

For organisations subject to GDPR, HIPAA, PIPEDA, or data sovereignty requirements, this architecture eliminates the need to choose between AI capability and regulatory compliance.

Selective Processing and Human Oversight

AI document processing in FormKiQ is designed for selective, incremental deployment with human oversight — not as an all-or-nothing automation layer:

  • Per-document-type — enable specific AI capabilities for specific document types (OCR for all scans, classification for intake submissions, full analysis for contracts only)
  • Per-workflow — apply AI processing at specific workflow stages (classification at intake, summarisation before review, analysis before approval)
  • Per-site — in multi-tenant deployments, enable different AI capabilities for different organisational sites or business units
  • Confidence-based routing — low-confidence AI outputs routed to human review queues for verification before metadata is finalised
  • Incremental enablement — start with OCR, add Textract extraction, then layer AI capabilities as confidence in the outputs grows

This approach ensures that AI augments human judgement rather than replacing it unsupervised. The organisation controls which document types are processed by AI, at which workflow stages, and with what confidence thresholds — and every AI decision is auditable and reversible.

FormKiQ Editions for AI Document Processing

Capability Core Essentials Advanced Enterprise
OCR — Tesseract
OCR & Structured Extraction — Textract
Custom Extraction Mappings
AI Classification (Bedrock)
AI Entity Extraction (Bedrock)
AI Sensitivity Classification (Bedrock)
AI Summarisation (Bedrock)
AI Document Analysis (Bedrock)
Inference Region Controls
KnowledgeBase
Document Gateway Modules
Integration Framework Modules
Multi-Instance & Multi-Region Licensing
Vendor-Managed & Hybrid Deployment
Compliance Consulting
SupportCommunity2-business-day SLAPrivate Slack + 40 hrs onboarding8-business-hour SLA + strategic support

Getting Started

FormKiQ Core — including Tesseract OCR — can be deployed to your AWS account in fifteen to twenty minutes. Amazon Textract integration is available from Essentials onward. AI-powered classification, extraction, sensitivity detection, summarisation, and analysis are available on Advanced and Enterprise.

Schedule a consultation · Start a Proof-of-Value deployment

Frequently Asked Questions

What is AI document processing?

AI document processing is the use of machine learning and large language models to automate the classification, data extraction, sensitivity detection, summarisation, and analysis of documents. It extends traditional OCR (which extracts text) by adding understanding — identifying what type of document it is, what the important data elements are, whether it contains sensitive information, and what action should be taken.

How is FormKiQ's AI processing different from standalone AI document processing services?

Standalone AI services process documents in isolation — you send a document, get results, and then manually integrate those results into your document management workflow. FormKiQ integrates AI processing directly into the governed document lifecycle. AI outputs become metadata on the document, feed into search indexes, trigger workflow routing, and apply governance controls. The document is classified, enriched, and governed in a single platform rather than processed in one system and stored in another.

Does document content leave my AWS account during AI processing?

No. Every layer of FormKiQ's AI processing stack — Tesseract OCR, Amazon Textract, Amazon Bedrock — runs within your AWS account. Documents are never sent to external services. Inference region controls for Bedrock allow you to specify which AWS regions are used for model processing, ensuring content stays within your data residency boundaries.

What happens when AI classification is uncertain?

Every AI classification includes a confidence score. Documents classified with high confidence are automatically tagged and routed. Documents classified with low confidence are routed to a human review queue for manual verification before metadata is finalised. Confidence thresholds are configurable per document type, allowing organisations to set the boundary between automated and human-reviewed classification based on their risk tolerance.

Can I start with basic OCR and add AI later?

Yes. FormKiQ's AI capabilities are layered and independently enableable. You can deploy Core with Tesseract OCR, upgrade to Essentials for Textract structured extraction, and add AI classification, extraction, and analysis on Advanced — all within the same deployment. Each layer builds on the previous without requiring migration or reconfiguration.

What large language models does FormKiQ use?

FormKiQ's AI processing runs through Amazon Bedrock, which supports multiple large language models including Anthropic Claude, Amazon Nova, and other available models. The specific model used can be configured based on the processing task, cost considerations, and regional availability. All models are accessed within your AWS account — no external API calls are made.

Start with FormKiQ Core

The open-source foundation — API-first, deployable into your own AWS account, and free to use. Right for architecture validation and early implementation.

Get Started Free

Deploy FormKiQ Essentials or Advanced

Production-ready editions for departments and complex workflows. Start with a Proof-of-Value deployment or go straight to production.

Explore Options

Plan an Enterprise Rollout

For governance-heavy environments with residency, sovereignty, assurance, and multi-jurisdiction requirements. Talk to us about the right deployment model.

Book a Call