Enterprise Ingestion Architecture

The Complete Enterprise
Ingestion Pipeline.

Every document, database row, real-time stream, and scanned archive — ingested, validated, classified, and compliance-audited before a single vector is written. Nine layers. Four compliance zones. Zero silent failures.

9 Layers
Hardware to audit trail
4 Compliance Zones
ISO 9001 + SOC 1 Type 2
3 Data Tracks
Batch · Streaming · Multi-modal
Complete Architecture

Nine Layers. No Gaps.

Most ingestion pipelines are built layer by layer until something works. This one was designed top-down from enterprise compliance requirements, then validated against real failure modes at each stage.

0 · Hardware — Compute · Object Storage · VPC · HSM/KMS · AES-256 at rest · TLS 1.2+ in transit 1 · Connector + credential management — Secrets vault · Rotation · Registry · Access log · SOC 1 CC6.2 ★ Data contracts + schema registry Schema registry Contract validation Schema evolution Data quality SLA 2 · Scheduler + CDC + dead letter queue — Airflow · Debezium · Kafka · Retry · DLQ ★ Backfill vs incremental router Backfill mode Incremental / CDC Progress checkpoint Rate limiter 3 · Data classification — Sensitivity label · Data category · Retention class · Namespace assignment A · Pre-ingestion compliance gate · SOC 1 CC6 User identity / SSO / MFA Upload event log RBAC check File metadata log · SHA-256 Virus scan (ClamAV) Format allowlist File size gate PII pre-screen FAIL→quarantine PASS Type detection + routing · Batch structured · Batch unstructured · Streaming · Multi-modal Unstructured track PDFs Email Scan ASR PDF parser MIME dec OCR eng Transcribe Layout + table + figure extraction Language detect + quality filter Structured track SQL API CSV CDC Schema CDC delta Col map Deserialize Type casting + null handling Row → NL serialization ★ Streaming + multi-modal Kafka Video Code AST DICOM Format adapter pattern Windowing + micro-batch Validate + schema check B · Mid-pipeline quality gate · ISO 9001 §8.7 · OCR confidence · Parse errors · Schema drift · PII redaction log Unified normalization — Canonical format · Lineage ID · Sensitivity label · Dedup (SimHash) Metadata enrichment Deduplication Access tagging Sensitivity label propagated ★ Embedding strategy layer Chunking strategy Embedding model Dimension tradeoff Multilingual routing Overlap strategy Embedding cache Batch size tuning MRL / Matryoshka Semantic segmentation — Chunking · Hierarchy preservation · Namespace isolation C · Data lifecycle management · SOC 1 change control · ISO 9001 §7.5 Update document re-parse + re-embed + version log Delete document chunks purged · tombstone retained Retention policy auto-expire · legal hold override Right to erasure GDPR · confirmed delete log D · Immutable audit trail · SOC 1 Type 2 operating effectiveness evidence Ingestion log who · what · when · size · hash Quality gate log pass / fail + reason per doc Change log actor + justification Periodic audit report ISO 9001 §9.2 · SOC 1 T2 evidence Processing layer — compliant, embedded, audited documents → DLQ retry loop Infrastructure Compliance zone Unstructured Structured Streaming + multi-modal ★ New layers
Layer Detail

What Each Layer Does — And Why It Exists

Every layer has a defined purpose, failure mode, and governance checkpoint. None are optional in a production enterprise system.

STAGE 01
Hardware Layer
Compute nodes, object storage (S3/Azure Blob/GCS), network fabric with VPC private endpoints, and HSM/KMS for key management. AES-256 encryption at rest. TLS 1.2+ for all connections in transit. The KMS is not optional — SOC 1 auditors will ask where your encryption keys live and who can access them.
EC2 / Azure VM / GCP Compute · S3 / Blob / GCS · AWS KMS / Azure Key Vault / Cloud KMS · VPC · AES-256 · TLS 1.2+
STAGE 02
Connector + Credential Management
Every source system credential — database passwords, API keys, OAuth tokens for SharePoint and Gmail connectors — lives in a secrets vault with scheduled rotation and breach-triggered rotation. A credential access log records which service fetched which credential and when. Hardcoded credentials are an instant SOC 1 finding.
HashiCorp Vault · AWS Secrets Manager · Azure Key Vault · Debezium · Credential Rotation · Access Log · SOC 1 CC6.2
STAGE 03
Data Contracts + Schema Registry
The formal agreement between data producers and the ingestion pipeline. Schemas are registered in Apache Avro, Protobuf, or JSON Schema. Contract validation fires before any data enters processing — a breaking schema change routes to the dead letter queue and alerts the producer team. Without this, a producer silently changes a database column and corrupts three weeks of embeddings.
Confluent Schema Registry · AWS Glue · Apache Avro · Protobuf · Backward Compatibility Enforcement · Producer SLA
STAGE 04
Scheduler + CDC + Dead Letter Queue
Ingestion runs on schedule (Airflow, Prefect) not on prayer. Change Data Capture via Debezium reads only changed rows from the database transaction log — no full re-scans, no missed updates. Failed documents after N retries route to a dead letter queue with an alert. A progress checkpoint store (Redis/DynamoDB) means a backfill crash at document 40 million of 50 million resumes at 40 million — not zero.
Apache Airflow · Prefect · Debezium · Apache Kafka · Redis Checkpoints · DLQ · Exponential Backoff
STAGE 05
Pre-Ingestion Compliance Gate (Zone A)
Every document passes through identity verification (SSO/MFA), RBAC permission check, virus scan (ClamAV or cloud AV), file format allowlist validation, file size policy enforcement, SHA-256 hash logging, and PII pre-screen before any processing begins. Failures route to quarantine with an immutable alert log. This is the gate SOC 1 auditors review first.
SOC 1 CC6 · SSO / MFA · RBAC · ClamAV · SHA-256 · PII Pre-screen · Quarantine Store · Upload Event Log
STAGE 06
Dual-Track + Streaming Processing
Structured data (SQL databases, REST APIs, CSV, JSON/XML) and unstructured data (PDFs, emails, scanned images, audio) require fundamentally different processing pipelines before they can converge. Unstructured: OCR engine, MIME decode, ASR transcription, layout detection, table and figure extraction, language detection. Structured: schema extraction, CDC delta processing, column mapping, type casting, null handling, and row-to-natural-language serialization. A third track handles real-time streaming (Kafka) and multi-modal inputs (video frames, code AST parsing, CAD files, DICOM).
GCP Document AI · AWS Textract · Tesseract · Whisper ASR · Apache Kafka · Debezium CDC · AST Parsing · Multi-modal Adapters
STAGE 07
Mid-Pipeline Quality Gate (Zone B)
After processing, before normalization. OCR confidence threshold check — below 80% triggers quarantine, not silent pass-through. Parse error log creates a nonconforming document record per ISO 9001 §8.7. Schema drift alerts fire when structured data deviates from contract. PII redaction log records what was masked and why. This is where nonconforming inputs are caught before they corrupt your vector index.
ISO 9001 §8.7 · OCR Confidence Gate · Schema Drift Alert · Parse Error Log · PII Redaction Log · Nonconforming Output Control
STAGE 08
Unified Normalization + Embedding Strategy
All three tracks converge into a canonical document format with a lineage ID, sensitivity label, deduplication hash (SimHash for near-duplicates, not just exact SHA-256), and access tags. Then: chunking strategy selection (fixed/semantic/late/parent-child), embedding model selection (dense, sparse, hybrid, multilingual routing), dimensionality decision (768 vs 1536 vs 3072), overlap strategy, and embedding cache check to avoid re-embedding unchanged documents. Embedding cache alone reduces cost by up to 80% on incremental ingestion runs.
SimHash Dedup · Semantic Chunking · Sentence Transformers · text-embedding-3 · Hybrid Dense+Sparse · MRL / Matryoshka · Embedding Cache · 512-token sweet spot
STAGE 09
Data Lifecycle + Immutable Audit Trail (Zones C + D)
Update: document is re-parsed, re-embedded, version logged. Delete: all chunks purged from vector DB, tombstone retained in audit log — the tombstone is never deleted. Retention policy auto-expires by classification class with legal hold override. Right-to-erasure requests produce a confirmed deletion log containing no PII. Zone D immutable audit trail — ingestion log, quality gate log, change log, periodic audit report — is written to WORM (append-only) storage. SOC 1 Type 2 requires 12 months of evidence that controls operated continuously. This is that evidence.
SOC 1 Type 2 · ISO 9001 §7.5 · WORM Storage · Tombstone Records · GDPR Right to Erasure · Retention Policy · S3 Object Lock · Periodic Audit Report
Data Track Detail

Structured vs Unstructured — Why They Need Separate Tracks

The two data types have fundamentally different failure modes and require different expertise to handle well. Collapsing them into a single pipeline is the most common architectural mistake in enterprise RAG projects.

Unstructured track

  • PDFs / DOCX: PDF parser → layout detection → table + figure extraction
  • Emails / Chat: MIME decode → thread reconstruction → attachment handling
  • Scanned images: OCR engine → deskew → denoise → confidence scoring
  • Audio / Video: ASR transcription → speaker diarization → timestamp alignment
  • Code repositories: AST parsing → not naive text extraction
  • CAD / DICOM: Format-specific adapter pattern → metadata extraction
Key challenge: The hardest problem in unstructured ingestion is not OCR — it is layout and structure detection. A PDF can be a single-column report, a two-column legal contract, a scanned form, or a slide deck. Detecting layout first determines whether your OCR output is coherent or scrambled.

Structured track

Row → natural language serialization
A database row means nothing to an embedding model. {customer_id: 4821, tier: 'gold', churn_risk: 0.82} must become: "Customer 4821 is a gold-tier account with 82% churn risk." That serialization must be deliberate, not JSON.stringify().
Change Data Capture — delta only
Re-ingesting an entire database on every run is how pipelines break production systems. CDC reads only the transaction log delta — new rows, updated rows, deleted rows — at near-zero source load.
Relational context preservation
A chunk referencing customer_id: 4821 without carrying the customer name, tier, and account status will retrieve poorly. JOIN context must be serialized into the chunk at ingestion time, not reconstructed at query time.
Schema contract enforcement
When a producer removes a required column, the pipeline must reject the data and alert — not silently ingest corrupt records that corrupt three weeks of downstream embeddings.
Compliance Architecture

ISO 9001 + SOC 1 Type 2 — Built Into Every Layer

Compliance is not a feature added before deployment. In this pipeline, every governance control is a structural decision enforced at the infrastructure layer. Here is exactly what each standard requires — and where in the pipeline it is enforced.

ISO 9001 — Process Quality Management
§7.4 Controlled input: Type detection router rejects unapproved formats
§7.5 Documented information: Every document versioned, lineage ID assigned
§7.5.3 Control of records: WORM audit log, retention policy enforced
§8.4 External provider control: Connector registry — approved sources only
§8.7 Nonconforming outputs: Zone B quality gate with mandatory logging
§9.1.1 Monitoring: OCR confidence, parse error rates, schema drift tracked
§9.2 Internal audit: Periodic audit report auto-generated from logs
SOC 1 Type 2 — Financial Reporting Controls
CC6.1 Logical access: RBAC enforced at vector DB filter layer
CC6.2 Credential access: Secrets vault + rotation + access log
CC6.6 Security events: Virus scan, failed auth log, quarantine alerts
CC7.2 System monitoring: Prometheus metrics, latency alerts, cost tracking
CC8.1 Change management: Update/delete events logged with actor + reason
Availability: DLQ + retry policy ensures no silent document loss
Type 2 evidence: 12 months of WORM logs demonstrating control operation
Architectural Note: Task AI Systems designs systems aligned with ISO 9001 process integrity requirements and SOC 1 Type 2 audit control principles. We architect systems to be compliance-ready; formal certification responsibilities remain with your organization's compliance function.
Cloud Cost Estimation

Size Your Ingestion Pipeline

Ingestion pipeline cost is driven by three independent variables that compound each other: OCR volume, storage, and compute. Estimate monthly infrastructure cost across AWS, Azure, and GCP based on your workload.

500,000
10 pages
10 TB
100 hrs
2 nodes
AWS
$0
~$0 / year
Azure
$0
~$0 / year
GCP
$0
~$0 / year
🟠
Choose AWS when…
Your existing workloads are on AWS, you need the widest connector ecosystem, or your org already has AWS Enterprise Support and compliance tooling. Cost premium is ~35% but switching cost is often higher.
🔵
Choose Azure when…
Your source data lives in Microsoft 365, SharePoint, or Teams. Native connectors eliminate an entire custom-build layer. Azure Purview and Sentinel are the most mature enterprise compliance tooling of the three clouds.
🟢
Choose GCP when…
Cost is the primary driver, your data is in BigQuery/GCS, or OCR volume is high. Document AI pricing advantage compounds fast above 500K pages/month. Best pure-price option for a greenfield build.
Operational Resilience

Every Failure Mode. Every Recovery Path.

An ingestion pipeline is only as good as its behaviour when something goes wrong. These are the failure modes we design explicit recovery paths for — not the ones we discover after deployment.

OCR Confidence Collapse
OCR confidence drops below threshold on a batch of documents — scanned archives with degraded quality, rotated pages, or handwritten annotations. Without a quality gate, corrupt text enters the vector index silently.
Schema Contract Break
A producer removes a required database column or changes a data type without notice. Without contract enforcement, corrupt records enter normalization and produce semantically broken embeddings.
Backfill Crash at Scale
A backfill job processing 50 million legacy documents crashes at document 40 million. Without checkpointing, the restart processes 40 million documents again — duplicating costs and creating duplicate chunks.
Deleted Document Still Retrieving
A user deletes a document but its chunks remain in the vector index. Retrieval continues returning content from a document that no longer exists — a compliance failure in regulated environments.
PII Leakage Into Embeddings
A document containing customer PII — names, account numbers, health data — is ingested without classification. The PII enters embeddings and becomes retrievable by users without appropriate access clearance.
Virus in Upload
A malicious file is uploaded through the ingestion API. Without pre-upload scanning, the file enters the processing pipeline where it may execute during parsing — particularly relevant for DOCX macro-enabled files.
Recovery path for each failure mode
Zone B Quality Gate + Quarantine
Confidence threshold (80%) fires. Document routes to quarantine store. DLQ entry created. Pipeline continues for all other documents. Alert fires with document ID, source, and confidence score. Human review queue populated.
Schema Registry Rejection + Alert
Contract validation layer rejects non-conforming messages. Routed to DLQ. Producer team receives automated alert with schema diff. Historical messages continue on last known good schema. No silent ingestion of corrupt data.
Progress Store + Idempotency Keys
Checkpoint store records last successfully processed document ID after each confirmed index write. Restart reads checkpoint and resumes from that position. SHA-256 idempotency key prevents duplicate chunks from boundary re-processing.
Tombstone + Chunk Purge Protocol
Deletion immediately sets deleted=true metadata on all chunks (stops retrieval). Async hard-delete removes all chunks from vector DB. Tombstone record retained in WORM audit log permanently. Confirmed deletion event written for SOC 1 evidence.
PII Pre-Screen + Redaction Log
Zone A pre-screen flags PII before processing. Zone B redaction log records what was masked and why. Sensitivity label assigned at source propagates through all downstream metadata. RBAC at vector DB filter layer enforces access at query time.
Pre-Ingestion Virus Gate
ClamAV or cloud AV scans every file before any processing begins. Infected files route immediately to quarantine with an immutable alert log. No processing occurs. Format allowlist rejects macro-enabled formats unless explicitly permitted by policy.
Start the Conversation

Ready to Architect Your Ingestion Pipeline?

A single architecture conversation can identify the specific gaps in your current ingestion approach — before they become production incidents.