Enterprise RAG Ingestion Pipeline

Layer Detail

What Each Layer Does — And Why It Exists

Every layer has a defined purpose, failure mode, and governance checkpoint. None are optional in a production enterprise system.

STAGE 01

Hardware Layer

Compute nodes, object storage (S3/Azure Blob/GCS), network fabric with VPC private endpoints, and HSM/KMS for key management. AES-256 encryption at rest. TLS 1.2+ for all connections in transit. The KMS is not optional — SOC 1 auditors will ask where your encryption keys live and who can access them.

EC2 / Azure VM / GCP Compute · S3 / Blob / GCS · AWS KMS / Azure Key Vault / Cloud KMS · VPC · AES-256 · TLS 1.2+

STAGE 02

Connector + Credential Management

Every source system credential — database passwords, API keys, OAuth tokens for SharePoint and Gmail connectors — lives in a secrets vault with scheduled rotation and breach-triggered rotation. A credential access log records which service fetched which credential and when. Hardcoded credentials are an instant SOC 1 finding.

HashiCorp Vault · AWS Secrets Manager · Azure Key Vault · Debezium · Credential Rotation · Access Log · SOC 1 CC6.2

STAGE 03

Data Contracts + Schema Registry

The formal agreement between data producers and the ingestion pipeline. Schemas are registered in Apache Avro, Protobuf, or JSON Schema. Contract validation fires before any data enters processing — a breaking schema change routes to the dead letter queue and alerts the producer team. Without this, a producer silently changes a database column and corrupts three weeks of embeddings.

Confluent Schema Registry · AWS Glue · Apache Avro · Protobuf · Backward Compatibility Enforcement · Producer SLA

STAGE 04

Scheduler + CDC + Dead Letter Queue

Ingestion runs on schedule (Airflow, Prefect) not on prayer. Change Data Capture via Debezium reads only changed rows from the database transaction log — no full re-scans, no missed updates. Failed documents after N retries route to a dead letter queue with an alert. A progress checkpoint store (Redis/DynamoDB) means a backfill crash at document 40 million of 50 million resumes at 40 million — not zero.

Apache Airflow · Prefect · Debezium · Apache Kafka · Redis Checkpoints · DLQ · Exponential Backoff

STAGE 05

Pre-Ingestion Compliance Gate (Zone A)

Every document passes through identity verification (SSO/MFA), RBAC permission check, virus scan (ClamAV or cloud AV), file format allowlist validation, file size policy enforcement, SHA-256 hash logging, and PII pre-screen before any processing begins. Failures route to quarantine with an immutable alert log. This is the gate SOC 1 auditors review first.

SOC 1 CC6 · SSO / MFA · RBAC · ClamAV · SHA-256 · PII Pre-screen · Quarantine Store · Upload Event Log

STAGE 06

Dual-Track + Streaming Processing

Structured data (SQL databases, REST APIs, CSV, JSON/XML) and unstructured data (PDFs, emails, scanned images, audio) require fundamentally different processing pipelines before they can converge. Unstructured: OCR engine, MIME decode, ASR transcription, layout detection, table and figure extraction, language detection. Structured: schema extraction, CDC delta processing, column mapping, type casting, null handling, and row-to-natural-language serialization. A third track handles real-time streaming (Kafka) and multi-modal inputs (video frames, code AST parsing, CAD files, DICOM).

GCP Document AI · AWS Textract · Tesseract · Whisper ASR · Apache Kafka · Debezium CDC · AST Parsing · Multi-modal Adapters

STAGE 07

Mid-Pipeline Quality Gate (Zone B)

After processing, before normalization. OCR confidence threshold check — below 80% triggers quarantine, not silent pass-through. Parse error log creates a nonconforming document record per ISO 9001 §8.7. Schema drift alerts fire when structured data deviates from contract. PII redaction log records what was masked and why. This is where nonconforming inputs are caught before they corrupt your vector index.

ISO 9001 §8.7 · OCR Confidence Gate · Schema Drift Alert · Parse Error Log · PII Redaction Log · Nonconforming Output Control

STAGE 08

Unified Normalization + Embedding Strategy

All three tracks converge into a canonical document format with a lineage ID, sensitivity label, deduplication hash (SimHash for near-duplicates, not just exact SHA-256), and access tags. Then: chunking strategy selection (fixed/semantic/late/parent-child), embedding model selection (dense, sparse, hybrid, multilingual routing), dimensionality decision (768 vs 1536 vs 3072), overlap strategy, and embedding cache check to avoid re-embedding unchanged documents. Embedding cache alone reduces cost by up to 80% on incremental ingestion runs.

SimHash Dedup · Semantic Chunking · Sentence Transformers · text-embedding-3 · Hybrid Dense+Sparse · MRL / Matryoshka · Embedding Cache · 512-token sweet spot

STAGE 09

Data Lifecycle + Immutable Audit Trail (Zones C + D)

Update: document is re-parsed, re-embedded, version logged. Delete: all chunks purged from vector DB, tombstone retained in audit log — the tombstone is never deleted. Retention policy auto-expires by classification class with legal hold override. Right-to-erasure requests produce a confirmed deletion log containing no PII. Zone D immutable audit trail — ingestion log, quality gate log, change log, periodic audit report — is written to WORM (append-only) storage. SOC 1 Type 2 requires 12 months of evidence that controls operated continuously. This is that evidence.

SOC 1 Type 2 · ISO 9001 §7.5 · WORM Storage · Tombstone Records · GDPR Right to Erasure · Retention Policy · S3 Object Lock · Periodic Audit Report

Data Track Detail

Structured vs Unstructured — Why They Need Separate Tracks

The two data types have fundamentally different failure modes and require different expertise to handle well. Collapsing them into a single pipeline is the most common architectural mistake in enterprise RAG projects.

Unstructured track

PDFs / DOCX: PDF parser → layout detection → table + figure extraction
Emails / Chat: MIME decode → thread reconstruction → attachment handling
Scanned images: OCR engine → deskew → denoise → confidence scoring
Audio / Video: ASR transcription → speaker diarization → timestamp alignment
Code repositories: AST parsing → not naive text extraction
CAD / DICOM: Format-specific adapter pattern → metadata extraction

Key challenge: The hardest problem in unstructured ingestion is not OCR — it is layout and structure detection. A PDF can be a single-column report, a two-column legal contract, a scanned form, or a slide deck. Detecting layout first determines whether your OCR output is coherent or scrambled.

Structured track

Row → natural language serialization

A database row means nothing to an embedding model. {customer_id: 4821, tier: 'gold', churn_risk: 0.82} must become: "Customer 4821 is a gold-tier account with 82% churn risk." That serialization must be deliberate, not JSON.stringify().

Change Data Capture — delta only

Re-ingesting an entire database on every run is how pipelines break production systems. CDC reads only the transaction log delta — new rows, updated rows, deleted rows — at near-zero source load.

Relational context preservation

A chunk referencing customer_id: 4821 without carrying the customer name, tier, and account status will retrieve poorly. JOIN context must be serialized into the chunk at ingestion time, not reconstructed at query time.

Schema contract enforcement

When a producer removes a required column, the pipeline must reject the data and alert — not silently ingest corrupt records that corrupt three weeks of downstream embeddings.

Compliance Architecture

ISO 9001 + SOC 1 Type 2 — Built Into Every Layer

Compliance is not a feature added before deployment. In this pipeline, every governance control is a structural decision enforced at the infrastructure layer. Here is exactly what each standard requires — and where in the pipeline it is enforced.

ISO 9001 — Process Quality Management

§7.4 Controlled input: Type detection router rejects unapproved formats

§7.5 Documented information: Every document versioned, lineage ID assigned

§7.5.3 Control of records: WORM audit log, retention policy enforced

§8.4 External provider control: Connector registry — approved sources only

§8.7 Nonconforming outputs: Zone B quality gate with mandatory logging

§9.1.1 Monitoring: OCR confidence, parse error rates, schema drift tracked

§9.2 Internal audit: Periodic audit report auto-generated from logs

SOC 1 Type 2 — Financial Reporting Controls

CC6.1 Logical access: RBAC enforced at vector DB filter layer

CC6.2 Credential access: Secrets vault + rotation + access log

CC6.6 Security events: Virus scan, failed auth log, quarantine alerts

CC7.2 System monitoring: Prometheus metrics, latency alerts, cost tracking

CC8.1 Change management: Update/delete events logged with actor + reason

Availability: DLQ + retry policy ensures no silent document loss

Type 2 evidence: 12 months of WORM logs demonstrating control operation

Architectural Note: Task AI Systems designs systems aligned with ISO 9001 process integrity requirements and SOC 1 Type 2 audit control principles. We architect systems to be compliance-ready; formal certification responsibilities remain with your organization's compliance function.

Cloud Cost Estimation

Size Your Ingestion Pipeline

Ingestion pipeline cost is driven by three independent variables that compound each other: OCR volume, storage, and compute. Estimate monthly infrastructure cost across AWS, Azure, and GCP based on your workload.

Documents / month

500,000

Pages per document (avg)

10 pages

Storage (TB)

10 TB

Audio hours / month

100 hrs

Compute nodes (4vCPU/16GB)

2 nodes

AWS

~$0 / year

Azure

~$0 / year

GCP

~$0 / year

🟠

Choose AWS when…

Your existing workloads are on AWS, you need the widest connector ecosystem, or your org already has AWS Enterprise Support and compliance tooling. Cost premium is ~35% but switching cost is often higher.

🔵

Choose Azure when…

Your source data lives in Microsoft 365, SharePoint, or Teams. Native connectors eliminate an entire custom-build layer. Azure Purview and Sentinel are the most mature enterprise compliance tooling of the three clouds.

🟢

Choose GCP when…

Cost is the primary driver, your data is in BigQuery/GCS, or OCR volume is high. Document AI pricing advantage compounds fast above 500K pages/month. Best pure-price option for a greenfield build.

Operational Resilience

Every Failure Mode. Every Recovery Path.

An ingestion pipeline is only as good as its behaviour when something goes wrong. These are the failure modes we design explicit recovery paths for — not the ones we discover after deployment.

OCR Confidence Collapse

OCR confidence drops below threshold on a batch of documents — scanned archives with degraded quality, rotated pages, or handwritten annotations. Without a quality gate, corrupt text enters the vector index silently.

Schema Contract Break

A producer removes a required database column or changes a data type without notice. Without contract enforcement, corrupt records enter normalization and produce semantically broken embeddings.

Backfill Crash at Scale

A backfill job processing 50 million legacy documents crashes at document 40 million. Without checkpointing, the restart processes 40 million documents again — duplicating costs and creating duplicate chunks.

Deleted Document Still Retrieving

A user deletes a document but its chunks remain in the vector index. Retrieval continues returning content from a document that no longer exists — a compliance failure in regulated environments.

PII Leakage Into Embeddings

A document containing customer PII — names, account numbers, health data — is ingested without classification. The PII enters embeddings and becomes retrievable by users without appropriate access clearance.

Virus in Upload

A malicious file is uploaded through the ingestion API. Without pre-upload scanning, the file enters the processing pipeline where it may execute during parsing — particularly relevant for DOCX macro-enabled files.

Recovery path for each failure mode

Zone B Quality Gate + Quarantine

Confidence threshold (80%) fires. Document routes to quarantine store. DLQ entry created. Pipeline continues for all other documents. Alert fires with document ID, source, and confidence score. Human review queue populated.

Schema Registry Rejection + Alert

Contract validation layer rejects non-conforming messages. Routed to DLQ. Producer team receives automated alert with schema diff. Historical messages continue on last known good schema. No silent ingestion of corrupt data.

Progress Store + Idempotency Keys

Checkpoint store records last successfully processed document ID after each confirmed index write. Restart reads checkpoint and resumes from that position. SHA-256 idempotency key prevents duplicate chunks from boundary re-processing.

Tombstone + Chunk Purge Protocol

Deletion immediately sets deleted=true metadata on all chunks (stops retrieval). Async hard-delete removes all chunks from vector DB. Tombstone record retained in WORM audit log permanently. Confirmed deletion event written for SOC 1 evidence.

PII Pre-Screen + Redaction Log

Zone A pre-screen flags PII before processing. Zone B redaction log records what was masked and why. Sensitivity label assigned at source propagates through all downstream metadata. RBAC at vector DB filter layer enforces access at query time.

Pre-Ingestion Virus Gate

ClamAV or cloud AV scans every file before any processing begins. Infected files route immediately to quarantine with an immutable alert log. No processing occurs. Format allowlist rejects macro-enabled formats unless explicitly permitted by policy.

The Complete Enterprise
Ingestion Pipeline.

Nine Layers. No Gaps.

What Each Layer Does — And Why It Exists

Structured vs Unstructured — Why They Need Separate Tracks

Unstructured track

Structured track

ISO 9001 + SOC 1 Type 2 — Built Into Every Layer

Size Your Ingestion Pipeline

Every Failure Mode. Every Recovery Path.

Ready to Architect Your Ingestion Pipeline?

The Complete EnterpriseIngestion Pipeline.

Nine Layers. No Gaps.

What Each Layer Does — And Why It Exists

Structured vs Unstructured — Why They Need Separate Tracks

Unstructured track

Structured track

ISO 9001 + SOC 1 Type 2 — Built Into Every Layer

Size Your Ingestion Pipeline

Every Failure Mode. Every Recovery Path.

Ready to Architect Your Ingestion Pipeline?

Design Your AI Architecture

Send a Direct Inquiry

Message Received

The Complete Enterprise
Ingestion Pipeline.