Processing & Inference Layer

From Raw Document
To Governed Answer.

The processing layer transforms ingested documents into retrieval-ready vectors. The inference layer transforms retrieved context into governed, cited, compliance-audited responses. Together they are the core of every enterprise RAG system — and the layer where most production failures originate.

3 Compliance Zones
Processing · Output · Audit
12 Stages
Enrichment to LLM answer
Prompt Governance
Versioned · tested · locked
Citation Required
No source = no answer
Complete Architecture

Processing + Inference — Every Stage.

Processing is the bridge between raw ingested text and a retrieval-ready vector index. Inference is the bridge between retrieved context and a governed LLM response. Neither layer is a black box — every stage has defined inputs, outputs, failure modes, and compliance checkpoints.

Processing and Inference Layer — complete architecture with three compliance zones ← from ingestion pipeline (normalized docs) ▼ PROCESSING LAYER 1 · Metadata enrichment — source · version · author · effective date · dept ownership · content category Source tagging doc ID · origin URL Version control v1 → v2 lineage Access tier public / internal / restricted Effective date temporal scoping Dept scope namespace assign 2 · Content classification — procedural · regulatory · definitional · advisory · technical Procedural Regulatory Definitional Advisory Technical A · Processing quality gate · ISO 9001 §8.7 nonconforming output · SOC 1 CC7.2 PII detection classify before embedding Content quality check min token threshold Duplicate check SimHash dedup Processing log ISO 9001 §9.1 evidence 3 · Chunking strategy — semantic boundary · parent-child · sliding window · hierarchy preservation Semantic boundary meaning not char count Parent-child section → clause → para Sliding window 20% overlap default 512-token sweet spot 256 precision · 1024 context 4 · Embedding generation — model select · domain fine-tune · dense + sparse · multilingual · cache Dense (768d) MiniLM / BGE fast · cost-efficient Dense (1536d) text-embedding-3 best cost/quality Dense (3072d) text-embedding-3-large legal / medical / financial Sparse (SPLADE) learned sparse vectors exact term + expand Embed cache SHA-256 → skip 80% cost save 5 · Vector DB write strategy — upsert · batch · namespace assign · index version · idempotency Upsert strategy chunk ID idempotency Batch write GPU utilisation optim Namespace assign dept / tenant / project Index versioning rollback + A/B testing 6 · Index management — HNSW rebuild · staleness detect · hot/warm/cold tier · delete propagation HNSW rebuild trigger on significant corpus change Staleness detection re-index trigger Tiered storage hot / warm / cold cost Delete propagation tombstone → chunk purge → retrieval layer (vector DB ready) retrieval returns context ▼ INFERENCE LAYER 7 · Prompt architecture — system prompt · context injection · citation instruction · refusal protocol System prompt versioned · tested · locked Context injection top-K chunks + citations Citation mandate no source = no answer Refusal protocol decline gracefully + log 8 · LLM orchestration — LangChain · LangGraph · agent routing · tool calling · memory systems LangChain pipeline chaining LangGraph stateful agent flows Agent routing specialist sub-agents Tool calling calculators · APIs · DB Memory systems short + long term 9 · Model selection + inference — local · API · quantized · routing by cost · fallback · caching Cloud API GPT-4o / Claude / Gemini On-premise Llama 3 · Mistral · Ollama Quantized 4-bit / 8-bit GPU Cost routing simple → small model Fallback secondary LLM B · Output quality gate · ISO 9001 §8.7 · SOC 1 CC7.2 · hallucination check · citation verify Citation verify claim → chunk traceability Hallucination signal NLI-based detection Output PII filter redact before delivery Format + length check output constraints enforced 10 · Response generation — stream · batch · structured output · confidence annotation · source attribution Streaming token-by-token UX Structured output JSON schema enforced Confidence score attached to response Source attribution chunk ID + doc link Prompt cache prefix caching 11 · Post-generation constraints — toxicity · scope check · escalation routing · human-in-the-loop Toxicity check block harmful outputs Scope boundary out-of-domain block Human-in-loop high-stakes routing Escalation queue SLA-bound handoff C · Immutable response audit trail · SOC 1 CC6 · ISO 9001 §7.5 · GDPR Art. 5 · complete chain Full response log query · context · answer Prompt version log template ID + hash Model config log model · temp · max tokens WORM audit store immutable · SOC 1 T2 evidence 12 · Delivery + cost tracking — token count · cost per query · latency · user feedback · drift signal Token monitoring input + output count Cost per query by dept / use case TTFT / TPS latency SLA tracking User feedback signal thumbs → model drift alert Response delivered → governance & evaluation layer Three compliance zones — control inventory Zone A · Processing quality gate: ISO 9001 §8.7 · SOC 1 CC7.2 — PII detect, content quality, dedup, processing log Zone B · Output quality gate: ISO 9001 §8.7 · SOC 1 CC7.2 — citation verify, hallucination signal, PII filter, output constraints Zone C · Response audit trail: SOC 1 CC6 · ISO 9001 §7.5 · GDPR Art. 5 — full response log, prompt version, model config, WORM store Processing layer Inference layer Compliance zone (A·B·C) Infrastructure
Stage Detail

What Each Stage Does — And Why It Exists

The processing and inference layer is two distinct subsystems that must be understood separately before they can be designed together. Processing is batch-oriented and runs continuously in the background. Inference is real-time and runs on every user query. Their failure modes, latency requirements, and compliance obligations are completely different.

PROCESSING · STAGE 01
Metadata Enrichment
Every document that passes through ingestion enters processing with a lineage ID and a sensitivity label. Metadata enrichment fills in the structured attributes that make retrieval meaningful rather than merely possible. Source tagging records the originating system, document ID, and origin URL — creating the first link in the audit chain that must be traceable all the way to the final LLM response. Version control records the parent-child relationship between document versions: when a policy is updated, v2 knows it superseded v1, enabling the retrieval layer to enforce current-only or historical queries by design. Access tier classification assigns the document to one of four levels (public, internal, confidential, restricted) based on the sensitivity label propagated from ingestion. Effective date enables temporal scoping — a compliance query can be answered with "the policy as it existed on 1 January 2024" rather than just the current version. Department scope assigns the namespace that will govern retrieval isolation. None of this is cosmetic metadata — every attribute is used by the retrieval layer to filter, scope, and audit queries at the infrastructure level.
Source tagging · Version lineage · Access tier · Effective date · Dept namespace · Content category · Author attribution
PROCESSING · STAGE 02
Content Classification
Content classification assigns a semantic type to each document segment — procedural, regulatory, definitional, advisory, or technical. This classification is not cosmetic: it enables the retrieval layer to filter by content type in addition to semantic similarity and keyword match. A compliance officer asking "what does our policy say about data retention?" needs definitional and regulatory content, not procedural steps. A field technician asking "how do I reset the device?" needs procedural content, not regulatory background. Without content classification, the retrieval layer returns everything semantically relevant — which is often too broad. Classification is applied at the chunk level, not just the document level, because a single document can contain clauses of different types. A master services agreement contains definitional sections (what terms mean), regulatory sections (what laws apply), and procedural sections (what happens when either party breaches). Retrieving all of them for a simple definitional query inflates context noise and degrades LLM response precision.
Procedural · Regulatory · Definitional · Advisory · Technical · Chunk-level not document-level · Enables type-filtered retrieval
COMPLIANCE ZONE A
Processing Quality Gate
Before chunks enter the embedding pipeline, four quality controls fire. PII detection classifies any chunk containing personally identifiable information — names, account numbers, health data, financial records — and applies the appropriate sensitivity label. Chunks containing PII are not blocked from processing; they are tagged so that retrieval-layer access controls can enforce who is permitted to receive them. Content quality check enforces a minimum token threshold — chunks below 50 tokens are either merged with adjacent context or discarded with a processing log entry. Sub-threshold chunks are almost always parsing artefacts (page headers, footers, figure captions) that produce low-quality embeddings and dilute retrieval precision. Duplicate check using SimHash (not SHA-256) detects near-duplicate chunks — the same policy clause that appears in two different documents with minor wording differences would otherwise produce two nearly identical embeddings that compete in retrieval, inflating apparent relevance of duplicated content. Processing log records every decision for ISO 9001 §9.1 monitoring and measurement evidence: what was processed, what was flagged, what was discarded, and why.
ISO 9001 §8.7 · SOC 1 CC7.2 · PII detection · Content quality gate · SimHash dedup · Processing decision log
Zone A controls
  • ISO 9001 §8.7 — nonconforming output: sub-quality chunks logged and handled, not silently passed through
  • SOC 1 CC7.2 — system monitoring: processing quality metrics logged for operational review
  • PII classification before embedding — ensures sensitivity labels are set before access controls are applied at retrieval
  • SimHash dedup — prevents near-duplicate chunks inflating retrieval relevance scores
PROCESSING · STAGE 03
Chunking Strategy
Fixed-size character chunking is the single most common cause of enterprise RAG failure. Splitting text at 512-character boundaries without regard for sentence structure, clause boundaries, or paragraph semantics produces fragments that are syntactically correct but semantically broken. The standard enterprise approach uses semantic boundary detection — identifying logical breakpoints based on sentence completion, paragraph endings, section headers, and clause structure. Parent-child chunking is the architecture pattern that resolves the precision-context tradeoff: small chunks (200–300 tokens) are indexed for precise retrieval, but each small chunk carries a reference to its parent section (600–1000 tokens) which is retrieved at inference time for fuller context. This means the retrieval layer finds the right precise chunk, and the inference layer receives enough surrounding context to interpret it correctly. Sliding window overlap (default 20%) prevents context loss at chunk boundaries — the last 20% of one chunk is repeated as the first 20% of the next, ensuring that information spanning a boundary appears in at least one complete chunk. The 512-token sweet spot is empirically derived: below 256 tokens, chunks are too short to contain complete semantic units in most enterprise documents; above 1024 tokens, chunks contain multiple distinct topics that degrade retrieval precision.
Semantic boundary detection · Parent-child architecture · 512-token default · 20% overlap · Hierarchy preservation · Clause-aware splitting
PROCESSING · STAGE 04
Embedding Generation
Embedding model selection is the decision with the highest downstream impact on retrieval quality and the one most frequently made incorrectly. Generic embedding models (trained on web text) systematically misrepresent domain-specific terminology: in a legal corpus, "consideration" means something precise and contractual; in a medical corpus, "discharge" has multiple distinct meanings depending on clinical context. Generic models treat these as their web-frequency meanings. Domain fine-tuning on 5,000–10,000 in-domain query-document pairs produces embeddings that understand regulatory terminology, cross-references, and domain-specific synonyms correctly — consistently delivering 20–30% precision improvement over generic models in enterprise evaluation datasets. The dimensionality decision is a cost-quality tradeoff: 768-dimension models (MiniLM, BGE-small) are 4× cheaper to store and query than 1536-dimension models (text-embedding-3-small) and sufficient for most enterprise Q&A workloads. 3072-dimension models (text-embedding-3-large) are reserved for high-stakes domains where marginal recall improvement has direct monetary or regulatory value. The embedding cache is the most consistently underutilised cost-reduction mechanism in production RAG systems: if a chunk's content has not changed since its last embedding (detectable by SHA-256 content hash comparison), re-embedding on incremental ingestion runs is pure waste. Embedding cache hit rates of 60–80% are typical in enterprise deployments with stable knowledge corpora.
768d / 1536d / 3072d tradeoff · Domain fine-tuning · Dense + sparse (SPLADE) · SHA-256 embedding cache · 80% cost reduction · Multilingual routing · MRL / Matryoshka
PROCESSING · STAGE 05
Vector DB Write Strategy
Writing embeddings to the vector database is not a simple insert operation in a production system. Upsert strategy with chunk-level idempotency keys (SHA-256 of chunk content + document ID + version) means that re-running the processing pipeline on a partially updated corpus does not create duplicate embeddings — if a chunk already exists with the same content and version, the write is a no-op. Batch write optimisation groups embedding writes to maximise GPU batch efficiency and minimise per-write overhead — individual writes are 10–50× more expensive than batched writes at scale. Namespace assignment at write time is the structural control that enforces retrieval isolation: a chunk written to the HR namespace cannot be retrieved by a query scoped to the finance namespace, regardless of semantic similarity. Index versioning tags the write with the current index version, enabling blue-green deployment of index updates (the new index version serves queries while the old version remains as rollback), A/B testing of embedding model changes, and point-in-time recovery of the index state as it existed before a large corpus update.
Upsert with idempotency key · Batch write optimisation · Namespace assignment at write · Index versioning · Blue-green index deployment · Write throughput monitoring
PROCESSING · STAGE 06
Index Management
The HNSW (Hierarchical Navigable Small World) index that powers ANN retrieval degrades in quality as the corpus changes significantly — chunks are added, updated, and deleted, but the graph structure of the index was optimised for the original corpus distribution. HNSW rebuild triggers fire when cumulative corpus change exceeds a configurable threshold (typically 10–15% of total chunk count) or on a scheduled basis for high-churn corpora. Staleness detection monitors embedding age against document update frequency — if a knowledge domain is updated regularly but re-indexing has not occurred, retrieved chunks may represent stale knowledge that does not reflect current policy. Tiered storage assigns chunks to hot (GPU-resident, <5ms retrieval), warm (SSD-backed, 10–30ms), and cold (object storage, 100–500ms) tiers based on query frequency — recent regulatory documents are hot, archived historical versions are cold. Delete propagation is the most operationally critical index management function: when a document is deleted from the source system or expired by retention policy, every chunk derived from that document must be purged from the vector index, the sparse index, and the graph store simultaneously. Partial deletion creates phantom chunks — orphaned embeddings that remain retrievable after the source document has been deleted, creating both compliance violations and retrieval quality degradation.
HNSW rebuild threshold · Staleness detection · Hot/warm/cold tiering · Delete propagation · Tombstone-driven purge · Phantom chunk prevention
INFERENCE · STAGE 07
Prompt Architecture
Prompt architecture is an engineering discipline, not a craft activity. System prompts that are written once, never tested, and never versioned are the primary cause of production LLM failures in enterprise deployments — not model capability limitations. Every system prompt must be version-controlled (semantic versioning: 1.0.0 → 1.0.1 for wording changes, 1.1.0 for structural changes, 2.0.0 for fundamental redesign), tested against a golden dataset before deployment, and locked to a specific hash in production to prevent accidental or unauthorised modification. Context injection is the structured assembly of retrieved chunks into the prompt: source attribution is embedded at the chunk level (not appended as a list at the end), so the LLM can reference specific sources inline. Citation mandate is a hard constraint in the system prompt: the LLM is instructed that every factual claim must be followed by a source reference, and that any claim for which a source cannot be found must not be made — the refusal protocol takes over instead. Refusal protocol defines the exact language and structure of graceful declination: "I was unable to find reliable source material for this question in the available knowledge base. The query has been logged for review." This is logged, not discarded — the accumulation of declined queries is a corpus gap signal that feeds back into ingestion.
Versioned system prompts · Prompt hash locking · Citation mandate · Graceful refusal protocol · Context injection structure · Token budget enforcement
INFERENCE · STAGE 08
LLM Orchestration
LangChain provides the component chaining framework — retrieval, reranking, prompt assembly, and response generation are connected as a pipeline with defined interfaces between each stage. LangGraph extends this with stateful multi-step reasoning: conditional routing (if the initial retrieval confidence is below threshold, reroute to a broader retrieval strategy before escalating), loops (plan → retrieve → evaluate → retrieve again if insufficient), and human-in-the-loop interrupts (pause the workflow at a defined checkpoint to await human review before proceeding). Agent routing dispatches complex multi-part queries to specialist sub-agents: a query that requires both regulatory lookup and financial calculation is decomposed into a regulatory retrieval agent and a calculation tool agent, whose outputs are synthesised by an orchestrator agent. Tool calling enables the LLM to invoke external capabilities mid-reasoning: a structured database query for exact figures, a calculator for numerical operations, a web search for information beyond the knowledge base. Memory systems maintain session-level short-term memory (what has been asked and answered in this conversation) and cross-session long-term memory (user preferences, recurring query patterns, personalisation signals) — both with appropriate data retention constraints for regulated environments.
LangChain · LangGraph stateful flows · Specialist agent routing · Tool calling (DB · calc · search) · Short-term session memory · Long-term memory with retention policy
INFERENCE · STAGE 09
Model Selection + Inference
Model selection at inference time is a cost-quality routing decision that most production systems get wrong by defaulting to the most capable model for every query. A factual lookup query ("what is the maximum file size for uploaded documents?") requires a small, fast, cheap model — the complexity of GPT-4 adds latency and cost without improving the answer. A multi-document synthesis query ("summarise the key obligations under all active vendor contracts and identify any conflicts") requires the most capable model available. Cost routing maintains a complexity classifier that assigns each query to one of three tiers: simple factual (small model, <50ms, ~$0.001), moderate analytical (medium model, <150ms, ~$0.01), complex synthesis (large model, <500ms, ~$0.10). On-premise deployment with quantized models (Llama 3 8B at 4-bit quantization, Mistral 7B) is mandatory for industries where data sovereignty prohibits sending content to external API endpoints — financial data, clinical records, legal documents. Quantized models run at 60–70% of full-precision quality at 15–20% of the compute cost, which is an acceptable tradeoff for most enterprise Q&A workloads. Fallback models activate when the primary model is unavailable, rate-limited, or returns an error — the fallback chain must be defined, tested, and logged so that a model outage does not produce user-visible failures.
Complexity-based model routing · Cloud API (GPT-4o / Claude / Gemini) · On-premise (Llama 3 / Mistral / Ollama) · 4-bit quantization · Fallback chain · Cost per tier: $0.001 → $0.01 → $0.10
COMPLIANCE ZONE B
Output Quality Gate
Before a response is delivered to the user, four output controls fire against the generated text. Citation verification checks that every factual claim in the response is traceable to a specific chunk ID in the context window — this is the primary hallucination detection mechanism, not an NLI model. If a claim appears in the response but cannot be matched to a source chunk, the claim is either flagged for review or the entire response is suppressed and replaced with a graceful declination. NLI-based hallucination signal uses a Natural Language Inference model to score each claim against its cited source chunk for entailment (the claim is supported by the source), contradiction (the claim contradicts the source), or neutrality (the claim cannot be verified from the source). High contradiction scores trigger immediate response suppression; high neutrality scores flag the response for human review. Output PII filter scans the generated response for personally identifiable information that should not have been included — a last-resort check before delivery. Format and length check enforces output constraints defined in the system prompt: structured JSON responses must validate against the defined schema, prose responses must fall within the defined length range, and domain-restricted responses must not contain content outside the permitted scope.
ISO 9001 §8.7 · SOC 1 CC7.2 · Citation-to-chunk traceability · NLI entailment scoring · Output PII filter · Format + schema validation · Response suppression on fail
Zone B controls
  • ISO 9001 §8.7 — nonconforming output: responses failing citation check are suppressed, not delivered
  • SOC 1 CC7.2 — system monitoring: hallucination signal rates logged per query category
  • Citation-to-chunk traceability is the primary hallucination prevention — not an afterthought
  • NLI contradiction scoring catches cases where the LLM paraphrased context incorrectly
INFERENCE · STAGE 10
Response Generation + Source Attribution
Response generation produces the final structured output that is delivered to the user. Streaming responses (token-by-token generation) are appropriate for conversational interfaces where perceived latency matters more than total latency. Batch responses are appropriate for background processing workflows where throughput matters more than perceived latency. Structured output enforces JSON schema validation when the application requires machine-readable responses rather than prose — a compliance reporting tool that generates audit summaries must produce valid, schema-conforming JSON that downstream systems can consume without parsing. Confidence score annotation attaches the retrieval confidence score and the citation verification status to the response metadata — not displayed to end users in most interfaces, but available to application developers to drive UI decisions (show a "verify this answer" warning below a confidence threshold, for example). Source attribution is the mechanism that makes the system auditable: every response includes a citation list linking each factual claim to the specific chunk ID, source document, document version, and effective date that supports it. This citation chain is what allows a compliance officer to reconstruct exactly what information the system used to answer a specific query at a specific point in time.
Streaming · Structured JSON output · Confidence annotation · Source attribution chain · Chunk ID → doc → version → date · Prompt cache (prefix caching)
INFERENCE · STAGE 11
Post-Generation Constraints
Post-generation constraints are the last safety layer before response delivery. Toxicity check scans the response for harmful content — particularly important when the LLM processes documents that may contain adversarial content injected into the knowledge base, a technique known as indirect prompt injection. Scope boundary enforcement verifies that the response stays within the domain the system is permitted to answer — a legal compliance assistant should not be answering questions about competitors' products, even if the retrieval layer happened to surface a document containing that information. Human-in-the-loop routing applies to queries designated as high-stakes: medical dosing calculations, legal advice that will be acted upon without further review, financial decisions above a defined threshold. These queries are intercepted before delivery and queued for human review — the AI-generated response is a draft, not a final answer. Escalation queue maintains SLA commitments for human review — if a query sits in the queue for more than the defined SLA period without review, an alert fires and the query is escalated to a supervisor. The SLA must be documented and auditable for SOC 1 Type 2 evidence.
Toxicity check · Indirect prompt injection detection · Scope boundary enforcement · Human-in-the-loop routing · Escalation queue · SLA-bound human review
COMPLIANCE ZONE C
Immutable Response Audit Trail
The response audit trail is the complete chain of evidence from query to answer that a compliance officer, internal auditor, or external regulator can use to reconstruct exactly what the system did, when it did it, with what information, using which model, under which prompt version. Full response log writes the complete record to WORM (Write Once Read Many) immutable storage: the original query text, the retrieved chunk IDs with their confidence scores, the assembled prompt (including system prompt version hash), the model identifier and configuration (temperature, max tokens, top-p), the raw model response, the citation verification result, and the final delivered response. This log cannot be modified after writing — it is the forensic record. Prompt version log records which system prompt version was active at the time of each query — when a prompt is updated, the previous version is archived, not deleted, so that any past query can be replayed with the exact prompt it was answered with. Model configuration log records model ID, quantization level, and inference parameters — because the same model at different temperatures produces systematically different outputs, and the log must capture the exact configuration. The WORM audit store satisfies SOC 1 Type 2's requirement for evidence that controls operated effectively over a 12-month period — the immutability of the log is what makes that evidence credible to an auditor.
SOC 1 CC6 · ISO 9001 §7.5 · GDPR Art. 5(1)(f) · Full response log · Prompt version archive · Model config log · WORM immutable store · 12-month retention for SOC 1 T2
Zone C controls
  • SOC 1 CC6 — logical access: complete audit chain from query to response in immutable log
  • ISO 9001 §7.5 — documented information: every prompt version archived, never deleted
  • GDPR Art. 5(1)(f) — integrity and confidentiality: WORM storage prevents log tampering
  • SOC 1 Type 2 — 12 months immutable evidence that controls operated continuously
  • Response replay capability: any historical query can be reconstructed with its exact prompt and model config
INFERENCE · STAGE 12
Delivery + Cost Tracking
Cost tracking per query is the operational control that prevents AI infrastructure costs from becoming invisible until the invoice arrives. Token monitoring counts input tokens (context window + prompt) and output tokens (generated response) for every query, attributed to the department, use case, and user tier that initiated it. Cost per query is computed in real time and attributed to the requesting department — this is the mechanism that enables showback (showing departments their AI infrastructure consumption) and chargeback (billing departments for their actual consumption). Time to first token (TTFT) and tokens per second (TPS) are the two latency metrics that matter for perceived performance: TTFT determines how quickly the user sees the first word of the response (target: <500ms), TPS determines how fast the response streams after it starts (target: >30 tokens/second for fluent reading pace). User feedback signal captures explicit thumbs up/down ratings and implicit signals (query rephrasing, session abandonment, correction submissions) that are the earliest indicators of model drift — when a model that was performing well begins returning lower-quality answers, user feedback signals typically degrade weeks before automated quality metrics reflect the change.
Token count (input + output) · Cost per query by dept · TTFT <500ms · TPS >30 t/s · Dept showback / chargeback · User feedback → drift signal
Layer Comparison

Processing vs Inference — Different Problems, Different Requirements

These two layers are frequently conflated in architecture discussions. They should not be. Processing is batch infrastructure. Inference is real-time software. Their operational requirements, failure modes, and compliance obligations are fundamentally different.

Processing layer

Batch infrastructure — runs continuously in the background

Processing is concerned with throughput, correctness, and idempotency. It runs whenever documents are added to or updated in the knowledge corpus. A processing failure corrupts the index — it does not directly fail a user query, but it degrades every query that would have used the corrupted data.

  • Optimise for throughput — documents per hour, not milliseconds
  • Must be idempotent — re-running on failure produces identical output
  • Failure mode: corrupt index, phantom chunks, stale embeddings
  • Primary metric: processing lag — time from document update to index ready
  • Compliance focus: data quality, lineage tracking, PII classification
  • Cost driver: embedding model inference, vector write operations
  • Key controls: Zone A quality gate, idempotency keys, delete propagation
Inference layer

Real-time software — runs on every user query

Inference is concerned with latency, accuracy, and compliance. It runs on demand in response to user queries. An inference failure is immediately visible — a user receives a wrong answer, a hallucination, or an error. Every inference decision is auditable.

  • Optimise for latency — p95 <500ms end-to-end target
  • Must be deterministic with audit — same inputs produce auditable outputs
  • Failure mode: hallucination, incorrect citation, wrong model routing
  • Primary metric: TTFT, answer accuracy, citation verification pass rate
  • Compliance focus: prompt versioning, output audit trail, response suppression
  • Cost driver: LLM token consumption, model tier routing decisions
  • Key controls: Zone B output gate, Zone C audit trail, citation mandate
Embedding Strategy

Choosing the Right Embedding Model — The Decision That Matters Most

Embedding model selection has more impact on retrieval quality than any other single architectural decision. A 20–30% precision improvement from domain fine-tuning is not marginal — in a system answering 10,000 queries per day, it is the difference between 2,000 and 2,600 queries answered correctly. Get this decision right before optimising anything else.

Dense · lightweight
all-MiniLM-L6-v2 / BGE-small
768d
Fast, cheap, good for high-volume low-stakes retrieval. 80M parameters. Suitable for FAQ systems, internal knowledge bases, and workloads where retrieval speed matters more than marginal precision. Not suitable for legal, clinical, or financial domains without fine-tuning.
Dense · balanced
text-embedding-3-small
1536d
Best cost-quality ratio for English enterprise workloads. Matryoshka Representation Learning — dimensions can be truncated to 256, 512, or 768 without retraining for cost reduction at acceptable quality loss. The default choice for most enterprise RAG deployments.
Dense · maximum quality
text-embedding-3-large
3072d
Maximum quality at 4× the storage and query cost of 1536d. Use only when marginal recall improvement has direct monetary or regulatory value — medical protocol retrieval, legal clause matching, financial regulation lookup where a missed relevant document has a measurable cost.
Domain fine-tuned
Base model + in-domain training
+23%
Fine-tuning on 5,000–10,000 in-domain query-document pairs delivers 20–30% precision improvement in enterprise evaluation datasets. Required for legal, clinical, and financial domains. Training data source: historical user queries + human relevance judgements. Rebuild quarterly or on significant corpus change.
Sparse · exact match
SPLADE / BM25
exact
Mandatory complement to dense embeddings. BM25 handles exact regulation codes, product identifiers, and proper nouns that dense models systematically miss. SPLADE adds learned sparse expansion — automatically expanding "HIPAA" to "Protected Health Information" and related terms. Always deploy alongside dense, never as a replacement.
Embedding cache
SHA-256 content hash check
80%
60–80% of chunks in a typical enterprise corpus are unchanged between ingestion runs. Checking SHA-256 content hash before re-embedding eliminates this redundant work. At $0.02 per 1M tokens (text-embedding-3-small), 80% cache hit rate on a 10M-token corpus saves $160 per full re-index run.
Inference Deployment

Three Inference Deployment Patterns — Choose Based on Data Sovereignty

The single most important inference architecture decision is not which model to use — it is where inference runs. Data sovereignty requirements, regulatory constraints, and cost profiles all point to the same first question: can this data leave your infrastructure perimeter?

Cloud API inference
Maximum capability, managed infrastructure
Best performance ceiling. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro. Managed infrastructure — no GPU provisioning, no model serving overhead. Appropriate when data sovereignty does not prohibit external API calls and maximum model capability is required.
  • GPT-4o: 128K context window, best complex reasoning
  • Claude 3.5 Sonnet: best instruction following, long-form synthesis
  • Gemini 1.5 Pro: 1M context window for very long documents
  • Cost: $0.003–$0.015 per 1K input tokens
  • Latency: 200–600ms TTFT typical
  • Risk: data leaves your infrastructure perimeter
On-premise inference
Data sovereignty, fixed cost at scale
Required for HIPAA-covered clinical data, financial data under certain regulatory frameworks, and classified information. Llama 3 8B (4-bit quantized) and Mistral 7B are the standard choices — both run on a single A100 GPU with sufficient throughput for most enterprise query volumes.
  • Llama 3 8B (4-bit): 60–70% of GPT-4 quality, ~15% of cost at scale
  • Mistral 7B: strong instruction following, 32K context
  • Ollama: local model serving, batch inference, GPU management
  • Cost: fixed GPU hardware + electricity (amortised over queries)
  • Latency: 100–300ms TTFT on A100
  • Advantage: data never leaves infrastructure perimeter
Hybrid inference routing
Complexity-based cost routing
The most cost-efficient pattern at scale. A complexity classifier routes simple factual queries to a small local model (Mistral 7B) and complex synthesis queries to a cloud API (GPT-4o). 70–80% of enterprise queries are simple enough for a local model — routing them off-premise wastes cost and adds latency.
  • Simple factual: local Mistral 7B, ~$0.001, <150ms
  • Moderate analytical: cloud GPT-3.5-turbo, ~$0.002, <300ms
  • Complex synthesis: cloud GPT-4o, ~$0.02, <600ms
  • 60–70% cost reduction vs always-cloud at enterprise scale
  • Complexity classifier: fine-tuned BERT, 95%+ routing accuracy
  • Fallback: primary model unavailable → secondary tier automatic
Compliance Architecture

Three Zones. Processing to Response.

The processing and inference layer has three compliance zones positioned at the points where data quality, output quality, and audit evidence are most critical. Each zone maps to specific ISO 9001 and SOC 1 Type 2 control requirements.

Zone AProcessing quality gate
ISO 9001 §8.7 · SOC 1 CC7.2
  • PII detected and classified before embedding — sensitivity label set at processing time
  • Content quality threshold enforced — sub-50-token chunks discarded with log entry
  • SimHash near-duplicate detection — prevents retrieval relevance inflation
  • Processing decision log records every quality gate outcome
  • ISO 9001 §9.1 evidence: processing quality metrics available for operational review
  • Failed items logged as nonconforming with reason code — never silently discarded
Zone BOutput quality gate
ISO 9001 §8.7 · SOC 1 CC7.2
  • Citation-to-chunk traceability: every claim traced to a source chunk ID
  • NLI entailment scoring: contradiction triggers response suppression
  • Output PII filter: final check before response delivery
  • Format + schema validation: structured outputs validated against defined schema
  • Response suppression on citation failure — graceful declination, not broken response
  • Hallucination signal rate logged per query category for drift detection
Zone CImmutable response audit
SOC 1 CC6 · ISO 9001 §7.5 · GDPR Art. 5
  • Full response log: query + context + prompt version + model config + response
  • WORM immutable store: logs cannot be modified or deleted after writing
  • Prompt version archive: every historical prompt version retained for replay
  • Model configuration log: model ID + temp + max tokens per query
  • 12-month retention minimum for SOC 1 Type 2 evidence package
  • Response replay: any past query reconstructable with exact original parameters
Architectural note: Task AI Systems designs processing and inference systems aligned with ISO 9001 process quality requirements, SOC 1 Type 2 operating effectiveness controls, and GDPR data integrity principles. The three-zone compliance architecture ensures that every document processed and every response generated is traceable, auditable, and recoverable for regulatory review. Formal certification responsibilities remain with your organisation's compliance function.
Operational Resilience

Every Failure Mode. Every Recovery Path.

Processing and inference failures are categorically different. A processing failure corrupts data silently. An inference failure fails users visibly. Both must be designed against explicitly — discovered failures are expensive, designed recoveries are not.

Embedding model version drift
The embedding model is updated or replaced mid-corpus. Old chunks are embedded with model v1; new chunks with model v2. The two embedding spaces are incompatible — retrieval returns nonsensical results because similar queries match against different vector spaces.
Prompt version mismatch
A prompt update is deployed to production without updating the audit log version tracking. Responses are generated under a new prompt but logged with the old prompt hash. Audit trail is corrupted — historical queries cannot be accurately replayed and regulators cannot verify what system produced what output.
Citation verification bypass
Zone B citation check is disabled or misconfigured under load. Responses with no source attribution are delivered to users. In a clinical or legal context, a response that appears authoritative but contains no cited source is indistinguishable from a well-cited response — until the decision it informed causes harm.
Phantom chunks after delete
A document is deleted from the source system, processed chunks are purged from the primary vector DB, but the sparse BM25 index and graph store are not updated. Exact-term queries continue returning content from the deleted document through the sparse retrieval track. Compliance failure for documents deleted under right-to-erasure requests.
Cost routing miscalibration
The complexity classifier that routes queries to model tiers degrades over time as query patterns shift. Complex synthesis queries are routed to the small local model. Quality appears to degrade gradually — users report worse answers, but no error metrics fire because the system is technically functioning.
Indirect prompt injection
Adversarial content embedded in a document in the knowledge base is retrieved and injected into the LLM context. The injected content manipulates the LLM into ignoring its system prompt constraints, revealing information from other namespaces, or generating content outside its permitted scope.
Recovery path for each failure mode
Full corpus re-embedding on model version change
Index versioning tracks which embedding model version produced each chunk. On model upgrade, a full re-embedding run is triggered with blue-green deployment — new index version serves queries while old remains as rollback until quality is verified.
Atomic prompt deployment with version lock
Prompt updates are deployed atomically: new prompt version activated, old version archived, audit log updated with new hash — all in a single transaction. No window where responses are generated under one prompt and logged under another.
Zone B circuit breaker on citation check failure rate
If citation verification failure rate exceeds 5% in a rolling 5-minute window, Zone B circuit breaker activates — all responses are suppressed and the system falls back to graceful declination until the root cause is identified and resolved.
Cross-index delete propagation with confirmation
Delete operations are not confirmed complete until all three indexes (vector DB, sparse BM25, graph store) return confirmed deletion. If any index fails to confirm, the delete is retried up to N times before alerting. For right-to-erasure, confirmed deletion across all indexes is logged as compliance evidence.
Complexity classifier retraining on quality signal
User feedback signals (thumbs down, query rephrasing) tagged with the model tier that served the query feed a weekly classifier retraining run. If a query category shows consistently lower satisfaction when routed to a lower tier, the routing threshold for that category is automatically adjusted upward.
Post-generation scope boundary enforcement
Zone B scope boundary check scans the response for content outside the permitted domain — including content that appears to have been injected via retrieved documents. Responses containing out-of-scope content are suppressed and the triggering chunk is flagged for manual review and potential removal from the corpus.
Architecture Decisions

The Decisions That Determine Production Quality

Every significant architectural choice in the processing and inference layer involves a tradeoff between quality, cost, latency, and compliance. These are the five decisions where the tradeoff analysis matters most.

Processing + inference architecture decisions
Domain fine-tuning vs generic embeddings
Generic models score 20–30% lower on retrieval precision in enterprise domain evaluation. Fine-tuning on 5,000–10,000 in-domain query-document pairs closes this gap. The cost of fine-tuning (a few hundred dollars per training run) is amortised over millions of queries. For legal, clinical, and financial corpora, generic embeddings are not acceptable — the precision loss is too large. For general internal knowledge bases with varied content, generic models with careful chunking often reach acceptable quality without fine-tuning overhead.
Chunk size: 512 tokens default
The 512-token sweet spot is empirically derived across enterprise RAG evaluations. Below 256 tokens, most enterprise documents produce fragments too short to contain complete semantic units — policy clauses, procedure steps, contractual terms. Above 1024 tokens, chunks contain multiple distinct topics that produce averaged embeddings with degraded precision on any single topic. Parent-child chunking resolves the residual tension: 256–512 token child chunks for precise retrieval, 800–1500 token parent sections for full context at inference time.
On-premise vs cloud inference
The decision is not primarily about cost — it is about data sovereignty. For any workload where regulatory constraints prohibit data leaving the infrastructure perimeter (HIPAA-covered clinical data, financial data under certain frameworks, legal documents under attorney-client privilege), on-premise is not a preference — it is a requirement. Llama 3 8B (4-bit quantized) on a single A100 GPU achieves 60–70% of GPT-4 quality at a fraction of the per-query cost at scale. For workloads without sovereignty constraints, hybrid routing captures the best of both: local model for 70% of simple queries, cloud API for 30% of complex queries.
Citation mandate vs best-effort attribution
Best-effort attribution produces responses where some claims are cited and others are not — the user cannot distinguish grounded claims from hallucinated ones. Citation mandate is the only architecture that makes hallucination detectable: if a claim cannot be cited, it cannot be made. The cost is a higher declination rate — the system will refuse more queries that a best-effort system would answer (perhaps incorrectly). In regulated industries, a higher declination rate is acceptable and preferable to undetected hallucination. In general enterprise deployments, the citation mandate can be tuned to require citation only for high-confidence factual claims, with lower-confidence claims flagged as uncertain.
WORM audit log vs standard database log
A standard database log can be modified — records can be updated, deleted, or overwritten. This makes it useless as compliance evidence: an auditor cannot trust that the log reflects what actually happened rather than what someone wanted the record to show. WORM (Write Once Read Many) storage — S3 Object Lock, Azure Blob immutability, GCS retention policies — creates a log that cannot be modified after writing. The immutability is what makes the log credible to a SOC 1 Type 2 auditor. The additional cost of WORM storage over standard storage is typically less than 10% — trivial relative to the compliance value it provides.
Start the Conversation

Ready to Architect Your Processing & Inference Layer?

A focused architecture conversation can identify the specific gaps in your current system — before an embedding version mismatch corrupts your index or an uncited hallucination reaches a regulated decision.