The processing layer transforms ingested documents into retrieval-ready vectors. The inference layer transforms retrieved context into governed, cited, compliance-audited responses. Together they are the core of every enterprise RAG system — and the layer where most production failures originate.
Processing is the bridge between raw ingested text and a retrieval-ready vector index. Inference is the bridge between retrieved context and a governed LLM response. Neither layer is a black box — every stage has defined inputs, outputs, failure modes, and compliance checkpoints.
The processing and inference layer is two distinct subsystems that must be understood separately before they can be designed together. Processing is batch-oriented and runs continuously in the background. Inference is real-time and runs on every user query. Their failure modes, latency requirements, and compliance obligations are completely different.
These two layers are frequently conflated in architecture discussions. They should not be. Processing is batch infrastructure. Inference is real-time software. Their operational requirements, failure modes, and compliance obligations are fundamentally different.
Processing is concerned with throughput, correctness, and idempotency. It runs whenever documents are added to or updated in the knowledge corpus. A processing failure corrupts the index — it does not directly fail a user query, but it degrades every query that would have used the corrupted data.
Inference is concerned with latency, accuracy, and compliance. It runs on demand in response to user queries. An inference failure is immediately visible — a user receives a wrong answer, a hallucination, or an error. Every inference decision is auditable.
Embedding model selection has more impact on retrieval quality than any other single architectural decision. A 20–30% precision improvement from domain fine-tuning is not marginal — in a system answering 10,000 queries per day, it is the difference between 2,000 and 2,600 queries answered correctly. Get this decision right before optimising anything else.
The single most important inference architecture decision is not which model to use — it is where inference runs. Data sovereignty requirements, regulatory constraints, and cost profiles all point to the same first question: can this data leave your infrastructure perimeter?
The processing and inference layer has three compliance zones positioned at the points where data quality, output quality, and audit evidence are most critical. Each zone maps to specific ISO 9001 and SOC 1 Type 2 control requirements.
Processing and inference failures are categorically different. A processing failure corrupts data silently. An inference failure fails users visibly. Both must be designed against explicitly — discovered failures are expensive, designed recoveries are not.
Every significant architectural choice in the processing and inference layer involves a tradeoff between quality, cost, latency, and compliance. These are the five decisions where the tradeoff analysis matters most.
A focused architecture conversation can identify the specific gaps in your current system — before an embedding version mismatch corrupts your index or an uncited hallucination reaches a regulated decision.