Enterprise RAG Retrieval & Intelligence Layer

Stage Detail

What Each Stage Does — And Why It Exists

The retrieval pipeline is a decision funnel, not a linear flow. A query fans out to three parallel retrieval tracks, converges at hybrid fusion, narrows through reranking, and routes through a confidence gate before context reaches the LLM. Every compliance control is structural — enforced at the infrastructure layer, not application logic.

STAGE 00

Hardware Layer

Retrieval hardware is fundamentally different from ingestion hardware. Ingestion is optimised for throughput — processing millions of documents over hours. Retrieval is optimised for latency — answering queries in under 200ms. This means dedicated GPU or high-CPU nodes for ANN search, separate Redis cluster nodes for semantic caching, and a distinct inference node for the cross-encoder reranker. The KMS decrypt at query time is the piece most architectures miss: if vector payloads are stored encrypted (required for regulated industries), they must be decrypted in memory at retrieval time, which requires a KMS call on every query. That call adds ~5–15ms and must be budgeted into the p95 latency SLA. Data residency zone assignment also happens here — a query containing patient data may need to be routed to an EU-region endpoint before touching any vector search infrastructure.

GPU / CPU ANN workers · Qdrant cluster · Redis semantic cache · VPC private endpoints · KMS decrypt · TLS 1.2+ all hops · HIPAA / GDPR region lock

COMPLIANCE ZONE A

Pre-Query Compliance Gate

Before a query touches any retrieval infrastructure, eight controls must pass in sequence. User identity verification via SSO/session token. RBAC pre-check confirming the user role is permitted to query this namespace at all. Query event log — writing user identity, timestamp, source IP, and SHA-256 hash of the query to an immutable log before retrieval begins, not after. Rate limiting as an availability control. PII pre-screen flagging any query that contains personal data before it enters retrieval. Toxicity filter blocking adversarial prompts designed to extract unauthorised information. Data residency routing directing the query to the correct regional endpoint. The query SHA-256 fingerprint creates a tamper-evident chain from query to retrieved context to LLM response — the SOC 1 auditor traces this chain to verify that what was logged is what was answered.

SOC 1 CC6.1 · CC6.6 · ISO 9001 §7.4 · SSO / MFA · RBAC · Query event log · Rate limit · PII screen · Toxicity filter · Data residency · SHA-256

Zone A controls

SOC 1 CC6.1 — logical access enforced before query reaches vector DB
SOC 1 CC6.6 — security event logging: failed auth attempts, rate limit violations
ISO 9001 §7.4 — controlled input: query validated before entering retrieval system
GDPR Article 25 — data residency routing: privacy by design at infrastructure level

STAGE 01

Vector Database Architecture

The vector database is not a single system — it is an architecture of four complementary stores. The primary vector store (Qdrant, Weaviate, Pinecone) handles dense embedding search with native namespace isolation and metadata payload filtering. The sparse index (Elasticsearch, OpenSearch) handles exact-term BM25 retrieval — regulation codes, product identifiers, contract numbers that dense retrieval systematically misses. The graph store (Neo4j) handles entity relationships that neither dense nor sparse retrieval can traverse — "show me all policy documents referenced by this contract, and all clauses in those documents that conflict with our internal guidelines." The semantic cache (Redis/GPTCache) stores recent query-result pairs and returns cached results on near-duplicate queries, eliminating retrieval latency entirely for repeated questions — typically hitting on 20–40% of production queries in enterprise deployments.

Qdrant · Weaviate · Pinecone · Elasticsearch · Neo4j · Redis · GPTCache · HNSW · IVF · PQ compression · index versioning

COMPLIANCE ZONE B

Namespace Isolation + Access Boundary Enforcement

Namespace isolation is the most important structural control in the retrieval layer. It is placed between indexing strategy and query understanding deliberately — access boundaries must be enforced before the query is even formulated for retrieval. The isolation is structural: an HR query physically cannot return financial documents because the query is scoped to the HR namespace at the vector DB filter layer, not through application logic that could be bypassed. Sensitivity filters enforce document classification — a user without "restricted" clearance never receives restricted-classified chunks regardless of query phrasing. The per-query access log records which namespaces were queried, by whom, at what time — this is the retrieval-layer equivalent of Zone D in ingestion, and it feeds the SOC 1 Type 2 evidence package.

SOC 1 CC6.1 · ISO 9001 §8.4 · Tenant namespaces · Cross-namespace blocked structurally · Sensitivity filter · Per-query access log

Zone B controls

SOC 1 CC6.1 — RBAC enforced at vector DB filter layer, not application logic
ISO 9001 §8.4 — external provider control: approved namespaces only
Cross-namespace isolation is structural — no query can span namespaces without explicit permission
Sensitivity classification propagated from ingestion is enforced at retrieval time

STAGE 03 ★

Query Understanding Layer

Query understanding is where most RAG implementations fail silently. A user asks "what was discussed in the Q3 board meeting about the India expansion?" — without query decomposition, this becomes a single vector search that misses most relevant context. With query understanding: intent classification identifies this as a multi-document analytical query. Query expansion adds synonyms and related terms. HyDE generates a hypothetical answer document and uses its embedding to retrieve semantically relevant chunks rather than searching for the literal question. Query decomposition breaks it into sub-queries: "Q3 board meeting documents" and "India expansion strategy discussion." Step-back reformulation abstracts to "strategic expansion discussions 2024" to catch related context. The result: 3–5× improvement in retrieval recall before a single vector search is executed. This is the layer that separates demo-quality RAG from production-grade RAG.

Intent classification · Query expansion · HyDE · Multi-hop decomposition · Step-back reformulation · Clarification routing

STAGES 4A · 4B · 4C — PARALLEL

Three Parallel Retrieval Tracks

All three tracks fire simultaneously — not sequentially. Sequential retrieval wastes latency proportional to the number of tracks. The dense track (ANN vector search, ColBERT late interaction) handles semantic similarity — "what documents are conceptually related to this query?" The sparse track (BM25, SPLADE) handles exact-match precision — regulation codes, product identifiers, policy numbers, names. A query for "Section 14.2(b) of the Master Services Agreement" will score zero on dense retrieval and perfect on sparse. The metadata filter track enforces temporal and access constraints — current-version documents only, documents owned by the querying department, documents within the user's sensitivity clearance. All three tracks produce candidate pools simultaneously; the fusion stage reconciles them.

Dense: ANN cosine · ColBERT · Sparse: BM25 · SPLADE · Metadata: date / version / dept / sensitivity · All three fire in parallel

STAGE 05

Hybrid Fusion

Reciprocal Rank Fusion (RRF) merges the three candidate pools into a single ranked list without requiring score calibration between tracks. Dense scores and sparse scores exist on different scales — RRF uses rank position rather than raw score, so a chunk ranked 3rd by dense and 8th by sparse fuses cleanly with a chunk ranked 1st by sparse and 18th by dense. Score normalisation is applied per-track before fusion for any weighted ensemble variants. Deduplication removes chunks that appeared in multiple tracks (same chunk retrieved by both dense and sparse — count it once, weight it higher). The output is a top-20 candidate pool entering the reranker.

RRF merge · Score normalise per track · Dedup across tracks · Top-20 pool → reranker

STAGE 06

Reranking Layer

Initial retrieval returns semantic similarity — not relevance. A cross-encoder reranking model scores each candidate chunk against the full original query for relevance, not vector proximity. The cross-encoder reads the query and chunk together, producing a precise relevance score that bi-encoder embedding models cannot produce. Cohere Rerank and similar managed rerankers offer API-based reranking without running the model locally. MMR (Maximal Marginal Relevance) diversity scoring eliminates redundant chunks — if three chunks are near-identical policy clauses, only the highest-scoring one enters the context window. The output is a top-5 precision context with a 0–1 confidence score per chunk. This confidence score feeds directly into the compliance gate that follows.

Cross-encoder · Cohere Rerank · Confidence 0–1 per chunk · MMR diversity · Top-5 precision context

COMPLIANCE ZONE C

Confidence Gate — Never Fabricate

This is the most important compliance control in the entire retrieval layer. If confidence scores are below threshold, the system must decline to answer — not fabricate a plausible-sounding response. The threshold is configurable per use case: a customer support bot might gate at 0.65, a medical protocol system at 0.85, a legal compliance system at 0.90. Below-threshold queries are not failures — they are the system working correctly. They route to a human review queue with the original query, the reason for declination, and the highest confidence score achieved. Every declination is logged immutably: query hash, timestamp, confidence scores, and routing decision. This log is SOC 1 evidence that the system refused to hallucinate. ISO 9001 §8.7 (nonconforming output control) requires exactly this: a documented process for handling outputs that fail quality standards, not silent pass-through.

ISO 9001 §8.7 · SOC 1 CC7.2 · Configurable threshold · Declination log · Human escalation queue + SLA · PASS → context · FAIL → decline + audit

Zone C controls

ISO 9001 §8.7 — nonconforming output control: documented process, not silent pass-through
SOC 1 CC7.2 — system monitoring: confidence distributions logged, drift alerted
Declination log is immutable — SOC 1 evidence the system refused to fabricate
Human escalation SLA — time-bound handoff, not an unbounded queue

STAGE 07

Context Assembly

Context assembly packs the top-K chunks into the LLM context window with three constraints: token budget (the total context must fit the model's context window), citation mapping (every chunk is tagged with its source document, version, and chunk ID so the LLM can cite it), and small-to-big retrieval (the reranker selected small precise chunks; this stage retrieves their parent sections for fuller context). The parent retrieval strategy is particularly important for legal and policy documents — a 200-token clause needs to be read alongside its surrounding section to be interpretable. The citation map is the mechanism that makes hallucination detection possible downstream: if the LLM makes a claim, it must be traceable to a specific chunk ID in this map.

Top-K select · Token budget aware · Parent retrieval (small-to-big) · Citation map (chunk ID → source doc)

COMPLIANCE ZONE D

Output Compliance Gate + Immutable Audit Trail

Before assembled context leaves the retrieval layer and enters the LLM, four output controls fire. Output PII filter removes any personally identifiable information from chunks before the LLM processes them — even if a user has clearance to retrieve a document, the LLM should not receive raw PII that it might reproduce in its output. RBAC re-verify confirms at retrieval time that the access permissions established at Zone A still apply — session tokens can expire, permissions can be revoked mid-session. Provenance lock tags every chunk with its exact source document ID, version number, and ingestion timestamp — this chain from ingestion lineage ID through to LLM context is the complete audit trace. The immutable WORM retrieval log records the full query trace: query hash, namespace accessed, chunk IDs retrieved, confidence scores, routing decision, and context sent to LLM. This is the SOC 1 Type 2 evidence package for the retrieval layer — 12 months of logs demonstrating that access controls operated continuously.

SOC 1 CC6 · ISO 9001 §7.5 · GDPR · Output PII filter · RBAC re-verify · Provenance lock · WORM audit log · Query trace · Miss-rate alert · Periodic audit report

Zone D controls

SOC 1 CC6 — logical access: RBAC re-verified at retrieval time, not only at query entry
ISO 9001 §7.5 — documented information: every chunk provenance tracked to ingestion lineage ID
GDPR Article 5(1)(f) — integrity and confidentiality: PII filtered before LLM processing
SOC 1 Type 2 — 12 months WORM logs as operating effectiveness evidence
ISO 9001 §9.2 — internal audit: periodic audit report auto-generated from retrieval logs

STAGE 09

Retrieval Observability

Retrieval observability is what separates a deployed system from a monitored one. Prometheus metrics collect query-to-response latency broken down by stage (ANN search time, reranker time, cache hit/miss), confidence score distributions per query category, cost per query (embedding inference, reranker API calls, cache tier cost), and retrieval miss-rate — the rate at which queries fail to retrieve any above-threshold candidates. Miss-rate is the silent failure signal: it tells you that your knowledge corpus has gaps the ingestion pipeline didn't fill. A sustained miss-rate above 5% on a query category signals a corpus gap, not a retrieval failure — and triggers the feedback loop back to ingestion.

Prometheus · Latency by stage · Confidence distributions · Cost per query · Miss-rate · Cache hit rate · Hallucination signal rate

STAGE 10 ★

Evaluation + Feedback Loop

The feedback loop is what makes a retrieval system improve over time rather than decay. RAGAS (Retrieval Augmented Generation Assessment) runs offline against a golden dataset, measuring context precision (how much of the retrieved context was actually used), context recall (how much relevant context was missed), and faithfulness (whether LLM claims are grounded in retrieved context). A/B testing compares retrieval strategies — testing whether BM25 hybrid outperforms pure dense on a specific query category, or whether a narrower reranker threshold improves precision at the cost of recall. User signal loop captures explicit feedback (thumbs up/down) and implicit signals (did the user rephrase and ask again?). Index re-tune triggers fire when RAGAS scores drop below baseline or miss-rate rises — initiating a re-indexing run, embedding model evaluation, or chunking strategy review in the ingestion pipeline. This is the feedback arrow that connects retrieval back to ingestion.

RAGAS · Context precision / recall / faithfulness · A/B strategy test · User signals · Index re-tune trigger · Corpus gap detection → ingestion feedback

Retrieval Track Detail

Three Tracks. One Reason Each Exists.

Dense, sparse, and metadata retrieval are not redundant — they answer fundamentally different questions. Removing any one of them creates systematic blind spots that no amount of tuning the other two will fix.

Dense retrieval

Semantic similarity — what is this query conceptually about?

Dense retrieval uses embedding vectors to find semantically related content regardless of exact wording. A query about "employee termination procedures" retrieves documents about "staff offboarding" and "employment separation" even if those exact words don't appear in the query. The ColBERT variant performs late interaction — comparing query token embeddings against document token embeddings rather than single dense vectors, producing superior precision on long documents with multiple distinct topics.

Best for: conceptual questions, paraphrased queries, cross-domain terminology
Fails on: exact regulation codes, proper nouns, numerical identifiers
Throughput: ~5–20ms per query on GPU-accelerated HNSW index
Model matters: generic embeddings underperform 23%+ vs domain fine-tuned

Sparse retrieval

Exact-match precision — find this specific term exactly as written.

Sparse retrieval is non-negotiable in enterprise RAG. When a compliance officer queries "ISO 27001:2022 Annex A.8.9" or an insurance adjuster queries "Policy Form CG 00 01 04 13," the query is a precise identifier. Dense retrieval will return semantically similar but inexact results. BM25 returns the exact document or nothing. SPLADE (Sparse Learned Ensemble Discrimination) extends BM25 with learned sparse representations — it can expand "HIPAA" to related terms like "Protected Health Information" and "covered entity" automatically, combining the precision of BM25 with some semantic breadth.

Best for: regulation codes, policy numbers, product identifiers, names
Fails on: paraphrased queries, conceptual questions without keywords
Throughput: ~1–5ms per query on inverted index
SPLADE adds semantic expansion while preserving exact-match precision

Metadata filter retrieval

Constraint enforcement — only search where you're allowed to search.

Metadata filtering is not a retrieval strategy — it is an access control mechanism that shapes what dense and sparse retrieval can see. Date and version filters ensure queries retrieve current policy documents, not superseded versions. Department and namespace filters enforce that a finance team query cannot accidentally retrieve HR documents even if they are semantically similar. Sensitivity classification filters enforce that a user without "restricted" clearance never receives restricted chunks in their candidate pool. These filters are applied at the vector DB query layer — not as post-processing — so restricted content is never even scored, let alone returned.

Best for: access control, version scoping, temporal constraints
Applied at DB query layer — restricted content never scored
Propagated from ingestion sensitivity labels — no re-classification at query time
Zero latency overhead — filters are HNSW payload filters, not post-processing

Operational Resilience

Every Retrieval Failure Mode. Every Recovery Path.

Retrieval failures are more dangerous than ingestion failures because they are invisible — the system returns an answer confidently, but the answer is wrong. These are the failure modes we design explicit recovery paths for.

Silent retrieval miss

Relevant documents exist in the corpus but are not retrieved. The system returns an answer based on partially relevant context, or confidently declines when it should have answered. Invisible without instrumentation — teams discover it through user complaints, not monitoring.

Cross-namespace leakage

A query returns chunks from a namespace the user is not authorised to access — HR documents appearing in a finance query, restricted-classified content returned to a standard user. Compliance failure regardless of whether the user noticed.

Stale version retrieval

A superseded policy document is retrieved instead of the current version. The LLM answers based on a policy that was changed six months ago. In regulated industries, acting on stale policy guidance is a compliance violation regardless of how the information was accessed.

Below-threshold fabrication

The system returns an answer when confidence is below threshold rather than declining. In a medical protocol system, this means a clinician receives guidance that is not grounded in retrieved evidence. The most dangerous failure mode — happens when confidence gating is not implemented.

Semantic cache poisoning

A cached result from a previous query is returned for a semantically similar but contextually different query. "What is our refund policy for EU customers?" returns a cached result for "What is our refund policy?" — missing the GDPR-specific provisions that apply to the EU context.

Confidence drift without alerting

Average retrieval confidence degrades over weeks as the knowledge corpus becomes stale relative to query patterns — new regulations, product changes, updated procedures. Without confidence distribution monitoring, this drift is invisible until user trust collapses.

Recovery path for each failure mode

Miss-rate monitoring + corpus gap detection

Zone D miss-rate alert fires when queries in a category consistently fail to retrieve above-threshold candidates. Alert triggers corpus gap detection in ingestion — the missing knowledge must be added, not the retrieval tuned.

Structural namespace isolation at Zone B

Cross-namespace isolation is enforced at the vector DB filter layer before scoring — restricted content is never scored, never ranked, never returned. No application logic can be bypassed because the filter is structural.

Version metadata filter at retrieval time

Metadata filter track enforces current-version-only retrieval by default. Historical version retrieval requires explicit scope parameter. Version filter applied at HNSW payload layer — not post-processing — so stale chunks are never scored.

Zone C confidence gate — mandatory declination

Below-threshold queries route to human escalation queue with documented SLA. Declination is logged immutably. The system cannot answer below threshold — the gate is structural. Declination log is SOC 1 evidence the system refused to fabricate.

Cache invalidation on document update

When a document is updated in the ingestion pipeline, all semantic cache entries containing chunks from that document are invalidated. Cache keys include document version — a version increment forces cache miss regardless of query similarity.

Confidence distribution alerting + A/B test

Zone D confidence distribution log feeds a drift detection alert — if average confidence drops more than 10% from baseline over a rolling 7-day window, an alert fires. A/B testing evaluates whether re-indexing or embedding model update restores confidence.

Finding the Right Knowledge
Intelligently. Compliantly.

The Retrieval Pipeline — Every Layer.

What Each Stage Does — And Why It Exists

Three Tracks. One Reason Each Exists.

Four Zones. Every Control Mapped.

Every Retrieval Failure Mode. Every Recovery Path.

Key Decisions & Their Rationale

Ready to Architect Your Retrieval Layer?

Finding the Right KnowledgeIntelligently. Compliantly.

The Retrieval Pipeline — Every Layer.

What Each Stage Does — And Why It Exists

Three Tracks. One Reason Each Exists.

Four Zones. Every Control Mapped.

Every Retrieval Failure Mode. Every Recovery Path.

Key Decisions & Their Rationale

Ready to Architect Your Retrieval Layer?

Design Your AI Architecture

Send a Direct Inquiry

Message Received

Finding the Right Knowledge
Intelligently. Compliantly.