Retrieval & Intelligence Layer

Finding the Right Knowledge
Intelligently. Compliantly.

The retrieval layer is where ingested knowledge becomes answerable intelligence. Three parallel retrieval tracks, hybrid fusion, cross-encoder reranking, confidence-gated context assembly — with four compliance zones enforcing access, auditability, and data sovereignty at every step.

4 Compliance Zones
Pre-query through audit
3 Parallel Tracks
Dense · Sparse · Metadata
Confidence Gating
Never fabricate, always escalate
10 Stages
Hardware to inference handoff
Complete Architecture

The Retrieval Pipeline — Every Layer.

Most RAG systems stop at vector similarity search. This architecture adds query understanding, parallel hybrid retrieval, cross-encoder reranking, confidence-gated context assembly, and four compliance zones — because retrieval quality and retrieval accountability are both non-negotiable in enterprise environments.

Retrieval and Intelligence Layer — complete with hardware and four compliance zones Full retrieval pipeline from hardware layer through vector DB, parallel retrieval tracks, hybrid fusion, reranking, context assembly, and governance. Four compliance zones: pre-query gate, namespace enforcement, confidence gate, and output audit trail. ← from ingestion pipeline 0 · Hardware layer — retrieval-optimised infrastructure GPU / CPU nodes ANN search workers Vector DB cluster Qdrant / Weaviate nodes VPC private endpoints no public query surface KMS decrypt at query vector payload decryption TLS 1.2+ all hops client → DB → reranker Redis cache cluster semantic cache nodes Reranker inference GPU or managed API Data residency zone HIPAA / GDPR region lock A · Pre-query compliance gate · SOC 1 CC6.1 · CC6.6 · ISO 9001 §7.4 User identity SSO / session token RBAC pre-check permitted to query? Query event log user · time · IP · query hash Rate limit + DDoS availability control PII pre-screen flag before retrieval Toxicity filter block adversarial queries Data residency route HIPAA / GDPR region Query SHA-256 immutable fingerprint FAIL block + alert PASS 1 · Vector database architecture Primary vector store Qdrant / Weaviate Sparse index Elasticsearch / OpenSearch Graph store Neo4j / entity links Semantic cache Redis / GPTCache 2 · Indexing strategy — HNSW · IVF flat · PQ compression · index versioning · tiered cost B · Namespace isolation + access boundary · SOC 1 CC6.1 · ISO 9001 §8.4 Tenant namespaces dept / project / role Cross-NS blocked structural not logic Sensitivity filter public / internal / restricted Access verified log per-query audit record 3 · Query understanding ★ — intent · expansion · HyDE · decomposition · multi-hop Intent classify factual / proc / analytic Query expansion synonyms + context HyDE hypothetical doc embed Decomposition multi-hop sub-queries Reformulation step-back + clarify Parallel retrieval dispatch — all 3 tracks fire simultaneously 4a · Dense (semantic) retrieval Vector ANN cosine / dot ColBERT late interaction Top-20 candidates recalled 4b · Sparse (keyword) retrieval BM25 exact term match SPLADE learned sparse Reg codes · IDs · exact terms 4c · Metadata filter retrieval Date / version temporal scope Dept / class namespace bound Sensitivity · doc type filter 5 · Hybrid fusion — RRF scoring · score normalise · dedup · top-20 pool into reranker 6 · Reranking — cross-encoder · Cohere Rerank · MMR diversity · confidence 0–1 Cross-encoder query × chunk scoring Cohere / managed API reranker option Confidence score 0–1 per candidate MMR diversity avoid redundant chunks C · Confidence gate · ISO 9001 §8.7 nonconforming output · SOC 1 CC7.2 · never fabricate Threshold check configurable per use case Declination log reason + query hash Human escalation review queue + SLA PASS → context assembly FAIL → decline + audit log Escalation path human review · decline log PASS 7 · Context assembly — top-K · window pack · parent retrieval · citation map 8 · Retrieval optimisation — semantic cache · <200ms p95 · fallback strategy · contextual compression D · Output compliance gate + immutable audit trail · SOC 1 CC6 · ISO 9001 §7.5 · GDPR Output PII filter before LLM sees chunks RBAC re-verify at retrieval time Provenance lock source + version tagged Immutable retrieval log WORM · SOC 1 T2 evidence Query trace log chunk IDs + scores Confidence dist log per query type Miss-rate alert silent failure signal Periodic audit report ISO 9001 §9.2 · SOC 1 T2 9 · Observability — Prometheus · latency by stage · cost per query · hallucination signal · drift 10 · Evaluation + feedback loop ★ — RAGAS · A/B test · user signals → index re-tune RAGAS offline context precision / recall A/B retrieval test strategy comparison User signal loop thumbs · corrections Index re-tune trigger retrain on failure signals Inference layer → LLM orchestration Four compliance zones — control inventory Zone A · Pre-query: SOC 1 CC6.1 · CC6.6 · ISO 9001 §7.4 — identity, RBAC, PII screen, rate limit, data residency routing Zone B · Namespace: SOC 1 CC6.1 · ISO 9001 §8.4 — structural isolation, sensitivity filter, per-query access log (WORM) Zone C · Confidence gate: ISO 9001 §8.7 · SOC 1 CC7.2 — threshold check, declination log, human escalation, never fabricate Zone D · Output + audit: SOC 1 CC6 · ISO 9001 §7.5 · GDPR — PII filter, RBAC re-verify, provenance lock, WORM audit trail Hardware / infrastructure Vector DB / context Dense retrieval Sparse / reranking Metadata filter Compliance zone (A·B·C·D) Query understanding / eval
Stage Detail

What Each Stage Does — And Why It Exists

The retrieval pipeline is a decision funnel, not a linear flow. A query fans out to three parallel retrieval tracks, converges at hybrid fusion, narrows through reranking, and routes through a confidence gate before context reaches the LLM. Every compliance control is structural — enforced at the infrastructure layer, not application logic.

STAGE 00
Hardware Layer
Retrieval hardware is fundamentally different from ingestion hardware. Ingestion is optimised for throughput — processing millions of documents over hours. Retrieval is optimised for latency — answering queries in under 200ms. This means dedicated GPU or high-CPU nodes for ANN search, separate Redis cluster nodes for semantic caching, and a distinct inference node for the cross-encoder reranker. The KMS decrypt at query time is the piece most architectures miss: if vector payloads are stored encrypted (required for regulated industries), they must be decrypted in memory at retrieval time, which requires a KMS call on every query. That call adds ~5–15ms and must be budgeted into the p95 latency SLA. Data residency zone assignment also happens here — a query containing patient data may need to be routed to an EU-region endpoint before touching any vector search infrastructure.
GPU / CPU ANN workers · Qdrant cluster · Redis semantic cache · VPC private endpoints · KMS decrypt · TLS 1.2+ all hops · HIPAA / GDPR region lock
COMPLIANCE ZONE A
Pre-Query Compliance Gate
Before a query touches any retrieval infrastructure, eight controls must pass in sequence. User identity verification via SSO/session token. RBAC pre-check confirming the user role is permitted to query this namespace at all. Query event log — writing user identity, timestamp, source IP, and SHA-256 hash of the query to an immutable log before retrieval begins, not after. Rate limiting as an availability control. PII pre-screen flagging any query that contains personal data before it enters retrieval. Toxicity filter blocking adversarial prompts designed to extract unauthorised information. Data residency routing directing the query to the correct regional endpoint. The query SHA-256 fingerprint creates a tamper-evident chain from query to retrieved context to LLM response — the SOC 1 auditor traces this chain to verify that what was logged is what was answered.
SOC 1 CC6.1 · CC6.6 · ISO 9001 §7.4 · SSO / MFA · RBAC · Query event log · Rate limit · PII screen · Toxicity filter · Data residency · SHA-256
Zone A controls
  • SOC 1 CC6.1 — logical access enforced before query reaches vector DB
  • SOC 1 CC6.6 — security event logging: failed auth attempts, rate limit violations
  • ISO 9001 §7.4 — controlled input: query validated before entering retrieval system
  • GDPR Article 25 — data residency routing: privacy by design at infrastructure level
STAGE 01
Vector Database Architecture
The vector database is not a single system — it is an architecture of four complementary stores. The primary vector store (Qdrant, Weaviate, Pinecone) handles dense embedding search with native namespace isolation and metadata payload filtering. The sparse index (Elasticsearch, OpenSearch) handles exact-term BM25 retrieval — regulation codes, product identifiers, contract numbers that dense retrieval systematically misses. The graph store (Neo4j) handles entity relationships that neither dense nor sparse retrieval can traverse — "show me all policy documents referenced by this contract, and all clauses in those documents that conflict with our internal guidelines." The semantic cache (Redis/GPTCache) stores recent query-result pairs and returns cached results on near-duplicate queries, eliminating retrieval latency entirely for repeated questions — typically hitting on 20–40% of production queries in enterprise deployments.
Qdrant · Weaviate · Pinecone · Elasticsearch · Neo4j · Redis · GPTCache · HNSW · IVF · PQ compression · index versioning
COMPLIANCE ZONE B
Namespace Isolation + Access Boundary Enforcement
Namespace isolation is the most important structural control in the retrieval layer. It is placed between indexing strategy and query understanding deliberately — access boundaries must be enforced before the query is even formulated for retrieval. The isolation is structural: an HR query physically cannot return financial documents because the query is scoped to the HR namespace at the vector DB filter layer, not through application logic that could be bypassed. Sensitivity filters enforce document classification — a user without "restricted" clearance never receives restricted-classified chunks regardless of query phrasing. The per-query access log records which namespaces were queried, by whom, at what time — this is the retrieval-layer equivalent of Zone D in ingestion, and it feeds the SOC 1 Type 2 evidence package.
SOC 1 CC6.1 · ISO 9001 §8.4 · Tenant namespaces · Cross-namespace blocked structurally · Sensitivity filter · Per-query access log
Zone B controls
  • SOC 1 CC6.1 — RBAC enforced at vector DB filter layer, not application logic
  • ISO 9001 §8.4 — external provider control: approved namespaces only
  • Cross-namespace isolation is structural — no query can span namespaces without explicit permission
  • Sensitivity classification propagated from ingestion is enforced at retrieval time
STAGE 03 ★
Query Understanding Layer
Query understanding is where most RAG implementations fail silently. A user asks "what was discussed in the Q3 board meeting about the India expansion?" — without query decomposition, this becomes a single vector search that misses most relevant context. With query understanding: intent classification identifies this as a multi-document analytical query. Query expansion adds synonyms and related terms. HyDE generates a hypothetical answer document and uses its embedding to retrieve semantically relevant chunks rather than searching for the literal question. Query decomposition breaks it into sub-queries: "Q3 board meeting documents" and "India expansion strategy discussion." Step-back reformulation abstracts to "strategic expansion discussions 2024" to catch related context. The result: 3–5× improvement in retrieval recall before a single vector search is executed. This is the layer that separates demo-quality RAG from production-grade RAG.
Intent classification · Query expansion · HyDE · Multi-hop decomposition · Step-back reformulation · Clarification routing
STAGES 4A · 4B · 4C — PARALLEL
Three Parallel Retrieval Tracks
All three tracks fire simultaneously — not sequentially. Sequential retrieval wastes latency proportional to the number of tracks. The dense track (ANN vector search, ColBERT late interaction) handles semantic similarity — "what documents are conceptually related to this query?" The sparse track (BM25, SPLADE) handles exact-match precision — regulation codes, product identifiers, policy numbers, names. A query for "Section 14.2(b) of the Master Services Agreement" will score zero on dense retrieval and perfect on sparse. The metadata filter track enforces temporal and access constraints — current-version documents only, documents owned by the querying department, documents within the user's sensitivity clearance. All three tracks produce candidate pools simultaneously; the fusion stage reconciles them.
Dense: ANN cosine · ColBERT · Sparse: BM25 · SPLADE · Metadata: date / version / dept / sensitivity · All three fire in parallel
STAGE 05
Hybrid Fusion
Reciprocal Rank Fusion (RRF) merges the three candidate pools into a single ranked list without requiring score calibration between tracks. Dense scores and sparse scores exist on different scales — RRF uses rank position rather than raw score, so a chunk ranked 3rd by dense and 8th by sparse fuses cleanly with a chunk ranked 1st by sparse and 18th by dense. Score normalisation is applied per-track before fusion for any weighted ensemble variants. Deduplication removes chunks that appeared in multiple tracks (same chunk retrieved by both dense and sparse — count it once, weight it higher). The output is a top-20 candidate pool entering the reranker.
RRF merge · Score normalise per track · Dedup across tracks · Top-20 pool → reranker
STAGE 06
Reranking Layer
Initial retrieval returns semantic similarity — not relevance. A cross-encoder reranking model scores each candidate chunk against the full original query for relevance, not vector proximity. The cross-encoder reads the query and chunk together, producing a precise relevance score that bi-encoder embedding models cannot produce. Cohere Rerank and similar managed rerankers offer API-based reranking without running the model locally. MMR (Maximal Marginal Relevance) diversity scoring eliminates redundant chunks — if three chunks are near-identical policy clauses, only the highest-scoring one enters the context window. The output is a top-5 precision context with a 0–1 confidence score per chunk. This confidence score feeds directly into the compliance gate that follows.
Cross-encoder · Cohere Rerank · Confidence 0–1 per chunk · MMR diversity · Top-5 precision context
COMPLIANCE ZONE C
Confidence Gate — Never Fabricate
This is the most important compliance control in the entire retrieval layer. If confidence scores are below threshold, the system must decline to answer — not fabricate a plausible-sounding response. The threshold is configurable per use case: a customer support bot might gate at 0.65, a medical protocol system at 0.85, a legal compliance system at 0.90. Below-threshold queries are not failures — they are the system working correctly. They route to a human review queue with the original query, the reason for declination, and the highest confidence score achieved. Every declination is logged immutably: query hash, timestamp, confidence scores, and routing decision. This log is SOC 1 evidence that the system refused to hallucinate. ISO 9001 §8.7 (nonconforming output control) requires exactly this: a documented process for handling outputs that fail quality standards, not silent pass-through.
ISO 9001 §8.7 · SOC 1 CC7.2 · Configurable threshold · Declination log · Human escalation queue + SLA · PASS → context · FAIL → decline + audit
Zone C controls
  • ISO 9001 §8.7 — nonconforming output control: documented process, not silent pass-through
  • SOC 1 CC7.2 — system monitoring: confidence distributions logged, drift alerted
  • Declination log is immutable — SOC 1 evidence the system refused to fabricate
  • Human escalation SLA — time-bound handoff, not an unbounded queue
STAGE 07
Context Assembly
Context assembly packs the top-K chunks into the LLM context window with three constraints: token budget (the total context must fit the model's context window), citation mapping (every chunk is tagged with its source document, version, and chunk ID so the LLM can cite it), and small-to-big retrieval (the reranker selected small precise chunks; this stage retrieves their parent sections for fuller context). The parent retrieval strategy is particularly important for legal and policy documents — a 200-token clause needs to be read alongside its surrounding section to be interpretable. The citation map is the mechanism that makes hallucination detection possible downstream: if the LLM makes a claim, it must be traceable to a specific chunk ID in this map.
Top-K select · Token budget aware · Parent retrieval (small-to-big) · Citation map (chunk ID → source doc)
COMPLIANCE ZONE D
Output Compliance Gate + Immutable Audit Trail
Before assembled context leaves the retrieval layer and enters the LLM, four output controls fire. Output PII filter removes any personally identifiable information from chunks before the LLM processes them — even if a user has clearance to retrieve a document, the LLM should not receive raw PII that it might reproduce in its output. RBAC re-verify confirms at retrieval time that the access permissions established at Zone A still apply — session tokens can expire, permissions can be revoked mid-session. Provenance lock tags every chunk with its exact source document ID, version number, and ingestion timestamp — this chain from ingestion lineage ID through to LLM context is the complete audit trace. The immutable WORM retrieval log records the full query trace: query hash, namespace accessed, chunk IDs retrieved, confidence scores, routing decision, and context sent to LLM. This is the SOC 1 Type 2 evidence package for the retrieval layer — 12 months of logs demonstrating that access controls operated continuously.
SOC 1 CC6 · ISO 9001 §7.5 · GDPR · Output PII filter · RBAC re-verify · Provenance lock · WORM audit log · Query trace · Miss-rate alert · Periodic audit report
Zone D controls
  • SOC 1 CC6 — logical access: RBAC re-verified at retrieval time, not only at query entry
  • ISO 9001 §7.5 — documented information: every chunk provenance tracked to ingestion lineage ID
  • GDPR Article 5(1)(f) — integrity and confidentiality: PII filtered before LLM processing
  • SOC 1 Type 2 — 12 months WORM logs as operating effectiveness evidence
  • ISO 9001 §9.2 — internal audit: periodic audit report auto-generated from retrieval logs
STAGE 09
Retrieval Observability
Retrieval observability is what separates a deployed system from a monitored one. Prometheus metrics collect query-to-response latency broken down by stage (ANN search time, reranker time, cache hit/miss), confidence score distributions per query category, cost per query (embedding inference, reranker API calls, cache tier cost), and retrieval miss-rate — the rate at which queries fail to retrieve any above-threshold candidates. Miss-rate is the silent failure signal: it tells you that your knowledge corpus has gaps the ingestion pipeline didn't fill. A sustained miss-rate above 5% on a query category signals a corpus gap, not a retrieval failure — and triggers the feedback loop back to ingestion.
Prometheus · Latency by stage · Confidence distributions · Cost per query · Miss-rate · Cache hit rate · Hallucination signal rate
STAGE 10 ★
Evaluation + Feedback Loop
The feedback loop is what makes a retrieval system improve over time rather than decay. RAGAS (Retrieval Augmented Generation Assessment) runs offline against a golden dataset, measuring context precision (how much of the retrieved context was actually used), context recall (how much relevant context was missed), and faithfulness (whether LLM claims are grounded in retrieved context). A/B testing compares retrieval strategies — testing whether BM25 hybrid outperforms pure dense on a specific query category, or whether a narrower reranker threshold improves precision at the cost of recall. User signal loop captures explicit feedback (thumbs up/down) and implicit signals (did the user rephrase and ask again?). Index re-tune triggers fire when RAGAS scores drop below baseline or miss-rate rises — initiating a re-indexing run, embedding model evaluation, or chunking strategy review in the ingestion pipeline. This is the feedback arrow that connects retrieval back to ingestion.
RAGAS · Context precision / recall / faithfulness · A/B strategy test · User signals · Index re-tune trigger · Corpus gap detection → ingestion feedback
Retrieval Track Detail

Three Tracks. One Reason Each Exists.

Dense, sparse, and metadata retrieval are not redundant — they answer fundamentally different questions. Removing any one of them creates systematic blind spots that no amount of tuning the other two will fix.

Dense retrieval
Semantic similarity — what is this query conceptually about?
Dense retrieval uses embedding vectors to find semantically related content regardless of exact wording. A query about "employee termination procedures" retrieves documents about "staff offboarding" and "employment separation" even if those exact words don't appear in the query. The ColBERT variant performs late interaction — comparing query token embeddings against document token embeddings rather than single dense vectors, producing superior precision on long documents with multiple distinct topics.
  • Best for: conceptual questions, paraphrased queries, cross-domain terminology
  • Fails on: exact regulation codes, proper nouns, numerical identifiers
  • Throughput: ~5–20ms per query on GPU-accelerated HNSW index
  • Model matters: generic embeddings underperform 23%+ vs domain fine-tuned
Sparse retrieval
Exact-match precision — find this specific term exactly as written.
Sparse retrieval is non-negotiable in enterprise RAG. When a compliance officer queries "ISO 27001:2022 Annex A.8.9" or an insurance adjuster queries "Policy Form CG 00 01 04 13," the query is a precise identifier. Dense retrieval will return semantically similar but inexact results. BM25 returns the exact document or nothing. SPLADE (Sparse Learned Ensemble Discrimination) extends BM25 with learned sparse representations — it can expand "HIPAA" to related terms like "Protected Health Information" and "covered entity" automatically, combining the precision of BM25 with some semantic breadth.
  • Best for: regulation codes, policy numbers, product identifiers, names
  • Fails on: paraphrased queries, conceptual questions without keywords
  • Throughput: ~1–5ms per query on inverted index
  • SPLADE adds semantic expansion while preserving exact-match precision
Compliance Architecture

Four Zones. Every Control Mapped.

The retrieval layer has more compliance surface area than any other layer because it is where access decisions are enforced in real time. Every query is a potential access violation, a potential data leakage event, and a potential hallucination. The four zones address each risk category at the structural level.

Zone APre-query gate
SOC 1 CC6.1 · CC6.6 · ISO 9001 §7.4 · GDPR Art. 25
  • User identity verified via SSO before query enters pipeline
  • RBAC check confirms role is permitted to query this namespace
  • Query event logged immutably before retrieval begins
  • Rate limiting enforced as SOC 1 availability control
  • PII pre-screen flags sensitive query content before retrieval
  • Toxicity filter blocks prompt injection and adversarial queries
  • Data residency routing directs query to compliant regional endpoint
  • Query SHA-256 fingerprint creates tamper-evident audit chain
Zone BNamespace isolation
SOC 1 CC6.1 · ISO 9001 §8.4
  • Tenant namespaces enforced structurally — not application logic
  • Cross-namespace queries blocked by design at vector DB layer
  • Sensitivity filter propagated from ingestion — no re-classification needed
  • Per-query access record written to WORM log
  • Isolation verified at retrieval time independently of Zone A check
  • Namespace scope logged for every retrieval operation
Zone CConfidence gate
ISO 9001 §8.7 · SOC 1 CC7.2
  • Confidence threshold configurable per use case and risk level
  • Below-threshold queries route to human review — never fabricate
  • Declination log written immutably: query hash, scores, reason
  • Human escalation queue with defined SLA per query category
  • Confidence distribution logged per query type for drift monitoring
  • Threshold calibrated against golden dataset, not set arbitrarily
Zone DOutput gate + audit trail
SOC 1 CC6 · ISO 9001 §7.5 · GDPR Art. 5(1)(f)
  • Output PII filter removes personal data before LLM processing
  • RBAC re-verified at retrieval time — session tokens may have expired
  • Provenance lock: every chunk tagged with source doc + version + ingestion ID
  • WORM retrieval log: chunk IDs, confidence scores, routing decision
  • Miss-rate monitoring triggers corpus gap detection in ingestion
  • Periodic audit report auto-generated as SOC 1 Type 2 evidence
Architectural note: Task AI Systems designs retrieval systems aligned with SOC 1 Type 2 operating effectiveness requirements, ISO 9001 process quality controls, and GDPR data handling principles. The four-zone compliance architecture creates a complete audit chain from query entry to context delivery that satisfies both internal governance requirements and external audit scrutiny. Formal certification responsibilities remain with your organisation's compliance function.
Operational Resilience

Every Retrieval Failure Mode. Every Recovery Path.

Retrieval failures are more dangerous than ingestion failures because they are invisible — the system returns an answer confidently, but the answer is wrong. These are the failure modes we design explicit recovery paths for.

Silent retrieval miss
Relevant documents exist in the corpus but are not retrieved. The system returns an answer based on partially relevant context, or confidently declines when it should have answered. Invisible without instrumentation — teams discover it through user complaints, not monitoring.
Cross-namespace leakage
A query returns chunks from a namespace the user is not authorised to access — HR documents appearing in a finance query, restricted-classified content returned to a standard user. Compliance failure regardless of whether the user noticed.
Stale version retrieval
A superseded policy document is retrieved instead of the current version. The LLM answers based on a policy that was changed six months ago. In regulated industries, acting on stale policy guidance is a compliance violation regardless of how the information was accessed.
Below-threshold fabrication
The system returns an answer when confidence is below threshold rather than declining. In a medical protocol system, this means a clinician receives guidance that is not grounded in retrieved evidence. The most dangerous failure mode — happens when confidence gating is not implemented.
Semantic cache poisoning
A cached result from a previous query is returned for a semantically similar but contextually different query. "What is our refund policy for EU customers?" returns a cached result for "What is our refund policy?" — missing the GDPR-specific provisions that apply to the EU context.
Confidence drift without alerting
Average retrieval confidence degrades over weeks as the knowledge corpus becomes stale relative to query patterns — new regulations, product changes, updated procedures. Without confidence distribution monitoring, this drift is invisible until user trust collapses.
Recovery path for each failure mode
Miss-rate monitoring + corpus gap detection
Zone D miss-rate alert fires when queries in a category consistently fail to retrieve above-threshold candidates. Alert triggers corpus gap detection in ingestion — the missing knowledge must be added, not the retrieval tuned.
Structural namespace isolation at Zone B
Cross-namespace isolation is enforced at the vector DB filter layer before scoring — restricted content is never scored, never ranked, never returned. No application logic can be bypassed because the filter is structural.
Version metadata filter at retrieval time
Metadata filter track enforces current-version-only retrieval by default. Historical version retrieval requires explicit scope parameter. Version filter applied at HNSW payload layer — not post-processing — so stale chunks are never scored.
Zone C confidence gate — mandatory declination
Below-threshold queries route to human escalation queue with documented SLA. Declination is logged immutably. The system cannot answer below threshold — the gate is structural. Declination log is SOC 1 evidence the system refused to fabricate.
Cache invalidation on document update
When a document is updated in the ingestion pipeline, all semantic cache entries containing chunks from that document are invalidated. Cache keys include document version — a version increment forces cache miss regardless of query similarity.
Confidence distribution alerting + A/B test
Zone D confidence distribution log feeds a drift detection alert — if average confidence drops more than 10% from baseline over a rolling 7-day window, an alert fires. A/B testing evaluates whether re-indexing or embedding model update restores confidence.
Architecture Decisions

Key Decisions & Their Rationale

Every architectural choice in the retrieval layer involves a tradeoff. These are the decisions that matter most — and the reasoning behind each one.

Retrieval architecture decisions
Parallel not sequential tracks
Sequential retrieval adds latency proportional to track count — three sequential tracks at 10ms each = 30ms minimum before fusion. Parallel dispatch fires all three simultaneously, so total retrieval latency is the slowest track, not the sum. At p95 this difference is 15–40ms — the difference between a 150ms response and a 200ms response under load.
RRF over score fusion
Dense and sparse retrieval scores exist on incompatible scales. A dense cosine similarity of 0.82 and a BM25 score of 14.3 cannot be combined without calibration. RRF uses rank position from each track rather than raw scores, making it robust to score distribution differences without requiring per-track calibration. It consistently outperforms weighted score fusion on out-of-distribution queries.
Confidence threshold calibration
Thresholds are calibrated against 400+ representative queries per use case, not set at 0.7 as a default. A threshold that is too high produces excessive declinations that frustrate users. Too low produces hallucinations. The calibration process evaluates precision-recall tradeoff at each threshold level and selects the point where declination rate is acceptable and precision at that threshold is above 95%.
RBAC re-verification at Zone D
Zone A verifies access at query entry. Zone D re-verifies at context assembly. The gap between the two can be seconds or minutes in asynchronous retrieval pipelines. During that gap, a user's permissions may have been revoked (terminated employee, role change). Re-verification at Zone D ensures context is assembled only with chunks the user is still authorised to receive at the moment of assembly.
KMS decrypt per query vs cached keys
Caching decrypted vector payloads eliminates per-query KMS latency (~5–15ms) but creates a plaintext window in memory that widens the attack surface. For regulated industries, per-query KMS decryption is the correct choice despite the latency cost — it ensures encrypted storage at rest and minimises plaintext exposure time. The 5–15ms is absorbed into the p95 latency budget at the hardware layer.
Start the Conversation

Ready to Architect Your Retrieval Layer?

A focused architecture conversation can identify the specific retrieval gaps in your current RAG system — before they become accuracy or compliance failures.