Enterprise RAG Governance, Observability & Evaluation

Stage Detail

What Each Stage Does — And Why It Cannot Be Skipped

The governance and evaluation layer is the only layer in the entire RAG architecture that operates across all other layers simultaneously. It reads from ingestion logs, retrieval traces, processing records, and inference audit trails — and it writes improvement signals back to all of them. A system without this layer is deployed, not operated.

GOVERNANCE · STAGE 01

RBAC Enforcement + Access Governance

Role-based access control in enterprise RAG is not a login gate — it is a continuous enforcement mechanism that operates on every query, not just on session establishment. The role matrix defines the three-dimensional access space: which users can query which namespaces at which sensitivity tier. This matrix is not static: employees change roles, join teams, leave the organisation. The access review process — mandatory quarterly for all AI system access — ensures that permissions are de-provisioned when they are no longer needed. SOC 1 Type 2 auditors specifically look for evidence that access reviews occurred on schedule and that de-provisioning was timely. Session audit logs every query with user identity, timestamp, namespace accessed, sensitivity tier accessed, and query hash — creating a continuous record of who accessed what. Privilege escalation alerting fires when a user's query pattern suggests they are attempting to access content outside their authorised scope — unusual namespace queries, abnormally high query volume, or query patterns that match known data exfiltration techniques. The governance event log records every RBAC decision, access review, and escalation event in WORM storage as SOC 1 CC6.1 evidence that logical access controls operated continuously over the audit period.

Role matrix · Quarterly access review · De-provisioning evidence · Session audit log · Privilege escalation alert · Governance event log WORM

GOVERNANCE · STAGE 02

Hallucination Monitoring

Hallucination monitoring is not a one-time check — it is a continuous operational signal that tells you whether the system is answering correctly and whether that correctness is degrading over time. NLI-based scoring applies a Natural Language Inference model to every response, classifying each claim as entailment (the claim is supported by the retrieved context), contradiction (the claim conflicts with the retrieved context), or neutral (the claim cannot be verified from the retrieved context). Contradiction rate — the percentage of claims scored as contradicting their cited source — is the primary hallucination metric. A target below 2% for most enterprise domains; below 0.5% for clinical and legal systems where hallucinated claims have direct harm potential. Citation tracing complements NLI scoring: for each claim in the response, the system traces the citation chain back to the specific chunk ID that was cited, verifying that the chunk actually exists in the index and was actually retrieved for this query. Claims with broken citation chains are hallucination signals even if NLI scoring shows entailment, because they indicate the model cited a source it did not actually receive. Drift detection monitors the rolling 7-day average contradiction rate against a calibrated baseline. A 10% relative increase from baseline triggers an alert — this is typically the first signal of model quality degradation, appearing weeks before users notice. Model card maintenance records known model limitations, documented failure modes, and domain-specific performance characteristics — required for ISO 9001 traceability of the AI system's known capabilities and boundaries.

NLI entailment / contradiction / neutral · Citation chain verification · <2% contradiction rate target · 7-day rolling drift baseline · Model card maintenance

GOVERNANCE · STAGE 03

Confidence Scoring + Threshold Governance

Confidence thresholds are not set once and forgotten — they are governance artefacts that must be maintained, calibrated, and documented. The calibration process evaluates the precision-recall tradeoff at each confidence level using a golden dataset of 400+ representative queries per domain, updated quarterly as the knowledge corpus and query patterns evolve. Per-domain threshold differentiation is mandatory: a customer support system that answers questions about product features can operate at 0.65 confidence because a wrong answer is recoverable; a medical protocol system operating at 0.65 would be dangerous because a wrong clinical recommendation is not recoverable. The compliance obligation here is documentation: the threshold for each domain, the calibration methodology, the dataset used, the calibration date, and the person who approved the threshold are all governance artefacts that must be version-controlled and auditable. Declination rate tracking monitors the percentage of queries that fail the confidence threshold and route to human review. A sustained declination rate above 15% in a query category is a corpus gap signal — the system is being asked questions it does not have reliable knowledge to answer, which is an ingestion problem, not a retrieval problem. The confidence distribution log records the full distribution of confidence scores per query category in WORM storage — this is the evidence that the threshold was enforced consistently over the audit period.

Quarterly calibration on golden dataset · Per-domain thresholds (0.65 FAQ → 0.90 legal) · Declination rate >15% → corpus gap · Confidence distribution WORM log

COMPLIANCE ZONE A

Access + Hallucination Governance Gate

Zone A combines access enforcement and hallucination control into a single compliance gate because both represent real-time risks that must be addressed before a response is considered valid. RBAC real-time check verifies that the user's current permissions still apply at the moment of query response — permissions can change mid-session, and a session established before a role change should not continue delivering content that the user is no longer authorised to receive. Hallucination gate intercepts responses where NLI contradiction scoring exceeds the domain threshold — these responses are suppressed before delivery, the user receives a graceful declination, and the suppressed response is logged with its contradiction score for analysis. Anomaly alerting monitors for patterns that suggest misuse: bulk query execution (a user executing hundreds of queries in minutes), privilege spiking (a user querying namespaces they have never accessed before), or query patterns that cluster around sensitive topics. These anomalies are flagged for security review, not automatically blocked — the alert creates an investigation record. Governance event log records every event from this zone in WORM storage: every RBAC decision, every hallucination suppression, every anomaly flag, every access review completion. This log is the primary SOC 1 CC6.1 evidence artefact — it demonstrates that access controls operated in real time across the entire audit period.

SOC 1 CC6.1 · CC6.6 · ISO 9001 §8.7 · RBAC real-time · Hallucination suppression log · Anomaly detection · Governance event log WORM

Zone A controls

SOC 1 CC6.1 — RBAC enforced on every query, not just session establishment
SOC 1 CC6.6 — security events logged: anomaly alerts, suppressed responses, access violations
ISO 9001 §8.7 — nonconforming output: hallucinated responses suppressed and logged with reason
Governance event log in WORM storage — continuous evidence of real-time enforcement over audit period

GOVERNANCE · STAGE 04

Observability Infrastructure

Observability infrastructure is the difference between a system you operate and a system that operates you. Prometheus collects time-series metrics from every layer: query latency by stage (ingestion lag, retrieval time broken down by dense/sparse/reranker, inference TTFT and TPS), error rates by category (retrieval miss, confidence gate fail, citation verify fail, model error), cache hit rates by tier, cost per query attributed to department and model tier, and hallucination signal rates per query category. Grafana renders three distinct dashboard views, each optimised for a different audience. The operations dashboard shows real-time latency percentiles, error rates, cache performance, and model tier distribution — the metrics engineers need to diagnose performance issues within seconds. The compliance dashboard shows RBAC event frequency, hallucination suppression rate, confidence threshold pass/fail distribution, and zone-by-zone control status — the view a compliance officer needs to verify that controls are operating. The executive dashboard shows quality KPIs (RAGAS scores trend, satisfaction score trend), cost per query by department, and risk indicators (hallucination rate, suppression rate, escalation frequency) — the strategic view for leadership. Custom query tracer records the complete path of every query through the system: which chunks were retrieved, their confidence scores, the reranking decisions, which prompt template was used, the model configuration, and the final response — creating a complete forensic record of every query-response pair. Alerting rules define the thresholds that trigger PagerDuty or equivalent notifications: p95 latency above 800ms, error rate above 2%, hallucination rate above 5%, or confidence distribution shift above 10% from baseline.

Prometheus metrics · Grafana (ops · compliance · executive dashboards) · Custom query tracer · Latency p50/p95/p99 · Error rate · Cache hit rate · Cost per query · Alerting rules

GOVERNANCE · STAGE 05

Audit Dashboards + Regulator Export

The audit dashboard is the interface between the system's technical operations and the compliance function's regulatory obligations. The compliance dashboard surfaces four categories of evidence in a form that a compliance officer can interpret without engineering expertise: RBAC status (are all access controls active, when was the last access review, are there any expired permissions outstanding?), hallucination control status (what is the current contradiction rate, how does it compare to the baseline, how many responses were suppressed in the last 30 days?), compliance zone status (are all four zones active, when did each last fire, are there any zone failures?), and audit log integrity (is the WORM log accepting writes, when was the last integrity check, is the SHA-256 chain unbroken?). Regulator export is the capability that transforms the system's audit trail into a packaged evidence submission. When a SOC 1 Type 2 audit occurs, the auditor requires evidence that specific controls operated effectively over a 12-month period. The regulator export function produces a structured package: access review records, RBAC enforcement event logs, hallucination suppression logs, confidence threshold calibration records, RAGAS evaluation history, and the SHA-256-verified audit log chain — all filtered to the requested audit period and formatted for submission. This capability exists because assembling this evidence manually under audit pressure is how compliance failures happen.

Ops dashboard · Compliance dashboard · Executive KPI dashboard · Regulator export package · SOC 1 T2 evidence assembly · Audit period filtering

COMPLIANCE ZONE B

Audit Trail Integrity

Zone B ensures that the audit trail itself is trustworthy — that the logs cannot be modified, that the chain of evidence is unbroken, and that the evidence package produced for a regulator accurately reflects what actually happened. Cross-layer log join connects the query event logs from the pre-query gate (Zone A, ingestion compliance zone), the retrieval trace logs (Zones B and D, retrieval compliance), the processing quality gate logs (Zone A, processing), and the inference audit trail (Zone C, inference) into a single queryable record. For any query, an auditor can trace the complete chain: who asked it, what was retrieved, how confident the retrieval was, what prompt was used, what model produced the response, whether the response passed citation verification, and what the user received. WORM log store uses cloud-native immutable storage (S3 Object Lock in compliance mode, Azure Blob immutability policy, GCS bucket retention policy) to ensure that logs cannot be modified, deleted, or overwritten after writing. SHA-256 chain hash creates a cryptographic chain across log entries — each log entry includes the hash of the previous entry, making any tampering with historical records detectable. Periodic audit report generation runs automatically on a monthly basis, producing an ISO 9001 §9.2 internal audit report that documents the operational status of all compliance controls — this is not a manually assembled document, it is generated from the system's own logs and submitted to the compliance function for review and sign-off.

SOC 1 Type 2 · ISO 9001 §9.1 · Cross-layer log join · WORM storage · SHA-256 chain hash · Monthly auto-generated audit report · 12-month retention minimum

Zone B controls

SOC 1 Type 2 — 12 months of WORM logs as operating effectiveness evidence
ISO 9001 §9.1 — monitoring and measurement: metrics logged continuously, not spot-checked
SHA-256 chain hash — cryptographic tamper detection across audit log entries
Cross-layer log join — complete query-to-response chain reconstructable for any historical query
Monthly audit report auto-generated from logs — ISO 9001 §9.2 internal audit evidence

EVALUATION · STAGE 06

RAGAS Offline Evaluation

RAGAS (Retrieval Augmented Generation Assessment) is the evaluation framework that quantifies RAG system quality across four independent dimensions, each measuring a different aspect of the system's performance. Context precision measures the proportion of retrieved context that was actually used in the response — high precision means the retrieval layer is returning relevant content; low precision means it is returning noise that the inference layer ignores. Formula: (relevant chunks used in response) / (total chunks retrieved). Target: above 0.80 for most enterprise workloads. Context recall measures the proportion of relevant content that was retrieved — high recall means the retrieval layer is not missing important information; low recall means relevant documents exist but are not being found. Formula: (relevant chunks retrieved) / (total relevant chunks in corpus for this query). Target: above 0.75. Faithfulness measures the proportion of claims in the response that are directly grounded in the retrieved context — this is the RAGAS hallucination metric. Formula: (claims entailed by retrieved context) / (total claims in response). Target: above 0.95 for regulated domains. Answer relevance measures whether the response actually addresses the question asked — a response can be perfectly cited and faithful to its sources but still fail to answer the question. Formula: LLM-judged relevance score on 0–1 scale. Target: above 0.85. RAGAS runs offline against a golden evaluation dataset — a curated set of query-answer pairs with human relevance judgements, updated quarterly and version-controlled as a compliance artefact. Results are logged with timestamps, dataset versions, and model configurations, creating a time-series of system quality that is the primary evidence for ISO 9001 §9.1 performance monitoring.

Context precision >0.80 · Context recall >0.75 · Faithfulness >0.95 · Answer relevance >0.85 · Golden dataset quarterly update · Score trend in WORM log

EVALUATION · STAGE 07

A/B Testing

A/B testing in a RAG system is not a UX experiment — it is a controlled quality improvement protocol. Every significant change to the system (new embedding model, updated chunking strategy, revised confidence threshold, new prompt version, alternative reranker) should be validated through a controlled A/B test before being promoted to production. The A/B testing framework routes a defined percentage of production queries (typically 10–20%) to the challenger configuration while the remaining queries continue to be served by the current configuration. Both configurations log their full retrieval traces and RAGAS scores, enabling direct statistical comparison. Statistical significance threshold is p<0.05 — the challenger configuration must demonstrate better RAGAS scores at statistical significance before it is eligible for promotion. This prevents premature promotion based on random variation rather than genuine improvement. The A/B decision log records: what was tested, the hypothesis, the traffic split, the duration, the RAGAS scores for control and challenger, the statistical significance result, and the promotion decision with the name of the approving authority. This log is a SOC 1 change management evidence artefact — it documents that every change to the production system was tested, evaluated, and approved before deployment.

Retrieval strategy A/B · Embedding model A/B · Prompt variant A/B · p<0.05 significance required · A/B decision log WORM · Approver name + rationale

EVALUATION · STAGE 08

User Signal Loop

User signals are the earliest available indicator of system quality degradation. Automated RAGAS evaluation runs on a schedule — daily or weekly. User signals are continuous. A user who receives a hallucinated response and marks it thumbs down has provided quality signal hours or days before the automated evaluation would have detected the degradation. Five signal types feed the loop. Explicit thumbs up/down ratings are the clearest signal but have the lowest capture rate — typically 3–8% of queries receive explicit ratings. Query rephrasing is the most common implicit signal: when a user immediately rephrases a query after receiving a response, it indicates the response was unsatisfactory. Session abandonment occurs when a user receives a response but does not select it, close the session, and does not return — indicating the system failed to provide useful information. Correction submission is the most valuable signal: when a user manually provides what they consider the correct answer after receiving the system's response, that correction is a labelled training example for fine-tuning. CSAT (Customer Satisfaction Score) trend monitors the rolling average of satisfaction ratings, providing a leading indicator of quality drift that precedes RAGAS score drops by days to weeks. All user signals are attributed to the specific query, retrieved chunks, and prompt configuration that produced the response — enabling targeted investigation of whether dissatisfaction is driven by retrieval quality, response quality, or topic gaps.

Explicit thumbs up/down · Query rephrase (implicit) · Correction submission (labelled training) · Session abandon signal · CSAT rolling trend · Signal attributed to query config

COMPLIANCE ZONE C

Evaluation Evidence Log

Zone C ensures that the evaluation process itself is auditable — that the measurements of system quality are traceable, reproducible, and cannot be selectively reported. RAGAS score archive stores every evaluation run in WORM storage with: the timestamp of the run, the version of the golden evaluation dataset used, the RAGAS scores for each dimension, the model and configuration being evaluated, and the hash of the RAGAS evaluation code. This archive enables any historical evaluation to be reproduced: given the same golden dataset version, model configuration, and evaluation code, the same scores should be produced. A/B decision log records the complete audit trail for every change to the production system: the hypothesis, the traffic split, the duration, the statistical results, and critically, the name and role of the person who approved the promotion. This is the SOC 1 CC8.1 change management evidence — it demonstrates that changes to the system were controlled, tested, and approved. Golden dataset version control treats the evaluation corpus as a first-class software artefact: each version is tagged, the changelog between versions is documented, and the rationale for adding or removing evaluation examples is recorded. When RAGAS scores change significantly between evaluations, the audit record should be able to determine whether the change was driven by a system change or an evaluation dataset change. Quality trend report aggregates all evaluation evidence into a periodic report for the compliance function, documenting whether the system's measured quality is meeting defined targets and whether any corrective actions have been taken.

ISO 9001 §9.1.1 · SOC 1 CC7.2 · RAGAS score WORM archive · A/B decision log with approver · Golden dataset version control · Quality trend report

Zone C controls

ISO 9001 §9.1.1 — monitoring and measurement: RAGAS evaluation on defined schedule with archived results
SOC 1 CC7.2 — system monitoring: quality metrics logged continuously, drift detected automatically
A/B decision log — SOC 1 CC8.1 change management: every system change tested and documented
Golden dataset version control — evaluation reproducibility: historical scores can be verified
WORM archive — evaluation evidence cannot be selectively modified or deleted

EVALUATION · STAGE 09

Knowledge Gap Detection

Knowledge gap detection is the feedback mechanism that connects the governance and evaluation layer back to the ingestion layer. A RAG system's quality is fundamentally limited by its corpus — if the knowledge base does not contain the answer to a category of question, no retrieval optimisation or model improvement will fix it. The gap detection system identifies these limitations before they manifest as user frustration. Miss-rate by category monitors the percentage of queries in each topic category that fail the retrieval confidence threshold. A sustained miss-rate above 5% in a category over a 7-day rolling window triggers a knowledge gap alert — indicating that the system is being asked questions about a topic for which the corpus is insufficient. Declination clustering analyses the content of confidence-gated queries to identify topic patterns. When 30 users over a week all ask questions that the system declined to answer, and those questions cluster around "new EU AI Act compliance requirements," that cluster is an actionable corpus gap signal: the knowledge base needs EU AI Act documentation. Corpus coverage analysis compares the distribution of query embeddings against the distribution of indexed chunk embeddings in embedding space — areas of query space that have no nearby index content are structural gaps. Gap prioritisation scores each identified gap by query frequency (how often is the system being asked about this topic?) multiplied by business impact (how important is it to answer questions about this topic?) to produce a ranked list of corpus additions for the content team.

Miss-rate >5% per category → gap alert · Declination topic clustering · Corpus coverage embedding analysis · Gap priority = frequency × impact · → triggers ingestion

EVALUATION · STAGE 10

Continuous Improvement Loop

The continuous improvement loop is what transforms a RAG deployment into a RAG product — a system that gets measurably better over time rather than decaying as the world changes around it. The loop has four phases. Trigger: a quality signal fires — a RAGAS score drops below baseline, a knowledge gap is identified, a user correction pattern emerges, or an A/B test demonstrates a challenger improvement. Action: the appropriate response is initiated — for a corpus gap, the ingestion pipeline is triggered to add documents; for a RAGAS score drop, the embedding model or chunking strategy is reviewed and a fine-tuning run initiated; for a prompt quality issue, a new prompt variant enters A/B testing. Verify: the proposed change is deployed in shadow mode — running in parallel with the current production configuration without serving users — and evaluated against the golden dataset and a sample of recent production queries. RAGAS scores for the shadow deployment must meet or exceed the current production baseline before the change is eligible for promotion. Promote: the change is promoted to production through the A/B framework, ensuring that the improvement holds in production traffic before full rollout. Every loop iteration is recorded in the improvement evidence log — including triggers, actions taken, shadow evaluation results, and promotion decisions. This log is the ISO 9001 §10.3 continual improvement evidence, demonstrating that the organisation has an active process for identifying and acting on opportunities for improvement in the AI system.

Trigger → action → shadow deploy → RAGAS verify → A/B promote · Corpus gap → ingest · RAGAS drop → fine-tune · Every loop iteration logged · ISO 9001 §10.3 evidence

COMPLIANCE ZONE D

Improvement Evidence + Change Management

Zone D closes the compliance loop: every improvement made to the system — every model update, every corpus addition, every prompt change, every threshold recalibration — is documented as a controlled change with before-and-after quality evidence. Change log records the complete change history of the production system: what changed, who made the change, when, with what approval, and for what reason. This is the SOC 1 CC8.1 change management evidence — demonstrating that changes to the system were controlled, not ad hoc. Before/after RAGAS scores are logged with every change — the system's measured quality at the time of the change and at 30 days after the change, enabling the organisation to demonstrate that changes improved the system or at minimum did not degrade it. Rollback capability is architectural, not aspirational: every component of the production system (index version, embedding model, prompt template, confidence thresholds) has a version tag and a documented rollback procedure. If a change causes unexpected quality degradation, rollback to the previous version can be initiated within minutes without data loss. Corrective action records document the organisation's response to identified quality failures — when a RAGAS score drops, when a hallucination incident occurs, or when a compliance zone fails, the corrective action record documents what happened, what caused it, what was done to fix it, and what was done to prevent recurrence. These records are the ISO 9001 §10 corrective action evidence, demonstrating that the organisation treats quality failures as improvement opportunities rather than isolated incidents.

ISO 9001 §10.3 · SOC 1 CC8.1 · Change log with approver · Before/after RAGAS scores · Rollback capability documented · Corrective action records · Continual improvement evidence

Zone D controls

SOC 1 CC8.1 — change management: every system change controlled, tested, approved, and documented
ISO 9001 §10.3 — continual improvement: documented process for identifying and acting on improvement opportunities
Rollback capability — every production component can be reverted within minutes with documented procedure
Corrective action records — ISO 9001 §10.2 evidence: failures treated as improvement opportunities
Before/after quality scores — demonstrates changes improved or maintained system quality

The Layer That Keeps
Enterprise AI Trustworthy.

Governance + Evaluation — Every Stage.

What Each Stage Does — And Why It Cannot Be Skipped

The Numbers That Define a Governed System

RAGAS — The Four Dimensions That Matter

Four Zones. Governance to Improvement.

Every Governance Failure Mode. Every Recovery Path.

The Governance Decisions That Determine Audit Readiness

Ready to Govern Your Enterprise AI System?

The Layer That KeepsEnterprise AI Trustworthy.

Governance + Evaluation — Every Stage.

What Each Stage Does — And Why It Cannot Be Skipped

The Numbers That Define a Governed System

RAGAS — The Four Dimensions That Matter

Four Zones. Governance to Improvement.

Every Governance Failure Mode. Every Recovery Path.

The Governance Decisions That Determine Audit Readiness

Ready to Govern Your Enterprise AI System?

Design Your AI Architecture

Send a Direct Inquiry

Message Received

The Layer That Keeps
Enterprise AI Trustworthy.