Governance, Observability & Evaluation

The Layer That Keeps
Enterprise AI Trustworthy.

A RAG system that is deployed but not governed is a liability, not an asset. This layer monitors every response for hallucination, enforces every access control, measures every retrieval decision, and runs continuous evaluation loops that detect degradation before users do — with four compliance zones creating an unbroken audit chain for regulators.

4 Compliance Zones
Access · Monitor · Audit · Improve
RAGAS Evaluation
Precision · recall · faithfulness
Hallucination Detection
NLI + citation tracing
Continuous Improvement
Signal → retrain → verify
Complete Architecture

Governance + Evaluation — Every Stage.

Governance is not a dashboard bolted on after deployment. It is an architecture of controls, monitors, and feedback loops built into the system from day one. The governance layer operates across all other layers — it reads from ingestion logs, retrieval traces, and inference records to produce a complete operational picture and trigger improvements automatically.

Governance, Observability and Evaluation Layer — complete architecture with four compliance zones reads from: ingestion · retrieval · processing · inference ▼ GOVERNANCE LAYER 1 · RBAC enforcement + access governance — who can query · what · from where · when Role matrix user → namespace → tier Session audit every query logged Access review quarterly de-provision Privilege escalation alert anomaly → immediate flag 2 · Hallucination monitoring — NLI entailment · citation tracing · contradiction rate · drift alert NLI scoring entail / contradict / neutral Citation trace claim → chunk ID verify Contradiction rate <2% threshold target Drift detection rolling 7-day baseline Model card known limitations log 3 · Confidence scoring + threshold governance — calibration · tier routing · declination rate Threshold calibration golden dataset quarterly Per-domain threshold legal 0.90 · FAQ 0.65 Declination rate track >15% → corpus gap signal Confidence distribution log per query category · WORM A · Access + hallucination governance · SOC 1 CC6.1 · ISO 9001 §8.7 · real-time enforcement RBAC real-time check every query · not just login Hallucination gate contradiction → suppress Anomaly alert bulk query · privilege spike Governance event log WORM · SOC 1 CC6 evidence 4 · Observability infrastructure — Prometheus · Grafana · custom query tracer · alerting Prometheus metrics collection Grafana ops + compliance dashboards Query tracer chunk IDs · scores · path Alerting rules latency · error · drift Cost analytics dept · use case · model 5 · Audit dashboards — ops view · compliance view · executive view · regulator export Ops dashboard latency · errors · cache hit Compliance dashboard RBAC · hallucination · zones Executive dashboard cost · quality · risk KPIs Regulator export SOC 1 T2 evidence package B · Audit trail integrity · SOC 1 Type 2 · ISO 9001 §9.1 · 12-month WORM evidence package Cross-layer log join query → chunks → response WORM log store S3 Object Lock / Azure immut Log integrity check SHA-256 chain hash Periodic audit report gen ISO 9001 §9.2 · auto-generated ▼ EVALUATION LAYER 6 · RAGAS offline evaluation — context precision · context recall · faithfulness · answer relevance Context precision retrieved used / retrieved total Context recall relevant retrieved / relevant total Faithfulness claims grounded in context Answer relevance response addresses the query 7 · A/B testing — retrieval strategy · embedding model · chunk size · reranker · prompt variant Retrieval A/B dense vs hybrid NDCG Embedding A/B model v1 vs fine-tuned Prompt A/B variant 1 vs variant 2 Statistical significance p<0.05 before promoting 8 · User signal loop — explicit · implicit · correction capture · satisfaction trend · churn signal Thumbs up/down explicit rating Query rephrase implicit dissatisfaction Correction submit user provides right answer Session abandon no response selected CSAT trend rolling satisfaction score C · Evaluation evidence log · ISO 9001 §9.1.1 · SOC 1 CC7.2 · RAGAS scores + A/B results archived RAGAS score archive timestamped · versioned A/B decision log what changed · why promoted Golden dataset version eval corpus lineage Quality trend report ISO 9001 §9.2 internal audit 9 · Knowledge gap detection — miss-rate · declination cluster · corpus coverage analysis · gap alert Miss-rate by category >5% sustained → gap signal Declination clustering topic grouping of refusals Corpus coverage query space vs index space Gap prioritisation frequency × business impact 10 · Continuous improvement — trigger → action → verify → promote — closes all feedback loops Gap → ingest trigger missing knowledge → add docs Score drop → retrain RAGAS drop → fine-tune trigger Shadow deploy verify before production Promote gate RAGAS ≥ baseline to promote D · Improvement evidence · ISO 9001 §10.3 continual improvement · SOC 1 change management CC8.1 Change log what changed · actor · reason Before/after scores RAGAS delta logged Rollback capability prior version always retained Corrective action record ISO 9001 §10 evidence Feeds back into: ingestion · retrieval · processing · inference Four compliance zones — control inventory Zone A · Access + hallucination: SOC 1 CC6.1 · ISO 9001 §8.7 — RBAC real-time, hallucination gate, anomaly alert, governance event log Zone B · Audit trail: SOC 1 Type 2 · ISO 9001 §9.1 — cross-layer log join, WORM store, SHA-256 chain hash, periodic audit report Zone C · Evaluation evidence: ISO 9001 §9.1.1 · SOC 1 CC7.2 — RAGAS archive, A/B decision log, golden dataset lineage, quality trend Zone D · Improvement evidence: ISO 9001 §10.3 · SOC 1 CC8.1 — change log, before/after scores, rollback capability, corrective action Governance layer Evaluation layer Compliance zone (A·B·C·D) Infrastructure / cross-layer
Stage Detail

What Each Stage Does — And Why It Cannot Be Skipped

The governance and evaluation layer is the only layer in the entire RAG architecture that operates across all other layers simultaneously. It reads from ingestion logs, retrieval traces, processing records, and inference audit trails — and it writes improvement signals back to all of them. A system without this layer is deployed, not operated.

GOVERNANCE · STAGE 01
RBAC Enforcement + Access Governance
Role-based access control in enterprise RAG is not a login gate — it is a continuous enforcement mechanism that operates on every query, not just on session establishment. The role matrix defines the three-dimensional access space: which users can query which namespaces at which sensitivity tier. This matrix is not static: employees change roles, join teams, leave the organisation. The access review process — mandatory quarterly for all AI system access — ensures that permissions are de-provisioned when they are no longer needed. SOC 1 Type 2 auditors specifically look for evidence that access reviews occurred on schedule and that de-provisioning was timely. Session audit logs every query with user identity, timestamp, namespace accessed, sensitivity tier accessed, and query hash — creating a continuous record of who accessed what. Privilege escalation alerting fires when a user's query pattern suggests they are attempting to access content outside their authorised scope — unusual namespace queries, abnormally high query volume, or query patterns that match known data exfiltration techniques. The governance event log records every RBAC decision, access review, and escalation event in WORM storage as SOC 1 CC6.1 evidence that logical access controls operated continuously over the audit period.
Role matrix · Quarterly access review · De-provisioning evidence · Session audit log · Privilege escalation alert · Governance event log WORM
GOVERNANCE · STAGE 02
Hallucination Monitoring
Hallucination monitoring is not a one-time check — it is a continuous operational signal that tells you whether the system is answering correctly and whether that correctness is degrading over time. NLI-based scoring applies a Natural Language Inference model to every response, classifying each claim as entailment (the claim is supported by the retrieved context), contradiction (the claim conflicts with the retrieved context), or neutral (the claim cannot be verified from the retrieved context). Contradiction rate — the percentage of claims scored as contradicting their cited source — is the primary hallucination metric. A target below 2% for most enterprise domains; below 0.5% for clinical and legal systems where hallucinated claims have direct harm potential. Citation tracing complements NLI scoring: for each claim in the response, the system traces the citation chain back to the specific chunk ID that was cited, verifying that the chunk actually exists in the index and was actually retrieved for this query. Claims with broken citation chains are hallucination signals even if NLI scoring shows entailment, because they indicate the model cited a source it did not actually receive. Drift detection monitors the rolling 7-day average contradiction rate against a calibrated baseline. A 10% relative increase from baseline triggers an alert — this is typically the first signal of model quality degradation, appearing weeks before users notice. Model card maintenance records known model limitations, documented failure modes, and domain-specific performance characteristics — required for ISO 9001 traceability of the AI system's known capabilities and boundaries.
NLI entailment / contradiction / neutral · Citation chain verification · <2% contradiction rate target · 7-day rolling drift baseline · Model card maintenance
GOVERNANCE · STAGE 03
Confidence Scoring + Threshold Governance
Confidence thresholds are not set once and forgotten — they are governance artefacts that must be maintained, calibrated, and documented. The calibration process evaluates the precision-recall tradeoff at each confidence level using a golden dataset of 400+ representative queries per domain, updated quarterly as the knowledge corpus and query patterns evolve. Per-domain threshold differentiation is mandatory: a customer support system that answers questions about product features can operate at 0.65 confidence because a wrong answer is recoverable; a medical protocol system operating at 0.65 would be dangerous because a wrong clinical recommendation is not recoverable. The compliance obligation here is documentation: the threshold for each domain, the calibration methodology, the dataset used, the calibration date, and the person who approved the threshold are all governance artefacts that must be version-controlled and auditable. Declination rate tracking monitors the percentage of queries that fail the confidence threshold and route to human review. A sustained declination rate above 15% in a query category is a corpus gap signal — the system is being asked questions it does not have reliable knowledge to answer, which is an ingestion problem, not a retrieval problem. The confidence distribution log records the full distribution of confidence scores per query category in WORM storage — this is the evidence that the threshold was enforced consistently over the audit period.
Quarterly calibration on golden dataset · Per-domain thresholds (0.65 FAQ → 0.90 legal) · Declination rate >15% → corpus gap · Confidence distribution WORM log
COMPLIANCE ZONE A
Access + Hallucination Governance Gate
Zone A combines access enforcement and hallucination control into a single compliance gate because both represent real-time risks that must be addressed before a response is considered valid. RBAC real-time check verifies that the user's current permissions still apply at the moment of query response — permissions can change mid-session, and a session established before a role change should not continue delivering content that the user is no longer authorised to receive. Hallucination gate intercepts responses where NLI contradiction scoring exceeds the domain threshold — these responses are suppressed before delivery, the user receives a graceful declination, and the suppressed response is logged with its contradiction score for analysis. Anomaly alerting monitors for patterns that suggest misuse: bulk query execution (a user executing hundreds of queries in minutes), privilege spiking (a user querying namespaces they have never accessed before), or query patterns that cluster around sensitive topics. These anomalies are flagged for security review, not automatically blocked — the alert creates an investigation record. Governance event log records every event from this zone in WORM storage: every RBAC decision, every hallucination suppression, every anomaly flag, every access review completion. This log is the primary SOC 1 CC6.1 evidence artefact — it demonstrates that access controls operated in real time across the entire audit period.
SOC 1 CC6.1 · CC6.6 · ISO 9001 §8.7 · RBAC real-time · Hallucination suppression log · Anomaly detection · Governance event log WORM
Zone A controls
  • SOC 1 CC6.1 — RBAC enforced on every query, not just session establishment
  • SOC 1 CC6.6 — security events logged: anomaly alerts, suppressed responses, access violations
  • ISO 9001 §8.7 — nonconforming output: hallucinated responses suppressed and logged with reason
  • Governance event log in WORM storage — continuous evidence of real-time enforcement over audit period
GOVERNANCE · STAGE 04
Observability Infrastructure
Observability infrastructure is the difference between a system you operate and a system that operates you. Prometheus collects time-series metrics from every layer: query latency by stage (ingestion lag, retrieval time broken down by dense/sparse/reranker, inference TTFT and TPS), error rates by category (retrieval miss, confidence gate fail, citation verify fail, model error), cache hit rates by tier, cost per query attributed to department and model tier, and hallucination signal rates per query category. Grafana renders three distinct dashboard views, each optimised for a different audience. The operations dashboard shows real-time latency percentiles, error rates, cache performance, and model tier distribution — the metrics engineers need to diagnose performance issues within seconds. The compliance dashboard shows RBAC event frequency, hallucination suppression rate, confidence threshold pass/fail distribution, and zone-by-zone control status — the view a compliance officer needs to verify that controls are operating. The executive dashboard shows quality KPIs (RAGAS scores trend, satisfaction score trend), cost per query by department, and risk indicators (hallucination rate, suppression rate, escalation frequency) — the strategic view for leadership. Custom query tracer records the complete path of every query through the system: which chunks were retrieved, their confidence scores, the reranking decisions, which prompt template was used, the model configuration, and the final response — creating a complete forensic record of every query-response pair. Alerting rules define the thresholds that trigger PagerDuty or equivalent notifications: p95 latency above 800ms, error rate above 2%, hallucination rate above 5%, or confidence distribution shift above 10% from baseline.
Prometheus metrics · Grafana (ops · compliance · executive dashboards) · Custom query tracer · Latency p50/p95/p99 · Error rate · Cache hit rate · Cost per query · Alerting rules
GOVERNANCE · STAGE 05
Audit Dashboards + Regulator Export
The audit dashboard is the interface between the system's technical operations and the compliance function's regulatory obligations. The compliance dashboard surfaces four categories of evidence in a form that a compliance officer can interpret without engineering expertise: RBAC status (are all access controls active, when was the last access review, are there any expired permissions outstanding?), hallucination control status (what is the current contradiction rate, how does it compare to the baseline, how many responses were suppressed in the last 30 days?), compliance zone status (are all four zones active, when did each last fire, are there any zone failures?), and audit log integrity (is the WORM log accepting writes, when was the last integrity check, is the SHA-256 chain unbroken?). Regulator export is the capability that transforms the system's audit trail into a packaged evidence submission. When a SOC 1 Type 2 audit occurs, the auditor requires evidence that specific controls operated effectively over a 12-month period. The regulator export function produces a structured package: access review records, RBAC enforcement event logs, hallucination suppression logs, confidence threshold calibration records, RAGAS evaluation history, and the SHA-256-verified audit log chain — all filtered to the requested audit period and formatted for submission. This capability exists because assembling this evidence manually under audit pressure is how compliance failures happen.
Ops dashboard · Compliance dashboard · Executive KPI dashboard · Regulator export package · SOC 1 T2 evidence assembly · Audit period filtering
COMPLIANCE ZONE B
Audit Trail Integrity
Zone B ensures that the audit trail itself is trustworthy — that the logs cannot be modified, that the chain of evidence is unbroken, and that the evidence package produced for a regulator accurately reflects what actually happened. Cross-layer log join connects the query event logs from the pre-query gate (Zone A, ingestion compliance zone), the retrieval trace logs (Zones B and D, retrieval compliance), the processing quality gate logs (Zone A, processing), and the inference audit trail (Zone C, inference) into a single queryable record. For any query, an auditor can trace the complete chain: who asked it, what was retrieved, how confident the retrieval was, what prompt was used, what model produced the response, whether the response passed citation verification, and what the user received. WORM log store uses cloud-native immutable storage (S3 Object Lock in compliance mode, Azure Blob immutability policy, GCS bucket retention policy) to ensure that logs cannot be modified, deleted, or overwritten after writing. SHA-256 chain hash creates a cryptographic chain across log entries — each log entry includes the hash of the previous entry, making any tampering with historical records detectable. Periodic audit report generation runs automatically on a monthly basis, producing an ISO 9001 §9.2 internal audit report that documents the operational status of all compliance controls — this is not a manually assembled document, it is generated from the system's own logs and submitted to the compliance function for review and sign-off.
SOC 1 Type 2 · ISO 9001 §9.1 · Cross-layer log join · WORM storage · SHA-256 chain hash · Monthly auto-generated audit report · 12-month retention minimum
Zone B controls
  • SOC 1 Type 2 — 12 months of WORM logs as operating effectiveness evidence
  • ISO 9001 §9.1 — monitoring and measurement: metrics logged continuously, not spot-checked
  • SHA-256 chain hash — cryptographic tamper detection across audit log entries
  • Cross-layer log join — complete query-to-response chain reconstructable for any historical query
  • Monthly audit report auto-generated from logs — ISO 9001 §9.2 internal audit evidence
EVALUATION · STAGE 06
RAGAS Offline Evaluation
RAGAS (Retrieval Augmented Generation Assessment) is the evaluation framework that quantifies RAG system quality across four independent dimensions, each measuring a different aspect of the system's performance. Context precision measures the proportion of retrieved context that was actually used in the response — high precision means the retrieval layer is returning relevant content; low precision means it is returning noise that the inference layer ignores. Formula: (relevant chunks used in response) / (total chunks retrieved). Target: above 0.80 for most enterprise workloads. Context recall measures the proportion of relevant content that was retrieved — high recall means the retrieval layer is not missing important information; low recall means relevant documents exist but are not being found. Formula: (relevant chunks retrieved) / (total relevant chunks in corpus for this query). Target: above 0.75. Faithfulness measures the proportion of claims in the response that are directly grounded in the retrieved context — this is the RAGAS hallucination metric. Formula: (claims entailed by retrieved context) / (total claims in response). Target: above 0.95 for regulated domains. Answer relevance measures whether the response actually addresses the question asked — a response can be perfectly cited and faithful to its sources but still fail to answer the question. Formula: LLM-judged relevance score on 0–1 scale. Target: above 0.85. RAGAS runs offline against a golden evaluation dataset — a curated set of query-answer pairs with human relevance judgements, updated quarterly and version-controlled as a compliance artefact. Results are logged with timestamps, dataset versions, and model configurations, creating a time-series of system quality that is the primary evidence for ISO 9001 §9.1 performance monitoring.
Context precision >0.80 · Context recall >0.75 · Faithfulness >0.95 · Answer relevance >0.85 · Golden dataset quarterly update · Score trend in WORM log
EVALUATION · STAGE 07
A/B Testing
A/B testing in a RAG system is not a UX experiment — it is a controlled quality improvement protocol. Every significant change to the system (new embedding model, updated chunking strategy, revised confidence threshold, new prompt version, alternative reranker) should be validated through a controlled A/B test before being promoted to production. The A/B testing framework routes a defined percentage of production queries (typically 10–20%) to the challenger configuration while the remaining queries continue to be served by the current configuration. Both configurations log their full retrieval traces and RAGAS scores, enabling direct statistical comparison. Statistical significance threshold is p<0.05 — the challenger configuration must demonstrate better RAGAS scores at statistical significance before it is eligible for promotion. This prevents premature promotion based on random variation rather than genuine improvement. The A/B decision log records: what was tested, the hypothesis, the traffic split, the duration, the RAGAS scores for control and challenger, the statistical significance result, and the promotion decision with the name of the approving authority. This log is a SOC 1 change management evidence artefact — it documents that every change to the production system was tested, evaluated, and approved before deployment.
Retrieval strategy A/B · Embedding model A/B · Prompt variant A/B · p<0.05 significance required · A/B decision log WORM · Approver name + rationale
EVALUATION · STAGE 08
User Signal Loop
User signals are the earliest available indicator of system quality degradation. Automated RAGAS evaluation runs on a schedule — daily or weekly. User signals are continuous. A user who receives a hallucinated response and marks it thumbs down has provided quality signal hours or days before the automated evaluation would have detected the degradation. Five signal types feed the loop. Explicit thumbs up/down ratings are the clearest signal but have the lowest capture rate — typically 3–8% of queries receive explicit ratings. Query rephrasing is the most common implicit signal: when a user immediately rephrases a query after receiving a response, it indicates the response was unsatisfactory. Session abandonment occurs when a user receives a response but does not select it, close the session, and does not return — indicating the system failed to provide useful information. Correction submission is the most valuable signal: when a user manually provides what they consider the correct answer after receiving the system's response, that correction is a labelled training example for fine-tuning. CSAT (Customer Satisfaction Score) trend monitors the rolling average of satisfaction ratings, providing a leading indicator of quality drift that precedes RAGAS score drops by days to weeks. All user signals are attributed to the specific query, retrieved chunks, and prompt configuration that produced the response — enabling targeted investigation of whether dissatisfaction is driven by retrieval quality, response quality, or topic gaps.
Explicit thumbs up/down · Query rephrase (implicit) · Correction submission (labelled training) · Session abandon signal · CSAT rolling trend · Signal attributed to query config
COMPLIANCE ZONE C
Evaluation Evidence Log
Zone C ensures that the evaluation process itself is auditable — that the measurements of system quality are traceable, reproducible, and cannot be selectively reported. RAGAS score archive stores every evaluation run in WORM storage with: the timestamp of the run, the version of the golden evaluation dataset used, the RAGAS scores for each dimension, the model and configuration being evaluated, and the hash of the RAGAS evaluation code. This archive enables any historical evaluation to be reproduced: given the same golden dataset version, model configuration, and evaluation code, the same scores should be produced. A/B decision log records the complete audit trail for every change to the production system: the hypothesis, the traffic split, the duration, the statistical results, and critically, the name and role of the person who approved the promotion. This is the SOC 1 CC8.1 change management evidence — it demonstrates that changes to the system were controlled, tested, and approved. Golden dataset version control treats the evaluation corpus as a first-class software artefact: each version is tagged, the changelog between versions is documented, and the rationale for adding or removing evaluation examples is recorded. When RAGAS scores change significantly between evaluations, the audit record should be able to determine whether the change was driven by a system change or an evaluation dataset change. Quality trend report aggregates all evaluation evidence into a periodic report for the compliance function, documenting whether the system's measured quality is meeting defined targets and whether any corrective actions have been taken.
ISO 9001 §9.1.1 · SOC 1 CC7.2 · RAGAS score WORM archive · A/B decision log with approver · Golden dataset version control · Quality trend report
Zone C controls
  • ISO 9001 §9.1.1 — monitoring and measurement: RAGAS evaluation on defined schedule with archived results
  • SOC 1 CC7.2 — system monitoring: quality metrics logged continuously, drift detected automatically
  • A/B decision log — SOC 1 CC8.1 change management: every system change tested and documented
  • Golden dataset version control — evaluation reproducibility: historical scores can be verified
  • WORM archive — evaluation evidence cannot be selectively modified or deleted
EVALUATION · STAGE 09
Knowledge Gap Detection
Knowledge gap detection is the feedback mechanism that connects the governance and evaluation layer back to the ingestion layer. A RAG system's quality is fundamentally limited by its corpus — if the knowledge base does not contain the answer to a category of question, no retrieval optimisation or model improvement will fix it. The gap detection system identifies these limitations before they manifest as user frustration. Miss-rate by category monitors the percentage of queries in each topic category that fail the retrieval confidence threshold. A sustained miss-rate above 5% in a category over a 7-day rolling window triggers a knowledge gap alert — indicating that the system is being asked questions about a topic for which the corpus is insufficient. Declination clustering analyses the content of confidence-gated queries to identify topic patterns. When 30 users over a week all ask questions that the system declined to answer, and those questions cluster around "new EU AI Act compliance requirements," that cluster is an actionable corpus gap signal: the knowledge base needs EU AI Act documentation. Corpus coverage analysis compares the distribution of query embeddings against the distribution of indexed chunk embeddings in embedding space — areas of query space that have no nearby index content are structural gaps. Gap prioritisation scores each identified gap by query frequency (how often is the system being asked about this topic?) multiplied by business impact (how important is it to answer questions about this topic?) to produce a ranked list of corpus additions for the content team.
Miss-rate >5% per category → gap alert · Declination topic clustering · Corpus coverage embedding analysis · Gap priority = frequency × impact · → triggers ingestion
EVALUATION · STAGE 10
Continuous Improvement Loop
The continuous improvement loop is what transforms a RAG deployment into a RAG product — a system that gets measurably better over time rather than decaying as the world changes around it. The loop has four phases. Trigger: a quality signal fires — a RAGAS score drops below baseline, a knowledge gap is identified, a user correction pattern emerges, or an A/B test demonstrates a challenger improvement. Action: the appropriate response is initiated — for a corpus gap, the ingestion pipeline is triggered to add documents; for a RAGAS score drop, the embedding model or chunking strategy is reviewed and a fine-tuning run initiated; for a prompt quality issue, a new prompt variant enters A/B testing. Verify: the proposed change is deployed in shadow mode — running in parallel with the current production configuration without serving users — and evaluated against the golden dataset and a sample of recent production queries. RAGAS scores for the shadow deployment must meet or exceed the current production baseline before the change is eligible for promotion. Promote: the change is promoted to production through the A/B framework, ensuring that the improvement holds in production traffic before full rollout. Every loop iteration is recorded in the improvement evidence log — including triggers, actions taken, shadow evaluation results, and promotion decisions. This log is the ISO 9001 §10.3 continual improvement evidence, demonstrating that the organisation has an active process for identifying and acting on opportunities for improvement in the AI system.
Trigger → action → shadow deploy → RAGAS verify → A/B promote · Corpus gap → ingest · RAGAS drop → fine-tune · Every loop iteration logged · ISO 9001 §10.3 evidence
COMPLIANCE ZONE D
Improvement Evidence + Change Management
Zone D closes the compliance loop: every improvement made to the system — every model update, every corpus addition, every prompt change, every threshold recalibration — is documented as a controlled change with before-and-after quality evidence. Change log records the complete change history of the production system: what changed, who made the change, when, with what approval, and for what reason. This is the SOC 1 CC8.1 change management evidence — demonstrating that changes to the system were controlled, not ad hoc. Before/after RAGAS scores are logged with every change — the system's measured quality at the time of the change and at 30 days after the change, enabling the organisation to demonstrate that changes improved the system or at minimum did not degrade it. Rollback capability is architectural, not aspirational: every component of the production system (index version, embedding model, prompt template, confidence thresholds) has a version tag and a documented rollback procedure. If a change causes unexpected quality degradation, rollback to the previous version can be initiated within minutes without data loss. Corrective action records document the organisation's response to identified quality failures — when a RAGAS score drops, when a hallucination incident occurs, or when a compliance zone fails, the corrective action record documents what happened, what caused it, what was done to fix it, and what was done to prevent recurrence. These records are the ISO 9001 §10 corrective action evidence, demonstrating that the organisation treats quality failures as improvement opportunities rather than isolated incidents.
ISO 9001 §10.3 · SOC 1 CC8.1 · Change log with approver · Before/after RAGAS scores · Rollback capability documented · Corrective action records · Continual improvement evidence
Zone D controls
  • SOC 1 CC8.1 — change management: every system change controlled, tested, approved, and documented
  • ISO 9001 §10.3 — continual improvement: documented process for identifying and acting on improvement opportunities
  • Rollback capability — every production component can be reverted within minutes with documented procedure
  • Corrective action records — ISO 9001 §10.2 evidence: failures treated as improvement opportunities
  • Before/after quality scores — demonstrates changes improved or maintained system quality
Key Metrics

The Numbers That Define a Governed System

These are the metrics that must be instrumented, baselined, and alerting before a production RAG system can be considered governed. Each metric has a target range, a measurement method, and a consequence if the target is breached.

Quality
RAGAS Faithfulness
>0.95
Proportion of response claims grounded in retrieved context. Below 0.90 in regulated domains triggers immediate investigation. Below 0.85 triggers system review.
Quality
Context Precision
>0.80
Proportion of retrieved chunks actually used in the response. Low precision indicates retrieval noise. Below 0.70 triggers retrieval strategy review.
Quality
Context Recall
>0.75
Proportion of relevant corpus content retrieved per query. Low recall indicates retrieval gaps. Below 0.65 triggers embedding model or chunking review.
Safety
Contradiction Rate
<2%
Percentage of response claims contradicting their cited source. Above 2% triggers hallucination investigation. Above 5% triggers system suspension for review.
Governance
Declination Rate
<15%
Percentage of queries failing confidence threshold. Above 15% per category indicates a corpus gap requiring ingestion action — not a retrieval or inference problem.
Performance
TTFT p95
<500ms
Time to first token at 95th percentile. Above 800ms triggers infrastructure review. Measured end-to-end from query received to first token delivered.
Performance
Retrieval Latency p95
<200ms
ANN search + reranking time at 95th percentile. Above 300ms triggers vector index or reranker investigation. Tracked separately from total query latency.
Cost
Cost per Query
by tier
$0.001 simple · $0.01 moderate · $0.10 complex. Tracked by department and use case. 20% increase from baseline triggers cost routing review.
Satisfaction
CSAT Trend
>4.0/5
Rolling 30-day average satisfaction score. A 10% relative drop from baseline is the leading indicator of quality degradation — typically appears 2–3 weeks before RAGAS scores drop.
Evaluation Framework

RAGAS — The Four Dimensions That Matter

RAGAS measures RAG system quality across four independent dimensions. Each dimension measures a distinct aspect of system behaviour. A system can score perfectly on three dimensions and fail on the fourth — and that failure matters for a specific category of user query. All four must be monitored.

Dimension 1
Context Precision
relevant_used / total_retrieved
Measures retrieval signal-to-noise ratio. A score of 0.90 means 90% of retrieved chunks contributed to the response. A score of 0.40 means 60% of retrieved chunks were noise — content that was retrieved but ignored by the LLM. Low context precision inflates inference costs (the model processes unnecessary context), increases latency (larger context windows take longer to process), and degrades response quality (noise dilutes the signal). The fix for low context precision is almost always in the retrieval layer: reranking threshold too low, namespace isolation not tight enough, or chunk size too large causing topically diffuse chunks to be retrieved.
Dimension 2
Context Recall
relevant_retrieved / total_relevant
Measures retrieval completeness. A score of 0.80 means the system retrieved 80% of the relevant information that exists in the corpus for a given query. The remaining 20% was present but not retrieved — a silent failure. Low context recall means the system is giving answers based on partial information, which is particularly dangerous for regulatory queries where the answer requires considering all applicable clauses. The fix for low context recall is almost always in the processing layer: embedding model not domain-adapted, chunking strategy splitting related context across non-overlapping chunks, or corpus coverage gaps that knowledge gap detection should have flagged.
Dimension 3
Faithfulness
grounded_claims / total_claims
Measures the hallucination rate directly. A score of 0.97 means 97% of claims in the response are directly supported by the retrieved context — 3% are not, and those 3% are hallucinations. This is the most critical RAGAS dimension for regulated industries. A faithfulness score of 0.97 in a system answering 10,000 queries per day means approximately 300 queries per day contain at least one unsupported claim. At a 0.72 average confidence threshold, the confidence gate catches some of these — but not all, because low confidence and hallucination are not perfectly correlated. The fix for low faithfulness is in the inference layer: citation mandate in the prompt, NLI-based output quality gate, and response suppression when citation verification fails.
Dimension 4
Answer Relevance
LLM relevance score (0–1)
Measures whether the response addresses the question that was asked. A response can be perfectly faithful to its cited sources and still fail to answer the question — this happens when the retrieval layer returns tangentially relevant content and the inference layer synthesises a coherent response from that content that happens not to address the original question. Low answer relevance is often a symptom of query understanding failure: the system did not correctly identify the user's intent, so it retrieved and answered the wrong question. The fix is in the query understanding layer: better intent classification, query decomposition for multi-part questions, and step-back reformulation for questions that require abstracting to a more general topic first.
Compliance Architecture

Four Zones. Governance to Improvement.

The governance and evaluation layer has four compliance zones that span the complete system lifecycle — from real-time access enforcement through continuous improvement evidence. Together they create an unbroken compliance chain that satisfies the operating effectiveness requirements of SOC 1 Type 2 and the process quality requirements of ISO 9001.

Zone AAccess + hallucination governance
SOC 1 CC6.1 · CC6.6 · ISO 9001 §8.7
  • RBAC enforced on every query in real time — not just at session establishment
  • Hallucination gate: contradiction-scored responses suppressed before delivery
  • Anomaly detection: bulk query, privilege spike, cross-namespace attempt
  • Governance event log in WORM storage — continuous enforcement evidence
  • Quarterly access review with de-provisioning evidence
  • Privilege escalation → security investigation record created automatically
Zone BAudit trail integrity
SOC 1 Type 2 · ISO 9001 §9.1
  • Cross-layer log join: complete query → chunks → prompt → response chain
  • WORM storage: logs cannot be modified after writing
  • SHA-256 chain hash: cryptographic tamper detection across log entries
  • 12-month retention minimum for SOC 1 Type 2 evidence period
  • Monthly auto-generated audit report for ISO 9001 §9.2 internal audit
  • Any historical query fully reconstructable from archived logs
Zone CEvaluation evidence
ISO 9001 §9.1.1 · SOC 1 CC7.2
  • RAGAS scores archived with timestamp, dataset version, model config
  • A/B decision log: hypothesis, results, approver, promotion rationale
  • Golden dataset version-controlled as compliance artefact
  • Quality trend report: are targets being met? Any corrective actions taken?
  • Evaluation reproducible: same dataset + config → same scores
  • WORM archive: evaluation evidence cannot be selectively modified
Zone DImprovement evidence
ISO 9001 §10.3 · SOC 1 CC8.1
  • Change log: every system change with actor, timestamp, approval, reason
  • Before/after RAGAS scores logged for every production change
  • Rollback procedure documented for every system component
  • Corrective action records: failure → root cause → fix → prevention
  • Continual improvement evidence: ISO 9001 §10.3 active improvement process
  • SOC 1 CC8.1: changes tested in shadow before production promotion
Architectural note: Task AI Systems designs governance and evaluation systems aligned with SOC 1 Type 2 operating effectiveness requirements, ISO 9001 process quality and continual improvement controls, and GDPR accountability principles. The four-zone compliance architecture creates an unbroken evidence chain from access enforcement through improvement documentation that satisfies both internal governance requirements and external regulatory scrutiny. Formal certification responsibilities remain with your organisation's compliance function.
Operational Resilience

Every Governance Failure Mode. Every Recovery Path.

Governance failures are the most dangerous category because they are often invisible — the system continues to operate, answers continue to be delivered, and the failure is only discovered during an audit or after a regulatory incident.

Silent RAGAS degradation
RAGAS faithfulness drops from 0.97 to 0.88 over six weeks as the model drifts. No automated alert fires. Users notice increasing answer quality issues. By the time it is escalated, six weeks of potentially incorrect answers have been delivered and logged — but the degradation was never formally identified or acted upon.
Orphaned access permissions
An employee changes departments but retains access to the previous department's namespace for eight months because the access review process was not completed on schedule. During a SOC 1 audit, the auditor identifies that the access review was six months overdue. The finding is a control failure regardless of whether the employee accessed any restricted content.
Audit log gap
A misconfiguration in the log forwarder causes 48 hours of query logs to be lost before the WORM store receives them. The gap is discovered during audit preparation. The auditor cannot verify that controls operated during the gap period — the log gap is a SOC 1 finding regardless of whether any control violations actually occurred.
Corpus gap ignored
A new regulation takes effect. Queries about the new regulation have a 40% declination rate. The knowledge gap is correctly identified by the detection system. However, no one owns the action to add the relevant regulatory documents to the corpus. Users continue to receive declinations for three months until the gap is escalated manually.
Unapproved system change
An engineer updates the confidence threshold configuration directly in production without going through the A/B testing and approval process. The change reduces the declination rate but also allows more low-confidence responses through. During audit, the change cannot be attributed to an approved change record — SOC 1 CC8.1 finding.
Golden dataset staleness
The golden evaluation dataset has not been updated in 18 months. New query patterns have emerged that are not represented in the dataset. RAGAS scores appear stable at 0.95 because they are evaluated against an outdated dataset — the system is actually performing poorly on the new query categories, but the evaluation is not measuring them.
Recovery path for each failure mode
RAGAS drift alerting with 7-day rolling baseline
Prometheus monitors RAGAS scores on a daily evaluation schedule. A 10% relative drop from the 7-day rolling baseline triggers a PagerDuty alert before the degradation is user-visible. Alert includes the specific dimension that dropped (faithfulness, precision, recall, or relevance) and the magnitude, enabling targeted investigation.
Automated access review scheduling + escalation
Access review workflow fires automatically on a 90-day schedule for all AI system access. If a review is not completed within 7 days of the due date, an escalation fires to the access owner's manager. If not completed within 14 days, access is automatically suspended pending review. De-provisioning and suspension events are logged as SOC 1 evidence.
Log pipeline health monitoring + gap alerting
Log forwarder health is monitored by a separate lightweight process that verifies log delivery to the WORM store every 15 minutes. A gap of more than 30 minutes triggers an alert. If logs cannot be verified as delivered, an incident is created and the log pipeline is investigated and repaired before it accumulates to a material audit gap.
Gap ownership assignment + SLA tracking
Knowledge gap alerts include automatic assignment to the content owner for the affected topic domain. Gap tickets are tracked with a default 30-day SLA for corpus addition. Unresolved gaps past SLA escalate to the governance owner. Gap closure requires verified RAGAS improvement in the affected query category — not just document addition.
Infrastructure-as-code enforced change control
All production configuration — confidence thresholds, prompt templates, model routing rules — is managed as version-controlled Infrastructure as Code. Direct production changes are blocked at the infrastructure layer. All changes must go through the standard pull request, A/B test, and approval workflow. Any attempted direct change creates an audit alert.
Golden dataset refresh on query pattern shift
Query embedding distribution is monitored weekly. When the query distribution shifts significantly from the distribution at the time of the last golden dataset update — measured by embedding space distance — a dataset refresh is triggered. New evaluation examples are sampled from recent production queries, human-labelled, and added to the golden dataset before the next RAGAS run.
Architecture Decisions

The Governance Decisions That Determine Audit Readiness

Enterprise organisations frequently discover these decisions matter only when an auditor asks about them. Making them explicitly at design time is what separates a governance architecture from a governance document.

Governance + evaluation architecture decisions
WORM logs vs standard DB logs
A standard database log can be modified — records can be updated, deleted, or overwritten by a database administrator. This makes it useless as SOC 1 Type 2 evidence: an auditor cannot trust that the log reflects what actually happened. WORM storage (S3 Object Lock, Azure Blob immutability, GCS retention policies) creates logs that physically cannot be modified after writing. The cost premium over standard storage is typically under 10%. For any system in scope for SOC 1, SOC 2, ISO 27001, or HIPAA, WORM is not optional — it is the mechanism that makes log evidence credible.
Real-time RBAC vs session-based RBAC
Session-based RBAC establishes permissions at login and holds them for the session duration. In a session that lasts 8 hours, a user whose permissions are revoked 30 minutes after login retains access for 7.5 hours. Real-time RBAC verifies permissions on every query — adding 1–5ms per query but ensuring that permission changes take effect immediately. For enterprise RAG systems handling sensitive content, the additional latency is the correct tradeoff. The governance event log must record real-time RBAC checks to constitute SOC 1 evidence — a session establishment log is not sufficient.
Automated vs manual RAGAS evaluation
Manual RAGAS evaluation — running against the golden dataset on a monthly cycle — is insufficient for production governance. By the time a monthly evaluation detects quality degradation, weeks of degraded responses have been delivered. Automated daily evaluation with rolling baseline comparison and automatic alerting is the correct architecture. The additional compute cost of daily RAGAS evaluation is negligible relative to the risk of undetected quality degradation. The golden dataset update process remains manual and scheduled — the evaluation itself is automated.
Per-domain vs uniform confidence thresholds
A uniform confidence threshold of 0.72 applied across all use cases is a compromise that serves none of them well. A 0.72 threshold in a customer support system is unnecessarily restrictive — it declines answerable questions and frustrates users. A 0.72 threshold in a clinical protocol system is dangerously permissive — it allows responses with 28% probability of being incorrect into a clinical decision context. Per-domain calibration requires more effort to establish and maintain, but it is the only approach that correctly balances precision and availability across diverse enterprise use cases. The calibration records are compliance artefacts — they document the risk tolerance decisions that were made and by whom.
Continuous improvement ownership
The most common governance failure mode in enterprise RAG is not technical — it is organisational. The improvement loop generates signals (RAGAS drops, knowledge gaps, A/B test results) but no one owns the action. Governance architecture must assign explicit ownership: a named individual or team responsible for acting on each signal category within a defined SLA. Knowledge gap signals go to the content team. RAGAS faithfulness drops go to the inference team. Context recall drops go to the retrieval team. Without defined ownership and SLA tracking, the signals are generated but not acted upon, and the system decays despite having excellent observability.
Start the Conversation

Ready to Govern Your Enterprise AI System?

A single architecture conversation can identify the specific governance gaps in your current RAG deployment — before they surface in an audit finding or a regulatory incident.