The Evidence Gap: Why Most AI Compliance Programs Fail at Proof

The auditor asks a simple question: “Show me your risk management documentation for this system.”

The compliance officer opens a folder. There’s a risk management policy – 14 pages, approved by the board last year. A risk register in Excel with 40 line items. An email thread where the data science team discussed model bias testing. A slide deck from a vendor review meeting.

The auditor looks at the folder and says: “This tells me you have a policy. It doesn’t tell me the policy is implemented. Where’s the evidence that this specific AI system was assessed against Article 9’s requirements? Where’s the record of the risk assessment methodology you applied? Where are the test results?”

Silence.

This is the evidence gap. Not a gap in compliance knowledge – the team knows what Article 9 requires. Not a gap in good intentions – the policy exists, the processes exist, the conversations happened. A gap in proof. The distance between “we do this” and “here’s the documentation that proves we do this, for this specific system, against this specific obligation.”

Most compliance programs stop at obligation mapping. They identify what regulations apply, flag the gaps, and produce a remediation plan. That’s necessary. It’s not sufficient. An auditor doesn’t accept a gap analysis as evidence of compliance. An auditor accepts evidence of compliance as evidence of compliance.

What “Evidence” Actually Means Under the EU AI Act

The EU AI Act is specific about documentation requirements. Not vague “maintain appropriate records” language – specific.

Article 11 requires technical documentation for high-risk AI systems. Annex IV specifies what that documentation must contain: a general description of the system, design specifications, training data details, performance metrics, risk management measures, cybersecurity protections, post-market monitoring plans, and applicable standards. Each section has sub-requirements. Each sub-requirement implies specific evidence artifacts.

Article 18 requires deployers to maintain logs of AI system operation for at least six months. Not “logs exist” – logs that are retained, accessible, and sufficient for post-market surveillance authorities to review.

Article 49 requires registration in the EU database before placing a high-risk system on the market. Registration isn’t a checkbox – it’s a specific filing with specific data fields that constitutes a public record.

These aren’t interpretive requirements. They’re documentary requirements. The regulation tells you what to produce. The question is whether you’ve produced it – and whether what you’ve produced actually addresses the specific obligation it claims to address.

The Three Layers of the Evidence Gap

Layer 1: Existence

Does the evidence artifact exist at all? The risk management documentation for the AI hiring tool – has anyone written it? Not the company-wide risk management policy. The system-specific documentation that addresses Article 9’s requirements for this system, with this data, in this deployment context.

Most organizations have policies. Fewer have system-specific documentation. Fewer still have documentation that maps one-to-one to the obligation it’s supposed to address.

Layer 2: Specificity

The artifact exists. But does it address the specific requirement? A risk management policy that says “all AI systems undergo risk assessment” doesn’t satisfy Article 9(2)(a), which requires identification and analysis of “known and reasonably foreseeable risks” for the specific AI system. The obligation asks for specific risks. The evidence must show specific risks were identified – not that a process for identifying risks exists.

This is where generalized compliance documentation fails. A single policy document applied to all systems doesn’t demonstrate that each system was individually assessed. The auditor needs to see that someone looked at this system and identified its risks.

Layer 3: Currency

The evidence exists and is specific. But is it current? Article 9(3) requires the risk management system to be maintained “throughout the entire lifecycle” of the AI system. A risk assessment from 18 months ago, before the model was retrained on new data, before the deployment expanded to new jurisdictions, before the regulatory landscape changed – that assessment is stale. The obligation is ongoing. The evidence must be ongoing.

Currency is the layer most programs miss entirely. The initial assessment gets done. The documentation gets filed. Nobody updates it when the system changes. The auditor finds an 18-month-old risk assessment for a system that was retrained six months ago and asks: “Was the risk assessment updated after retraining?” That question has a wrong answer.

What Evaluation Plans Do

An evaluation plan bridges the gap between “we mapped the obligations” and “we can prove compliance against each one.”

The plan generates structured evidence requirements – specific questions and document requests – tailored to each obligation that applies to each system. Not a generic questionnaire. A regulation-specific, obligation-specific assessment that tells the consultant exactly what evidence to collect.

Here’s what that looks like.

A client’s AI hiring tool is mapped against 23 Article 9 obligations. The gap analysis identified seven gaps – three critical, two high, two medium. The evaluation plan generates items for each obligation: what evidence is needed, what questions the consultant should answer, and what documents should be uploaded.

For Article 9(2)(a) – the requirement to identify foreseeable risks – the evaluation generates a document upload request for the system-specific risk assessment, plus questionnaire items about methodology: How were risks identified? Who conducted the assessment? When was it last updated? What data sources informed the risk identification?

For Article 9(7) – the requirement to test the system against risk management measures – the evaluation generates a request for test results, test methodology documentation, and pass/fail criteria. Not “do you test?” but “show me the test plan, the test data, and the results.”

Each item maps back to a specific obligation. Each obligation maps back to the regulation text. The consultant can trace any evidence request to the exact article and sub-requirement that demands it.

Six Assessment Formats

Not every regulation structures its documentation requirements the same way. An EU AI Act assessment looks different from a Colorado impact assessment looks different from a NYC bias audit.

Evaluation plans automatically adapt to the regulation being assessed:

EU AI Act Annex IV – the technical documentation structure mandated by the regulation itself. Eight sections matching Annex IV’s categories: general description, design specifications, training data, performance metrics, risk management, cybersecurity, post-market monitoring, and standards compliance.

Colorado SB 205 Impact Assessment – structured around Colorado’s algorithmic impact assessment requirements. System purpose, demographic impact analysis, known limitations, and human oversight mechanisms.

Texas TRAIGA Impact Assessment – aligned with Texas’s AI governance framework. Training data documentation, accuracy metrics, bias evaluation, and oversight procedures.

NYC Local Law 144 Bias Audit – Phase 1 format matching the bias audit requirements for automated employment decision tools. Selection rate analysis by protected category, audit methodology, and auditor independence documentation.

NIST AI RMF – organized around NIST’s four functions: Govern, Map, Measure, Manage. Each function generates evaluation items that map to the corresponding subcategories.

Generic – for regulations without a mandated documentation structure. Items organized by obligation type – risk management, transparency, documentation, human oversight.

The format isn’t a consultant choice. It’s determined by the regulation. When an auditor asks for Annex IV documentation, the evaluation plan already generated its items in Annex IV structure. The consultant collects evidence into a framework the auditor recognizes.

Evidence Validation

Collecting evidence isn’t the same as demonstrating compliance. A consultant can upload every document the plan requests and still have gaps – because the documents don’t actually address the specific requirement, or they’re incomplete, or they’re outdated.

Evidence validation reviews uploaded documents, questionnaire responses, and attestations against the specific obligation they’re supposed to address. Each evaluation item receives a compliance assessment: does the provided evidence demonstrate compliance with this specific requirement?

The validation identifies what’s missing. Not “you need more documentation” – specifically what’s insufficient. “The risk assessment identifies three risks but doesn’t address foreseeable misuse scenarios, which Article 9(2)(b) requires.” “The test results show accuracy metrics but don’t include testing against the risk management measures defined in Article 9(7).”

This specificity matters. A consultant who receives “needs more documentation” has to guess what’s missing. A consultant who receives “Article 9(2)(b) foreseeable misuse analysis is absent from the risk assessment” knows exactly what to request from the client.

The Auditor’s Trail

Every evidence artifact, every questionnaire response, every validation result is logged. The evaluation plan creates an audit trail that runs from regulation text to obligation to evidence to validation.

When the auditor asks “show me your Article 9 compliance,” the consultant doesn’t open a folder of loosely organized documents. She opens an evaluation plan that shows:

23 obligations mapped from Article 9
23 evaluation items, each with specific evidence requirements
Uploaded documents matched to the items they address
Validation results showing compliance status per item
An overall compliance score for the Article 9 assessment
A generated report in Annex IV format – the structure the auditor expects

The trail is bidirectional. Start from any evidence artifact and trace backward: which obligation required it? Which regulation text demands it? What gap did it address? Start from any obligation and trace forward: what evidence was collected? Was it validated? What was the finding?

This traceability is what separates “we have a compliance program” from “we can demonstrate compliance.” The first is a claim. The second is a record.

Mapping Obligations Is the Easy Part

We’ve mapped 2,964 obligations across 15 regulations. We’ve built cross-regulation mapping that identifies where frameworks overlap. We’ve built a compliance heatmap that shows every gap at a glance.

All of that identifies what compliance requires. None of it proves compliance exists.

The evidence gap is the distance between those two states. Most compliance programs live in that gap – they know what they need to do, they’ve done some of it, but they can’t prove all of it to the satisfaction of an auditor who has no reason to take their word for it.

Evaluation plans close the gap by making the evidence requirements explicit, structured, and traceable. Not “document your risk management.” Instead: “For this system, against this obligation, upload this specific artifact and answer these specific questions.”

The obligation tells you what to do. The evidence proves you did it.

That’s the difference between compliance and defensible compliance.

Evaluation plans are generated from obligation mappings using regulation-specific prompt families. Evidence validation uses AI with findings logged in the audit trail. All evaluation data – items, evidence uploads, validation results, and compliance scores – is tenant-isolated and immutable once validated. Learn how we validate our AI outputs.