· 7 min read

EU AI Act Data Governance: Article 10 Training Data Requirements

How to comply with EU AI Act Article 10 data governance requirements. Training data quality criteria, bias detection and mitigation, dataset documentation, and GDPR intersection.

For high-risk AI systems, the EU AI Act does not just regulate what your system does — it regulates how it was built. Article 10 imposes detailed requirements on the data used to train, validate, and test your AI system. It is often the Annex IV item with the widest gap between what organisations have documented and what the regulation requires.

This guide explains what Article 10 demands, how to build a compliant data governance approach, and how to navigate the complex intersection with GDPR.


What Article 10 Actually Requires

Article 10 applies to high-risk AI systems that use techniques involving training with data. It sets out obligations across five areas:

1. Data Quality Criteria

Training, validation, and test datasets must be subject to data governance and management practices that ensure they are:

There is no prescribed methodology for meeting these criteria. What matters is that you can demonstrate, with evidence, that your dataset meets them. “We used data from a reputable source” is not sufficient — you need documented analysis of dataset characteristics.

2. Examination for Biases

Article 10(2)(f) requires that datasets are “examined for possible biases that could lead to prohibited discrimination.” This is one of the most practically demanding requirements in the Act.

Bias examination must be systematic — not an informal check. It requires:

Special category data for bias detection: Article 10(5) contains an important provision: providers may process special categories of personal data (including race, health data, and other sensitive categories) for the purpose of detecting and correcting biases — subject to appropriate safeguards including pseudonymisation and access controls. This is a deliberate carve-out to enable thorough bias auditing.

3. Dataset Provenance

You must be able to document where your training data came from. This includes:

For licensed datasets, this means obtaining and retaining documentation from the data provider. For internally generated data, this means maintaining collection methodology records. For scraped or publicly available data, this means documenting the sources, the scraping methodology, and any filtering applied.

4. Collection and Processing Methodology

Document how data was:

Annotation quality is particularly important. The EU AI Act treats labelling methodology as a core documentation requirement, not a technical detail. If labels were generated by crowd workers, by a third-party annotation service, or by subject matter experts, document the process and the quality controls applied.


Building a Compliant Data Documentation Package

To satisfy Article 10 and Annex IV Item 3, you need a dataset documentation package that covers the following:

Dataset Card

For each dataset used in training, validation, and testing, produce a dataset card covering:

Basic information:

Provenance:

Dataset characteristics:

Processing:

Bias Audit Report

A separate bias audit report documenting:

Scope of analysis:

Methodology:

Findings:

Mitigation:

Residual disparities:


The GDPR and EU AI Act Intersection

This is the most legally complex aspect of Article 10 compliance. The EU AI Act and GDPR interact in ways that require careful analysis.

Training Data Containing Personal Data

If your training dataset contains personal data (which most enterprise AI training sets do), GDPR applies in full to that processing. Key questions:

Lawful basis: What is the lawful basis for using personal data to train your AI system? Options include:

Data minimisation: GDPR requires that only the personal data necessary for the specified purpose is processed. For AI training, this means considering whether you could train an effective model with less personal data, or with pseudonymised or anonymised data.

Purpose limitation: Personal data collected for one purpose (e.g. customer service interactions) cannot simply be repurposed for AI training without a compatible purpose or a new lawful basis.

Retention: How long do you need to retain the training dataset? Once the model is trained, ongoing retention of the full dataset containing personal data requires justification. GDPR’s storage limitation principle requires deletion when personal data is no longer necessary.

Data Subject Rights and Training Data

Can a data subject demand that their personal data be removed from your training dataset? This is one of the most contested questions in AI/GDPR compliance.

Under GDPR:

Machine unlearning — technically removing the influence of a specific data point from a trained model — is an active research area but not yet reliably feasible for large models. Your data retention and processing policies should account for the possibility of erasure requests, ideally by maintaining training dataset records that enable you to identify and remove specific individuals’ data before training, rather than relying on post-training unlearning.

Pseudonymisation and Anonymisation

Where possible, pseudonymise or anonymise training data before training. This reduces GDPR exposure and simplifies compliance with data subject rights. However:

Document your anonymisation approach and why you believe it achieves genuine anonymisation (rather than pseudonymisation) where you rely on this.


Common Compliance Gaps in Article 10

Gap 1: No dataset documentation Training data was assembled ad hoc, without formal provenance records. The data sources, collection methodology, and characteristics are not documented — making Annex IV Item 3 impossible to complete accurately.

Gap 2: Bias audit conducted but not in the right dimensions A gender bias analysis was performed, but age, nationality, and disability were not examined — even though these are relevant to the system’s use case.

Gap 3: Mitigation applied but not measured The team removed a proxy feature (e.g. postcode) but did not re-test disparity rates after removal. The bias audit report records the mitigation but not its effectiveness.

Gap 4: No GDPR lawful basis for training data The system was trained on customer interaction data or employee records without a documented lawful basis for using that data for AI training. The original collection purpose does not extend to model training without analysis.

Gap 5: Training data retained indefinitely The full training dataset (including personal data) is retained on internal servers without a documented retention period or deletion schedule, violating GDPR’s storage limitation principle.


Practical Steps

  1. Inventory your training data: List every dataset used for training, validation, and testing — including data augmentation sources
  2. Produce dataset cards for each dataset covering provenance, scope, and characteristics
  3. Conduct a formal bias audit covering all relevant demographic dimensions for your use case
  4. Document your GDPR lawful basis for any training data containing personal data
  5. Set a retention policy for training datasets containing personal data
  6. Link your data documentation to Annex IV Item 3 in your technical file

Our free Status Quo Assessment includes readiness questions covering training data documentation and bias audit status. For a complete Annex IV roadmap with worked dataset card and bias audit templates, see our Technical Documentation Roadmap.

🎯

Free Status Quo Assessment

12 questions. Instant Annex III classification + readiness score. Free PDF delivered to your inbox.

Take free assessment →
📄

Annex IV Roadmap — €149

15-page personalised report. All 8 Annex IV items with practical examples. 90-day action plan. Instant PDF.

Get your roadmap →
← Back to all articles