EU AI Act Data Governance: Article 10 Training Data Requirements

For high-risk AI systems, the EU AI Act does not just regulate what your system does — it regulates how it was built. Article 10 imposes detailed requirements on the data used to train, validate, and test your AI system. It is often the Annex IV item with the widest gap between what organisations have documented and what the regulation requires.

This guide explains what Article 10 demands, how to build a compliant data governance approach, and how to navigate the complex intersection with GDPR.

What Article 10 Actually Requires

Article 10 applies to high-risk AI systems that use techniques involving training with data. It sets out obligations across five areas:

1. Data Quality Criteria

Training, validation, and test datasets must be subject to data governance and management practices that ensure they are:

Relevant: The data must be appropriate for the intended purpose of the system — not merely available or convenient
Representative: The data must adequately represent the population and context in which the system will operate
Free from errors: Errors and inaccuracies in training data must be identified and corrected where technically feasible
Complete: The dataset must be sufficiently complete for the intended purpose — significant gaps in coverage that affect system performance must be documented and addressed

There is no prescribed methodology for meeting these criteria. What matters is that you can demonstrate, with evidence, that your dataset meets them. “We used data from a reputable source” is not sufficient — you need documented analysis of dataset characteristics.

2. Examination for Biases

Article 10(2)(f) requires that datasets are “examined for possible biases that could lead to prohibited discrimination.” This is one of the most practically demanding requirements in the Act.

Bias examination must be systematic — not an informal check. It requires:

Identifying demographic groups relevant to your intended use case
Measuring whether the dataset under-represents or over-represents those groups
Identifying proxy features that encode protected characteristics even when the protected characteristics themselves are not explicitly present
Measuring outcome disparities that would result from training on the dataset without mitigation

Special category data for bias detection: Article 10(5) contains an important provision: providers may process special categories of personal data (including race, health data, and other sensitive categories) for the purpose of detecting and correcting biases — subject to appropriate safeguards including pseudonymisation and access controls. This is a deliberate carve-out to enable thorough bias auditing.

3. Dataset Provenance

You must be able to document where your training data came from. This includes:

Origin: The source of the data — scraped web data, licensed datasets, internal data, synthetic data, or combinations
Geographic scope: Where the data was collected — relevant because data from one geographic context may not be representative of users in another
Temporal scope: When the data was collected — older data may not reflect current patterns in the deployment context
Contextual scope: The conditions under which the data was originally generated or collected

For licensed datasets, this means obtaining and retaining documentation from the data provider. For internally generated data, this means maintaining collection methodology records. For scraped or publicly available data, this means documenting the sources, the scraping methodology, and any filtering applied.

4. Collection and Processing Methodology

Document how data was:

Collected: What process was used; whether consent was obtained where required; what population the collection covered
Selected: What filtering criteria were applied; what data was excluded and why; what risk of selection bias this creates
Preprocessed: What cleaning, normalisation, tokenisation, or transformation steps were applied; how these steps affect the dataset characteristics
Labelled/annotated: For supervised learning, the annotation methodology, annotator qualifications, inter-annotator agreement metrics, and dispute resolution procedure

Annotation quality is particularly important. The EU AI Act treats labelling methodology as a core documentation requirement, not a technical detail. If labels were generated by crowd workers, by a third-party annotation service, or by subject matter experts, document the process and the quality controls applied.

Building a Compliant Data Documentation Package

To satisfy Article 10 and Annex IV Item 3, you need a dataset documentation package that covers the following:

Dataset Card

For each dataset used in training, validation, and testing, produce a dataset card covering:

Basic information:

Dataset name and version
Number of samples (training / validation / test split)
Data types (text, tabular, image, audio, etc.)
Languages or formats
Date of collection / last update

Provenance:

Data source(s) with documentation
Collection methodology
Licences or data use agreements
Geographic and temporal scope

Dataset characteristics:

Demographic distribution (where applicable and measurable)
Known limitations and gaps
Known biases identified during examination

Processing:

Preprocessing steps applied
Any data augmentation applied
Annotation methodology (for labelled data)
Quality control measures

Bias Audit Report

A separate bias audit report documenting:

Scope of analysis:

Which protected characteristics were examined
Which proxy features were tested
Which demographic sub-groups were analysed

Methodology:

What statistical tests were applied
What fairness metrics were used (demographic parity, equalised odds, individual fairness, etc.)
What tools were used (Fairlearn, IBM AI Fairness 360, Aequitas, or custom)

Findings:

Pre-mitigation disparity by demographic group
Which proxy features were found to encode protected characteristics
The correlation coefficients or other measures of association

Mitigation:

What mitigation measures were applied (re-weighting, feature removal, adversarial debiasing, etc.)
Post-mitigation disparity by demographic group
Evidence of effectiveness (before/after comparison)

Residual disparities:

What disparities remain after mitigation
Justification for accepting residual disparities
How residual disparities are disclosed in the Instructions for Use

This is the most legally complex aspect of Article 10 compliance. The EU AI Act and GDPR interact in ways that require careful analysis.

Training Data Containing Personal Data

If your training dataset contains personal data (which most enterprise AI training sets do), GDPR applies in full to that processing. Key questions:

Lawful basis: What is the lawful basis for using personal data to train your AI system? Options include:

Legitimate interests (Art. 6(1)(f) GDPR): Requires a legitimate interests assessment (LIA) balancing your interest in training the model against the data subjects’ rights. This is the most commonly used basis but requires documentation.
Consent: High bar — requires specific, informed, freely given consent for the specific AI training purpose. Often impractical for large datasets.
Performance of contract: Only applies if training is necessary to deliver a service to the data subjects whose data is used — a narrow basis.

Data minimisation: GDPR requires that only the personal data necessary for the specified purpose is processed. For AI training, this means considering whether you could train an effective model with less personal data, or with pseudonymised or anonymised data.

Purpose limitation: Personal data collected for one purpose (e.g. customer service interactions) cannot simply be repurposed for AI training without a compatible purpose or a new lawful basis.

Retention: How long do you need to retain the training dataset? Once the model is trained, ongoing retention of the full dataset containing personal data requires justification. GDPR’s storage limitation principle requires deletion when personal data is no longer necessary.

Data Subject Rights and Training Data

Can a data subject demand that their personal data be removed from your training dataset? This is one of the most contested questions in AI/GDPR compliance.

Under GDPR:

Right to erasure (Art. 17): Data subjects can request erasure when personal data is no longer necessary, consent is withdrawn, or processing is unlawful. This applies to training data.
Right to object (Art. 21): Data subjects can object to processing based on legitimate interests. If successful, you must cease processing — which could mean re-training the model without that data.

Machine unlearning — technically removing the influence of a specific data point from a trained model — is an active research area but not yet reliably feasible for large models. Your data retention and processing policies should account for the possibility of erasure requests, ideally by maintaining training dataset records that enable you to identify and remove specific individuals’ data before training, rather than relying on post-training unlearning.

Pseudonymisation and Anonymisation

Where possible, pseudonymise or anonymise training data before training. This reduces GDPR exposure and simplifies compliance with data subject rights. However:

Pseudonymisation does not eliminate GDPR application — pseudonymised data is still personal data
True anonymisation is a high bar — data is only truly anonymous if re-identification is not reasonably possible. This is difficult to achieve with rich datasets

Document your anonymisation approach and why you believe it achieves genuine anonymisation (rather than pseudonymisation) where you rely on this.

Common Compliance Gaps in Article 10

Gap 1: No dataset documentation Training data was assembled ad hoc, without formal provenance records. The data sources, collection methodology, and characteristics are not documented — making Annex IV Item 3 impossible to complete accurately.

Gap 2: Bias audit conducted but not in the right dimensions A gender bias analysis was performed, but age, nationality, and disability were not examined — even though these are relevant to the system’s use case.

Gap 3: Mitigation applied but not measured The team removed a proxy feature (e.g. postcode) but did not re-test disparity rates after removal. The bias audit report records the mitigation but not its effectiveness.

Gap 4: No GDPR lawful basis for training data The system was trained on customer interaction data or employee records without a documented lawful basis for using that data for AI training. The original collection purpose does not extend to model training without analysis.

Gap 5: Training data retained indefinitely The full training dataset (including personal data) is retained on internal servers without a documented retention period or deletion schedule, violating GDPR’s storage limitation principle.

Practical Steps

Inventory your training data: List every dataset used for training, validation, and testing — including data augmentation sources
Produce dataset cards for each dataset covering provenance, scope, and characteristics
Conduct a formal bias audit covering all relevant demographic dimensions for your use case
Document your GDPR lawful basis for any training data containing personal data
Set a retention policy for training datasets containing personal data
Link your data documentation to Annex IV Item 3 in your technical file

Our free Status Quo Assessment includes readiness questions covering training data documentation and bias audit status. For a complete Annex IV roadmap with worked dataset card and bias audit templates, see our Technical Documentation Roadmap.

EU AI Act Data Governance: Article 10 Training Data Requirements

What Article 10 Actually Requires

1. Data Quality Criteria

2. Examination for Biases

3. Dataset Provenance

4. Collection and Processing Methodology

Building a Compliant Data Documentation Package

Dataset Card

Bias Audit Report

Training Data Containing Personal Data

Data Subject Rights and Training Data

Pseudonymisation and Anonymisation

Common Compliance Gaps in Article 10

Practical Steps

Free Status Quo Assessment

Annex IV Roadmap — €149

What Article 10 Actually Requires

1. Data Quality Criteria

2. Examination for Biases

3. Dataset Provenance

4. Collection and Processing Methodology

Building a Compliant Data Documentation Package

Dataset Card

Bias Audit Report

The GDPR and EU AI Act Intersection

Training Data Containing Personal Data

Data Subject Rights and Training Data

Pseudonymisation and Anonymisation

Common Compliance Gaps in Article 10

Practical Steps

Free Status Quo Assessment

Annex IV Roadmap — €149