EU AI Act Data Governance: Article 10 Training Data Requirements
How to comply with EU AI Act Article 10 data governance requirements. Training data quality criteria, bias detection and mitigation, dataset documentation, and GDPR intersection.
For high-risk AI systems, the EU AI Act does not just regulate what your system does — it regulates how it was built. Article 10 imposes detailed requirements on the data used to train, validate, and test your AI system. It is often the Annex IV item with the widest gap between what organisations have documented and what the regulation requires.
This guide explains what Article 10 demands, how to build a compliant data governance approach, and how to navigate the complex intersection with GDPR.
What Article 10 Actually Requires
Article 10 applies to high-risk AI systems that use techniques involving training with data. It sets out obligations across five areas:
1. Data Quality Criteria
Training, validation, and test datasets must be subject to data governance and management practices that ensure they are:
- Relevant: The data must be appropriate for the intended purpose of the system — not merely available or convenient
- Representative: The data must adequately represent the population and context in which the system will operate
- Free from errors: Errors and inaccuracies in training data must be identified and corrected where technically feasible
- Complete: The dataset must be sufficiently complete for the intended purpose — significant gaps in coverage that affect system performance must be documented and addressed
There is no prescribed methodology for meeting these criteria. What matters is that you can demonstrate, with evidence, that your dataset meets them. “We used data from a reputable source” is not sufficient — you need documented analysis of dataset characteristics.
2. Examination for Biases
Article 10(2)(f) requires that datasets are “examined for possible biases that could lead to prohibited discrimination.” This is one of the most practically demanding requirements in the Act.
Bias examination must be systematic — not an informal check. It requires:
- Identifying demographic groups relevant to your intended use case
- Measuring whether the dataset under-represents or over-represents those groups
- Identifying proxy features that encode protected characteristics even when the protected characteristics themselves are not explicitly present
- Measuring outcome disparities that would result from training on the dataset without mitigation
Special category data for bias detection: Article 10(5) contains an important provision: providers may process special categories of personal data (including race, health data, and other sensitive categories) for the purpose of detecting and correcting biases — subject to appropriate safeguards including pseudonymisation and access controls. This is a deliberate carve-out to enable thorough bias auditing.
3. Dataset Provenance
You must be able to document where your training data came from. This includes:
- Origin: The source of the data — scraped web data, licensed datasets, internal data, synthetic data, or combinations
- Geographic scope: Where the data was collected — relevant because data from one geographic context may not be representative of users in another
- Temporal scope: When the data was collected — older data may not reflect current patterns in the deployment context
- Contextual scope: The conditions under which the data was originally generated or collected
For licensed datasets, this means obtaining and retaining documentation from the data provider. For internally generated data, this means maintaining collection methodology records. For scraped or publicly available data, this means documenting the sources, the scraping methodology, and any filtering applied.
4. Collection and Processing Methodology
Document how data was:
- Collected: What process was used; whether consent was obtained where required; what population the collection covered
- Selected: What filtering criteria were applied; what data was excluded and why; what risk of selection bias this creates
- Preprocessed: What cleaning, normalisation, tokenisation, or transformation steps were applied; how these steps affect the dataset characteristics
- Labelled/annotated: For supervised learning, the annotation methodology, annotator qualifications, inter-annotator agreement metrics, and dispute resolution procedure
Annotation quality is particularly important. The EU AI Act treats labelling methodology as a core documentation requirement, not a technical detail. If labels were generated by crowd workers, by a third-party annotation service, or by subject matter experts, document the process and the quality controls applied.
Building a Compliant Data Documentation Package
To satisfy Article 10 and Annex IV Item 3, you need a dataset documentation package that covers the following:
Dataset Card
For each dataset used in training, validation, and testing, produce a dataset card covering:
Basic information:
- Dataset name and version
- Number of samples (training / validation / test split)
- Data types (text, tabular, image, audio, etc.)
- Languages or formats
- Date of collection / last update
Provenance:
- Data source(s) with documentation
- Collection methodology
- Licences or data use agreements
- Geographic and temporal scope
Dataset characteristics:
- Demographic distribution (where applicable and measurable)
- Known limitations and gaps
- Known biases identified during examination
Processing:
- Preprocessing steps applied
- Any data augmentation applied
- Annotation methodology (for labelled data)
- Quality control measures
Bias Audit Report
A separate bias audit report documenting:
Scope of analysis:
- Which protected characteristics were examined
- Which proxy features were tested
- Which demographic sub-groups were analysed
Methodology:
- What statistical tests were applied
- What fairness metrics were used (demographic parity, equalised odds, individual fairness, etc.)
- What tools were used (Fairlearn, IBM AI Fairness 360, Aequitas, or custom)
Findings:
- Pre-mitigation disparity by demographic group
- Which proxy features were found to encode protected characteristics
- The correlation coefficients or other measures of association
Mitigation:
- What mitigation measures were applied (re-weighting, feature removal, adversarial debiasing, etc.)
- Post-mitigation disparity by demographic group
- Evidence of effectiveness (before/after comparison)
Residual disparities:
- What disparities remain after mitigation
- Justification for accepting residual disparities
- How residual disparities are disclosed in the Instructions for Use
The GDPR and EU AI Act Intersection
This is the most legally complex aspect of Article 10 compliance. The EU AI Act and GDPR interact in ways that require careful analysis.
Training Data Containing Personal Data
If your training dataset contains personal data (which most enterprise AI training sets do), GDPR applies in full to that processing. Key questions:
Lawful basis: What is the lawful basis for using personal data to train your AI system? Options include:
- Legitimate interests (Art. 6(1)(f) GDPR): Requires a legitimate interests assessment (LIA) balancing your interest in training the model against the data subjects’ rights. This is the most commonly used basis but requires documentation.
- Consent: High bar — requires specific, informed, freely given consent for the specific AI training purpose. Often impractical for large datasets.
- Performance of contract: Only applies if training is necessary to deliver a service to the data subjects whose data is used — a narrow basis.
Data minimisation: GDPR requires that only the personal data necessary for the specified purpose is processed. For AI training, this means considering whether you could train an effective model with less personal data, or with pseudonymised or anonymised data.
Purpose limitation: Personal data collected for one purpose (e.g. customer service interactions) cannot simply be repurposed for AI training without a compatible purpose or a new lawful basis.
Retention: How long do you need to retain the training dataset? Once the model is trained, ongoing retention of the full dataset containing personal data requires justification. GDPR’s storage limitation principle requires deletion when personal data is no longer necessary.
Data Subject Rights and Training Data
Can a data subject demand that their personal data be removed from your training dataset? This is one of the most contested questions in AI/GDPR compliance.
Under GDPR:
- Right to erasure (Art. 17): Data subjects can request erasure when personal data is no longer necessary, consent is withdrawn, or processing is unlawful. This applies to training data.
- Right to object (Art. 21): Data subjects can object to processing based on legitimate interests. If successful, you must cease processing — which could mean re-training the model without that data.
Machine unlearning — technically removing the influence of a specific data point from a trained model — is an active research area but not yet reliably feasible for large models. Your data retention and processing policies should account for the possibility of erasure requests, ideally by maintaining training dataset records that enable you to identify and remove specific individuals’ data before training, rather than relying on post-training unlearning.
Pseudonymisation and Anonymisation
Where possible, pseudonymise or anonymise training data before training. This reduces GDPR exposure and simplifies compliance with data subject rights. However:
- Pseudonymisation does not eliminate GDPR application — pseudonymised data is still personal data
- True anonymisation is a high bar — data is only truly anonymous if re-identification is not reasonably possible. This is difficult to achieve with rich datasets
Document your anonymisation approach and why you believe it achieves genuine anonymisation (rather than pseudonymisation) where you rely on this.
Common Compliance Gaps in Article 10
Gap 1: No dataset documentation Training data was assembled ad hoc, without formal provenance records. The data sources, collection methodology, and characteristics are not documented — making Annex IV Item 3 impossible to complete accurately.
Gap 2: Bias audit conducted but not in the right dimensions A gender bias analysis was performed, but age, nationality, and disability were not examined — even though these are relevant to the system’s use case.
Gap 3: Mitigation applied but not measured The team removed a proxy feature (e.g. postcode) but did not re-test disparity rates after removal. The bias audit report records the mitigation but not its effectiveness.
Gap 4: No GDPR lawful basis for training data The system was trained on customer interaction data or employee records without a documented lawful basis for using that data for AI training. The original collection purpose does not extend to model training without analysis.
Gap 5: Training data retained indefinitely The full training dataset (including personal data) is retained on internal servers without a documented retention period or deletion schedule, violating GDPR’s storage limitation principle.
Practical Steps
- Inventory your training data: List every dataset used for training, validation, and testing — including data augmentation sources
- Produce dataset cards for each dataset covering provenance, scope, and characteristics
- Conduct a formal bias audit covering all relevant demographic dimensions for your use case
- Document your GDPR lawful basis for any training data containing personal data
- Set a retention policy for training datasets containing personal data
- Link your data documentation to Annex IV Item 3 in your technical file
Our free Status Quo Assessment includes readiness questions covering training data documentation and bias audit status. For a complete Annex IV roadmap with worked dataset card and bias audit templates, see our Technical Documentation Roadmap.
Free Status Quo Assessment
12 questions. Instant Annex III classification + readiness score. Free PDF delivered to your inbox.
Take free assessment →Annex IV Roadmap — €149
15-page personalised report. All 8 Annex IV items with practical examples. 90-day action plan. Instant PDF.
Get your roadmap →