AI Training Sets Harbor Millions of Private Documents, Exposing Deep Privacy Crisis
Share this article
A groundbreaking study has exposed a massive privacy breach at the heart of modern AI development: major open-source training datasets likely contain hundreds of millions of images of passports, credit cards, birth certificates, driver's licenses, and résumés featuring personally identifiable information (PII). Researchers auditing just 0.1% of the massive DataComp CommonPool dataset – a cornerstone for training generative image models with 12.8 billion image-text pairs – found thousands of validated identity documents and sensitive personal files. Extrapolating from their sample, they estimate the full dataset harbors at least 102 million identifiable faces and countless sensitive documents.
The Scale of Exposure
Led by Rachel Hong, a PhD student at the University of Washington, the research team meticulously validated a subset of CommonPool data, uncovering:
- Identity Documents: Thousands of images of credit cards (with visible numbers), passports, driver's licenses, and birth certificates.
- Résumés & Job Applications: Over 800 validated résumés and cover letters, often linked via LinkedIn to real individuals. These frequently disclosed highly sensitive information including:
- Disability status
- Background check results
- Birth dates and birthplaces of dependents
- Race/ethnicity
- Home addresses
- Contact information for references
- Children's Data: Numerous instances of children's personal information, including birth certificates and health status details.
"Anything you put online can [be] and probably has been scraped," stated co-author William Agnew, an AI ethics postdoctoral fellow at Carnegie Mellon University. The dataset, scraped from the web by Common Crawl between 2014 and 2022, forms the backbone of widely used models. CommonPool is a follow-up to the LAION-5B dataset, used to train models like Stable Diffusion and Midjourney, meaning similar PII contamination is highly probable in those models and countless downstream derivatives. CommonPool has been downloaded over 2 million times.
Why Filtering Failed
Dataset curators were aware of the PII risk and implemented automated face blurring. However, the research revealed critical failures:
- Missed Faces: In their small audit, researchers validated over 800 faces the algorithm failed to blur. They estimate 102 million faces were missed across the entire dataset.
- Ignored PII Strings: Filters did not target known PII character strings like email addresses or Social Security numbers.
- Caption & Metadata Risk: Blurring images doesn't address sensitive information embedded in accompanying captions or image metadata (names, exact locations).
- Opt-Out Futility: While Hugging Face (the platform hosting CommonPool) offers an opt-out tool, it requires individuals to know their data is included – an almost impossible task given the dataset's scale and opacity.
"Filtering is extremely hard to do well," explained Agnew. "They would have had to make very significant advancements in PII detection and removal that they haven’t made public to be able to effectively filter this." Abeba Birhane, a cognitive scientist and tech ethicist at Trinity College Dublin, emphasized this is systemic: "You can assume that any large-scale web-scraped data always contains content that shouldn’t be there."
Legal Gray Areas and the 'Publicly Available' Myth
The research highlights significant legal ambiguities and shortcomings:
- Lack of Meaningful Consent: Data was scraped pre-2020, before generative AI's rise. Individuals couldn't consent to AI training uses that didn't exist. Removing source images often doesn't remove them from scraped repositories like CommonPool.
- Patchwork Regulation: GDPR (Europe) and CCPA (California) offer some protections, but the US lacks comprehensive federal privacy law. Many entities creating datasets fall outside these laws' scope.
- 'Publicly Available' Loophole: Laws like CCPA often exempt "publicly available" information. The AI research community has historically operated under the assumption that web-scraped data is inherently public and fair game. This research fundamentally challenges that notion. "What we found is that ‘publicly available’ includes a lot of stuff that a lot of people might consider private... These are probably not things people want to just be used anywhere, for anything," stated Hong.
- Deletion Dilemma: Even if data is removed from a training set, retraining massive AI models is often infeasible. "If the organization only deletes data from the training data sets—but does not delete or retrain the already trained model—then the harm will nonetheless be done," noted Tiffany Li, an associate professor of law at the University of San Francisco School of Law.
An Extractive Foundation
Ben Winters, director of AI and privacy at the Consumer Federation of America, called this the "original sin of AI systems built off public data—it’s extractive, misleading, and dangerous." Individuals shared information online under one risk framework, never anticipating it would be "hoovered up" for AI training. The paper calls for a fundamental reevaluation of indiscriminate web scraping practices within machine learning. The sheer scale of exposure uncovered suggests current approaches are not just ethically dubious, but potentially unlawful on a massive scale, demanding urgent policy and technical solutions. As Marietje Schaake, a Stanford tech policy fellow, hopes, this research "will raise alarm bells and create change."
Source: Based on research published on arXiv and reporting from MIT Technology Review (https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/).