Search: WebScrapingRisks

AI Training Sets Harbor Millions of Private Documents, Exposing Deep Privacy Crisis

July 19, 2025 5 min read

New research reveals that one of the largest open-source AI image training datasets, DataComp CommonPool, likely contains hundreds of millions of sensitive personal documents, including passports, credit cards, and résumés. Despite attempts at filtering, the study found widespread failures in protecting personally identifiable information, raising urgent questions about the ethics and legality of indiscriminate web scraping for AI development.

Search Results: WebScrapingRisks