New research reveals that one of the largest open-source AI image training datasets, DataComp CommonPool, likely contains hundreds of millions of sensitive personal documents, including passports, credit cards, and résumés. Despite attempts at filtering, the study found widespread failures in protecting personally identifiable information, raising urgent questions about the ethics and legality of indiscriminate web scraping for AI development.