Researchers unveil The Common Pile v0.1—an unprecedented 8TB collection of public domain and openly licensed text for LLM training. By demonstrating that models trained on this ethically sourced data match proprietary counterparts like Llama 2 in performance, this initiative could fundamentally reshape how AI systems are developed amid growing copyright concerns.