Hugging Face introduces FineTranslations, a massive parallel text dataset with over 1 trillion tokens across 500+ languages, created by translating non-English web content into English using Gemma3 27B models.

Hugging Face has unveiled FineTranslations, a groundbreaking multilingual dataset containing over 1 trillion tokens of parallel text across English and more than 500 languages. This resource addresses a critical gap in machine translation capabilities, particularly for lower-resource languages where translation quality traditionally lags.
The dataset originates from FineWeb2, which aggregates multilingual web content from CommonCrawl snapshots (2013-2024). To ensure balanced representation, Hugging Face applied strict filters—excluding language subsets where religious texts or Wikipedia articles comprised more than half of content (bible_wiki_ratio < 0.5). Each language includes up to 50 billion tokens, prioritized using FineWeb2-HQ quality classifiers where available.
Technical Implementation
Translation leveraged the datatrove framework for scalable processing:
- Documents split into 512-token chunks with sliding-window context preservation
- Gemma3 27B models handling translation tasks
- Robust safeguards against toxic/spam content via early classification
- Strict formatting constraints and post-processing for structural consistency
Each dataset entry provides:
- Aligned original/translated text chunks
- Language and script identifiers
- Token counts and quality metrics
- Original CommonCrawl source references
Dual-Purpose Utility
Beyond its primary translation application, Hugging Face discovered the English-translated corpus retains substantial cultural context from source languages. Models trained exclusively on FineTranslations' English output achieved performance comparable to those trained on the original FineWeb dataset—positioning it as a viable supplement for English-only pretraining tasks.
"This release will bridge the gap and allow communities to better align popular models with their languages," commented Achref Karoui.
Accessibility
Available immediately via:
- Hugging Face Datasets library
- Streamable pipelines for large-scale processing
- Direct consumption through datatrove
Released under Open Data Commons Attribution (ODC-By) v1.0, FineTranslations remains subject to CommonCrawl's terms. This dataset represents a significant step toward democratizing high-quality multilingual NLP capabilities.
Image: 
Robert Krzaczyński
Senior Software Engineer

Comments
Please log in or register to join the discussion