Hugging Face Releases FineTranslations: A Trillion-Token Multilingual Dataset
#Machine Learning

Hugging Face Releases FineTranslations: A Trillion-Token Multilingual Dataset

Frontend Reporter
1 min read

Hugging Face introduces FineTranslations, a massive parallel text dataset with over 1 trillion tokens across 500+ languages, created by translating non-English web content into English using Gemma3 27B models.

Featured image

Hugging Face has unveiled FineTranslations, a groundbreaking multilingual dataset containing over 1 trillion tokens of parallel text across English and more than 500 languages. This resource addresses a critical gap in machine translation capabilities, particularly for lower-resource languages where translation quality traditionally lags.

The dataset originates from FineWeb2, which aggregates multilingual web content from CommonCrawl snapshots (2013-2024). To ensure balanced representation, Hugging Face applied strict filters—excluding language subsets where religious texts or Wikipedia articles comprised more than half of content (bible_wiki_ratio < 0.5). Each language includes up to 50 billion tokens, prioritized using FineWeb2-HQ quality classifiers where available.

Technical Implementation

Translation leveraged the datatrove framework for scalable processing:

  • Documents split into 512-token chunks with sliding-window context preservation
  • Gemma3 27B models handling translation tasks
  • Robust safeguards against toxic/spam content via early classification
  • Strict formatting constraints and post-processing for structural consistency

Each dataset entry provides:

  • Aligned original/translated text chunks
  • Language and script identifiers
  • Token counts and quality metrics
  • Original CommonCrawl source references

Dual-Purpose Utility

Beyond its primary translation application, Hugging Face discovered the English-translated corpus retains substantial cultural context from source languages. Models trained exclusively on FineTranslations' English output achieved performance comparable to those trained on the original FineWeb dataset—positioning it as a viable supplement for English-only pretraining tasks.

"This release will bridge the gap and allow communities to better align popular models with their languages," commented Achref Karoui.

Accessibility

Available immediately via:

  • Hugging Face Datasets library
  • Streamable pipelines for large-scale processing
  • Direct consumption through datatrove

Released under Open Data Commons Attribution (ODC-By) v1.0, FineTranslations remains subject to CommonCrawl's terms. This dataset represents a significant step toward democratizing high-quality multilingual NLP capabilities.


Image: Author photo
Robert Krzaczyński
Senior Software Engineer

Comments

Loading comments...