Even GenAI uses Wikipedia as a source - Stack Overflow
#AI

Even GenAI uses Wikipedia as a source - Stack Overflow

DevOps Reporter
3 min read

Wikimedia Deutschland's Wikidata Embedding Project tackles the challenge of making Wikipedia's vast knowledge accessible to AI developers while reducing server load from scrapers.

The rise of AI applications has created a new challenge for knowledge repositories like Wikipedia: how to make their vast stores of information accessible to developers building AI systems while protecting their infrastructure from the strain of constant scraping. Wikimedia Deutschland's Wikidata Embedding Project offers an innovative solution that benefits both the organization and the broader AI community.

The Problem: AI Scraping Overload

As AI applications proliferate, they've become voracious consumers of web data. Wikipedia and its sister projects have seen increasing pressure on their servers from scrapers collecting data for Retrieval-Augmented Generation (RAG) applications and model training. This creates a resource burden for Wikimedia while also potentially limiting access to valuable knowledge.

The Solution: A Vector Database for Wikidata

Philippe Saade, AI Project Lead at Wikimedia Deutschland, spearheaded the creation of a vector database built on top of Wikidata—the structured knowledge graph that powers Wikipedia's infoboxes and underlies much of the Wikimedia ecosystem. The project embeds approximately 30 million Wikidata items (out of 119 million total) that are linked to Wikipedia pages, creating a searchable semantic layer over Wikipedia's knowledge.

Technical Implementation

The transformation from knowledge graph to vector embeddings required several innovative approaches:

  • Text Representation: Since most embedding models work with text rather than graph structures, each Wikidata item was converted into a textual representation
  • Multi-pass Processing: The team processed data dumps multiple times to gather all connected information from the knowledge graph
  • Selective Embedding: Not all data types were embedded—IDs and certain technical properties were excluded as they wouldn't be useful for semantic search
  • Chunking Strategy: Items were broken into chunks of 10-24 tokens, with each chunk containing the item's label, description, and aliases, plus one statement

The team used Jina AI's V3 embedding model through their API, which provided the necessary scale and performance. They experimented with different embedding sizes, settling on 512 dimensions as a good balance between accuracy and resource efficiency.

Making Data Accessible

To further reduce the load on Wikimedia servers, the team published the processed data on Hugging Face using Parquet format. This columnar storage format makes it easy for developers to access pre-processed data without hitting Wikimedia's APIs directly.

Beyond the Vector Database: MCP Integration

The project includes an MCP (Model Context Protocol) server that makes Wikidata accessible to AI models in a more user-friendly way. This allows developers to write SPARQL queries—the SQL-like language for querying knowledge graphs—more easily by using the vector database for exploration before generating precise queries.

Balancing Openness and Quality

One challenge the team faced was ensuring the vector database didn't propagate poor-quality information. They implemented rules for what gets embedded (items must have labels, descriptions, and Wikipedia links) while also recognizing that making less-maintained items searchable could encourage community improvement.

The Future: User Feedback and Evolution

The project is currently in alpha, with the team actively seeking user feedback to understand:

  • What use cases developers are exploring
  • Whether the accuracy meets user needs
  • What additional features would be valuable
  • How frequently the embeddings should be updated

This collaborative approach—building tools that serve both the organization's needs and the developer community's requirements—represents a promising model for how knowledge repositories can adapt to the AI era.

The Wikidata Embedding Project demonstrates that cooperation, rather than resistance, is the most effective strategy for knowledge repositories facing AI-driven demand. By providing structured, accessible data in formats that AI developers can use efficiently, Wikimedia Deutschland is ensuring that Wikipedia's knowledge continues to serve humanity—now including the AI systems that are becoming an integral part of how we access and process information.

Learn more about the Wikidata Embedding Project

Comments

Loading comments...