Curlie Opens Its 2.9 Million Website Directory to Developers, Fueling Open Web Initiatives

The human-edited Curlie directory, successor to DMOZ, has released its entire 2.9 million-entry dataset under an open source license, providing developers with a spam-free foundation for search engines, AI training, and web research.

In an era dominated by algorithmically-generated search results and proprietary data hoarding, the Curlie project has taken a significant step toward restoring transparency and openness to the web. The community-maintained directory, which traces its lineage back to the legendary Open Directory Project (DMOZ), has made its entire collection of 2.9 million curated website entries available for free download under an open source license.

"We want to enable free, unbiased and transparent access to information. By working together, we are taking a big step towards greater data transparency and data democracy on the World Wide Web," explains Michael Granitzer, project manager of OpenWebSearch.eu, which has already integrated Curlie's editorial descriptions into its open web index.

The Human Advantage in Web Curation

What sets Curlie apart from automated alternatives is its human-curated approach. The directory is maintained by a global community of volunteer editors who specialize in specific categories. This human touch ensures that only high-quality, non-spam websites are included—a crucial advantage over language models that struggle to assess content quality and trustworthiness.

"This is the advantage we humans have over chat language models: We can assess whether websites are trustworthy," the Curlie team explains. When editors, aided by detection-bots, identify websites that have turned to spam, they are quickly removed from the directory.

Technical Specifications and Accessibility

The entire directory dataset is remarkably compact at just 200 megabytes, thanks to a strictly text-based file format and standard gzip compression. The data is provided as tab-separated values (TSV), making it accessible to a wide range of tools and programming languages.

The download includes:

Category hierarchy and structure
Website entries with URL, title, and editorial description
Category metadata with titles, descriptions, and placement in the tree
Geographic labels for approximately 45,000 categories

Matching between websites and categories is accomplished through ID references, with the full category path included for each entry. This structure allows developers to easily build the hierarchy or filter content by trusted categories.

Strategic Partnerships and Future Directions

To facilitate the download of this substantial dataset, Curlie has partnered with two key institutions:

Leibniz Supercomputing Centre (LRZ): The Munich-based provider of scientific IT services hosts the download on its super-connected supercomputer facilities, ensuring reliable access to the data.
OpenWebSearch.eu: This initiative is building an open web index already containing 1.3 billion website entries. The integration of Curlie's editorial descriptions adds a layer of human curation to this massive dataset.

The project is committed to providing fresh data, with updates pulled from the Curlie database monthly. The current version can be identified by the <LastModified> field in the download metadata.

Applications for Developers and AI Projects

The availability of this curated directory opens numerous possibilities for developers and researchers:

Search Engines: Build niche search engines focused on specific topics with pre-vetted sources
AI Training: Provide high-quality, human-curated data for training language models and other AI systems
Web Crawlers: Create focused crawlers that only visit trusted websites
Research Tools: Develop academic or market research applications with reliable data sources
Directory Services: Create specialized directories for communities or industries

Contributing to the Project

Curlie operates on the principles of open source collaboration and welcomes community contributions. Website owners can submit their sites for inclusion for free, and individuals passionate about specific topics can apply to become editors. The project also accepts donations to help cover server hosting costs.

The legacy of human-edited web directories may seem antiquated in today's algorithm-driven world, but Curlie's approach offers a valuable counterbalance to the opacity of modern search. By providing developers with access to a massive, curated dataset, the project is not just preserving a piece of internet history—it's actively shaping a more transparent and accessible web future.

#OpenWeb #WebDirectory #OpenData