How researchers are using GitHub Innovation Graph data to reveal the “digital complexity” of nations - The GitHub Blog
#Regulation

How researchers are using GitHub Innovation Graph data to reveal the “digital complexity” of nations - The GitHub Blog

Serverless Reporter
9 min read

GitHub's public Innovation Graph dataset is enabling new academic research that quantifies national digital capability via open source activity, with a new Research Policy paper finding that software specialization predicts GDP and inequality more accurately than traditional trade and patent metrics.

GitHub's Innovation Graph dataset, a quarterly managed public release tracking open source developer activity across 163 economies and 150 programming languages, was built with a core goal of facilitating research on the economic impact of open source software and developer collaboration.

The platform recently released its Q4 2025 data tranche, alongside a major validation of its utility: a paper published in Research Policy that uses Innovation Graph data to quantify the "digital complexity" of nations, a metric that captures software capability gaps traditional economic data misses entirely.

Featured image

The research team comprises four interdisciplinary scholars: Sándor Juhász, research fellow at Corvinus University of Budapest focusing on economic geography and innovation networks; Johannes Wachs, associate professor at Corvinus University of Budapest and director of the Center for Collective Learning, who studies computational social science and open source communities; Jermain Kaminski, assistant professor at Maastricht University School of Business and Economics specializing in causal machine learning and innovation; and César A. Hidalgo, professor at Toulouse School of Economics and Corvinus University of Budapest, director of the Center for Collective Learning, and creator of the Observatory of Economic Complexity.

Service Update: GitHub Innovation Graph Q4 2025 Release

The GitHub Innovation Graph is a managed public dataset that aggregates quarterly activity data from GitHub's public repositories, including counts of developers pushing code, segmented by economy (based on IP address geolocation) and programming language. It covers 163 economies and 150 languages from 2020 to present, with new data released every quarter. The Q4 2025 release, announced alongside the Research Policy paper, adds three additional months of global open source activity to the historical record.

GitHub designed the Innovation Graph specifically to address a gap in public data: while open source software underpins most modern digital infrastructure, its economic contribution is rarely captured in national statistics. Traditional metrics like trade flows, patent filings, and academic publications ignore software, which is not physically traded and often developed in public open source repositories.

Use Cases: Measuring National Digital Complexity

The Research Policy paper represents one of the first large-scale uses of Innovation Graph data to adapt the Economic Complexity Index (ECI), a well-established metric for measuring national economic capability, to software production.

Traditional Economic Complexity Limitations

For 15 years, economists have used ECI to predict national GDP growth, income inequality, and emissions by measuring what physical products a country exports, what patents it files, and what research it publishes. These metrics work because they reflect the accumulation of productive knowledge and capabilities within a country. However, they have a critical blind spot: software.

As Jermain Kaminski notes, "Code doesn’t go through customs. It crosses borders through 'git push', cloud services, and package managers. So all that productive knowledge was essentially invisible, what some colleagues have called the 'digital dark matter' of the economy."

Adapting ECI for Software

To fix this gap, the research team applied the ECI framework to Innovation Graph data. The core dataset includes quarterly counts of developers pushing code in each programming language, for each economy, from 2020 to 2023. The team first addressed a key limitation: individual programming languages are not meaningful units of analysis, since most real software projects use bundles of languages. A web application might combine HTML, CSS, and JavaScript; a data science project uses Python and Jupyter Notebook; systems programming pairs C with Assembly.

To group languages into coherent stacks, the team queried the GitHub GraphQL API for all repositories active in 2024, then computed cosine similarity between languages based on co-occurrence within the same repos. They applied hierarchical clustering to group 150 languages into 59 "software bundles" representing common technology stacks.

From there, they followed the standard ECI pipeline:

  1. Build a country-by-bundle matrix counting developer activity.
  2. Compute revealed comparative advantage (RCA) for each country-bundle pair, measuring whether a country has a disproportionate share of global developers in that bundle.
  3. Binarize the RCA matrix to mark specializations.
  4. Apply an iterative algorithm to calculate ECI: countries that specialize in many non-ubiquitous bundles score high, while countries that only specialize in common, widely used bundles score low.

César Hidalgo uses a simple analogy to explain ECI: "Think of countries like kitchens. Some kitchens can cook anything, since they have an abundance of ingredients and tools, from the rarest spices to the best knives. Others are more limited. Maybe they can boil rice and do a few other simple things. Since we cannot look at the kitchens directly, we need to infer their 'complexity' based on the dishes they are able to produce. This is what the economic complexity index or ECI allows you to estimate. We can infer what’s going on in the kitchen by seeing if it is a chicken and rice operation, or a place that can produce sophisticated edible foams and souffles. Originally, these methods were applied to trade data, where the dishes coming out of the kitchen were a country’s exports, but in this paper, we applied that to software. A chicken-and-rice country is a Python and JavaScript country. A Michelin-star country is one that can program certified embedded systems for aerospace and defense."

Key Findings

The team found that software ECI outperforms traditional metrics in predicting GDP per capita and income inequality, even when controlling for trade, patents, and research data. It also captures variation in carbon emissions that traditional data misses.

Another key finding confirms the "principle of relatedness" holds for software: countries do not randomly jump between specializations. They diversify into software bundles that are related to their existing capabilities, just as countries in the physical economy move into products similar to what they already export.

The top 20 economies by software ECI are:

Ranking Economy Software ECI
1 Germany 1.739
2 Australia 1.730
3 Canada 1.729
4 Netherlands 1.727
5 France 1.702
6 United States 1.695
7 Poland 1.691
8 United Kingdom 1.687
9 Italy 1.672
10 Sweden 1.620
11 Switzerland 1.620
12 Hong Kong SAR 1.595
13 Norway 1.571
14 Japan 1.552
15 Spain 1.552
16 Russia 1.530
17 Singapore 1.468
18 Taiwan 1.464
19 Belgium 1.448
20 Finland 1.444

Notably, the United States ranks 6th, lower than its typical position in traditional economic complexity rankings, reflecting differences in how software specialization is distributed compared to physical exports.

Real-World Applications

The metric has immediate use cases for multiple stakeholders:

  • Policymakers: Software ECI can guide industrial policy, since software relies on movable human capital. Countries that attract and retain software talent without overregulating data use or shifting innovation risk to small firms can build digital capability faster. The team predicts software ECI will become a standard policy tool within a decade, sitting alongside trade-based metrics.
  • Developers: The software product space representation of capabilities can help developers identify countries where their skillset matches local specializations, improving relocation decisions.
  • Researchers: The dataset provides a foundation for studying the impact of generative AI on software production. The team is already tracking how AI coding assistants affect specialization patterns, with a separate paper in Science on global AI-assisted coding diffusion on GitHub.

Trade-offs: Limitations and Considerations

While the Innovation Graph data enables new research, it comes with clear trade-offs that limit its applicability.

Data Limitations

The biggest constraint is that the Innovation Graph only tracks public GitHub activity, meaning all proprietary, closed-source enterprise software development is excluded. This likely underestimates the digital complexity of countries with weaker open source cultures, where more development happens in private repositories.

The current time window of 2020–2023 is sufficient for cross-sectional analysis but too short to test long-run growth predictions, which ECI is designed for. Economic structures shift over decades, not quarters, so the team recommends expanding the dataset to 20 years of historical data to validate long-term predictive power.

The language-level granularity of the current Innovation Graph is another limitation. The team’s workaround of clustering languages into bundles improves on raw language counts, but ideal data would include project-level details: frameworks, libraries, and use cases (e.g., fintech vs. game engine development) rather than just languages. The team used GitHub Topics as a robustness check but found the data too noisy and incomplete for primary analysis.

Methodological Trade-offs

The ECI method requires binarizing the revealed comparative advantage matrix, which simplifies complex specialization patterns into binary yes/no specializations. This can obscure nuance in countries with moderate but not disproportionate activity in a bundle. The hierarchical clustering used to create software bundles also involves subjective choices about similarity thresholds, which can affect final ECI scores.

Policy Trade-offs

For policymakers, using software ECI to guide talent policy involves trade-offs. While attracting software talent can boost digital complexity, overregulation of data use or poorly designed worker protection laws can drive talent away. The high mobility of software developers means policies that work in one country can be easily undermined by competing jurisdictions with more favorable rules.

Researcher Backgrounds and Advice

The research team represents an interdisciplinary mix of economic geographers, computational social scientists, physicists, and causal machine learning experts. Their career paths highlight the value of cross-field collaboration:

  • Johannes Wachs moved from mathematics to computational social science, focusing on digital data traces to study human behavior.
  • Sándor Juhász trained in traditional economic geography, applying network methods to regional development.
  • Jermain Kaminski combined entrepreneurial projects with academic research, building multimodal machine learning pipelines before focusing on causal inference.
  • César Hidalgo started in physics, applying network tools to economic development to create the field of economic complexity.

All four emphasize the importance of early collaboration, avoiding sunk cost fallacies in research projects, and building tools that compound in value over time. As César Hidalgo advises, "The key is to invest in things that grow or compound. The question you need to ask when working on a project is whether you honestly believe that the project will be more important in a decade from now than today. If the answer is yes, that’s probably a good project."

The team also notes that generative AI tools have become part of their daily workflow, used for debugging data pipelines, drafting boilerplate code, and sanity-checking statistical approaches, though they stress LLMs are most useful when the researcher already has a clear plan.

Further Reading

To learn more about the concepts discussed in the paper, the team recommends:

Comments

Loading comments...