AI Code Duplication Study Reveals Surprising Patterns in Vibe-Coded Projects

Analysis of 49 AI-generated projects shows 7.98% average code duplication, with skill libraries showing significantly higher rates than applications.

What happens when you scan 49 projects built primarily with AI coding tools for code duplication? The numbers tell a story that goes beyond what most people expect.

Andrey Kucherenko, maintainer of jscpd - a copy-paste detector with 700K+ weekly npm downloads that recently peaked at 1.16M in April 2026 - decided to find out. In May 2026, he ran jscpd against 49 GitHub projects identified as "vibe-coded" (built primarily with AI coding tools like Claude Code, Cursor, v0, and Bolt).

The headline numbers are striking:

7.24 million lines of code scanned
25,612 duplicate blocks found
Average duplication: 7.98%
45 out of 49 projects had detectable duplication
Only 4 projects were completely clean

But the most interesting finding wasn't about AI agents producing duplicated code. It was about something nobody was talking about yet.

The Unexpected Story: Skill Libraries Dominate Duplication

When Kucherenko reviewed the top duplicators, he discovered something unexpected. The most duplicated projects in his sample weren't AI-generated applications. They were skill libraries for AI agents.

The top-ranked project, frontend-slides, isn't a vibe-coded application at all. It's a Claude skill - a markdown-based instruction set that teaches Claude how to generate HTML presentations. The 42% duplication isn't AI-generated code; it's CSS template variations included in the skill so Claude can offer style choices.

The second-ranked project, agent-skills, is a collection of 100+ skills for multiple AI agents: Claude Code, Codex CLI, and Gemini CLI. The 37% duplication is largely the same skill files structured for different agent platforms.

"The 'vibe-coded' tag on GitHub turns out to be much broader than I assumed," Kucherenko explains. "It includes actual applications built primarily with AI agents, skill libraries for AI agents, tools that enable vibe coding, and tutorials and reference implementations. These categories have very different duplication profiles, and conflating them obscures what's actually happening."

Real Applications vs. Skill Libraries

When the sample was narrowed to projects that look like actual AI-generated applications, the picture became clearer:

Duplication rates in the 6-9% range - meaningfully higher than human-written codebases (typically 3-5%), but not catastrophic. The real concern is the rate of accumulation, not the absolute number.

"For a human developer, accumulating 5,000+ duplicate blocks would take years and probably get caught in code review," Kucherenko notes. "AI agents can produce that level of duplication in weeks because they don't remember what they wrote yesterday. Every new feature is a fresh paste."

Skill libraries, however, show dramatically higher duplication rates (30-40%). This duplication has a different character than what's found in apps. It's not maintenance debt accumulating accidentally. It's deliberate copying.

"The AI agent ecosystem is fragmented," Kucherenko explains. "A skill written for Claude Code needs to work with Cursor, Codex CLI, Gemini CLI, Windsurf, and a dozen other agents. Each agent has slightly different conventions, file locations, and metadata requirements. The path of least resistance is to copy the skill folder for each target agent and tweak the metadata."

Andrey Kucherenko HackerNoon profile picture

Language Distribution and Tool Evolution

The data reveals interesting patterns about where duplication occurs:

TypeScript dominates the duplication: 44% of all duplicate blocks were in TypeScript files
JavaScript was second at 14%
TSX was third at 9%
Markdown duplication accounted for 6% of all duplicates
JSON configuration duplication was significant at 1,500 clones

This concentration in the JavaScript/TypeScript ecosystem makes sense given the dominance of these languages in AI-generated frontend code, but it confirms that the duplication issue is concentrated rather than evenly distributed.

As a result of his findings, Kucherenko updated jscpd to include cross-format detection, allowing it to find duplication across file boundaries - a necessity in modern codebases where the same logic might appear in .vue files, .ts libraries, and README.md documentation.

The new version also includes:

An AI Reporter that produces token-efficient output for feeding back into LLMs
An MCP Server that AI agents can query directly during their workflows
An Agent Skill that works across Claude Code, Cursor, Copilot, and Gemini

"The goal is to put detection inside the agent loop, not after it," Kucherenko explains. "Catch the duplication while the code is being written, not in a CI run that arrives an hour later."

What This Means for the AI Coding Ecosystem

The study reveals a fundamental challenge in the AI coding ecosystem: the infrastructure designed to teach AI agents how to write better code is itself extremely duplicated.

"The tools meant to give agents discipline embody the same pattern they're trying to prevent," Kucherenko observes. "The duplication isn't carelessness; it's the natural consequence of a young ecosystem without consolidation primitives."

For maintainers of these libraries, this is a rational architectural choice. Centralizing the content would require building abstractions that don't exist yet in the agent skill standard. So they paste.

But as more skill libraries proliferate, this duplication compounds across the entire agent ecosystem, creating technical debt that will need to be addressed as the ecosystem matures.

Next Steps in Understanding AI Code Duplication

Kucherenko plans several follow-up studies:

A control group of human-written projects of comparable size and language
A focused study on duplication patterns specifically in agent skill libraries
Longitudinal data tracking duplication in specific projects over time

"The broader point is that code quality in the AI era is going to need different tools, different metrics, and different workflows than what we built for human-driven development," Kucherenko concludes. "The duplication problem is one example. There are others."

For developers working with AI coding agents, Kucherenko recommends measuring their own duplication using jscpd and considering whether their duplication patterns are necessary or accidental. For those shipping AI-generated production code, integrating jscpd into CI pipelines - particularly using the new AI Reporter - can help catch duplication issues before they accumulate.

The raw data from this study is publicly available at kucherenko.github.io/cpd-vibe-coding-report, and the jscpd tool is open source under the MIT license with support options available at opencollective.com/jscpd.

As AI coding tools continue to evolve and proliferate, understanding and managing code duplication will become increasingly important. This study provides valuable insights into one of the first measurable quality challenges emerging in the AI coding era.

#AI #Code Duplication #jscpd #Skill Libraries #TypeScript