The Invisible Epidemic of Code Theft

Earlier this month, the maintainer of open-source project Cheating-Daddy uncovered a brazen case of software plagiarism: a Y Combinator-backed startup had copied their GPL-licensed codebase, removed all comments, and relicensed it as proprietary under the name "Glass." This incident isn't isolated. During security assessments, Trail of Bits routinely discovers improperly vendored code—copied snippets or entire libraries—that introduce hidden vulnerabilities and legal liabilities.

"Security debt accumulates silently when developers vendor code," explains the Trail of Bits team. "Whether copying an OpenSSL function or a smart contract utility from OpenZeppelin, you inherit every vulnerability in that code. Without tracking its origin, you won't know when new CVEs affect you." The consequences cascade: attribution disappears, licenses are violated, and frozen codebites while the original project evolves.

How Vendetect Unmasks Obfuscated Code

Enter Vendetect, Trail of Bits' new open-source tool that detects vendored code using semantic fingerprinting. Unlike academic plagiarism detectors, Vendetect implements the Winnowing algorithm—used in Stanford's MOSS system—but adds critical version control awareness. Here's how it cuts through obfuscation:

  1. Tokenization: Code is parsed via Pygments' language-specific lexers
  2. k-gram generation: Overlapping token sequences create context-rich signatures
  3. Fingerprinting: Hashed k-grams pass through a sliding window to generate immutable identifiers
  4. Git archaeology: Pinpoints exact source commits when matches are found
# Core detection process
from vendetect.fingerprint import WinnowingComparator

# Generates fingerprint resilient to:
# - Renamed variables
# - Stripped comments
# - Reformatted code
comparator = WinnowingComparator()
fingerprint = comparator.fingerprint('suspicious_file.c')

This approach identifies matches even when code undergoes superficial transformations. But Vendetect's killer feature is version control integration. When it flags copied OpenSSL code, it identifies the exact commit—letting you immediately check if it contains Heartbleed or other patched vulnerabilities.

Article illustration 2

Vendetect output comparing Glass (left) to Cheating-Daddy (right). Despite comment removal and reformatting, high similarity scores expose copying.

Real-World Impact: From Smart Contracts to Supply Chains

During audits, Vendetect has exposed critical risks:

  • Smart contracts using vulnerable OpenZeppelin versions
  • Cryptographic libraries copied from pre-disclosure commits
  • Authentication code with hardcoded backdoors lifted from tutorials

In the Cheating-Daddy case, Vendetect took just 10 seconds to reveal extensive copying:

vendetect https://github.com/pickle-com/glass https://github.com/example/cheating-daddy

Beyond catching plagiarism, the tool enables:
- License compliance: Detect stripped copyright notices
- Vulnerability tracking: Map vendored code to CVE-affected commits
- Supply chain mapping: Find untracked dependencies

Integrating Vendetect Into Your Workflow

Installation is straightforward:

pip install vendetect

Scan local or remote repositories with rich diff views or machine-readable output:

# Compare local projects
vendetect /path/to/project /path/to/upstream

# Generate CI/CD reports
vendetect repo1 repo2 --format json > results.json

The modular architecture supports custom comparators for specialized needs like smart contracts or embedded systems:

class MyComparator(vendetect.comparison.Comparator):
    def fingerprint(self, path: Path) -> CustomFingerprint:
        # AST-based hashing? ML embeddings?
        return custom_fingerprint

Turning Visibility Into Action

Code vendoring won't disappear, but Vendetect makes its risks manageable. By exposing the hidden lineage of software components, teams can finally track security debt where it accumulates—in the shadows between copy-pasted code and its forgotten origins. As Trail of Bits notes: "Security compounds fastest when you don't know it exists."

Vendetect is available on GitHub and PyPI.