sem: Semantic Version Control CLI Brings Entity-Level Code Diffing to 16 Languages

sem is a semantic version control CLI that provides entity-level diffs, blame, graph, and impact analysis for code across 13 programming languages and structured data formats.

Tracking code changes at the entity level rather than just lines of text represents a fundamental shift in how developers understand their codebase evolution. The sem CLI tool, developed by Ataraxy Labs, brings semantic version control to Git repositories, allowing developers to see exactly what functions, classes, and other code entities were added, modified, or deleted across their projects.

Beyond Line-Based Diffing

Traditional diff tools show changes at the character or line level, which can make understanding code evolution difficult. When you see "line 43 changed," you still need to open the file and mentally reconstruct what actually happened. sem changes this by parsing code into its semantic entities and tracking those entities across changes.

For example, instead of seeing generic line changes, sem tells you that "function validateToken was added in src/auth.ts" or that "property production.pool_size changed from 5 to 20 in config/database.yml." This entity-level insight makes code reviews faster and helps developers understand the actual impact of changes without manually scanning through modified files.

Multi-Language Support with Tree-Sitter

The tool supports 13 programming languages through tree-sitter parsing, including TypeScript, JavaScript, Python, Go, Rust, Java, C, C++, C#, Ruby, PHP, and Fortran. Each language gets appropriate entity extraction - functions, classes, methods, interfaces, and other language-specific constructs are identified and tracked.

Beyond programming languages, sem also handles structured data formats like JSON, YAML, TOML, CSV, and Markdown. For these formats, it provides entity-level diffing at the property, section, or row level, making it useful for configuration files and data files that aren't traditional source code.

Smart Entity Matching

What makes sem particularly powerful is its three-phase entity matching system. When comparing code before and after changes, sem first looks for exact ID matches - the same entity appearing in both versions. If that fails, it uses structural hashing to detect renamed or moved entities by comparing their abstract syntax tree structures while ignoring whitespace and comments. Finally, it applies fuzzy matching for entities with high token overlap, catching cases where code was substantially rewritten but represents the same logical entity.

This approach means sem can detect renames and moves automatically, not just additions and deletions. It also distinguishes between cosmetic changes like formatting and actual logic changes through structural hashing, providing more accurate insights into what truly changed in the codebase.

Practical CLI Features

The tool integrates seamlessly with Git workflows. Basic usage is straightforward - running sem diff in any Git repository shows semantic diffs of working changes. Developers can target specific commits, commit ranges, or staged changes. The --format json option outputs structured data perfect for CI pipelines, AI agents, or custom tooling.

Additional commands include sem graph for visualizing entity dependency graphs, sem impact for analyzing what breaks if specific entities change, and sem blame for entity-level blame information. These features turn sem into a comprehensive code analysis tool beyond just diffing.

Installation and Architecture

sem is built in Rust for performance, using tree-sitter for native code parsing (not WebAssembly), git2 for Git operations, and rayon for parallel file processing. Installation is simple - developers can build from source with Cargo or download pre-built binaries from GitHub Releases.

The tool is designed as both a CLI and a library. The sem-core crate can be used as a Rust dependency, enabling integration into other tools. It's already being used by projects like weave (semantic merge driver) and inspect (entity-level code review).

Open Source and Extensible

Licensed under MIT or Apache-2.0, sem is open source and designed with extensibility in mind. The plugin system allows adding support for new languages and formats, making it adaptable to emerging programming languages and data formats.

For development teams working across multiple languages or dealing with complex configuration files, sem offers a more intelligent way to track code changes. By focusing on what actually changed in the code's structure rather than just which lines moved, it provides clearer insights into code evolution and helps teams maintain better understanding of their growing codebases.

sem represents a practical step toward semantic version control, making code changes more meaningful and easier to understand without requiring changes to existing Git workflows.

#semantic-version-control #code-diff #Tree-sitter #Rust #Git