#DevOps

When the Same Filename Has Two Truths

Tech Essays Reporter
10 min read

Subversion’s Unicode filename problem shows how a tiny difference in codepoints can become a philosophical problem for version control: what is a file, if two identical-looking names are not the same bytes?

Thesis

The Subversion Unicode normalization issue is not merely a bug about accented characters on Mac OS X. It is a case study in how software systems confuse representation with identity, especially when a human-visible name, an operating system path, and a repository key all pretend to be the same thing.

Unicode permits characters with diacritics to be represented in more than one valid form. A character such as an accented letter may appear as a single composed codepoint in NFC, or as a base character plus a combining mark in NFD. To the user, these strings can look identical. To UTF-8, to a hash table, and to a version control repository, they are different byte sequences.

Subversion’s problem emerges because it treats paths as durable identifiers, while operating systems treat filenames as local filesystem artifacts. Mac OS X historically accepts multiple Unicode forms but tends to return decomposed forms, while Linux and Windows generally return the same byte sequence they were given. That difference becomes destructive in a cross-platform working copy: the repository may remember one spelling of a path, while the local filesystem hands back another spelling that looks the same on screen.

The central argument of the proposal is therefore conservative but precise: before Subversion 2.0, the client and server should avoid rewriting path identity and instead compare paths with NFC/NFD-aware routines, while preserving the exact encoding received from the other side. In a later compatibility-breaking release, Subversion can move toward a cleaner rule: normalize all input paths to NFC.

Key arguments

The first key argument is that Unicode equivalence is not byte equivalence. Unicode Standard Annex #15 defines normalization forms precisely because the same abstract text can have multiple codepoint representations. NFC prefers composed characters where possible. NFD decomposes characters into base characters and combining marks. Both are legitimate Unicode. Neither is inherently more “correct” for every context.

That distinction matters because Subversion stores paths internally as UTF-8. UTF-8 is an encoding of codepoints, not an encoding of visual intention. If two users create visually identical filenames using different normalization forms, Subversion may see two different paths. The system’s byte-level memory disagrees with the human’s visual perception.

The second key argument is that Mac OS X turns a theoretical Unicode issue into a practical interoperability failure. According to Apple’s older technical guidance on path encodings in VFS, the platform has special behavior around decomposed Unicode filenames. The proposal summarizes this operationally: Mac OS X accepts all forms but may give back NFD, while Linux and Windows tend to give back the input form.

That asymmetry creates a specific failure mode. A Windows user commits a file whose name contains an NFC precomposed character. A Mac user updates the working copy. The repository path is NFC, but the filesystem may return the local filename as NFD. When Subversion compares the path from disk with the path stored in its entries metadata, a byte comparison says they differ. The client can then report the NFC path as missing and the NFD path as unversioned, even though the person looking at the terminal sees the same filename twice.

This is more than awkward output. It violates one of version control’s basic promises: that the system can tell the user what changed. When the tool cannot distinguish between “this file is missing” and “this filename came back through a different Unicode normalization form,” it has lost its grip on identity.

The third argument is that server-side history makes purity difficult. If Subversion were new, it could simply declare NFC as the canonical path form and reject or convert everything else. But existing repositories already contain paths in whatever forms clients committed. Some may be Mac-only repositories with NFD names. Some may contain mixed forms. Some server-side structures may locate data through hashes of path bytes.

That means a naive cleanup can break real repositories. Normalizing only Mac client input would help prevent future confusion on that platform, but it could cause a Mac client to send NFC paths to a repository that already knows the same files only under NFD spellings. Normalizing all client input has the same backward compatibility problem, merely with a broader scope. Normalizing paths read from the repository is also dangerous, because repository internals may depend on the original byte sequence, including hashed storage for path-related data such as locks.

The fourth argument is that comparison and communication are separate acts. A Subversion client needs to compare local filesystem paths with repository paths in an equivalence-aware way, but when it talks to the server, it must use the exact repository spelling the server expects. Locally, however, it may need to use the spelling returned by the filesystem. This creates two kinds of path truth: wc_path, the encoding observed in the working copy, and repo_path, the encoding preserved from the repository.

That naming convention is small but philosophically rich. It admits that there is no single path object that can safely represent every context. The local filesystem and the repository are different authorities. The client’s job is not to pretend they are identical, but to translate carefully between them.

The support library question

The proposal considers two libraries for Unicode normalization: ICU and utf8proc. ICU is broad, mature, and capable, but its size and scope are larger than what Subversion needs for this problem. utf8proc is narrower, focused on UTF-8 processing, and therefore better aligned with a system that already uses APR for character set conversion and only needs normalization support.

This is a familiar engineering trade-off. A large internationalization library can solve many future problems, but it also increases build complexity, binary size, and maintenance surface. A smaller library can be easier to embed and reason about, provided it does the specific job correctly. Since the issue is not general localization but normalization of UTF-8 paths, the argument for utf8proc is pragmatic.

The proposed normal form is NFC. The reason is partly semantic and partly mechanical. NFC is compact for characters that can be represented in composed form. That matters because normalization can otherwise require larger output buffers. If NFC output never needs more space than the input in the relevant cases, conversion can be cheaper and simpler. More importantly, NFC is widely used as a canonical interchange form, making it a reasonable long-term repository policy.

The proposal explicitly rejects NFKC and NFKD, the compatibility normalization forms. That exclusion is essential. Compatibility normalization can erase distinctions related to formatting or compatibility characters. In a filename system, lossy equivalence is dangerous. A version control system should not silently decide that two distinct names are the same because a compatibility rule made them look interchangeable.

Why the short-term answer is comparison, not conversion

The short-term recommendation is to implement NFC/NFD-aware path comparison routines on both client and server, while preserving exact path encodings when communicating across the client-server boundary.

This is less clean than a universal normalized internal representation, but it respects Subversion’s compatibility contract. Older clients and servers exist. Existing repositories exist. Existing working copies exist. A pre-2.0 fix must operate inside that historical reality.

On the client side, this means a filename read from disk can match an entries-file path even when their byte sequences differ by normalization form. Once matched, Subversion should use the repository path when talking to the server, because the repository may only recognize that exact byte sequence. Conversely, when opening the local file, Subversion should use the path form the local filesystem actually returned.

This distinction prevents the system from turning equivalence into mutation. Equivalence-aware comparison says, “these names denote the same human-visible path for this purpose.” Normalization says, “I will rewrite this name into my preferred form.” Before a compatibility boundary, comparison is safer than rewriting.

The implementation consequence follows naturally: hashes used for working copy administrative access and entries lookup should use normalized path encodings, so both local and repository spellings can arrive at the same key. But the stored path values still need to preserve their contextual spelling. Normalization is useful for lookup, not necessarily for replacing the authoritative name.

Long-term implications

The long-term Subversion 2.0+ proposal is to normalize all input paths to NFC. That is the cleaner architectural destination because it restores a single internal rule: paths entering the system are converted to the chosen canonical form, and specialized comparison routines become less central.

The benefit is conceptual compression. Once every new path is NFC, two visually identical names cannot enter the repository as distinct byte sequences merely because one client used a composed character and another used a decomposed one. Cross-platform behavior becomes easier to explain. Hashing becomes safer. Working copy metadata becomes less surprising.

The cost is migration. Existing repositories may contain non-NFC paths. Existing tools may depend on byte-for-byte path behavior. Any move to mandatory NFC needs a transition story: repository format changes, validation behavior, client compatibility rules, and possibly administrative tools for detecting and repairing conflicting names.

There is also a deeper implication for distributed software design. A path is not just a string. It is a string interpreted by a filesystem, rendered by a terminal, stored by a repository, compared by a client, and possibly hashed by a server. Each layer has its own idea of sameness. Bugs appear when one layer’s equality relation is smuggled into another layer without being named.

Subversion’s proposed distinction between wc_path and repo_path is therefore more than a variable naming convention. It is a discipline of representation. It forces code to say which authority a path came from and which authority it is about to address.

Counter-perspectives

One counter-perspective says that Subversion should normalize everything immediately, because any other answer preserves complexity. This view is attractive from a design standpoint. Mixed Unicode forms are a source of confusion, and permitting them indefinitely means every comparison site becomes suspect.

The weakness of that position is backward compatibility. A version control system is a memory machine. It cannot treat existing repository paths as disposable implementation details. If a repository has stored an NFD path for years, and a server-side lookup uses the original bytes, rewriting that path on the way in can make the file unreachable or can create apparent duplicates.

Another counter-perspective says the problem is mostly Mac-specific, so the fix should live only on Mac clients. That approach minimizes code touched elsewhere, but it misidentifies the failure. Mac OS X exposes the issue, yet the underlying ambiguity belongs to Unicode and to any repository that accepts arbitrary UTF-8 paths. A Linux server can store both forms. A Windows client can commit either form. The problem is not one operating system’s behavior alone, but the absence of a repository-wide equality policy.

A third counter-perspective favors server-side normalization of repository paths during access. This seems powerful because it centralizes the rule. But repository internals may use path bytes as keys, including hash-based storage. If the lookup path is normalized differently from the stored key, the server can fail to find its own data. Server-side normalization must therefore be introduced only with a repository format and migration plan that understands storage mechanics.

The final counter-perspective is to accept byte-level identity and tell users that visually identical Unicode filenames are distinct. That is internally consistent, but it is hostile to human use. The terminal does not show codepoint sequences. Developers do not reason about combining marks during ordinary file operations. A version control system that exposes invisible representational differences as missing and unversioned files may be technically consistent while still being practically wrong.

Conclusion

The proposal’s strength is that it separates the ethical desire for a clean future from the engineering duty to preserve the past. Short term, Subversion should compare Unicode paths according to canonical equivalence while preserving the exact spellings required by the filesystem and repository. Long term, it should converge on NFC as the normal input form.

This is the kind of bug that reveals a system’s metaphysics. A filename looks like a label, but in software it is also a byte string, a user interface artifact, a database key, a network protocol value, and a historical promise. Unicode normalization does not merely ask Subversion to convert text. It asks Subversion to decide what kind of sameness it believes in.

Comments

Loading comments...