When developer Ted Stern set out to build Media HTTP File Server (MHFS), a lightweight alternative to resource-heavy solutions like Plex, he anticipated challenges around media formats and performance. What he didn't expect was that text encoding would become his most persistent battle—a struggle exposing fundamental flaws in how programming languages handle filenames and Unicode. His journey reveals critical lessons for any developer working with cross-platform file systems or internationalization.

Perl's Double-Edged String Abstraction

Perl's approach to strings—where "café" can internally exist as Latin-1 (63 61 66 E9) or UTF-8 (63 61 66 C3 A9)—creates invisible landmines. As Stern demonstrates, these representations behave equally in Perl comparisons but fracture when passed to external systems like filesystems:

#!/usr/bin/env perl
my $latin_1 = "caf\xE9";  # Latin-1
my $utf_8 = 'café';        # UTF-8

# Perl sees them as identical:
say $latin_1 eq $utf_8 ? 'match' : 'no match';  # prints "match"

# The filesystem does not:
mkdir($latin_1);  # succeeds
mkdir($utf_8);    # fails on many systems

The culprit? Perl's lack of string type enforcement. Without explicit encoding/decoding, strings mutate dangerously when concatenated or processed, risking double-encoding or silent corruption. Stern adopted Hungarian-like prefixes (b_ for byte strings) as a naming convention, but admits it's a fragile defense.

Filenames: The Unicode Illusion

Stern's key realization: Filenames are not strings. They're platform-specific byte sequences masquerading as text:
- Windows NTFS: Stores filenames as UTF-16 but allows lone surrogates (invalid Unicode), breaking UTF-8 conversion
- Linux: Permits any byte sequence except / and \0, often—but not always—UTF-8
- macOS: Requires valid UTF-8

This caused MHFS to generate JSON with "corrupted" filenames containing `` (U+FFFD) replacements when original bytes didn't map to valid Unicode. The solution? MHFS now treats filenames as byte strings internally, but this created new challenges for serialization and display.

Serialization Wars: Base64, Escapes, and WTF-8

To transmit these byte strings through UTF-8-dependent channels (like JSON), MHFS employs three strategies:
1. Mapping: Convert filenames to "clean" UTF-8 for storage (risks collisions)
2. URI Escaping: Inefficient but compatible with web standards
3. Base64url: Avoids slash-encoding issues in URLs

# Example: Base64url encoding for Kodi plugin paths
use MIME::Base64 qw(encode_base64url);
my $safe_path = encode_base64url($raw_bytes);

For display, Stern implemented a "recovery" algorithm that reconstructs valid UTF-8 from malformed sequences—like rejoining surrogate pairs split across bytes—turning \xED\xA0\xBC (high surrogate) + \xED\xB0\x84 (low surrogate) into U+1F384 (🎄).

The uTorrent Outbreak: How Bad Encoding Spreads

A real-world case study emerged from a torrent file containing filenames with malformed surrogates. Investigation revealed:

"uTorrent 2.2.1 failed to correctly convert UTF-16 to UTF-8 [...] When surrogate characters are encoded naïvely, you're left with WTF-8 text. The BitTorrent spec mandates UTF-8, but clients like rTorrent preserved the corruption—infecting downstream systems."

This highlights encoding errors are contagious—poor handling in one system propagates to others, forcing complex workarounds.

JSON Decoding: Silent Surrogate Infections

While auditing dependencies, Stern discovered a critical vulnerability in Perl's JSON decoders. Most (including JSON::XS and JSON::PP) silently accept lone surrogates in JSON strings, creating invalid Unicode:

{"file": "\ud800"}  # Lone high surrogate → corrupts Perl string

His patch to Cpanel::JSON::XS added UTF8_DISALLOW_SURROGATE to reject such input—a fix still absent in other popular libraries.

The Unending Battle

Stern's conclusions resonate beyond Perl:
- Filenames ≠ Strings: Treat them as opaque byte sequences with platform rules
- Validate Early: Reject invalid Unicode at system boundaries
- Centralize Encoding: Limit decode/encode operations to interfaces

As he wryly notes: "Though they may look like them, filenames are not necessarily strings."* MHFS still evolves its text handling, but Stern's ordeal underscores that in a world of emojis, torrents, and global file shares, encoding isn't an edge case—it's the battlefield.