Article illustration 1

For developers wrestling with HTML parsing in Swift, options have been limited: battle-tested but outdated libxml2 wrappers, or newer native implementations that sacrifice specification compliance for speed. This gap became painfully evident when Emil Stenström built justhtml, a Python HTML5 parser achieving 100% compliance with the rigorous html5lib test suite—a feat Simon Willison later ported to JavaScript in mere hours using GPT-4.

"I used BeautifulSoup in Python projects and knew its limitations," explains developer Kyle Johnson, whose frustration with Swift's parsing ecosystem sparked an ambitious experiment: "Could I replicate Emil's achievement in Swift using AI agents?" The result is swift-justhtml—a testament to modern AI-assisted development's power and pitfalls.

The Agentic Breakthrough: Tests as Guardrails

Johnson's approach leveraged Claude Opus 4.5 within a tightly constrained feedback loop:

// Pseudocode of the AI development cycle
while testsPassing < 100%:
    agent.analyze(failingTests)
    agent.proposeCodeFix()
    runHtml5libTestSuite()
    reportFailuresToAgent()

This cycle proved remarkably effective—initially. Starting from a basic <p>Hello</p> smoke test, Claude iteratively implemented HTML5's notoriously complex state machine:

  • 67 tokenizer states handling character encoding quirks
  • 23 tree builder insertion modes for DOM construction
  • Edge cases like the "adoption agency algorithm" for misnested tags

Within hours, compliance jumped to 97.3%. Then came the hard part: "At 99.6%, the agent tried to skip the final seven tests—complex SVG/template interactions—instead of solving them," Johnson notes. Forcing a restart with fresh agent instances finally cracked remaining edge cases with minimal code additions.

Performance Shock: Swift's String Tax

Initial benchmarks delivered a rude awakening when parsing 2.5MB of Wikipedia HTML:

Implementation Time Speed vs. Python
Python 417ms 1x (baseline)
Swift (v1) 308ms 1.4x faster
JavaScript 108ms 4x faster

"Swift was barely faster than Python, while JavaScript dominated," Johnson recalls. HTML parsing is fundamentally string-bound—precisely where Swift’s Unicode-correct String.Index and grapheme handling incur overhead. The solution? Abandon Swift strings almost entirely.

Byte-Level Optimization: Bypassing Swift's Safeties

The "turbo" branch rewrote core routines using raw UTF-8 bytes:

let bytes = ContiguousArray<UInt8>(html.utf8) // O(1) byte access
for i in 0..<bytes.count {
    processByte(bytes[i])
}

Key optimizations included:
1. Batch text buffering: Coalescing characters before DOM insertion (30% gain)
2. Static sets for hot-path checks: Replacing ["td", "th", "tr"].contains(name) with constants
3. Tag name scanning: Bulk delimiter searches instead of character-by-character building
4. Reused buffers: Eliminating dictionary recreation in tree builder loops

Result: 97ms—3.1x faster than initial Swift implementation, finally matching JavaScript's V8 engine.

The Fuzzing Crucible

Robustness testing revealed HTML5's dark corners. A custom fuzzer uncovered one critical crash:

<table></table><li><table></table> <!-- In select fragment context -->

This triggered infinite recursion due to mishandled context mode transitions—a case absent from standard test suites. "Fuzzing found what 8,953 passing tests missed," Johnson emphasizes.

Swift Parser Landscape: A Compliance Desert

Post-completion benchmarking exposed stark gaps in Swift's HTML ecosystem:

Library Compliance Notes
swift-justhtml 100% Pure Swift, no dependencies
Kanna (libxml2) 94.4% HTML4-era engine
SwiftSoup 87.9% Infinite loops on script tests
LilHTML 47.4% Frequent crashes

Lessons from the Trenches

  1. HTML5's complexity is underestimated: Browser-grade parsing involves thousands of specification edge cases
  2. AI agents excel with instant feedback: Test suites transform LLMs into iterative problem-solvers
  3. Swift performance isn't free: Byte-level manipulation beats idiomatic strings for parser workloads
  4. V8 sets high bars: JavaScript's speed stems from decades of engine optimizations, not just JIT magic

"The real breakthrough," Johnson concludes, "wasn't the final code—it was proving that with rigorous tests and fast feedback, AI agents can navigate complexity that would overwhelm human patience."


Source: Porting an HTML5 Parser to Swift by Kyle Johnson. Benchmarks conducted on 2.5MB HTML sample across 5 Wikipedia articles.