Rising Rust: Building a Blazing-Fast AsciiDoc Parser from Scratch

Tired of Ruby dependencies and slow conversions, a developer crafts 'asciidocr'—a native Rust AsciiDoc parser. This deep dive reveals how hand-rolled scanning, abstract syntax graphs, and battle-tested design choices enable lightning-fast document processing while paving the way for novel outputs like native DOCX generation.

The Dependency Frustration That Sparked a Rust Revolution

Every developer knows the itch: existing tools almost fit your workflow, but not quite. For one engineer, years of relying on AsciiDoctor's Ruby-based processor led to mounting frustrations—interpreted language overhead, extension limitations, and deployment friction. The breaking point? A vision for a vim-centric writing toolkit that demanded embeddable, dependency-free document processing. Thus began the quest to build asciidocr—a spec-compliant AsciiDoc parser in Rust.

"I realized if I wanted to share tools without forcing Ruby or Python installs, I needed a compiled solution," the developer explains. After briefly flirting with Go (and recoiling at ubiquitous if err != nil checks), Rust emerged as the ideal candidate: "The type safety, enums, and performance characteristics were perfect for text processing."

Under the Hood: Scanning, Parsing, and Abstract Syntax Graphs

Byte-by-Byte Scanning Nuances

The parser follows a rigorous pipeline:

Scanning: Converts raw text into tokens
Parsing: Structures tokens into an abstract syntax graph (ASG)
Rendering: Transforms the ASG into output formats (HTML, DOCX, etc.)

Scanning posed immediate challenges with Rust's string handling. Unlike ASCII-centric formats, AsciiDoc handles multi-byte Unicode characters (emojis, ellipses). The solution? Byte-by-byte scanning with boundary checks:

fn peek(&self) -> char {
    if self.is_at_end() || !self.source.is_char_boundary(self.current) {
        return '\0';
    }
    self.source.as_bytes()[self.current] as char
}

Source: asciidocr scanner implementation

Enum-driven token typing proved invaluable. Each TokenType variant (e.g., ThematicBreak, SourceBlock) enabled exhaustive pattern matching during parsing—a Rust strength.

Parsing with Context-Aware State Machines

The parser tracks intricate context to handle AsciiDoc's fluid syntax:

Block continuations (+ operators)
Nested includes
Metadata inheritance
Pending titles

A 500-line Parser struct manages this state, leveraging Rust's memory safety for complex intermediate representations compliant with the AsciiDoc TCK schema.

"Rust's enums forced me to model the AST rigorously. The initial verbosity pays dividends in conversion reliability and speed." — Project Developer

Beyond HTML: DOCX, Python Bindings, and the Future

Templating and Novel Outputs

Using Tera templates, asciidocr generates clean HTMLBook output—a semantic HTML variant. But the real ambition is native DOCX support:

// Work-in-progress DOCX renderer (simplified)
fn render_docx(asg: &Asg) -> Result<Vec<u8>> {
    let mut docx = Docx::new();
    for block in &asg.blocks {
        match block {
            Block::Paragraph(p) => docx.add_paragraph(p.inlines),
            Block::Heading(h) => docx.add_heading(h.level, h.text),
            // ...
        }
    }
    docx.build()
}

Python Integration via PyO3

Compiled as a Python extension using PyO3, asciidocr offers C-like speed for Python workflows:

import asciidocr
html = asciidocr.parse_to_html("Hello _Rustaceans_!") 
# <em>Rustaceans</em> rendered in ~0.01s

Why This Matters for the Text Processing Ecosystem

AsciiDoc's complexity surpasses Markdown, making efficient parsing non-trivial. This implementation demonstrates Rust's strengths in text processing:

Performance: 30x faster than AsciiDoctor in early benchmarks
Deployability: Single binaries eliminate language runtime dependencies
Extensibility: Native code enables tight integration with novel toolchains

The project embodies a growing trend: developers rewriting critical text tools in Rust (see Ropey for ropes, xi-editor for text engines). For technical writers and engineers alike, asciidocr hints at a future where document pipelines are as performant and portable as the systems they document.

Source: Writing an AsciiDoc Parser in Rust by project developer Delfan Baum

#AsciiDoc #RustLang #TextProcessing