This Tool Embeds Clean Markdown in PDFs So LLMs Stop Guessing
#Dev

This Tool Embeds Clean Markdown in PDFs So LLMs Stop Guessing

Startups Reporter
3 min read

A developer found a way to make the same PDF render perfectly for humans while extracting as clean markdown for machines, using a little-known PDF spec property that's been sitting in the standard since 2001.

Featured image

Most PDFs are lies. Not intentional ones, but structural ones. They store instructions for drawing glyphs on a page, not the semantic meaning behind them. A heading looks like a heading because it's bigger and bolder, but the file itself has no concept that it is a heading. Text extractors parse draw commands left to right, top to bottom, and reconstruct structure through educated guessing. When humans were the only ones reading PDFs, this didn't matter. Now that LLMs consume them by the millions, it's a real problem.

Sarthak Gaud built something called Adaptive PDFs to address this. The core idea is deceptively simple: what if the same file could present formatted content to humans and clean markdown to machines, with no separate versions or conversion steps?

The Mechanism

The technique exploits a property in the PDF specification that's been around since version 1.4, released in 2001. It lets you define replacement text for marked content. Renderers ignore it entirely, drawing whatever the content stream specifies. But text extractors that support the property return the replacement text instead of the raw visual text. In testing, both PyMuPDF and Poppler honor it. It was originally designed for edge cases like ligatures, where a visual glyph "fi" should extract as two separate characters. Nobody adopted it for anything broader.

Adaptive PDF uses this at the document level. Replacement text attached via marked-content sequences tells extractors to return structured markdown instead of raw visual text. One file, two completely different outputs depending on who's reading it.

What Extractors Actually See

The difference is stark. From a quarterly infrastructure report, a normal PDF extract reads like this:

Quarterly Infrastructure Report Overview Cloud migration completed ahead of sch edule. Three critical services were moved to the new cluster. Key Metrics Uptime: 99.97% Latency: 42ms avg (down from 68ms)

Broken line wraps, no hierarchy, bullet points indistinguishable from paragraphs. The same PDF processed through Adaptive PDF extracts as:

# Quarterly Infrastructure Report
## Overview
Cloud migration completed ahead of schedule.
## Key Metrics
| Metric | Value |
|---------|------|
| Uptime | 99.97% |

Headings, tables, bullets, all explicit. An LLM doesn't have to guess that "Key Metrics" is a section header or that three lines are a list. It's encoded.

Benchmarks

Token counts stay roughly the same across document types. A 417-page textbook went from 193,064 tokens to 195,858 tokens, less than 2% increase. A resume saw a 15.7% increase from 650 to 668 tokens. A research paper went from 8,082 to 7,897 tokens. The advantage isn't fewer tokens. It's that the same tokens now carry structure. "Overview" and "## Overview" cost the same, but one tells the machine what it's looking at.

When uploaded to both ChatGPT and Claude, the models returned markdown formatting that matched the embedded layer exactly, including formatting choices no layout heuristic would reproduce identically. That's not conclusive proof, since LLMs do structural inference, but the exact match is suggestive.

What This Means for PDF Pipelines

Every LLM-powered PDF processing pipeline today includes a conversion step. Parse the PDF, extract text, hope the structure comes through. Tools like Docling produce markdown from normal PDFs via layout analysis, but that's inference, not ground truth. Adaptive PDF flips the model. Instead of reconstructing structure after extraction, you embed it from the start. The PDF becomes self-describing.

The file extension stays .pdf. No new format to adopt, no viewer compatibility issues. It just works depending on who's reading it.

Gaud is exploring a Google Docs extension to streamline the creation process. The initial implementation is open source at github.com/iminoaru/adaptivepdf. The approach is a good example of solving a modern problem with a specification feature that's been available for over two decades, just waiting for someone to use it at the document level.

Comments

Loading comments...