Inside the Epstein PDFs: A Digital Forensics Deep Dive
#Security

Inside the Epstein PDFs: A Digital Forensics Deep Dive

Startups Reporter
4 min read

The PDF Association's technical analysis of the Epstein Files reveals how the DOJ sanitized sensitive documents, uncovering both robust redaction practices and surprising technical inefficiencies.

The recent release of thousands of PDF documents by the US Department of Justice under the "Epstein Files Transparency Act" has sparked intense scrutiny, not just for their content but for their technical construction. The PDF Association's forensic analysis reveals a fascinating case study in how government agencies handle sensitive document releases in the digital age.

The Challenge of PDF Forensics

PDFs present unique challenges for forensic analysis compared to other formats. As binary files requiring specialized knowledge and software, they demand a different approach than simple text documents or spreadsheets. The Epstein PDF collection—totaling nearly 3GB across 4,085 files—exemplifies these complexities.

The analysis began with a fundamental question: are these files technically valid? Using multiple forensic tools, researchers found only one minor defect across the entire collection—109 PDFs had positive FontDescriptor Descent values instead of negative ones. This is a relatively common error typically associated with font substitution that doesn't affect overall file validity.

Version Confusion and Incremental Updates

One of the most intriguing findings involved PDF version inconsistencies. Different tools reported wildly different version numbers for the same files. Tool A reported 209 files as version 1.3, while Tool B claimed 3,817 files were version 1.3. The discrepancy arose because Tool B failed to account for incremental updates—a PDF feature that allows multiple revisions to be stored in a single file.

Incremental updates work by appending changes to the original document, creating a chain of edits. When processed by conforming PDF software, the file is always read from the end, applying deltas to reconstruct the current version. The Epstein PDFs used this feature extensively, particularly for adding Bates numbering—the unique identifiers assigned to each page.

The Bates Numbering Mystery

The analysis revealed that Bates numbers were added through separate incremental updates. In the first PDF examined (EFTA00000001.pdf), the numbering appeared in the third incremental update, using a hexadecimal string to paint the identifier onto each page. This pattern was consistent across all examined files.

Interestingly, the first incremental update added a missing binary marker comment that should have been present from the start. This comment tells software to treat the file as binary rather than text, preventing potential corruption from line ending changes. Its placement after binary data was technically pointless but didn't impact functionality.

Hidden Metadata and Orphaned Objects

Perhaps the most surprising discovery was an "orphaned" document information dictionary hidden inside a compressed object stream. This dictionary contained metadata about the software used (OmniPage CSDK 21.1) and creation/modification dates, but wasn't referenced in the final document version. As such, it was invisible to standard PDF software.

This finding highlights a critical lesson in PDF sanitization: different incremental updates may have marked other document information dictionaries or XMP metadata streams as free without actually deleting the data. The presence of this orphaned metadata demonstrates how extra care is required when sanitizing PDFs.

Image Processing and Redaction Practices

The Epstein PDFs contained no JPEG images, despite appearing to include photographs. The Department of Justice had explicitly converted all lossy JPEG images to low-resolution (96 DPI) FLATE-encoded bitmaps using indexed device-dependent color spaces with 256-color palettes. This conversion served multiple purposes:

  • Eliminated EXIF, IPTC, and XMP metadata that could reveal camera details, GPS locations, or timestamps
  • Reduced file sizes while maintaining readability
  • Made text on photographed objects harder to discern

Black box redactions were correctly applied directly into image pixel data rather than as separate PDF rectangle objects floating above sensitive information. This robust approach prevents the common mistake of leaving underlying text recoverable through simple copy-paste operations.

Technical Inefficiencies

Despite robust redaction practices, the analysis uncovered several technical inefficiencies:

  • Unnecessary empty content streams and ProcSet references
  • Mixed use of conventional cross-reference tables and compressed object streams
  • Inconsistent PDF version handling
  • Retained PDF comments that could inadvertently disclose information

These inefficiencies resulted in larger file sizes than necessary and could be improved with better PDF technology practices.

The Human Element

The analysis also revealed insights into the human processes behind the document preparation. Creation and modification dates for the 3,609 PDFs with timestamps ranged from December 18 to December 19, 2025, suggesting the batch processing took at least 36 hours. The consistent use of the monospaced "Courier" typeface made it easier to determine the number of characters redacted, potentially reducing the possible options for guessing redacted content.

Conclusion: Robust but Imperfect

The Epstein PDF forensic analysis reveals a government agency that has developed internal processes for sanitizing and redacting sensitive information before public release. The DoJ's approach to image conversion and black box redactions demonstrates an understanding of common PDF vulnerabilities.

However, the technical inefficiencies and orphaned metadata show that even well-intentioned sanitization workflows can have gaps. The presence of PDF comments and unnecessary objects suggests room for improvement in their PDF processing pipeline.

This case study serves as both a success story in preventing information leakage and a cautionary tale about the complexities of digital document forensics. As the PDF Association notes, variations in files and tool assumptions can easily yield false results, making this field both challenging and essential for transparency in the digital age.

Comments

Loading comments...