Harmonic's pbcc: A Custom Python Protobuf Compiler for High-Scale AI Infrastructure
#Python

Harmonic's pbcc: A Custom Python Protobuf Compiler for High-Scale AI Infrastructure

Startups Reporter
4 min read

Harmonic's infrastructure team has open-sourced pbcc, a custom Protocol Buffers implementation designed to replace the standard library in high-performance Python workloads. The compiler generates specialized C++ extensions to handle massive datasets with reduced overhead and a cleaner API.

Featured image

Harmonic has open-sourced pbcc, a custom Protocol Buffers compiler for Python designed to address performance bottlenecks in their AI infrastructure. The tool replaces the standard Google Protobuf library for handling large-scale data serialization, offering significant improvements in speed, memory efficiency, and developer ergonomics.

The Problem with Standard Protobuf

As Harmonic's data scale grew, their infrastructure team encountered fundamental limitations with Google's standard Python Protobuf implementation. The standard library struggled with very large messages, frequently failing on inputs larger than 2GB. Performance issues were compounded by memory inefficiency and an API that didn't integrate well with modern Python development practices.

The standard library's interface violated type-checking conventions and forced engineers into awkward workarounds. For example, repeated fields were wrapped in custom container types rather than native Python lists, causing strict equality checks against standard lists to fail. This forced developers to constantly cast data just to perform basic comparisons.

Harmonic needed a solution that was:

  • Fast: Capable of serializing and deserializing large payloads with minimal latency
  • Memory Efficient: Able to handle very large messages without crashing
  • Pythonic: Using native types like lists and dicts with proper IDE type hinting support

Architecture: Ahead-of-Time Compilation

Unlike the standard library's runtime interpretation of message descriptors, pbcc takes an ahead-of-time compilation approach. The system uses a Python script called compile.py that reads message descriptors from modules already compiled by Google's Protobuf compiler. It then generates specific C++ source code for each message type using a highly optimized C++ template.

This generated C++ code compiles into an extension module (.so file) that Python can import directly. The resulting module provides message types as classes that behave like Python's built-in dataclasses, but with built-in serialization and deserialization functions.

The architecture encapsulates Protobuf complexity entirely in the extension module, allowing engineers to treat pbcc objects as normal Python objects while getting C++ performance characteristics.

Key Design Decisions

Correctness First: Harmonic prioritized correctness over speed, implementing an extensive test suite that exercises every field type with every kind of modifier (optional, repeated, oneof, and map). The system maintains Protobuf's unknown field behavior, ensuring that re-serializing a message always results in semantically-equivalent data, even if the input contained fields the module didn't recognize.

Memory Management: The team implemented a PyObjectRef wrapper class that acts like a std::shared_ptr but holds references to Python objects. This ensures no memory leaks even when processing massive, deeply nested messages. They use native 64-bit integer types everywhere, eliminating artificial size limits. The team suspects that Google's upb library uses signed 32-bit values for message sizes, which explains the 2GB limit, though they haven't verified this hypothesis.

Python Integration: To make the tool developer-friendly, pbcc generates .pyi stub files alongside compiled modules, enabling full autocomplete and type checking in IDEs. The compiler maps all Protobuf field types to native Python types: repeated fields become lists, maps become dicts, and optional fields become None. This allows intuitive operations like direct equality comparisons between repeated fields and list literals.

Performance Optimization: The C++ implementation uses custom StringReader and StringWriter classes for efficient memory management, copying data only when necessary. Without extensive explicit optimization, pbcc already performs comparably to Google's upb library.

Trade-offs and Philosophy

The decision to build pbcc in-house rather than using existing solutions reflects Harmonic's engineering philosophy. Even though serialization is considered a solved problem by many, the team wasn't satisfied with available options. This mirrors their approach with other infrastructure components, like their REPL service for Lean theorem proving.

The company believes that companies often underestimate the overhead of integrating external projects that don't fit perfectly. In Harmonic's case, building a custom solution that exactly matched their needs proved more efficient than adapting to the limitations of existing tools.

Availability and Impact

pbcc is now available as open-source software at github.com/harmonic-ai/pbcc. While designed for Harmonic's specific high-performance AI infrastructure needs, the tool addresses common pain points with Python Protobuf implementations that other organizations working with large-scale data may encounter.

The release demonstrates Harmonic's commitment to solving engineering problems in simple, elegant, and performant ways, even when that means challenging conventional approaches and building components from scratch when existing solutions fall short.

This article is part of Harmonic's ongoing series about their infrastructure and research developments. The company recently announced their Series C funding round and continues to advance their Mathematical Superintelligence (MSI) initiative.

Comments

Loading comments...