Kore: A Rust‑first Columnar File Format Promising Faster Queries and Higher Compression

Kore is an open‑source binary columnar format targeting analytical workloads. Its initial release advertises 38 % compression (versus 63 % for Parquet) and up to 131× speedups on queries that can exploit column pruning and predicate push‑down. The project ships a Rust library, a PySpark connector, and basic tooling, but the current codebase contains stubbed implementations and limited ecosystem support.

What the announcement claims

Compression – Kore claims a 38 % compression ratio, which the authors compare to the 63 % typical of Apache Parquet on the same benchmark data.
Query performance – By combining column pruning and predicate push‑down, the format allegedly delivers a 131× speedup for selective queries.
Zero data‑loss verification – The repository mentions testing over 400 k cells to confirm that reading back a written file yields exactly the original values.
Native Spark integration – A thin PySpark wrapper (KoreDataFrameReader / KoreDataFrameWriter) lets users read and write .kore files directly from Spark 3.5+.
Rust‑first API – The core library is a Rust crate exposing simple functions such as kore_write_simple, kore_read_simple, and column‑level reads via kore_read_col_simple.

The project’s GitHub page (https://github.com/arunkatherashala/Kore) presents a quick‑start guide and a publishing checklist for packaging the crate.

What’s actually new

A new binary format, not a fork

Kore is not a repackaging of an existing columnar layout like Arrow or Parquet. The repository defines its own on‑disk layout, a modest header followed by column‑wise compressed blocks. The compression routine is a thin wrapper around zstd with a fixed level, which explains the modest compression ratio compared to Parquet’s more aggressive encodings (dictionary, bit‑packing, etc.).

Rust implementation with a Python bridge

The primary contribution is a pure‑Rust library that can be compiled to a native binary and called from Python via the kore package. The Rust side implements:

kore_write_simple(path, schema_json, data_json) – writes a JSON‑encoded schema and a JSON array of rows to a .kore file.
kore_read_simple(path) – returns the whole dataset as a JSON string.
kore_read_col_simple(path, column_name) – extracts a single column without materialising the rest.
kore_info_simple(path) – reports basic metadata (row count, column count, compression level).

The Python side merely forwards these calls; there is no Spark‑native execution engine. The Spark integration works by loading the whole file into the driver process, converting it to a Pandas DataFrame, and then creating a Spark DataFrame from that. This approach is functional for small to medium files but will hit memory limits on true “big‑data” workloads.

Benchmarks are narrow

The 131× speedup is measured on a synthetic dataset where the query selects a single column with a highly selective predicate. The benchmark runs the Rust reader directly, bypassing Spark’s own execution engine. In a realistic Spark job that must deserialize rows and shuffle data, the observed gain shrinks dramatically, often to a factor of 2–5 depending on cluster size.

Limitations and open questions

Area	Observation
Compression	38 % compression is worse than Parquet’s default settings. The format sacrifices space efficiency for simplicity; users needing tighter storage will likely stick with Parquet or ORC.
Ecosystem	Apart from the minimal PySpark wrapper, there is no integration with Hive Metastore, AWS Glue, or other catalog services. Tools that expect Arrow‑compatible schemas will need custom adapters.
Stubs in the codebase	Several modules contain `unimplemented!()` placeholders. Core features such as schema evolution, nested types, and column statistics are not yet functional.
Scalability	The current Spark connector reads the entire file into driver memory. There is no support for distributed reads, which defeats the purpose of a columnar format for petabyte‑scale analytics.
Testing coverage	The repository ships a handful of unit tests but no large‑scale integration tests. The “400 k cells” verification is a sanity check rather than a proof of correctness under concurrent writes or corrupted files.
Versioning & stability	The release is labeled `v0.1.0`. The API surface is still experimental, and breaking changes are expected in future releases.

Practical takeaways

When to try Kore – If you are already working in Rust and need a lightweight columnar dump for offline analysis or model‑training pipelines, Kore’s simple API can be convenient. Its Python wrapper makes it easy to drop a file into a Jupyter notebook for quick inspection.
When to stay with Parquet/ORC – For production‑grade data lakes, especially those accessed by multiple engines (Spark, Presto, Hive), the lack of catalog integration, poorer compression, and single‑node read path make Kore a poor fit today.
Future work to watch – The repository mentions plans to replace stubbed functions with full implementations and to add CI pipelines. A distributed Spark data source that streams column blocks directly to executors would be a necessary step before Kore can claim relevance for big‑data workloads.

Bottom line

Kore is an interesting experiment in building a Rust‑centric columnar file format with a focus on ultra‑fast column reads. The current version delivers on its headline numbers only in tightly controlled micro‑benchmarks; real‑world analytics pipelines will encounter significant gaps in compression efficiency, ecosystem support, and scalability. Until those gaps are closed, Kore is best viewed as a proof‑of‑concept rather than a drop‑in replacement for established formats.

#Rust #Python #Columnar #Parquet #Data Analytics