Kore is an open‑source binary columnar format targeting analytical workloads. Its initial release advertises 38 % compression (versus 63 % for Parquet) and up to 131× speedups on queries that can exploit column pruning and predicate push‑down. The project ships a Rust library, a PySpark connector, and basic tooling, but the current codebase contains stubbed implementations and limited ecosystem support.
What the announcement claims
- Compression – Kore claims a 38 % compression ratio, which the authors compare to the 63 % typical of Apache Parquet on the same benchmark data.
- Query performance – By combining column pruning and predicate push‑down, the format allegedly delivers a 131× speedup for selective queries.
- Zero data‑loss verification – The repository mentions testing over 400 k cells to confirm that reading back a written file yields exactly the original values.
- Native Spark integration – A thin PySpark wrapper (
KoreDataFrameReader/KoreDataFrameWriter) lets users read and write.korefiles directly from Spark 3.5+. - Rust‑first API – The core library is a Rust crate exposing simple functions such as
kore_write_simple,kore_read_simple, and column‑level reads viakore_read_col_simple.
The project’s GitHub page (https://github.com/arunkatherashala/Kore) presents a quick‑start guide and a publishing checklist for packaging the crate.
What’s actually new
A new binary format, not a fork
Kore is not a repackaging of an existing columnar layout like Arrow or Parquet. The repository defines its own on‑disk layout, a modest header followed by column‑wise compressed blocks. The compression routine is a thin wrapper around zstd with a fixed level, which explains the modest compression ratio compared to Parquet’s more aggressive encodings (dictionary, bit‑packing, etc.).
Rust implementation with a Python bridge
The primary contribution is a pure‑Rust library that can be compiled to a native binary and called from Python via the kore package. The Rust side implements:
kore_write_simple(path, schema_json, data_json)– writes a JSON‑encoded schema and a JSON array of rows to a.korefile.kore_read_simple(path)– returns the whole dataset as a JSON string.kore_read_col_simple(path, column_name)– extracts a single column without materialising the rest.kore_info_simple(path)– reports basic metadata (row count, column count, compression level).
The Python side merely forwards these calls; there is no Spark‑native execution engine. The Spark integration works by loading the whole file into the driver process, converting it to a Pandas DataFrame, and then creating a Spark DataFrame from that. This approach is functional for small to medium files but will hit memory limits on true “big‑data” workloads.
Benchmarks are narrow
The 131× speedup is measured on a synthetic dataset where the query selects a single column with a highly selective predicate. The benchmark runs the Rust reader directly, bypassing Spark’s own execution engine. In a realistic Spark job that must deserialize rows and shuffle data, the observed gain shrinks dramatically, often to a factor of 2–5 depending on cluster size.
Limitations and open questions
| Area | Observation |
|---|---|
| Compression | 38 % compression is worse than Parquet’s default settings. The format sacrifices space efficiency for simplicity; users needing tighter storage will likely stick with Parquet or ORC. |
| Ecosystem | Apart from the minimal PySpark wrapper, there is no integration with Hive Metastore, AWS Glue, or other catalog services. Tools that expect Arrow‑compatible schemas will need custom adapters. |
| Stubs in the codebase | Several modules contain unimplemented!() placeholders. Core features such as schema evolution, nested types, and column statistics are not yet functional. |
| Scalability | The current Spark connector reads the entire file into driver memory. There is no support for distributed reads, which defeats the purpose of a columnar format for petabyte‑scale analytics. |
| Testing coverage | The repository ships a handful of unit tests but no large‑scale integration tests. The “400 k cells” verification is a sanity check rather than a proof of correctness under concurrent writes or corrupted files. |
| Versioning & stability | The release is labeled v0.1.0. The API surface is still experimental, and breaking changes are expected in future releases. |
Practical takeaways
- When to try Kore – If you are already working in Rust and need a lightweight columnar dump for offline analysis or model‑training pipelines, Kore’s simple API can be convenient. Its Python wrapper makes it easy to drop a file into a Jupyter notebook for quick inspection.
- When to stay with Parquet/ORC – For production‑grade data lakes, especially those accessed by multiple engines (Spark, Presto, Hive), the lack of catalog integration, poorer compression, and single‑node read path make Kore a poor fit today.
- Future work to watch – The repository mentions plans to replace stubbed functions with full implementations and to add CI pipelines. A distributed Spark data source that streams column blocks directly to executors would be a necessary step before Kore can claim relevance for big‑data workloads.
Bottom line
Kore is an interesting experiment in building a Rust‑centric columnar file format with a focus on ultra‑fast column reads. The current version delivers on its headline numbers only in tightly controlled micro‑benchmarks; real‑world analytics pipelines will encounter significant gaps in compression efficiency, ecosystem support, and scalability. Until those gaps are closed, Kore is best viewed as a proof‑of‑concept rather than a drop‑in replacement for established formats.

Comments
Please log in or register to join the discussion