Researchers have successfully processed a 2.4TB dataset containing one trillion rows of weather data in just 76 seconds using a distributed computing approach. This breakthrough demonstrates the potential of cloud-native tools to handle massive data processing tasks efficiently and cost-effectively.

Processing 2.4TB of Data in 76 Seconds: The Trillion Row Challenge Breakthrough

The Challenge: From Billion to Trillion Rows

The Trillion Row Challenge (TRC) extends the original Billion Row Challenge (BRC), which tasked developers with computing min, max, and mean temperatures per weather station from a dataset of one billion rows. The TRC scales this up significantly, requiring the same computation but on one trillion rows—413 unique weather stations with 1,000,000,000,000 data points.

This isn't just an academic exercise. Processing datasets of this scale represents real-world challenges faced by organizations dealing with massive amounts of sensor data, logs, or scientific information. The ability to quickly extract meaningful insights from such large datasets is increasingly critical in fields from climate science to industrial monitoring.

The Solution: Distributed Processing with Burla

The team chose to tackle this challenge using Burla, a platform designed for distributed computing tasks. Their approach involved splitting the massive dataset into manageable chunks and processing them in parallel across a powerful cloud cluster.

Cluster Configuration

The solution utilized a 125-node cluster with 80 CPUs and 320GB RAM per node, running on Google Cloud's N4-standard-80 machines. This configuration provided a total of 10,000 CPUs working in parallel. The cluster booted in just 1 minute and 47 seconds—a remarkably fast setup time for such a powerful computing environment.

Data Generation and Storage

The team generated 1,000 Parquet files, each containing one billion rows of synthetic weather data. The data was structured with two columns: station name and temperature. These files were written to a shared folder that automatically synchronized with a Google Cloud Storage bucket using GCSFuse.

import pyarrow
import numpy as np
import pandas as pd
from burla import remote_parallel_map

TOTAL_ROWS = 1_000_000_000_000
ROWS_PER_FILE = 1_000_000_000

lookup_df = pd.read_csv("avg_temp_per_station.csv")
file_nums = range(TOTAL_ROWS // ROWS_PER_FILE)

def generate_parquet(file_num: int):
    rng = np.random.default_rng(seed=file_num)
    
    # array of station names
    station_indices = rng.integers(0, len(lookup_df) - 1, ROWS_PER_FILE)
    station_names = lookup_df.station.to_numpy()[station_indices]
    
    # array of station temps, aligned by index with `station_names`
    station_temps = rng.normal(0, 10, ROWS_PER_FILE)
    station_temps += lookup_df.mean_temp.to_numpy()[station_indices]
    station_temps = station_temps.round(1)
    
    # save to parquet in cluster filesystem
    df = pd.DataFrame({"station": station_names, "temp": station_temps})
    df.to_parquet(f"shared/1TRC/{file_num}.parquet", engine="pyarrow")

remote_parallel_map(generate_parquet, file_nums, func_ram=64)

This data generation process completed in 5 minutes and 48 seconds, resulting in a 2.4TB dataset ready for processing.

The Processing Engine: DuckDB in Parallel

With the data prepared, the team turned to DuckDB, an in-process analytical database designed for fast queries on structured data. Their approach was to run a simple query on each Parquet file in parallel, then combine the results.

import duckdb
import pandas as pd
from time import time
from burla import remote_parallel_map

TOTAL_ROWS = 1_000_000_000_000
ROWS_PER_FILE = 1_000_000_000
file_nums = range(TOTAL_ROWS // ROWS_PER_FILE)

def station_stats(file_num: int):
    query = """
    SELECT
        station,
        MIN(temp) AS min_temp,
        AVG(temp) AS mean_temp,
        MAX(temp) AS max_temp
    FROM read_parquet(?)
    GROUP BY station
    ORDER BY station
    """
    con = duckdb.connect(database=":memory:")
    con.execute("PRAGMA threads=8")
    return con.execute(query, (f"shared/1TRC/{file_num}.parquet",)).df()

start = time()
dataframes = remote_parallel_map(station_stats, file_nums, func_cpu=10)
df = pd.concat(dataframes).groupby("station").agg(
    min_value=("min_temp", "min"),
    mean_value=("mean_temp", "mean"),
    max_value=("max_temp", "max"),
)
print(df)
print(f"Done after {start-time()}s!")

Each query returned a pandas DataFrame with the min, mean, and max temperatures per station within that file. These DataFrames were then concatenated and aggregated locally to produce the final results.

Results: 76 Seconds for 2.4TB of Data

The processing completed in an impressive 76 seconds, analyzing all 1,000 Parquet files containing one trillion rows of data. The final output provided the minimum, mean, and maximum temperatures for each of the 413 weather stations in the dataset.

              name  min_value  mean_value  max_value
0             Abha      -46.9   18.000151       84.2
1          Abidjan      -34.1   26.000040       89.9
2           Abéché      -30.9   29.400059       91.5
3            Accra      -36.1   26.399808       88.7
4      Addis Ababa      -46.5   15.999770       81.7
..             ...        ...         ...        ...
407       Yinchuan      -54.2    9.000354       75.5
408         Zagreb      -52.6   10.700003       75.1
409  Zanzibar City      -36.0   25.999645       91.1
410         Ürümqi      -53.8    7.399437       69.3
411          İzmir      -40.8   17.900016       80.9

[412 rows x 4 columns]

Cost Analysis: Under $10 for Massive Processing

Perhaps as impressive as the speed is the cost. The team used spot instances—discounted computing resources that can be preempted if needed—which significantly reduced expenses. With 125 nodes running for approximately 3.5 minutes each (including boot time), the total compute time was about 7.3 hours.

At spot pricing of $1.22 per hour per N4-standard-80 machine, the entire processing job cost approximately $8.91. This demonstrates that even massive data processing tasks can be economically feasible with the right approach.

Beyond the Numbers: Real-World Implications

While 76 seconds is a remarkable achievement, the team emphasizes that the real value lies in what this represents for real-world data processing. In a typical business environment, a developer could receive this request, write the necessary code, and have results in about five minutes for less than $10—all using an interface accessible to beginners.

"If I were in the office, and you asked me to get you the min/mean/max per station, assuming I'd never heard of this challenge before, I'd have an answer for you around 5 minutes later, and for less than $10. In my opinion this is the real result, and I think it's an impressive one!"

This democratization of massive data processing capabilities could transform how organizations approach big data analytics, making it accessible to more teams without requiring specialized expertise or prohibitively expensive infrastructure.

The Path to Even Greater Speed: Under 5 Seconds?

The team believes their current implementation isn't fully optimized and that significant speed improvements are possible. They've already demonstrated this by achieving 39 seconds when using better compression that reduced the dataset to 1.2TB.

Their analysis suggests several potential optimizations:

Faster Data Downloads: N4-standard-80 machines can achieve download speeds of up to 50Gbps from cloud storage in the same region. With optimized parallel connection logic (rather than GCSFuse), they estimate the entire compressed dataset could be loaded into memory in just 1.9 seconds.
Parallelizing the 1BRC Winning Code: The best solution to the original Billion Row Challenge completes in 1.5 seconds using 8 CPUs and 32GB of RAM. By adapting this approach to work with compressed Parquet files instead of CSV and running it across 1,000 machines in parallel, the team believes a sub-5-second completion time is achievable.
Algorithmic Optimizations: Further optimizations to the query logic and data processing pipeline could reduce overhead and improve parallelization efficiency.

The Future of Big Data Processing

This demonstration represents more than just a technical achievement—it illustrates the evolving landscape of big data processing. As cloud platforms become more sophisticated and tools like DuckDB and Burla mature, the barriers to processing massive datasets continue to lower.

For developers and organizations, this means that insights that once required massive, expensive data warehouses can now be extracted quickly and cost-effectively using modern distributed computing approaches. As the team notes, the ability to process a trillion rows of data in minutes rather than hours or days opens up new possibilities for real-time analytics and data-driven decision making.

The Trillion Row Challenge serves as both a benchmark and a proof of concept, showing what's possible when modern tools and cloud infrastructure are combined to solve big data problems. As the team continues to optimize their approach, we may see even more impressive performance numbers that push the boundaries of what's possible in distributed data processing.

#BigDataProcessing #CloudComputing #DistributedSystems

Processing 2.4TB of Data in 76 Seconds: The Trillion Row Challenge Breakthrough

Processing 2.4TB of Data in 76 Seconds: The Trillion Row Challenge Breakthrough

The Challenge: From Billion to Trillion Rows

The Solution: Distributed Processing with Burla

Cluster Configuration

Data Generation and Storage

The Processing Engine: DuckDB in Parallel

Results: 76 Seconds for 2.4TB of Data

Cost Analysis: Under $10 for Massive Processing

Beyond the Numbers: Real-World Implications

The Path to Even Greater Speed: Under 5 Seconds?

The Future of Big Data Processing

Comments