Async/Await on GPU: VectorWare's Breakthrough in GPU-Native Programming

VectorWare has successfully implemented Rust's Future trait and async/await syntax on GPUs, enabling developers to write complex, high-performance applications using familiar Rust abstractions while addressing the challenges of GPU concurrency.

In the rapidly evolving landscape of parallel computing, VectorWare has achieved a significant milestone by successfully implementing Rust's Future trait and async/await syntax directly on GPUs. This breakthrough represents not just a technical accomplishment but a potential paradigm shift in how developers approach GPU programming, bridging the gap between traditional GPU paradigms and modern concurrency models.

The Evolution of GPU Programming

GPU programming has traditionally followed a data-parallel model where developers write a single operation that the GPU executes in parallel across different data segments. This approach works exceptionally well for uniform tasks like graphics rendering, matrix multiplication, and image processing. However, as GPU applications grow more sophisticated, developers have increasingly turned to warp specialization to introduce more complex control flow and dynamic behavior.

With warp specialization, different parts of the GPU can run different parts of the program concurrently. For example, one warp might load data from memory while another performs computations, improving utilization of both compute and memory resources. This added expressiveness comes at a cost: developers must manually manage concurrency and synchronization, a task that is notoriously error-prone and difficult to reason about, similar to threading challenges on CPUs.

Current Approaches and Their Limitations

Several projects have attempted to provide the benefits of warp specialization without the associated complexity. JAX models GPU programs as computation graphs that encode dependencies between operations, allowing its compiler to analyze and optimize execution. Triton expresses computation in terms of blocks that execute independently on the GPU, using a Python-based DSL to define execution patterns. More recently, NVIDIA introduced CUDA Tile, which organizes computation around blocks and introduces "tiles" as first-class units of data to make dependencies explicit.

While these approaches offer valuable abstractions, they share common limitations. They require developers to structure code in new and specific ways, creating adoption barriers. Additionally, they don't compose well with existing CPU code and traditional programming models. Most importantly, they represent entirely new programming paradigms and ecosystems, which significantly hinders adoption. Virtually no one writes entire applications with these technologies; instead, they use them for specific components while maintaining traditional approaches for the rest of their codebase.

Rust's Async/Await as a Unifying Abstraction

VectorWare's approach leverages Rust's Future trait and async/await syntax as a potential solution to these challenges. Futures represent computations that may not be complete yet, without specifying whether they run on threads, cores, blocks, tiles, or warps. This hardware-agnostic nature allows the same async code to execute in different environments.

The Future trait itself is intentionally minimal, with its core operation being poll, which returns either Ready or Pending. This separation between definition and execution enables the same async code to be driven in different environments. Like JAX's computation graphs, futures are deferred and composable, allowing developers to construct programs as values before executing them. This enables compilers to analyze dependencies ahead of time while preserving the shape of user code.

Rust's ownership model also plays a crucial role in making data constraints explicit in the program structure. Futures capture the data they operate on, and that captured state becomes part of the compiler-generated state machine. Ownership, borrowing, Pin, and bounds such as Send and Sync encode how data can be shared and transferred between concurrent units of work. In effect, warp specialization reduces to manually written task state machines, while futures compile to state machines that the Rust compiler generates and manages automatically.

Technical Implementation and Challenges

Implementing async/await on GPU required addressing significant technical challenges. VectorWare had to fix bugs and close gaps across multiple compiler backends and encountered issues in NVIDIA's ptxas tool, which they reported and worked around. Their initial implementation used a simple block_on executor, which takes a single future and drives it to completion by repeatedly polling it on the current thread.

Once they had futures working end-to-end, they moved to a more capable executor: Embassy, designed for embedded systems and operating in Rust's #![no_std] environment. This made it a natural fit for GPUs, which lack a traditional operating system. Adapting Embassy to run on the GPU required very few changes, demonstrating the power of reusing existing open-source libraries—a significant advantage over other GPU ecosystems.

VectorWare demonstrated their implementation with concurrent task scheduling on the GPU, showing three independent async tasks that loop indefinitely and increment counters in shared state. The Asciinema recording of these tasks executing concurrently on the GPU via Embassy's executor proves the viability of their approach, even though performance in the demonstration wasn't representative due to the nature of the test workload.

Advantages and Implications

This approach offers several compelling advantages. It provides ergonomic concurrency without requiring a new language or ecosystem. It composes with existing CPU code and execution models, addressing a significant limitation of other approaches. It offers fine-grained control when needed, similar to warp specialization, while providing ergonomic defaults for common cases.

The implications of this work extend beyond technical implementation. By leveraging familiar abstractions, this approach could significantly lower the barrier to entry for GPU programming, making it accessible to a broader range of developers. It could enable better code reuse between CPU and GPU components of applications, potentially leading to more maintainable codebases. Additionally, this work could spur the development of GPU-native executors and libraries in Rust, similar to how CPU executors like Tokio emerged in the Rust ecosystem.

Challenges and Limitations

Despite its promise, this approach faces several challenges. Futures are cooperative, meaning if a future doesn't yield, it can starve other work and degrade performance—a problem not unique to GPUs but exacerbated by their architecture. GPUs don't provide interrupts, requiring executors to use polling mechanisms or spin loops, which are less efficient than interrupt-driven execution.

Driving futures and maintaining scheduling state also increases register pressure on GPUs, which can reduce occupancy and impact performance. Additionally, Rust's async model on the GPU still carries the same function coloring problem that exists on the CPU, where functions must be either entirely async or entirely synchronous.

Future Directions

VectorWare is actively exploring several avenues for future development. They're experimenting with GPU-native executors designed specifically around GPU hardware characteristics, potentially leveraging mechanisms such as CUDA Graphs or CUDA Tile for efficient task scheduling. They've also recently enabled std on the GPU, which opens the door to richer runtimes and tighter integration with existing Rust async libraries.

While focused on Rust for now, VectorWare acknowledges that not everyone uses Rust and plans to support multiple programming languages and runtimes in future products. However, they believe Rust is uniquely well-suited to building high-performance, reliable GPU-native applications, which remains their primary focus.

The work by VectorWare represents a significant step forward in GPU programming. By bringing the familiar and powerful async/await model to GPUs, they're not just solving a technical problem—they're potentially reshaping how developers approach parallel computing, making sophisticated GPU programming more accessible, maintainable, and productive.

#async/await #GPU programming #Embassy #CUDA #Concurrency