#Regulation

cuda-oxide: Compile Idiomatic Rust Directly to CUDA PTX Without DSLs or Bindings

Tech Essays Reporter
8 min read

NVIDIA's NVlabs has released cuda-oxide, an experimental Rust-to-CUDA compiler backend that lets developers write safe, idiomatic GPU kernels in pure Rust, compiling directly to PTX with no domain-specific languages or foreign bindings, and early benchmarks show it achieves up to 58% of cuBLAS performance for GEMM workloads on B200 hardware.

The persistent friction between Rust's growing adoption in systems programming and the dominance of CUDA C++ for GPU kernel development has left developers with unsatisfying options: either accept the memory safety risks and syntactic complexity of C++ for kernel code, or adopt Rust-based GPU frameworks that require learning domain-specific languages, managing fragile foreign function bindings, or sacrificing support for core Rust features like generics and closures. NVIDIA's NVlabs has open-sourced cuda-oxide, an experimental custom rustc backend that resolves this tension by compiling standard, idiomatic Rust code directly to CUDA PTX, with no DSLs, no external language bindings, and full integration with Cargo and the Rust native toolchain.

{{IMAGE:1}}

The core promise of cuda-oxide is that it treats Rust as a first-class language for CUDA kernel development, rather than an afterthought requiring wrappers or translations, by hooking directly into the Rust compiler to produce GPU-executable code from the same source files that contain host logic. This approach preserves Rust's safety guarantees for device code, supports nearly all of Rust's core language features, and fits naturally into existing Rust development workflows, all while targeting the full range of NVIDIA GPU hardware from older architectures to the latest Blackwell and Hopper chips.

The Compilation Pipeline and Tooling

cuda-oxide operates as a custom rustc codegen backend, which means it integrates with the standard Rust compilation process rather than acting as a separate transpiler. The full pipeline flows from standard Rust source code to final PTX: the Rust compiler first generates its native Mid-Level Intermediate Representation (MIR) from the source, which the mir-importer crate translates into Pliron IR, an MLIR-like intermediate representation framework written entirely in Rust. Pliron uses distinct dialects to model Rust MIR, LLVM IR, and NVVM intrinsics, allowing the pipeline to lower code through dialect-mir to dialect-llvm, then export to standard LLVM IR, and finally compile to PTX using the llc tool from LLVM 21 or newer. This Rust-native pipeline means the entire compiler stack is accessible to Rust contributors, and it avoids the fragility of maintaining a separate parser or language subset for kernel code.

The project includes cargo-oxide, a Cargo subcommand that wraps the entire build process, so developers can build, run, and debug kernels with familiar commands: cargo oxide run vecadd compiles both host and device code, links against the CUDA runtime, and launches the kernel on a connected GPU. The cargo oxide pipeline command exposes each step of the compilation process, showing the MIR, Pliron IR, LLVM IR, and final PTX for any example, which is invaluable for debugging or understanding how Rust code maps to GPU instructions. For low-level debugging, cargo oxide debug integrates with cuda-gdb, allowing developers to step through kernel code as they would any other Rust program.

A major differentiator for cuda-oxide is single-source compilation: host and device code live in the same file, with device functions marked with the #[kernel] attribute. The project's quick start example demonstrates this clearly, defining a generic map function that applies a closure to each element of an input array, with the kernel and host logic (including context creation, memory allocation, and kernel launching) all in one file. The cuda_launch! macro handles passing arguments to the kernel, including closures with captures from the host scope, which the compiler automatically scalarizes and passes as PTX parameters without manual intervention.

Language Support and Safety Features

cuda-oxide prioritizes support for idiomatic Rust, rather than forcing developers to learn a new syntax for GPU code. It supports generics with monomorphization, so the map example can work with any Copy type and any closure that implements Fn(T) -> T, with rustc generating concrete kernel variants at compile time. Closures with captures are fully supported: the example passes a move |x: f32| x * factor closure to the kernel, where factor is a host variable, and the compiler handles capturing and passing that value to the GPU automatically. User-defined structs, enums, and pattern matching are all supported on the device side, as are Rust's standard control flow constructs.

For GPU-specific functionality, the crate ecosystem provides safe abstractions: cuda-device includes intrinsics for thread indexing, warp-level operations, shared memory, scoped atomics, barriers, tensor memory acceleration (TMA), and thread block clusters, all with type-safe APIs that prevent common GPU programming errors like out-of-bounds indexing or incorrect atomic scoping. The cuda-core crate provides RAII wrappers for CUDA contexts, streams, and device buffers, eliminating the need to manually manage raw CUDA handles and reducing memory leaks. For async workflows, cuda-async introduces DeviceOperation and DeviceFuture types that integrate with Rust's async ecosystem, allowing developers to chain GPU operations across streams and await their completion, as shown in the async_mlp example that pipelines GEMM, matrix-vector multiplication, and ReLU activation across concurrent streams.

The project's support for advanced CUDA features is notable for an experimental compiler: it includes intrinsics for Hopper and Blackwell architectures, including TMA, tcgen05 MMA instructions, and cluster-level communication via DSMEM. The gemm_sol example demonstrates this with a GEMM implementation that achieves 868 TFLOPS on a B200 GPU, roughly 58% of the performance of NVIDIA's optimized cuBLAS library, using 8 kernels across 4 phases with cta_group::2, CLC, and a 4-stage pipeline. Additional examples include atomic operation tests, FFT integration via cuFFTDx, block-level GEMM via cuBLASDx, and cross-crate kernel definitions, with 46 total examples in the crates/rustc-codegen-cuda/examples/ directory.

Ecosystem Integration

cuda-oxide is one of several active Rust GPU projects, each targeting different use cases: some focus on Vulkan/SPIR-V for graphics, others on implicit offload via LLVM, and others on safe driver bindings. The NVlabs team notes that they are collaborating with maintainers across the broader Rust GPU community to align efforts and advance GPU support in Rust as a whole, rather than building a siloed tool. The project's ecosystem appendix in the work-in-progress cuda-oxide book details how it compares to other projects, noting that its explicit focus on CUDA and NVIDIA hardware makes it a targeted tool for developers already working in the NVIDIA ecosystem.

Implications for Rust and GPU Development

If cuda-oxide matures beyond its current alpha status, it could significantly lower the barrier to entry for Rust developers to write GPU kernels, as it requires no prior knowledge of CUDA C++ or GPU-specific DSLs. By bringing Rust's memory safety guarantees to GPU code, it could reduce the prevalence of subtle, hard-to-debug errors like out-of-bounds memory access, race conditions in shared memory, and incorrect atomic ordering, which are common in C++ CUDA code. Its integration with standard Rust tooling means it fits into existing CI pipelines, testing workflows, and dependency management systems, unlike standalone GPU frameworks that require custom build steps.

For NVIDIA, the project expands the developer ecosystem for its hardware, making Rust a viable alternative to C++ for teams that prioritize safety and Rust's modern language features. The early performance results suggest that the approach does not require sacrificing runtime efficiency for developer ergonomics, which is critical for adoption in high-performance computing and machine learning workloads where GPU performance is paramount. The support for LTOIR generation for Blackwell and newer architectures also ensures that the project can keep pace with new NVIDIA hardware features, unlike older frameworks that may lag behind vendor releases.

More broadly, cuda-oxide demonstrates the flexibility of Rust's compiler architecture: by allowing custom codegen backends, the Rust project enables experiments like this that target entirely new platforms without forking the compiler or maintaining a separate language implementation. This could pave the way for similar efforts targeting other GPU vendors or accelerator architectures, further expanding Rust's reach in systems programming.

Limitations and Current Status

It is critical to note that cuda-oxide is in an early alpha stage, with the team explicitly warning of bugs, incomplete features, and API breakage as development progresses. The project is not ready for production use, and developers should expect to encounter edge cases where Rust features are not yet supported, or where generated PTX has correctness or performance issues. The documentation is a work in progress, with the Sphinx-based cuda-oxide book still under active development, so learning resources are currently limited to example code and source comments.

The dependency requirements are steep, which may limit adoption for some developers. The project requires Rust nightly (pinned to a specific nightly version in rust-toolchain.toml), as it relies on unstable rustc internals to implement the custom codegen backend. A CUDA Toolkit 12.x or newer is required, along with LLVM 21 or newer with the NVPTX backend enabled, as older LLVM versions do not support the intrinsics needed for Hopper and Blackwell architectures. The project is currently only tested on Linux (Ubuntu 24.04), with no official support for Windows or macOS, and the host cuda-bindings crate requires Clang and libclang development headers to build, adding another layer of system dependencies.

There are also scope limitations: cuda-oxide targets only NVIDIA CUDA hardware, so it is not a cross-platform GPU solution for developers targeting AMD or Intel GPUs. For teams that need graphics support via Vulkan or SPIR-V, other Rust GPU projects are more appropriate. The licensing model may also be a consideration for some users: the cuda-bindings crate is licensed under the NVIDIA Software License, while all other crates use the Apache License 2.0, which could conflict with projects that have strict licensing requirements for dependencies.

While the 58% cuBLAS performance number is impressive for an experimental project, it is still significantly behind the vendor-optimized library, and it is unclear how performance will scale for more complex, real-world workloads beyond the curated examples. The use of Rust nightly also means that the project is subject to breaking changes in the Rust compiler itself, which could require frequent updates to the codegen backend to keep pace with upstream rustc changes.

Comments

Loading comments...