BarraCUDA: Building a CUDA Compiler for AMD GPUs from Scratch

A deep dive into BarraCUDA, an ambitious open-source project that compiles CUDA code directly to AMD GPUs without LLVM dependencies.

In the competitive world of GPU computing, where vendor-specific ecosystems often create walled gardens, an intriguing new project has emerged: BarraCUDA. This open-source compiler, built from the ground up by a single developer, takes CUDA C source code and compiles it directly to AMD GPU machine code—without relying on LLVM or any translation layers. The result is a fascinating technical achievement that could have significant implications for cross-platform GPU development.

The Challenge: Breaking NVIDIA's CUDA Monopoly

CUDA has long been the dominant platform for GPU computing, but it's intrinsically tied to NVIDIA hardware. While AMD offers competitive GPUs, developers have historically faced a choice: rewrite their CUDA code using HIP (AMD's CUDA-like API), or stick with NVIDIA hardware. BarraCUDA presents a third option: compile existing CUDA code directly to AMD GPUs without modification.

The project's creator describes it as "what happens when you look at NVIDIA's walled garden and think 'how hard can it be?'" The answer, as it turns out, is "quite hard, actually." Building a compiler is complex work, but building a compiler that targets a specific GPU architecture without the usual compiler infrastructure tools like LLVM is an even more ambitious undertaking.

Technical Architecture: A Complete Toolchain in 15,000 Lines

BarraCUDA is written entirely in C99, consisting of approximately 15,000 lines of code with zero LLVM dependencies. The codebase is structured as a complete compiler toolchain with the following components:

Lexer (747 lines): Tokenizes CUDA C source code
Preprocessor (1,370 lines): Handles #include, #define, macros, and conditionals
Parser (1,500 lines): Recursive descent parser that builds an AST
Semantic analyzer (1,725 lines): Type checking, scope resolution, and overload resolution
BIR (BarraCUDA IR) (3,032 lines): SSA-form intermediate representation
Memory-to-register promotion (965 lines): Transforms stack allocas to SSA registers
Instruction selector (1,788 lines): Maps BIR to AMDGPU machine instructions
Code emitter (1,735 lines): Register allocation, GFX11 encoding, and ELF emission
CLI driver (317 lines): Command-line interface

The compiler pipeline follows a traditional structure: source code undergoes preprocessing, lexical analysis, parsing, semantic analysis, and then transformation through several intermediate representations before finally generating machine code for AMD's GFX11 architecture (RDNA 3).

The BarraCUDA IR: A Target-Independent Intermediate Representation

A key design decision in BarraCUDA is the creation of a custom intermediate representation called BIR (BarraCUDA IR). This SSA-form IR serves as a bridge between the high-level CUDA semantics and the low-level AMDGPU machine instructions.

The IR is target-independent, which is a forward-looking design choice that enables future expansion to other architectures. The project already mentions Tenstorrent (RISC-V based AI accelerators) and Intel Arc (Xe architecture) as potential targets.

Instruction Selection: The Hardest Part

According to the project documentation, roughly 1,700 lines of hand-written instruction selection code make up one of the most complex parts of the compiler. This is where the BIR gets mapped to AMDGPU machine instructions—a process that requires deep understanding of the AMDGPU ISA.

The creator notes that AMD's ISA documentation presents several challenges:

SOP1 prefix is 0xBE800000, not what you'd expect from the docs
SOPC prefix is 0xBF000000
VOP3 VDST is at bits [7:0], not [15:8] as one might assume
Null SADDR is 0x7C for global memory, 0xFC for scratch
RDNA 3 uses Wave32 by default, not Wave64 like previous GCN architectures

These kinds of implementation details make backend compiler development particularly challenging, as they often require reverse-engineering behavior rather than following published specifications.

Current Capabilities and Limitations

BarraCUDA already supports a significant subset of CUDA features:

Core language features:

Function qualifiers (global, device, host)
Built-in variables (threadIdx, blockIdx, blockDim, gridDim)
Structs, enums, typedefs, namespaces
Pointers, arrays, pointer arithmetic
All C control flow constructs
Short-circuit operators and ternary expressions
Templates (basic instantiation)

CUDA-specific features:

shared memory (properly allocated from LDS)
__syncthreads() barrier
Atomic operations (atomicAdd, atomicSub, etc.)
Warp intrinsics (__shfl_sync variants)
Warp votes (__ballot_sync, __any_sync, __all_sync)
Vector types with component access
Half precision support (__half, conversion functions)
launch_bounds with VGPR cap enforcement
Cooperative groups

However, several features are still pending:

Compound assignment operators (+=, -=, etc.)
Bare 'unsigned' type specifier
const qualifier
constant memory
2D array declarations in shared memory
Integer literal suffixes
Parameter reassignment in device functions
Textures and surfaces
Dynamic parallelism
Multiple translation units
Host code generation

The project maintains a test suite with 14 test files covering 35+ kernels and approximately 27,000 bytes of machine code, demonstrating that the compiler can handle real-world CUDA code patterns.

Performance and Optimization Potential

While BarraCUDA can generate working code, the creator acknowledges that the generated code "isn't winning any benchmarks." The current implementation prioritizes correctness over optimization, with several areas identified for future improvement:

Instruction scheduling (to hide memory latency)
Better register allocation (currently uses linear scan, graph coloring could be better)
Constant folding and dead code elimination
Loop-invariant code motion
Occupancy tuning based on register pressure

These optimizations would be implemented in the BIR passes before instruction selection, making them applicable to any target architecture that BarraCUDA might support in the future.

Building Without Dependencies: A Philosophy of Self-Reliance

BarraCUDA is built with a philosophy of minimal external dependencies. The entire project uses only standard C99 libraries and can be built with a simple make command. There's no cmake, autoconf, or complex build systems—just a straightforward Makefile.

This approach has benefits and drawbacks. On one hand, it makes the build process incredibly simple and portable. On the other hand, it means the project reimplements functionality that could be provided by specialized libraries like LLVM.

The creator notes that all data structures use pre-allocated fixed-size arrays, with no malloc in hot paths and no recursion in the compiler itself. This design choice prioritizes predictability and performance over flexibility—a sensible approach for a compiler that needs to handle potentially large input programs.

The Significance: What BarraCUDA Means for GPU Computing

BarraCUDA represents several important developments in the GPU computing ecosystem:

Vendor Neutrality: By providing a path to compile CUDA code for AMD hardware without vendor lock-in, BarraCUDA could help break down the silos that have characterized GPU computing.
Compiler Education: The project serves as an excellent case study in compiler construction, with a clean architecture that could help developers understand how compilers work.
Performance Potential: Direct compilation to AMDGPU machine code, without intermediate translation layers, could potentially offer better performance than HIP-based approaches.
Open-Source Alternative: As an Apache 2.0 licensed project, BarraCUDA provides an open-source alternative to vendor-specific tools.

Future Roadmap

The project outlines a clear roadmap for development:

Near term: Fixing known gaps like compound assignment operators, bare unsigned types, and other language features
Medium term: Implementing optimizations to improve generated code quality
Long term: Expanding to support additional architectures like Tenstorrent and Intel GPUs

The creator also mentions that the IR is designed to be target-independent, which means adding new targets would primarily require implementing new instruction selection and code emission modules.

Conclusion: A Remarkable Technical Achievement

BarraCUDA is more than just a compiler—it's a testament to what a single developer with deep technical knowledge can accomplish. The project demonstrates that it's possible to build a complete CUDA-compatible compiler from scratch, even for complex GPU architectures.

While the project is still in development and doesn't yet support all CUDA features, it already works for many common patterns and provides a solid foundation for future expansion. For developers working with AMD GPUs, BarraCUDA could eventually provide a valuable tool for porting existing CUDA code without modification.

Perhaps most importantly, BarraCUDA challenges the notion that GPU computing must be locked behind vendor-specific ecosystems. By providing a path to compile CUDA code directly to AMD hardware, the project opens up new possibilities for cross-platform GPU development.

For those interested in exploring the project, the GitHub repository contains the complete source code, documentation, and test suite. The project welcomes contributions and feedback from the developer community.