A deep dive into BarraCUDA, an ambitious open-source project that compiles CUDA code directly to AMD GPUs without LLVM dependencies.
In the competitive world of GPU computing, where vendor-specific ecosystems often create walled gardens, an intriguing new project has emerged: BarraCUDA. This open-source compiler, built from the ground up by a single developer, takes CUDA C source code and compiles it directly to AMD GPU machine code—without relying on LLVM or any translation layers. The result is a fascinating technical achievement that could have significant implications for cross-platform GPU development.
The Challenge: Breaking NVIDIA's CUDA Monopoly
CUDA has long been the dominant platform for GPU computing, but it's intrinsically tied to NVIDIA hardware. While AMD offers competitive GPUs, developers have historically faced a choice: rewrite their CUDA code using HIP (AMD's CUDA-like API), or stick with NVIDIA hardware. BarraCUDA presents a third option: compile existing CUDA code directly to AMD GPUs without modification.
The project's creator describes it as "what happens when you look at NVIDIA's walled garden and think 'how hard can it be?'" The answer, as it turns out, is "quite hard, actually." Building a compiler is complex work, but building a compiler that targets a specific GPU architecture without the usual compiler infrastructure tools like LLVM is an even more ambitious undertaking.
Technical Architecture: A Complete Toolchain in 15,000 Lines
BarraCUDA is written entirely in C99, consisting of approximately 15,000 lines of code with zero LLVM dependencies. The codebase is structured as a complete compiler toolchain with the following components:
- Lexer (747 lines): Tokenizes CUDA C source code
- Preprocessor (1,370 lines): Handles #include, #define, macros, and conditionals
- Parser (1,500 lines): Recursive descent parser that builds an AST
- Semantic analyzer (1,725 lines): Type checking, scope resolution, and overload resolution
- BIR (BarraCUDA IR) (3,032 lines): SSA-form intermediate representation
- Memory-to-register promotion (965 lines): Transforms stack allocas to SSA registers
- Instruction selector (1,788 lines): Maps BIR to AMDGPU machine instructions
- Code emitter (1,735 lines): Register allocation, GFX11 encoding, and ELF emission
- CLI driver (317 lines): Command-line interface
The compiler pipeline follows a traditional structure: source code undergoes preprocessing, lexical analysis, parsing, semantic analysis, and then transformation through several intermediate representations before finally generating machine code for AMD's GFX11 architecture (RDNA 3).
The BarraCUDA IR: A Target-Independent Intermediate Representation
A key design decision in BarraCUDA is the creation of a custom intermediate representation called BIR (BarraCUDA IR). This SSA-form IR serves as a bridge between the high-level CUDA semantics and the low-level AMDGPU machine instructions.
The IR is target-independent, which is a forward-looking design choice that enables future expansion to other architectures. The project already mentions Tenstorrent (RISC-V based AI accelerators) and Intel Arc (Xe architecture) as potential targets.
Instruction Selection: The Hardest Part
According to the project documentation, roughly 1,700 lines of hand-written instruction selection code make up one of the most complex parts of the compiler. This is where the BIR gets mapped to AMDGPU machine instructions—a process that requires deep understanding of the AMDGPU ISA.
The creator notes that AMD's ISA documentation presents several challenges:
- SOP1 prefix is 0xBE800000, not what you'd expect from the docs
- SOPC prefix is 0xBF000000
- VOP3 VDST is at bits [7:0], not [15:8] as one might assume
- Null SADDR is 0x7C for global memory, 0xFC for scratch
- RDNA 3 uses Wave32 by default, not Wave64 like previous GCN architectures
These kinds of implementation details make backend compiler development particularly challenging, as they often require reverse-engineering behavior rather than following published specifications.
Current Capabilities and Limitations
BarraCUDA already supports a significant subset of CUDA features:
Core language features:
- Function qualifiers (global, device, host)
- Built-in variables (threadIdx, blockIdx, blockDim, gridDim)
- Structs, enums, typedefs, namespaces
- Pointers, arrays, pointer arithmetic
- All C control flow constructs
- Short-circuit operators and ternary expressions
- Templates (basic instantiation)
CUDA-specific features:
- shared memory (properly allocated from LDS)
- __syncthreads() barrier
- Atomic operations (atomicAdd, atomicSub, etc.)
- Warp intrinsics (__shfl_sync variants)
- Warp votes (__ballot_sync, __any_sync, __all_sync)
- Vector types with component access
- Half precision support (__half, conversion functions)
- launch_bounds with VGPR cap enforcement
- Cooperative groups
However, several features are still pending:
- Compound assignment operators (+=, -=, etc.)
- Bare 'unsigned' type specifier
- const qualifier
- constant memory
- 2D array declarations in shared memory
- Integer literal suffixes
- Parameter reassignment in device functions
- Textures and surfaces
- Dynamic parallelism
- Multiple translation units
- Host code generation
The project maintains a test suite with 14 test files covering 35+ kernels and approximately 27,000 bytes of machine code, demonstrating that the compiler can handle real-world CUDA code patterns.
Performance and Optimization Potential
While BarraCUDA can generate working code, the creator acknowledges that the generated code "isn't winning any benchmarks." The current implementation prioritizes correctness over optimization, with several areas identified for future improvement:
- Instruction scheduling (to hide memory latency)
- Better register allocation (currently uses linear scan, graph coloring could be better)
- Constant folding and dead code elimination
- Loop-invariant code motion
- Occupancy tuning based on register pressure
These optimizations would be implemented in the BIR passes before instruction selection, making them applicable to any target architecture that BarraCUDA might support in the future.
Building Without Dependencies: A Philosophy of Self-Reliance
BarraCUDA is built with a philosophy of minimal external dependencies. The entire project uses only standard C99 libraries and can be built with a simple make command. There's no cmake, autoconf, or complex build systems—just a straightforward Makefile.
This approach has benefits and drawbacks. On one hand, it makes the build process incredibly simple and portable. On the other hand, it means the project reimplements functionality that could be provided by specialized libraries like LLVM.
The creator notes that all data structures use pre-allocated fixed-size arrays, with no malloc in hot paths and no recursion in the compiler itself. This design choice prioritizes predictability and performance over flexibility—a sensible approach for a compiler that needs to handle potentially large input programs.
The Significance: What BarraCUDA Means for GPU Computing
BarraCUDA represents several important developments in the GPU computing ecosystem:
Vendor Neutrality: By providing a path to compile CUDA code for AMD hardware without vendor lock-in, BarraCUDA could help break down the silos that have characterized GPU computing.
Compiler Education: The project serves as an excellent case study in compiler construction, with a clean architecture that could help developers understand how compilers work.
Performance Potential: Direct compilation to AMDGPU machine code, without intermediate translation layers, could potentially offer better performance than HIP-based approaches.
Open-Source Alternative: As an Apache 2.0 licensed project, BarraCUDA provides an open-source alternative to vendor-specific tools.
Future Roadmap
The project outlines a clear roadmap for development:
- Near term: Fixing known gaps like compound assignment operators, bare unsigned types, and other language features
- Medium term: Implementing optimizations to improve generated code quality
- Long term: Expanding to support additional architectures like Tenstorrent and Intel GPUs
The creator also mentions that the IR is designed to be target-independent, which means adding new targets would primarily require implementing new instruction selection and code emission modules.
Conclusion: A Remarkable Technical Achievement
BarraCUDA is more than just a compiler—it's a testament to what a single developer with deep technical knowledge can accomplish. The project demonstrates that it's possible to build a complete CUDA-compatible compiler from scratch, even for complex GPU architectures.
While the project is still in development and doesn't yet support all CUDA features, it already works for many common patterns and provides a solid foundation for future expansion. For developers working with AMD GPUs, BarraCUDA could eventually provide a valuable tool for porting existing CUDA code without modification.
Perhaps most importantly, BarraCUDA challenges the notion that GPU computing must be locked behind vendor-specific ecosystems. By providing a path to compile CUDA code directly to AMD hardware, the project opens up new possibilities for cross-platform GPU development.
For those interested in exploring the project, the GitHub repository contains the complete source code, documentation, and test suite. The project welcomes contributions and feedback from the developer community.

Featured: BarraCUDA's compiler pipeline diagram showing the transformation from CUDA source code to AMD GPU binaries.

Comments
Please log in or register to join the discussion