Accelerate provides an embedded DSL for regular multi‑dimensional arrays in Haskell, compiling them on‑the‑fly to LLVM‑based CPU or CUDA GPU code. The project ships a core library, several back‑ends, and a growing ecosystem of I/O, FFT, and BLAS extensions, but it remains limited to regular arrays, requires explicit compilation steps, and still lacks many higher‑level abstractions found in mainstream data‑parallel frameworks.

Accelerate brings GPU‑ready array programming to Haskell – what the library actually offers

What the announcement claims

The Accelerate project advertises a high‑performance embedded language for array computations that can be compiled at runtime to run on multicore CPUs or NVIDIA GPUs. The repository lists a core package (accelerate) plus a set of back‑ends (accelerate-llvm-native, accelerate-llvm-ptx) and a collection of I/O and algorithmic extensions (FFT, BLAS, image formats, etc.). The documentation highlights a syntax that mirrors ordinary Haskell list code—e.g. a dot‑product expressed with fold, zipWith, and (*)—while promising automatic off‑loading to the GPU via run functions.

What is actually new

Feature	Description	Status
Embedded DSL	Uses higher‑order abstract syntax (HOAS) to build an AST of array operations that is later lowered to LLVM IR.	Stable, part of `accelerate` core.
CPU back‑end	`accelerate-llvm-native` generates multi‑threaded code for any LLVM‑compatible CPU.	Production‑ready, used in the examples.
CUDA back‑end	`accelerate-llvm-ptx` emits PTX for NVIDIA GPUs (compute capability ≥3.0).	Works for many kernels; requires a CUDA‑capable device.
Ecosystem packages	I/O adapters (`accelerate-io-*`), numeric libraries (`accelerate-fft`, `accelerate-blas`), and domain‑specific extensions (e.g., `colour-accelerate`).	Individually maintained, some lag behind core releases.
Example suite	`accelerate-examples` ships a mandelbrot renderer, ray‑tracer, N‑body simulation, PageRank, and more.	Demonstrates feasibility but often tuned for small‑scale benchmarks.

The core novelty lies in the type‑safe separation of description and execution: the Acc type marks an expression as compilable, and the back‑ends guarantee that generated code respects Haskell’s purity guarantees. The library also introduces a HOAS‑to‑de Bruijn conversion that simplifies variable handling inside the DSL, a technique described in a separate technical note.

Practical implications

Performance – Benchmarks published by the authors show speed‑ups of 5‑10× on GPUs for embarrassingly parallel kernels (e.g., vector addition, matrix multiplication). Real‑world workloads that require irregular memory access or dynamic control flow still suffer from the regular‑array restriction.
Portability – Because the same Haskell source can target both CPU and GPU, developers can prototype on a laptop and later switch to a GPU cluster without rewriting code. However, the back‑end selection is explicit (run vs run1), so automatic device discovery is not yet built in.
Tooling – The project integrates with standard Haskell tooling (Cabal, Stack, GHCup). Debugging generated LLVM/PTX code is possible but requires stepping outside the usual Haskell REPL workflow.
Ecosystem fit – Accelerate fills a niche between low‑level CUDA bindings (e.g., cuda package) and high‑level data‑frame libraries (e.g., frames). It is particularly attractive for researchers who already use Haskell for algorithmic prototyping and need a way to evaluate performance on GPUs.

Limitations and open problems

Regular array only – The DSL assumes dense, rectangular data layouts. Irregular structures (sparse matrices, graphs) need to be flattened or handled via the experimental accelerate-io-repa bridge, which adds overhead.
Compilation latency – The “online compilation” step can take several hundred milliseconds for non‑trivial kernels, making Accelerate unsuitable for fine‑grained, per‑iteration JIT scenarios.
Feature gaps – Advanced CUDA features such as shared memory tiling, warp‑level primitives, or dynamic parallelism are not exposed. Users must rely on the library’s optimizer, which may not generate optimal kernels for all patterns.
Maturity of extensions – Packages like accelerate-fft and accelerate-blas wrap external libraries but often lag behind the latest vendor releases, limiting their usefulness for cutting‑edge scientific code.
Community size – Development is driven by a small academic team; issue triage and PR reviews can be slow, and documentation beyond the Haddock API is sparse.

Where to start

Add the core library to a project via cabal add accelerate or stack add accelerate.
Choose a back‑end: accelerate-llvm-native for CPU, accelerate-llvm-ptx for CUDA.
Write a computation inside the Acc monad, e.g. the dot‑product example from the README.
Execute with run (CPU) or run1 (GPU) from Data.Array.Accelerate.LLVM.PTX.
Explore the accelerate-examples repository for ready‑made kernels; the mandelbrot example is a good sanity check for GPU output.

Outlook

Accelerate demonstrates that Haskell can generate performant GPU code without abandoning its pure functional core. The library’s design—type‑safe DSL, LLVM back‑ends, and a modular extension system—offers a solid foundation for future work, such as adding support for AMD GPUs via ROCm, exposing more low‑level CUDA controls, or integrating with heterogeneous runtimes like SYCL. Until those extensions arrive, users should treat Accelerate as a research‑grade tool: excellent for prototyping regular, data‑parallel algorithms, but not yet a drop‑in replacement for production‑grade GPU libraries.

#Haskell #GPU #Accelerate #LLVM #CUDA

Accelerate brings GPU‑ready array programming to Haskell – what the library actually offers

Accelerate brings GPU‑ready array programming to Haskell – what the library actually offers

What the announcement claims

What is actually new

Practical implications

Limitations and open problems

Where to start

Outlook

Comments