How FFmpeg leverages Vulkan Compute to accelerate professional video codecs on consumer GPUs, bypassing hardware limitations and enabling GPU-resident processing without CPU bottlenecks.
Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg
March 16, 2026 by Lynne, Khronos member, FFmpeg Vulkan maintainer

Video encoding and decoding on the internet is largely a solved problem for everyday users. Most consumer devices now ship with dedicated hardware accelerator chips, to which APIs like the Vulkan® Video extensions provide direct access. Meanwhile, newer codecs are increasingly royalty-free with open specifications — or simply age out of licensing restrictions — making the standards accessible to everyone.
It's easy to forget how demanding 720p H.264 decoding was on CPUs just 18 years ago. That challenge drove intense competition and optimization among software implementations, pushing performance to the limit until hardware decoding finally became commonplace.
In professional workflows, however, performance walls still exist. Editors scrubbing through days of raw camera footage, colorists working with 8K 16-bit masters, VFX artists rendering 32-bit floating-point ACEScg video, and archivists handling extreme-resolution lossless film scans are still performance-bound. Where casual users once tolerated the occasional frame drop, today's professionals are often pushed toward expensive proprietary solutions or liquid-cooled, hundred-core workstations with hundreds of gigabytes of RAM.
This post explores how FFmpeg uses Vulkan Compute to seamlessly accelerate encoding and decoding of even professional-grade video on consumer GPUs — unlocking GPU compute parallelism at scale, without specialized hardware. This approach complements Vulkan Video's fixed-function codec support, extending acceleration to formats and workflows it doesn't cover.
Codecs
Codecs are algorithms that exploit redundancy and patterns in a signal to compress it for storage or transmission. How easy is it to parallelize codec processing on a GPU? Take JPEG, the C. elegans of compression codecs, as an illustrative example.
Encoding an image requires a 2D frequency transform (partially parallelizable, processing rows then columns), DC value prediction (fully serial), quantization to discard perceptually irrelevant information (fully parallel), and finally Huffman coding (extremely serial). The mix of parallel and serial steps turns out to be the central challenge for GPU codec acceleration.
Decoding reverses these steps — but the serial bottlenecks remain just as problematic. This is the fundamental tension: codec pipelines are riddled with serial dependencies, while GPUs are purpose-built to execute thousands of independent, uncorrelated operations simultaneously.
Compromises
The historically obvious approach was hybrid decoding: handle the serial steps (like coefficient decoding) on the CPU, upload intermediate results to the GPU, then let the GPU run the parallel steps where it excels. In practice, this runs into a fundamental problem: GPUs are physically distant from system memory. Even with DMA and high-bandwidth transfers, the round-trip latency often makes hybrid decoding slower than just doing the parallel steps on the CPU — especially given how capable modern SIMD-enabled CPUs have become.
Real-world results with hybrid codec implementations have confirmed this. The dav1d decoder attempted to offload its final filter pass — complex but highly parallelizable — to the GPU, but saw no gain over the CPU, even on mobile. x264 added basic OpenCL™ support, but frame upload latency killed any performance advantage, and the code eventually bitrotted. These failures have left hybrid implementations with a poor reputation in the multimedia community.
The lesson is clear: to be consistently fast, maintainable, and widely adopted, compute-based codec implementations need to be fully GPU-resident — no CPU hand-offs.
Where there's a will...
Most codecs are designed with ASIC hardware in mind — the dedicated video engines found on modern GPUs and exposed through Vulkan Video. But even ASICs aren't infinitely fast: codecs typically compromise and define a minimum unit of parallelizable work, called a slice or block, representing the smallest chunk that can be processed independently.
Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities.
Together, these trends make it genuinely feasible today to implement certain codecs entirely in compute shaders — no CPU involvement required. Compute-based encoders also have an advantage over ASICs that's easy to overlook: they're unconstrained in memory usage and search time. With enough threads to exhaustively scan each block, matching or even surpassing the quality of software encoders is entirely achievable.
Accessibility
FFmpeg is a free and open source collection of libraries and tools to enable working with multimedia streams, regardless of format or codec. Whilst famous for its codec implementations with handwritten assembly optimizations across multiple platforms, FFmpeg also provides easy access to hardware accelerators.
Crucially, hardware acceleration in FFmpeg is built on top of the software codecs. Parsing of headers, threading, scheduling frames, slices, and error correction/handling all happen in software. Decoding of all video data is the only part offloaded. This combines robust well-tested code with hardware acceleration.
We can directly translate the threading of independent frames that software implementations do by dispatching multiple frames for parallel decoding to fully saturate a GPU. It also allows users to switch between software and hardware implementations dynamically via a toggle, with no differentiation whether hardware decoding is implemented using Vulkan Video or Vulkan Compute shaders.
The widespread usage of FFmpeg in editing software, media players and browsers, combined with the ability to add hardware accelerator support to any software implementation, makes it an ideal starting point for making compute-based codec implementations widely accessible, rather than dedicated library implementations.
FFv1
The FFmpeg Video Codec version #1, has become a staple of the archival community and in applications where lossless compression is required. It's open, royalty-free, and an official IETF standard. The work of implementing codecs in compute shaders in FFmpeg began here.
The FFv1 encoder and decoder are very slow to run on a CPU, despite supporting up to 1024 slices. This was partly due to the huge bandwidth needed for high-resolution RGB video, and the somewhat bottlenecked entropy coding design. FFv1 version 3 was designed over 10 years ago, and it was thanks to the archival community, who adopted it, that it gained wide usage. However, the bottlenecks were making encoding and decoding of high resolution archival film scans prohibitively time consuming.
Thus, thanks to the archival community, the FFv1 encoder and decoder were written. They started out as conversions of the software encoder and decoder, but were gradually more and more optimized with GPU-specific functions.
The biggest challenge when encoding FFv1 is working with the range coder system, which lacks the optimizations that, for example, AV1's range coder has. Each symbol (pixel difference value) has each bit having its own 8-bit adaptation value, therefore needing to lookup 32 contiguous values randomly from a set of thousands (per plane!) when encoding or decoding. We speed this up by having a workgroup size of 32, with each local invocation looking up and performing adaptation in parallel, while a single invocation performs the actual encoding or decoding.
For RGB, a Reversible Color Transform (RCT) is performed to decorrelate pixel values further. Originally, a separate shader was used for this, which encoded to a separate image. However, the bandwidth requirements to do this for very high resolution images outweighed the advantages. Since only 2 lines are needed to decode or encode images, we allocate widthhorizontal_slices2 images, and perform the RCT ahead of encoding each line with the help of the 32 helper invocations.


APV
APV is a new codec designed by Samsung to serve as a royalty-free, open alternative for mezzanine video compression. Recently, it too became an IETF standard. It's gaining traction with the VFX and professional media production communities, as well as a camera recording format in smartphones.
Unlike most codecs mentioned in this article, APV was designed for parallelism from the ground up. Similar to JPEG, each frame is subdivided into components, and each component is subdivided into tiles, with each tile featuring multiple blocks. Each block is simply transformed, quantized via a scalar quantizer (simple division), and encoded via variable length codes. There is not even any DC prediction.
To implement it as a compute shader, we first handle decoding on each tile in one shader, and run a second shader which transforms a single block's row per invocation.
ProRes
ProRes is the de-facto standard mezzanine codec, used for editing, camera footage, and mastering. It's a relatively simple codec, similar to JPEG and APV, which made it possible to implement a decoder, and due to popular demand, an encoder.
For decoding, we do essentially the same process as with APV. For encoding however, we do proper rate control and estimation by running a shader to find which quantizer makes a block fit within the frame's bit budget.
Unfortunately, unlike other codecs on the list, ProRes codecs are not royalty-free, nor have open specifications. The implementations in FFmpeg are unofficial. But due to their sheer popularity, such implementations are necessary for interoperability with much of the professional world. Nevertheless, the developers dogfood on the implementations, and their output is monitored to match the official implementations.
ProRes RAW
ProRes RAW features a bitstream that shares little in common with ProRes, because it was made for compressing RAW (not debayered) lossy sensor data. It uses a DCT performed on each component, and a coefficient coder which predicts DCs across components and efficiently encodes AC values from multiple components in a normal zigzag order. The entropy coding system is not exactly a traditional variable length code, but closer to exponential coding. Slices feature multiple blocks, with each component being able to be decoded in parallel.
Unlike FFv1, there are no limitations on the number of tiles per image, which potentially requires decoding hundreds of thousands of independent blocks. This is great for parallelism, leading to efficient implementations. The decoder was implemented in a 2-pass approach, with the first shader decoding each tile, and the second shader transforming all blocks within each tile with row/column parallelism (referred to as shred configuration due to being able to fully saturate a GPU's workgroup size limit).
DPX
DPX is not a codec, but rather a packed pixel packing container with a header. It's an official SMPTE standard, and rather popular with film scanners. Rather than being optimally laid out and tightly packing pixels, it can pack pixels in 32-bit chunks, padding if needed. Or it can... not pack pixels, depending on a header switch. Its being an uncompressed format with loose regulations, made decades ago, means it's rife with vendors being rather creative in interpreting the specifications, in ways that completely break decoding.
Thankfully, there's a text "producer" field left in the header for such implementations to sign their artistry with, which can be used to figure out how to correctly unpack without seeing alien rainbows. All of this comes down to just writing heuristics in shaders. The overhead will never be the calculations needed to find a collection of pixels, but actually pulling data from memory and writing it elsewhere.
VC-2
VC-2 is another mezzanine codec. Authored by the BBC, based on its Dirac codec, it is royalty-free, with official SMPTE specifications. Its primary use-case was real-time streaming, particularly fitting high resolution video over a gigabit connection with sub-frame latency.
Unlike APV or ProRes, it is based on wavelet transforms. Each frame is subdivided into power-of-two sized slices. Wavelets are rather interesting as transforms. They subdivide a frame into a quarter-resolution image, and 3 more quarter-resolution images as residuals. Unlike DCTs, they are highly localized, which means they can be performed individually on each slice, yet when assembled they function as if the entire frame was transformed. This eliminates blocking artifacts that all DCT-based codecs suffer from. This also means they're less efficient to encode as their frequency decomposition is compromised. Also, their distortion characteristics are substantially less visually appealing than the blurring of DCTs. This was one of the main reasons they failed to gain traction in post-2000s codecs.
The resulting coefficients are encoded via simple interleaved Golomb-exp codes, which, while not parallelizable, can be beautifully simplified in a decoder to remove all bit-parsing and instead operate on whole bytes.
JPEG
The codec given as an example at the start, turns out to have a very interesting attack that not only opens the door to parallelization, but also to parallelizing arbitrary data compression standards such as DEFLATE. The idea is that although VLC streams lack any way to parallelize, VLC decoders, and in fact all codes that satisfy the Kraft–McMillan inequality, can spuriously resynchronize. After a surprisingly short delay, VLC decoders tend to output valid data. All that's needed is to run 4 shaders to gradually synchronize the starting points within each JPEG stream.
JPEG has multiple variants too, such as progressive and lossless profiles, which can also be parallelized to such an extent. DC prediction can be done via a parallel prefix sum, which is amongst the most common operations done via compute shaders. DCTs can be done via a shred configuration, as with other codecs.
Future
With the release of FFmpeg 8.1, we've implemented FFv1 encoding and decoding, ProRes encoding and decoding, ProRes RAW decoding, and DPX unpacking. GPU-based processing is automatically enabled and used if Vulkan-accelerated decoding is enabled. The VC-2 encoder and decoder, along with the JPEG and APV decoders, are still in progress and need additional work before they can be merged.
Looking further ahead, the only remaining codecs with meaningful GPU acceleration potential are JPEG2000 and PNG — the rest either have limited practical use cases or don't benefit from compute-based acceleration. Unfortunately, JPEG2000 — and by extension JPEG2000HT — is unlike most modern codecs, burdened with the worst features of several combined: a semi-serialized coding system that requires extensive domain knowledge and a bitstream complex enough to give most modern bureaucracies pause. Software decoding of JPEG2000 ranks among the slowest of all widely-used codecs, owing to its ASIC-centric design and under-engineered arithmetic coder. Despite all this, it remains the primary codec used in digital cinema, medicine, and forensics.
PNG acceleration is an open question: its viability as a GPU target will depend on how effectively DEFLATE can be parallelized.
Vulkan Compute
Vulkan is often pigeonholed as a graphics API with added compute — but that framing is outdated. Its compute capabilities have evolved to match, and in some cases exceed, dedicated compute APIs. Modern Vulkan offers pointers, extensive subgroup operations, shared memory aliasing, native bitwise operations, a well-defined memory model, shader specialization, 64-bit addressing, and direct access to GPU matrix units. Together, these features enable programmers to optimize at a lower level than more abstracted APIs.
Even so, the Vulkan Compute API is not near its full potential as it doesn't yet expose the full capabilities of SPIR-V™, which as an intermediate representation is remarkably expressive. Support for the broader SPIR-V feature set is actively expanding — untyped pointers and 64-bit addressing are already available, and support for bitwise operations on non-32-bit integer types is on the way.
Competing compute APIs from GPU vendors often bundle hundreds of specialized and specifically optimized algorithm implementations, accessible through more comfortable programming languages — a tempting package. The catch, of course, is vendor lock-in, which can be a serious concern for portable, long-lived software like FFmpeg.
FFmpeg may be no stranger to writing its own implementations of popular algorithms to avoid dependencies, such as hashing functions, sorting algorithms, CRCs, or frequency transforms. But on the other hand, are extensive, object-orientated APIs, actually necessary? Often, formatting data to be used by common implementations takes longer and produces less optimal code than simply writing a small implementation of an algorithm specialized for a given use-case. OOP can in a lot of cases be handled by simply templating via a preprocessor. Linking multiple pieces of code could just be an #include. And, fragile code that targets a singular version of a vendor's API, which in turn depends on a specific old gcc version, can be replaced by a reliable, lasting, self-sufficient shader.
Vulkan is ubiquitous - from tiny SoCs, to tablets, embedded GPUs, discrete GPUs, and professional server GPUs — and its industry-led governance model creates strong incentives to support new extensions broadly. Constant automated testing is performed using a comprehensive conformance test suite. Lastly, Vulkan enjoys a broad ecosystem of debugging, optimization, and profiling tools, and a large global developer community means that almost any GPU quirk or optimization trick you discover has already been found, documented, and fed back into the specification.
Whether using Vulkan Video or Vulkan compute shaders, Vulkan has become a compelling API to access GPU-accelerated video processing.
FFmpeg download: https://ffmpeg.org/download.html
Khronos® and Vulkan® are registered trademarks, and SPIR-V™ is a trademark of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used under license by Khronos. All other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.

Comments
Please log in or register to join the discussion