Unlocking GPU Performance: The Hidden Mechanics of Early-Z Testing
Share this article
The Hidden Engine of GPU Efficiency: Demystifying Early-Z Testing
For decades, Early-Z testing has been the silent workhorse of real-time graphics, enabling techniques like depth pre-passes that keep forward rendering viable. Yet its nuanced interactions with modern shader features remain widely misunderstood. As GPUs evolve, mastering Early-Z becomes critical for unlocking peak performance in complex rendering pipelines.
The Logical Pipeline vs. Hardware Reality
Graphics APIs depict a logical pipeline where depth operations occur after pixel shading—a historical artifact from when depth buffers primarily resolved visibility. In reality, drivers analyze shaders and states to determine if depth testing can safely move before pixel shading (Early-Z), culling fragments without executing expensive shaders:
"This ‘sneaky’ optimization works because for opaque geometry, culling before shading produces identical results to the logical pipeline—just faster," explains the analysis. "The magic lies in drivers guaranteeing correctness while exploiting hardware parallelism."
When Early-Z Thrives… and Stumbles
The Ideal Scenario
With standard opaque shaders (no discards, depth exports, or UAV writes), Early-Z shines. Front-to-back rendering slashes pixel shader invocations dramatically, as demonstrated by the author's test app:
- Back-to-front draw: 648,000 shader invocations
- Front-to-front draw: 440,640 invocations (32% reduction)
<img src="https://news.lavx.hu/api/uploads/unlocking-gpu-performance-the-hidden-mechanics-of-early-z-testing_20250909_050945_image.jpg"
alt="Article illustration 5"
loading="lazy">
// This unused discard still disables full Early-Z!
if (false) discard;
2. **Depth Export**: Pixel shader overrides (`SV_Depth`, `gl_FragDepth`) force full Late-Z—the GPU can't predict outputs pre-shading. Conservative variants (`SV_DepthGreaterEqual`) offer limited reprieves.
3. **UAV/Storage Writes**: Side effects break Early-Z's "pure function" assumption. Without explicit forcing, drivers default to Late-Z to preserve correctness.
## Taking Control: Forcing Early-Z
APIs like D3D offer `[earlydepthstencil]` to override driver decisions. This enables Early-Z with UAVs—crucial for techniques like **Order-Independent Transparency**—but introduces caveats:
- Depth exports are **ignored**
- Discard **doesn't prevent depth writes**
- Without ROVs, UAV writes race across overlapping fragments
<img src="https://news.lavx.hu/api/uploads/unlocking-gpu-performance-the-hidden-mechanics-of-early-z-testing_20250909_050948_image.jpg"
alt="Article illustration 2"
loading="lazy">
Rasterizer Order Views: The Savior?
ROVs/FSI enforce submission-order UAV writes, restoring expected depth-test behavior when forcing Early-Z:"ROVs guarantee UAV writes only occur for visible fragments and respect draw order, making forced Early-Z viable for advanced techniques—with a parallelism penalty."
The Decision Matrix
| Shader Features | Depth Write | Implicit Early-Z? | Forced Early-Z Behavior |
|---|---|---|---|
| None | Off | ✅ Likely | Correct |
| Discard | On | ⚠️ Partial (reduced) | ❌ Depth write ignores discard |
| UAV Writes | Off | ❌ Late-Z | ✅ Writes if visible (unordered) |
| UAV + ROV | On | ❌ Late-Z | ✅ Correct with ROVs |
| Depth Export | Any | ❌ Late-Z | ❌ Export ignored |
Strategic Insights
- Prepass Wisely: Depth-only passes maximize Early-Z efficiency for opaque geometry.
- Isolate Disruptors: Batch non-discard opaques first to prime the depth buffer.
- ROVs > Atomcis: For OIT, prefer ROVs over depth+payload atomics when forcing Early-Z.
- Mobile Caveat: Behavior varies—test target hardware aggressively.
As rendering complexity escalates, understanding Early-Z transitions from optimization to necessity. The difference between theory and hardware reality isn't just academic—it's the gap between stutter and silky frames.
Source: To Early-Z or Not to Early-Z by Michał Iwanicki (Principal Engine Architect)