Unlocking GPU Performance: The Hidden Mechanics of Early-Z Testing
#Hardware

Unlocking GPU Performance: The Hidden Mechanics of Early-Z Testing

LavX Team
3 min read

Early-Z testing is a foundational GPU optimization that dramatically reduces unnecessary pixel shading by culling occluded fragments early. This deep dive explores its intricate mechanics, surprising limitations with modern shader techniques, and practical strategies for maximizing rendering performance.

The Hidden Engine of GPU Efficiency: Demystifying Early-Z Testing

Article Image

For decades, Early-Z testing has been the silent workhorse of real-time graphics, enabling techniques like depth pre-passes that keep forward rendering viable. Yet its nuanced interactions with modern shader features remain widely misunderstood. As GPUs evolve, mastering Early-Z becomes critical for unlocking peak performance in complex rendering pipelines.

The Logical Pipeline vs. Hardware Reality

Graphics APIs depict a logical pipeline where depth operations occur after pixel shading—a historical artifact from when depth buffers primarily resolved visibility. In reality, drivers analyze shaders and states to determine if depth testing can safely move before pixel shading (Early-Z), culling fragments without executing expensive shaders:

Article Image

"This ‘sneaky’ optimization works because for opaque geometry, culling before shading produces identical results to the logical pipeline—just faster," explains the analysis. "The magic lies in drivers guaranteeing correctness while exploiting hardware parallelism."

When Early-Z Thrives… and Stumbles

The Ideal Scenario

With standard opaque shaders (no discards, depth exports, or UAV writes), Early-Z shines. Front-to-back rendering slashes pixel shader invocations dramatically, as demonstrated by the author's test app:

  • Back-to-front draw: 648,000 shader invocations
  • Front-to-front draw: 440,640 invocations (32% reduction)
    Article Image Article Image

The Disruptors

  1. Discard/Alpha Test: Forces partial Late-Z when depth writes are enabled, crippling culling efficiency. Even an unused discard instruction in the shader disables full Early-Z:

    // This unused discard still disables full Early-Z!
    if (false) discard;
    
  2. Depth Export: Pixel shader overrides (SV_Depth, gl_FragDepth) force full Late-Z—the GPU can't predict outputs pre-shading. Conservative variants (SV_DepthGreaterEqual) offer limited reprieves.

  3. UAV/Storage Writes: Side effects break Early-Z's "pure function" assumption. Without explicit forcing, drivers default to Late-Z to preserve correctness.

Taking Control: Forcing Early-Z

APIs like D3D offer [earlydepthstencil] to override driver decisions. This enables Early-Z with UAVs—crucial for techniques like Order-Independent Transparency—but introduces caveats:

  • Depth exports are ignored
  • Discard doesn't prevent depth writes
  • Without ROVs, UAV writes race across overlapping fragments

Article Image

Rasterizer Order Views: The Savior?

ROVs/FSI enforce submission-order UAV writes, restoring expected depth-test behavior when forcing Early-Z:

"ROVs guarantee UAV writes only occur for visible fragments and respect draw order, making forced Early-Z viable for advanced techniques—with a parallelism penalty."

The Decision Matrix

Shader Features Depth Write Implicit Early-Z? Forced Early-Z Behavior
None Off ✅ Likely Correct
Discard On ⚠️ Partial (reduced) ❌ Depth write ignores discard
UAV Writes Off ❌ Late-Z ✅ Writes if visible (unordered)
UAV + ROV On ❌ Late-Z ✅ Correct with ROVs
Depth Export Any ❌ Late-Z ❌ Export ignored

Strategic Insights

  1. Prepass Wisely: Depth-only passes maximize Early-Z efficiency for opaque geometry.
  2. Isolate Disruptors: Batch non-discard opaques first to prime the depth buffer.
  3. ROVs > Atomcis: For OIT, prefer ROVs over depth+payload atomics when forcing Early-Z.
  4. Mobile Caveat: Behavior varies—test target hardware aggressively.

As rendering complexity escalates, understanding Early-Z transitions from optimization to necessity. The difference between theory and hardware reality isn't just academic—it's the gap between stutter and silky frames.

Source: To Early-Z or Not to Early-Z by Michał Iwanicki (Principal Engine Architect)

Comments

Loading comments...