Benchmarking Java Runtimes: When Methodology Changes Everything
#Backend

Benchmarking Java Runtimes: When Methodology Changes Everything

Backend Reporter
7 min read

A deep dive into how rigorous benchmarking methodology completely changed conclusions about Java runtime performance, revealing the importance of fair comparisons in distributed systems decisions.

The benchmark that made me change my mind about Jakarta EE in 2026 isn't really about Jakarta EE at all. It's about what happens when we move from convenient narratives to defensible measurements in distributed systems. This story reveals how methodology can completely flip conclusions about runtime performance, which matters for anyone making architectural decisions in Java ecosystems.

The Problem with Benchmarking

Benchmarking distributed systems is notoriously difficult. The initial results from this lab showed Embedded GlassFish outperforming Spring Boot and Payara Micro, but this conclusion was based on incomplete methodology. The author correctly identified that this would be "methodologically weak" - a crucial admission in systems engineering where incomplete data often leads to incorrect decisions.

The core problem is that benchmarks rarely reflect real-world complexity. They frequently miss critical factors like:

  • Proper warmup periods
  • Consistent runtime environments
  • Database cost attribution
  • Memory management under pressure
  • Error rates under load

These factors become increasingly important as systems scale, making initial benchmark results potentially misleading for production decisions.

The Evolution of a Fair Benchmark

The author's approach evolved through five phases, each adding rigor to the comparison:

Phase 2: The Initial Temptation

The first complete benchmark showed Embedded GlassFish leading in p95 latency and throughput. However, this phase lacked critical controls:

  • No database cost attribution
  • Inconsistent JDK versions
  • No separate warmup period
  • Short measurement windows
  • Inconsistent pool configurations

These omissions meant the results, while interesting, couldn't support definitive conclusions about runtime performance.

Phase 3: Adding Causality

This phase introduced more complexity:

  • Multiple virtual user (VU) levels
  • Database cost attribution with pg_stat_statements
  • RSS measurements
  • GC logging

A key issue emerged: Payara Micro complained about an unsupported JDK in some runs, introducing an unfair comparison that needed resolution.

Phase 4: The Fair Benchmark

The final defensible benchmark included:

  • Temurin 21.0.10 for all runtimes
  • Fixed heap size (-Xms512m -Xmx512m)
  • Separate warmup period
  • 180-second measured window
  • Explicit pool settings
  • pg_stat_statements reset after warmup
  • Three runs per runtime/VU combination

The results at this stage showed a different picture:

Runtime VUs Median p50 Median p95 Median p99 Throughput Error Rate Check Failures Median RSS
Spring Boot 25 4.59 ms 66.92 ms 110.03 ms 213.13 req/s 0.01% 2 517.5 MB
Payara Micro 25 33.10 ms 188.16 ms 336.77 ms 156.48 req/s 0.00% 0 694.3 MB
Embedded GlassFish 25 38.03 ms 198.83 ms 371.96 ms 151.26 req/s 0.00% 0 579.1 MB
Spring Boot 100 149.36 ms 341.69 ms 473.41 ms 372.56 req/s 0.04% 25 543.0 MB
Payara Micro 100 204.61 ms 588.31 ms 870.53 ms 284.29 req/s 0.00% 0 715.7 MB
Embedded GlassFish 100 320.12 ms 540.00 ms 677.23 ms 229.28 req/s 0.01% 5 593.9 MB

At 25 VUs, Spring Boot clearly led in median latency and throughput with lower RSS. At 100 VUs, Spring Boot maintained better p95/p99 and median throughput, though with some check failures. Payara Micro had zero check failures in all tests, making it the "cleanest" Jakarta EE option. Embedded GlassFish remained technically viable but didn't lead in the final phase.

Phase 5: External Validation

A smoke test on Railway confirmed all runtimes could be deployed externally and pass basic functionality checks, but this wasn't used for performance comparisons.

The Database Factor

A crucial insight from the benchmark was that the system was database-heavy, with analytical aggregations dominating the latency tail under pressure. The pg_stat_statements output clearly showed that much of the performance difference came from database interactions rather than pure runtime performance.

This is a critical lesson for distributed systems: as systems scale, the database often becomes the bottleneck, not the application server. Optimizing JDBC connection pools, query performance, and database configuration can yield more significant gains than switching runtimes.

Trade-offs in Runtime Selection

The author provides a pragmatic decision tree for runtime selection:

Greenfield with Spring Team

Choose Spring Boot

  • Lower adoption friction
  • Strong ecosystem and tooling
  • Better hiring market
  • Better local performance in this benchmark
  • Superior developer experience for teams already familiar with Spring

Organizations with Existing Jakarta EE

Try Payara Micro before migration

  • Zero check failures under pressure
  • Competitive throughput
  • Preserves existing Jakarta EE knowledge
  • Lower migration cost than full rewrite

Jakarta Code Seeking Lightweight Executable

Evaluate Embedded GlassFish

  • More viable than often assumed
  • Lighter than full app server
  • Can be a migration bridge without full rewrite

Implications for Distributed Systems

This benchmark has several important implications for distributed systems architecture:

1. Benchmarking Methodology Matters

The most important conclusion is that "the conclusion changed when the benchmark stopped being convenient and started being defensible." In distributed systems, where performance characteristics change under different load patterns, incomplete benchmarks can lead to expensive architectural mistakes.

2. Context is King

There's no universal "best" runtime. The optimal choice depends on:

  • Team expertise
  • Existing infrastructure
  • Workload characteristics
  • Operational requirements
  • Business constraints

3. Database Interactions Dominate

For many applications, database interactions become the primary performance constraint as load increases. Optimizing the data access layer often yields more significant gains than runtime optimization.

4. Error Rates Under Pressure

Spring Boot showed check failures under high load (25 at 100 VUs), which is a critical operational consideration. In production systems, consistent behavior under pressure often matters more than peak throughput.

5. Memory Efficiency vs. Performance

There's a trade-off between memory efficiency and performance. Spring Boot used less memory in many cases but had higher error rates under pressure. Payara Micro used more memory but was more stable.

The Pragmatic Approach

The author's approach exemplifies pragmatic systems engineering:

Iterative Improvement

The benchmark evolved through multiple phases, each adding rigor. This iterative approach mirrors how complex systems should be evaluated - gradually increasing test complexity and realism.

Transparency About Limitations

The author openly acknowledges the benchmark's limitations:

  • Single workstation testing
  • DB-heavy workload
  • No long soak tests
  • No Kubernetes or autoscaling testing
  • Familiarity bias with Spring Boot

This transparency is crucial for systems engineering, where overgeneralizing from limited test scenarios can lead to problems in production.

Evidence-Based Decisions

The author emphasizes that "migration decisions should be tested against the real workload, not against intuition or generic benchmarks." This evidence-based approach is essential for distributed systems where intuition often fails under real-world conditions.

Broader Patterns

This benchmark illustrates several broader patterns in distributed systems:

1. The Benchmarking Fallacy

Many benchmarks suffer from the "benchmarking fallacy" - testing synthetic workloads that don't reflect real usage patterns. The author's evolution from Phase 2 to Phase 4 shows how adding realism can completely change conclusions.

2. The Local vs. Production Gap

What performs well on a developer workstation may behave differently in production with network latency, different resource constraints, and real-world load patterns. The Railway smoke test (Phase 5) represents a small step toward bridging this gap.

3. The Ecosystem Effect

Spring Boot's advantage isn't just raw performance - it's the ecosystem, tooling, and community support. In distributed systems, operational advantages often outweigh small performance differences.

4. The Optimization Trap

Teams often optimize the wrong things, focusing on micro-optimizations while ignoring larger architectural issues. This benchmark shows how database interactions dominated performance, not the runtime itself.

Conclusion

The most valuable insight from this benchmark isn't which runtime "won" - it's that fair methodology completely changed the conclusion. In distributed systems, where complexity multiplies at scale, this lesson is particularly important.

For teams making runtime decisions, the key takeaways are:

  1. Benchmark with realistic workloads, not synthetic tests
  2. Include operational factors like error rates under pressure
  3. Consider database interactions, which often dominate performance
  4. Choose runtimes based on team expertise and existing infrastructure
  5. Use iterative benchmarking that gradually increases realism

The author's GitHub repository (enterprise-runtime-lab) contains the complete benchmark methodology and results, allowing others to replicate and extend the work. This transparency exemplifies the best practices in systems engineering - open methodology, evidence-based decisions, and acknowledgment of limitations.

In the end, the benchmark isn't about Spring Boot vs. Jakarta EE. It's about how we make architectural decisions in complex systems - with rigor, transparency, and a clear understanding of what we're actually measuring.

Comments

Loading comments...