Benchmarking Java Runtimes: When Methodology Changes Everything

A deep dive into how rigorous benchmarking methodology completely changed conclusions about Java runtime performance, revealing the importance of fair comparisons in distributed systems decisions.

The benchmark that made me change my mind about Jakarta EE in 2026 isn't really about Jakarta EE at all. It's about what happens when we move from convenient narratives to defensible measurements in distributed systems. This story reveals how methodology can completely flip conclusions about runtime performance, which matters for anyone making architectural decisions in Java ecosystems.

The Problem with Benchmarking

Benchmarking distributed systems is notoriously difficult. The initial results from this lab showed Embedded GlassFish outperforming Spring Boot and Payara Micro, but this conclusion was based on incomplete methodology. The author correctly identified that this would be "methodologically weak" - a crucial admission in systems engineering where incomplete data often leads to incorrect decisions.

The core problem is that benchmarks rarely reflect real-world complexity. They frequently miss critical factors like:

Proper warmup periods
Consistent runtime environments
Database cost attribution
Memory management under pressure
Error rates under load

These factors become increasingly important as systems scale, making initial benchmark results potentially misleading for production decisions.

The Evolution of a Fair Benchmark

The author's approach evolved through five phases, each adding rigor to the comparison:

Phase 2: The Initial Temptation

The first complete benchmark showed Embedded GlassFish leading in p95 latency and throughput. However, this phase lacked critical controls:

No database cost attribution
Inconsistent JDK versions
No separate warmup period
Short measurement windows
Inconsistent pool configurations

These omissions meant the results, while interesting, couldn't support definitive conclusions about runtime performance.

Phase 3: Adding Causality

This phase introduced more complexity:

Multiple virtual user (VU) levels
Database cost attribution with pg_stat_statements
RSS measurements
GC logging

A key issue emerged: Payara Micro complained about an unsupported JDK in some runs, introducing an unfair comparison that needed resolution.

Phase 4: The Fair Benchmark

The final defensible benchmark included:

Temurin 21.0.10 for all runtimes
Fixed heap size (-Xms512m -Xmx512m)
Separate warmup period
180-second measured window
Explicit pool settings
pg_stat_statements reset after warmup
Three runs per runtime/VU combination

The results at this stage showed a different picture:

Runtime	VUs	Median p50	Median p95	Median p99	Throughput	Error Rate	Check Failures	Median RSS
Spring Boot	25	4.59 ms	66.92 ms	110.03 ms	213.13 req/s	0.01%	2	517.5 MB
Payara Micro	25	33.10 ms	188.16 ms	336.77 ms	156.48 req/s	0.00%	0	694.3 MB
Embedded GlassFish	25	38.03 ms	198.83 ms	371.96 ms	151.26 req/s	0.00%	0	579.1 MB
Spring Boot	100	149.36 ms	341.69 ms	473.41 ms	372.56 req/s	0.04%	25	543.0 MB
Payara Micro	100	204.61 ms	588.31 ms	870.53 ms	284.29 req/s	0.00%	0	715.7 MB
Embedded GlassFish	100	320.12 ms	540.00 ms	677.23 ms	229.28 req/s	0.01%	5	593.9 MB

At 25 VUs, Spring Boot clearly led in median latency and throughput with lower RSS. At 100 VUs, Spring Boot maintained better p95/p99 and median throughput, though with some check failures. Payara Micro had zero check failures in all tests, making it the "cleanest" Jakarta EE option. Embedded GlassFish remained technically viable but didn't lead in the final phase.

Phase 5: External Validation

A smoke test on Railway confirmed all runtimes could be deployed externally and pass basic functionality checks, but this wasn't used for performance comparisons.

The Database Factor

A crucial insight from the benchmark was that the system was database-heavy, with analytical aggregations dominating the latency tail under pressure. The pg_stat_statements output clearly showed that much of the performance difference came from database interactions rather than pure runtime performance.

This is a critical lesson for distributed systems: as systems scale, the database often becomes the bottleneck, not the application server. Optimizing JDBC connection pools, query performance, and database configuration can yield more significant gains than switching runtimes.

Trade-offs in Runtime Selection

The author provides a pragmatic decision tree for runtime selection:

Greenfield with Spring Team

Choose Spring Boot

Lower adoption friction
Strong ecosystem and tooling
Better hiring market
Better local performance in this benchmark
Superior developer experience for teams already familiar with Spring

Organizations with Existing Jakarta EE

Try Payara Micro before migration

Zero check failures under pressure
Competitive throughput
Preserves existing Jakarta EE knowledge
Lower migration cost than full rewrite

Jakarta Code Seeking Lightweight Executable

Evaluate Embedded GlassFish

More viable than often assumed
Lighter than full app server
Can be a migration bridge without full rewrite

Implications for Distributed Systems

This benchmark has several important implications for distributed systems architecture:

1. Benchmarking Methodology Matters

The most important conclusion is that "the conclusion changed when the benchmark stopped being convenient and started being defensible." In distributed systems, where performance characteristics change under different load patterns, incomplete benchmarks can lead to expensive architectural mistakes.

2. Context is King

There's no universal "best" runtime. The optimal choice depends on:

Team expertise
Existing infrastructure
Workload characteristics
Operational requirements
Business constraints

3. Database Interactions Dominate

For many applications, database interactions become the primary performance constraint as load increases. Optimizing the data access layer often yields more significant gains than runtime optimization.

4. Error Rates Under Pressure

Spring Boot showed check failures under high load (25 at 100 VUs), which is a critical operational consideration. In production systems, consistent behavior under pressure often matters more than peak throughput.

5. Memory Efficiency vs. Performance

There's a trade-off between memory efficiency and performance. Spring Boot used less memory in many cases but had higher error rates under pressure. Payara Micro used more memory but was more stable.

The Pragmatic Approach

The author's approach exemplifies pragmatic systems engineering:

Iterative Improvement

The benchmark evolved through multiple phases, each adding rigor. This iterative approach mirrors how complex systems should be evaluated - gradually increasing test complexity and realism.

Transparency About Limitations

The author openly acknowledges the benchmark's limitations:

Single workstation testing
DB-heavy workload
No long soak tests
No Kubernetes or autoscaling testing
Familiarity bias with Spring Boot

This transparency is crucial for systems engineering, where overgeneralizing from limited test scenarios can lead to problems in production.

Evidence-Based Decisions

The author emphasizes that "migration decisions should be tested against the real workload, not against intuition or generic benchmarks." This evidence-based approach is essential for distributed systems where intuition often fails under real-world conditions.

Broader Patterns

This benchmark illustrates several broader patterns in distributed systems:

1. The Benchmarking Fallacy

Many benchmarks suffer from the "benchmarking fallacy" - testing synthetic workloads that don't reflect real usage patterns. The author's evolution from Phase 2 to Phase 4 shows how adding realism can completely change conclusions.

2. The Local vs. Production Gap

What performs well on a developer workstation may behave differently in production with network latency, different resource constraints, and real-world load patterns. The Railway smoke test (Phase 5) represents a small step toward bridging this gap.

3. The Ecosystem Effect

Spring Boot's advantage isn't just raw performance - it's the ecosystem, tooling, and community support. In distributed systems, operational advantages often outweigh small performance differences.

4. The Optimization Trap

Teams often optimize the wrong things, focusing on micro-optimizations while ignoring larger architectural issues. This benchmark shows how database interactions dominated performance, not the runtime itself.

Conclusion

The most valuable insight from this benchmark isn't which runtime "won" - it's that fair methodology completely changed the conclusion. In distributed systems, where complexity multiplies at scale, this lesson is particularly important.

For teams making runtime decisions, the key takeaways are:

Benchmark with realistic workloads, not synthetic tests
Include operational factors like error rates under pressure
Consider database interactions, which often dominate performance
Choose runtimes based on team expertise and existing infrastructure
Use iterative benchmarking that gradually increases realism

The author's GitHub repository (enterprise-runtime-lab) contains the complete benchmark methodology and results, allowing others to replicate and extend the work. This transparency exemplifies the best practices in systems engineering - open methodology, evidence-based decisions, and acknowledgment of limitations.

In the end, the benchmark isn't about Spring Boot vs. Jakarta EE. It's about how we make architectural decisions in complex systems - with rigor, transparency, and a clear understanding of what we're actually measuring.

#Java #Benchmarking #Spring Boot #Payara Micro #distributed systems