A deep dive into how rigorous benchmarking methodology completely changed conclusions about Java runtime performance, revealing the importance of fair comparisons in distributed systems decisions.
The benchmark that made me change my mind about Jakarta EE in 2026 isn't really about Jakarta EE at all. It's about what happens when we move from convenient narratives to defensible measurements in distributed systems. This story reveals how methodology can completely flip conclusions about runtime performance, which matters for anyone making architectural decisions in Java ecosystems.
The Problem with Benchmarking
Benchmarking distributed systems is notoriously difficult. The initial results from this lab showed Embedded GlassFish outperforming Spring Boot and Payara Micro, but this conclusion was based on incomplete methodology. The author correctly identified that this would be "methodologically weak" - a crucial admission in systems engineering where incomplete data often leads to incorrect decisions.
The core problem is that benchmarks rarely reflect real-world complexity. They frequently miss critical factors like:
- Proper warmup periods
- Consistent runtime environments
- Database cost attribution
- Memory management under pressure
- Error rates under load
These factors become increasingly important as systems scale, making initial benchmark results potentially misleading for production decisions.
The Evolution of a Fair Benchmark
The author's approach evolved through five phases, each adding rigor to the comparison:
Phase 2: The Initial Temptation
The first complete benchmark showed Embedded GlassFish leading in p95 latency and throughput. However, this phase lacked critical controls:
- No database cost attribution
- Inconsistent JDK versions
- No separate warmup period
- Short measurement windows
- Inconsistent pool configurations
These omissions meant the results, while interesting, couldn't support definitive conclusions about runtime performance.
Phase 3: Adding Causality
This phase introduced more complexity:
- Multiple virtual user (VU) levels
- Database cost attribution with pg_stat_statements
- RSS measurements
- GC logging
A key issue emerged: Payara Micro complained about an unsupported JDK in some runs, introducing an unfair comparison that needed resolution.
Phase 4: The Fair Benchmark
The final defensible benchmark included:
- Temurin 21.0.10 for all runtimes
- Fixed heap size (-Xms512m -Xmx512m)
- Separate warmup period
- 180-second measured window
- Explicit pool settings
- pg_stat_statements reset after warmup
- Three runs per runtime/VU combination
The results at this stage showed a different picture:
| Runtime | VUs | Median p50 | Median p95 | Median p99 | Throughput | Error Rate | Check Failures | Median RSS |
|---|---|---|---|---|---|---|---|---|
| Spring Boot | 25 | 4.59 ms | 66.92 ms | 110.03 ms | 213.13 req/s | 0.01% | 2 | 517.5 MB |
| Payara Micro | 25 | 33.10 ms | 188.16 ms | 336.77 ms | 156.48 req/s | 0.00% | 0 | 694.3 MB |
| Embedded GlassFish | 25 | 38.03 ms | 198.83 ms | 371.96 ms | 151.26 req/s | 0.00% | 0 | 579.1 MB |
| Spring Boot | 100 | 149.36 ms | 341.69 ms | 473.41 ms | 372.56 req/s | 0.04% | 25 | 543.0 MB |
| Payara Micro | 100 | 204.61 ms | 588.31 ms | 870.53 ms | 284.29 req/s | 0.00% | 0 | 715.7 MB |
| Embedded GlassFish | 100 | 320.12 ms | 540.00 ms | 677.23 ms | 229.28 req/s | 0.01% | 5 | 593.9 MB |
At 25 VUs, Spring Boot clearly led in median latency and throughput with lower RSS. At 100 VUs, Spring Boot maintained better p95/p99 and median throughput, though with some check failures. Payara Micro had zero check failures in all tests, making it the "cleanest" Jakarta EE option. Embedded GlassFish remained technically viable but didn't lead in the final phase.
Phase 5: External Validation
A smoke test on Railway confirmed all runtimes could be deployed externally and pass basic functionality checks, but this wasn't used for performance comparisons.
The Database Factor
A crucial insight from the benchmark was that the system was database-heavy, with analytical aggregations dominating the latency tail under pressure. The pg_stat_statements output clearly showed that much of the performance difference came from database interactions rather than pure runtime performance.
This is a critical lesson for distributed systems: as systems scale, the database often becomes the bottleneck, not the application server. Optimizing JDBC connection pools, query performance, and database configuration can yield more significant gains than switching runtimes.
Trade-offs in Runtime Selection
The author provides a pragmatic decision tree for runtime selection:
Greenfield with Spring Team
Choose Spring Boot
- Lower adoption friction
- Strong ecosystem and tooling
- Better hiring market
- Better local performance in this benchmark
- Superior developer experience for teams already familiar with Spring
Organizations with Existing Jakarta EE
Try Payara Micro before migration
- Zero check failures under pressure
- Competitive throughput
- Preserves existing Jakarta EE knowledge
- Lower migration cost than full rewrite
Jakarta Code Seeking Lightweight Executable
Evaluate Embedded GlassFish
- More viable than often assumed
- Lighter than full app server
- Can be a migration bridge without full rewrite
Implications for Distributed Systems
This benchmark has several important implications for distributed systems architecture:
1. Benchmarking Methodology Matters
The most important conclusion is that "the conclusion changed when the benchmark stopped being convenient and started being defensible." In distributed systems, where performance characteristics change under different load patterns, incomplete benchmarks can lead to expensive architectural mistakes.
2. Context is King
There's no universal "best" runtime. The optimal choice depends on:
- Team expertise
- Existing infrastructure
- Workload characteristics
- Operational requirements
- Business constraints
3. Database Interactions Dominate
For many applications, database interactions become the primary performance constraint as load increases. Optimizing the data access layer often yields more significant gains than runtime optimization.
4. Error Rates Under Pressure
Spring Boot showed check failures under high load (25 at 100 VUs), which is a critical operational consideration. In production systems, consistent behavior under pressure often matters more than peak throughput.
5. Memory Efficiency vs. Performance
There's a trade-off between memory efficiency and performance. Spring Boot used less memory in many cases but had higher error rates under pressure. Payara Micro used more memory but was more stable.
The Pragmatic Approach
The author's approach exemplifies pragmatic systems engineering:
Iterative Improvement
The benchmark evolved through multiple phases, each adding rigor. This iterative approach mirrors how complex systems should be evaluated - gradually increasing test complexity and realism.
Transparency About Limitations
The author openly acknowledges the benchmark's limitations:
- Single workstation testing
- DB-heavy workload
- No long soak tests
- No Kubernetes or autoscaling testing
- Familiarity bias with Spring Boot
This transparency is crucial for systems engineering, where overgeneralizing from limited test scenarios can lead to problems in production.
Evidence-Based Decisions
The author emphasizes that "migration decisions should be tested against the real workload, not against intuition or generic benchmarks." This evidence-based approach is essential for distributed systems where intuition often fails under real-world conditions.
Broader Patterns
This benchmark illustrates several broader patterns in distributed systems:
1. The Benchmarking Fallacy
Many benchmarks suffer from the "benchmarking fallacy" - testing synthetic workloads that don't reflect real usage patterns. The author's evolution from Phase 2 to Phase 4 shows how adding realism can completely change conclusions.
2. The Local vs. Production Gap
What performs well on a developer workstation may behave differently in production with network latency, different resource constraints, and real-world load patterns. The Railway smoke test (Phase 5) represents a small step toward bridging this gap.
3. The Ecosystem Effect
Spring Boot's advantage isn't just raw performance - it's the ecosystem, tooling, and community support. In distributed systems, operational advantages often outweigh small performance differences.
4. The Optimization Trap
Teams often optimize the wrong things, focusing on micro-optimizations while ignoring larger architectural issues. This benchmark shows how database interactions dominated performance, not the runtime itself.
Conclusion
The most valuable insight from this benchmark isn't which runtime "won" - it's that fair methodology completely changed the conclusion. In distributed systems, where complexity multiplies at scale, this lesson is particularly important.
For teams making runtime decisions, the key takeaways are:
- Benchmark with realistic workloads, not synthetic tests
- Include operational factors like error rates under pressure
- Consider database interactions, which often dominate performance
- Choose runtimes based on team expertise and existing infrastructure
- Use iterative benchmarking that gradually increases realism
The author's GitHub repository (enterprise-runtime-lab) contains the complete benchmark methodology and results, allowing others to replicate and extend the work. This transparency exemplifies the best practices in systems engineering - open methodology, evidence-based decisions, and acknowledgment of limitations.
In the end, the benchmark isn't about Spring Boot vs. Jakarta EE. It's about how we make architectural decisions in complex systems - with rigor, transparency, and a clear understanding of what we're actually measuring.

Comments
Please log in or register to join the discussion