Article illustration 1

For developers, few phrases induce more dread than "it works after restarting." Sumanto Pal's recent debugging saga—spanning months and culminating in an embarrassingly simple fix—exposes a critical Java anti-pattern that could lurk in any enterprise system: the poison pill of global JVM state mutation.

The Perfect Storm: Spark, Jetty, and Questionable Isolation

Pal's team operated a financial data service where a Tomcat server converted UI filters into Spark SQL. Instead of running Spark locally (due to JAR conflicts and deployment complexities), they offloaded execution to a separate Jetty service—a decision that seemed reasonable until connections began mysteriously failing. The architecture looked deceptively robust:

public class FilterToSparkService {
    public QueryResult executeFilter(FilterRequest filter) {
        String sparkSQL = convertFilterToSQL(filter);
        SparkRequest request = new SparkRequest(sparkSQL);
        return jettyClient.execute(request); // Failure point!
    }
}

Random Tomcat instances would suddenly throw ConnectionExceptions to the Jetty service, while others hummed along. Restarts provided temporary relief, masking a systemic flaw. Infrastructure teams found no network issues, and load testing couldn't reproduce the bug. "It must be a production-only problem," became a dangerous mantra.

The Breakthrough: Proxy Settings and a Snowflake Connection

After months, Pal reproduced the failure in a lower environment and traced it to JVM-level HTTP proxy settings mysteriously enabled on failing instances. Audit logs revealed the trigger: every failing instance had recently executed a Snowflake query. The culprit code was hiding in plain sight:

public class SnowflakeQueryService {
    public ResultSet executeQuery(String sql) throws SQLException {
        System.setProperty("http.useProxy", "true"); // Global mutation!
        System.setProperty("http.proxyHost", proxyHost);
        System.setProperty("http.proxyPort", proxyPort);
        Connection conn = DriverManager.getConnection(snowflakeUrl, props);
        // ...
    }
}

This Snowflake call set JVM-wide proxy configurations, poisoning all subsequent HTTP traffic—including calls to the internal Jetty service. Since the corporate proxy couldn't route to Jetty, connections failed until the JVM restarted. Pal dubbed this the "poison pill pattern": one request corrupting shared state for all others.

The Fix: Containing the Contagion

The solution was elegant: stop polluting global state. Instead of System.setProperty(), Pal used Snowflake's driver-specific connection properties:

Properties props = new Properties();
props.put("user", username);
props.put("password", password);
props.put("useProxy", "true");    // Driver-specific
props.put("proxyHost", proxyHost); // No global impact
props.put("proxyPort", proxyPort);
Connection conn = DriverManager.getConnection(snowflakeUrl, props);

This contained proxy settings to the Snowflake connection, eliminating collateral damage. Post-deployment, connection failures vanished entirely.

Lessons for Architects and Developers

  1. Audit JVM Mutations: Any System.setProperty() call is a potential poison pill. Scrutinize them in code reviews.
  2. Isolate State Changes: Driver/config libraries should never alter global JVM settings. Use connection-specific parameters.
  3. Monitor Shared State: Track JVM property changes in health checks to detect "infection" early.
  4. Correlate Failures with Actions: Had audit logs tied instance failures to Snowflake calls earlier, the bug would have surfaced sooner.

Pal's ordeal underscores a fundamental truth: in distributed systems, shared state is radioactive. The fix took minutes, but diagnosing it required methodically eliminating assumptions—a testament to why debugging remains one of our most vital skills. As microservices and complex integrations proliferate, this case is a stark reminder: sometimes the deepest bugs hide in the shallowest code.

Source: Sumanto Pal's debugging saga