Grafana's Turbulent Evolution: When Innovation Outpaces Stability in Observability

For years, Grafana stood as a beacon in the observability landscape—a lightweight, open-source darling that democratized monitoring for developers drowning in data. But as Henrik Gerdes recounts in a revealing blog post, what began as an elegant solution has morphed into a whirlwind of disruptive changes, leaving many users questioning its long-term viability. His experience underscores a growing tension in tech: the race for innovation versus the bedrock need for stability in core infrastructure.

The Promise Fades: From Simple Starts to Kubernetes Chaos

Gerdes' journey with Grafana started ideally. At a small software firm, he deployed a straightforward Docker Compose stack—Loki for logs, Prometheus for metrics, Grafana for visualization—with minimal overhead. "Loki and Prometheus were the perfect fit back then," he writes, praising the setup's simplicity and low resource footprint. Early lessons, like avoiding excessive labels to prevent disk inode exhaustion, were painful but instructive. Grafana Cloud's free tier even won him over for personal projects, cementing his initial loyalty.

But as his career advanced into Kubernetes environments, cracks emerged. Scaling needs led him to Grafana's Mimir for long-term storage, replacing Prometheus with Grafana Agent for unified telemetry shipping. This seemed logical, given Grafana's earlier successes. Yet stability soon unraveled:

"Grafana liked to change things. They started to build their own observability platform to steal some of DataDogs customers... Software maintenance shows with age."

The Deprecation Treadmill: Alloy, Kafka, and Shifting Sands

Grafana's aggressive expansion birthed a cycle of reinvention that alienated users. Key tools like Grafana OnCall and Grafana Agent were deprecated within years of launch. Helm charts ballooned to "6k lines in default state," while UI overhauls (Angular to React) broke existing dashboards. The pivot to Grafana Alloy—a HCL-based "all-in-one" successor—introduced its own headaches:

  • A custom configuration language that added cognitive overhead ("Not everything needs their own DSL").
  • Partial support for Kubernetes standards like PrometheusRules, while AlertmanagerConfig compatibility gaps created operational friction.
  • A buggy initial release that demanded constant tuning.

The culmination arrived with Mimir 3.0, which mandated Apache Kafka for ingestion. For Gerdes, this was a bridge too far: "None of the above things alone would be a reason to ditch Grafana... But all this together makes me uncomfortable." The move felt emblematic of a broader trend—opaque changes pushing users toward Grafana's proprietary fleet management, burying endpoints and complicating self-hosted deployments.

Why This Matters: The Human Cost of Unrelenting Change

At its core, Gerdes' critique isn't about technical capability. He concedes, "Mimir, Loki and Grafana are technically really good software products." Instead, it's a warning about velocity: "The pace within Grafana is way too fast for many companies... partially driven by career-driven development." For developers and SREs, monitoring is foundational infrastructure—not a product to constantly reconfigure. Every deprecation or architectural shift (like Kafka dependencies) translates to:

  • Wasted engineering hours on migrations and debugging.
  • Increased risk of outages in critical systems.
  • Erosion of trust, as teams dread the next disruptive announcement.

Gerdes now eyes alternatives like the kube-prometheus stack with Thanos or hopes OpenTelemetry (OTEL) matures into a "stable and boring" standard. His sentiment resonates industry-wide: in observability, reliability isn't a feature—it's the product. As tools like Grafana chase market share, they risk forgetting that for users, the most innovative solution is often the one that quietly, consistently works.

Source: Henrik Gerdes' personal blog