Infracost's Backend Challenge: Scaling Real-Time Cloud Cost Analysis for Thousands of Engineers
#Regulation

Infracost's Backend Challenge: Scaling Real-Time Cloud Cost Analysis for Thousands of Engineers

Trends Reporter
7 min read

Infracost, a YC-backed startup, is hiring a senior backend engineer to tackle the complex technical challenges of scaling real-time infrastructure cost analysis. The role requires deep expertise in Node.js, TypeScript, and PostgreSQL to build systems that parse massive Terraform repositories, prevent AI-generated code from creating costly infrastructure mistakes, and surface thousands of issues across enterprise codebases.

Featured image

Infracost's recent job posting for a Senior Backend Engineer reveals a company grappling with the technical complexities of "shifting FinOps left"—the practice of identifying and fixing cloud cost issues before they reach production. The role, which offers $90K to $170K for remote work, is less about maintaining existing systems and more about solving scaling challenges that emerge when you try to give engineers real-time cost visibility across thousands of repositories.

The Scaling Problem: From GitHub Organizations to Real-Time Analysis

The company has scaled to support customers with "thousands of GitHub organizations and tens of thousands of repositories." This isn't just a database scaling problem—it's an architectural challenge that required overhauling their entire stack. When you're parsing infrastructure-as-code files (like Terraform) at this scale, you're dealing with:

  • Complex dependency graphs: Terraform configurations can reference other modules, variables, and data sources across repositories. Building a real-time cost analysis engine requires understanding these relationships without introducing massive latency.
  • API performance: The system must surface cost impacts in pull requests quickly enough that developers don't context-switch away from their workflow. This means optimizing query patterns and caching strategies across a distributed system.
  • Onboarding friction: As customers grow from one repository to thousands, the onboarding process itself becomes a bottleneck. The team has had to redesign interfaces and infrastructure to make scaling feel seamless rather than painful.

The backend engineer will need to write "complex PostgreSQL queries" and understand query plans intimately. This isn't just about knowing SQL syntax—it's about recognizing when a query that works for 100 repositories will grind to a halt with 10,000. The mention of window functions and CTEs suggests they're dealing with analytical queries that aggregate cost data across time periods, teams, or infrastructure types.

The AI Code Generation Challenge: Safety Nets for Infrastructure

One of the more interesting technical challenges Infracost is tackling involves AI-generated code. The company notes: "With infrastructure changes there's a lower tolerance for AI-generated slop—there's not the same safety nets in terms of testing and the risk is often higher." This reveals a nuanced understanding of how AI code generation tools are being adopted in infrastructure contexts.

Their solution combines "AI-generated changes with our best-in-class static analysis engine to robustly open good-quality PRs to fix the most important issues." This hybrid approach suggests several technical considerations:

  1. Static analysis integration: The backend system must parse both human-written and AI-generated Terraform configurations, applying rules that catch not just syntax errors but cost anti-patterns (like untagged resources or oversized instances).
  2. Quality scoring: Not all AI-generated fixes are equal. The system likely assigns confidence scores or quality metrics to suggested changes before opening PRs.
  3. Risk assessment: Infrastructure changes carry different risks than application code. A buggy API endpoint might cause downtime; a buggy Terraform change could provision expensive resources or destroy critical infrastructure. The backend must incorporate this risk calculus.

The engineer will need to understand how to build systems that are both permissive enough to be useful and strict enough to prevent disasters. This requires deep knowledge of both infrastructure-as-code semantics and software engineering best practices.

The Issue Explorer: Surfacing Signal from Noise

The "Issue Explorer" represents another scaling challenge: how to let enterprise customers "filter, group, and chart tens of thousands of issues across their entire codebase." This isn't just a UI problem—it's fundamentally a data architecture problem.

When you have tens of thousands of issues (cost inefficiencies, security misconfigurations, compliance violations) across thousands of repositories, you're dealing with:

  • Data volume: Each issue has metadata (severity, resource type, estimated cost, affected repository) and temporal data (when it was introduced, when it was fixed).
  • Query complexity: Users want to slice and dice this data by team, project, time period, or issue type. This requires a backend that can handle ad-hoc analytical queries without pre-computing every possible aggregation.
  • Performance trade-offs: Do you index everything and slow down writes, or optimize for read patterns and accept some query latency?

The backend engineer will need to design schemas and query patterns that balance these competing concerns. The mention of "charting" suggests they're building systems that can aggregate data for visualization without materializing every possible view.

Technical Stack and Philosophy

The company's technical philosophy is evident in their job requirements:

Node.js and TypeScript: The choice of Node.js suggests they value the JavaScript ecosystem's developer experience and the ability to share code between frontend and backend. TypeScript adds type safety, which is crucial when dealing with complex infrastructure configurations where a type error could mean misinterpreting a resource's cost.

PostgreSQL: The emphasis on PostgreSQL over NoSQL solutions indicates they're dealing with structured data that benefits from relational queries. Infrastructure configurations have clear relationships (repository → resources → cost impact) that map well to relational models.

GraphQL (preferred): The mention of GraphQL as a preferred skill suggests they're building APIs that need to serve multiple client needs efficiently. For the Issue Explorer, GraphQL's ability to fetch exactly the data needed for a particular view could be crucial for performance.

Tooling over process: The company values "fixing problems with tooling rather than adding process." This philosophy extends to their internal CLI, which streamlines engineering work. For a backend engineer, this means building systems that are self-documenting and easy to debug, rather than relying on extensive documentation or manual processes.

The Engineering Culture: Async-First and Pragmatic

The role requires working in GMT+2 to GMT-6 time zones with an async-first culture. This has technical implications:

  • Documentation becomes code: When you can't tap someone on the shoulder, your APIs, error messages, and system diagrams must be self-explanatory.
  • Monitoring and observability: Production issues must be diagnosable without real-time collaboration. This means robust logging, metrics, and tracing.
  • Code quality as communication: Clean, well-structured code becomes the primary way to communicate intent and design decisions.

The company's values—"Let's JEDI" (Just Effing Do It!) and "customer, not customer"—suggest a pragmatic, product-focused engineering culture. The backend engineer will need to balance technical perfection with shipping speed, especially when fixing customer issues.

Broader Patterns in Infrastructure Tooling

Infracost's challenges reflect broader trends in the infrastructure tooling space:

  1. The rise of "FinOps": As cloud costs become a significant line item, tools that integrate cost analysis into developer workflows are gaining traction. The backend challenge is making cost data actionable without overwhelming developers.

  2. AI in infrastructure: While AI code generation is popular, its application to infrastructure-as-code is riskier. Infracost's hybrid approach—AI suggestions validated by static analysis—represents a pragmatic middle ground.

  3. Enterprise scaling: Many infrastructure tools start with individual developers but must scale to enterprise needs. This requires rethinking everything from data architecture to access controls.

  4. Developer experience as differentiator: The emphasis on "shifting left" and integrating into workflows (GitHub, Azure Repos) shows that developer experience is becoming a key battleground for infrastructure tools.

What This Means for Backend Engineers

For a senior backend engineer considering this role, the technical challenges are substantial:

  • Distributed systems thinking: You'll be building systems that must be reliable, scalable, and performant across multiple dimensions.
  • Domain complexity: Understanding infrastructure-as-code (Terraform, CloudFormation, etc.) is essential for building accurate analysis tools.
  • Product engineering mindset: The backend isn't just serving data—it's enabling product features like real-time cost insights and automated fixes.

The role is less about maintaining microservices and more about architecting systems that can handle the complexity of modern cloud infrastructure at scale. It's a chance to work on problems that sit at the intersection of software engineering, data engineering, and product development.

Infracost's approach—combining static analysis with AI, building tools rather than processes, and focusing on developer experience—represents a pragmatic response to the growing complexity of cloud infrastructure. For engineers who enjoy solving scaling challenges while directly impacting how teams manage cloud costs, this role offers both technical depth and product impact.

Infracost Careers | Infracost GitHub | Infracost Documentation

Comments

Loading comments...