What 'production-ready' really means for a NestJS backend
#DevOps

What 'production-ready' really means for a NestJS backend

Backend Reporter
6 min read

A deep dive into the infrastructure, observability, and operational practices that transform a simple Todo API into a production-ready service, with lessons learned from real-world feedback.

Three weeks ago I shared an early version of this project and got a lot of thoughtful feedback. I've since reworked it - here's the updated version: https://github.com/prod-forge/backend

What this project is about

The idea hasn't changed: Writing backend code is the easy part. Everything around it is what makes systems reliable. This project focuses on:

  • what happens before code
  • what happens before deploy
  • what happens when things break

What's inside

A simple Todo API, but built like a real production service:

  • CI/CD with rollback
  • forward-only migrations
  • observability (Prometheus + Grafana + Loki)
  • structured logging + correlation IDs
  • Terraform infrastructure (AWS)
  • E2E testing with Testcontainers

What changed after feedback

  • improved structure and documentation
  • clarified migration and release flow
  • refined CI/CD and rollback approach
  • better explanations of key decisions

Important

This is not a boilerplate. The goal is not to copy configs, but to understand how production systems are actually put together.

Feedback

If you've worked on real systems: What would you add or do differently?

Featured image


The gap between "works on my machine" and production-ready

Most tutorials stop at "your API responds to requests." But anyone who's maintained a service in production knows that's about 10% of the battle. The other 90% is what happens when:

  • Your database migration fails halfway through
  • A bad deploy goes out to production
  • You need to understand why a request failed three days ago
  • Your service needs to scale from 10 to 10,000 requests per second
  • You need to reproduce a bug that only happens in production

This project bridges that gap by showing how to build a simple Todo API with production-grade operational practices.

The architecture: Simple API, complex operations

The Todo API itself is straightforward - create, read, update, delete tasks. But the surrounding infrastructure demonstrates real production concerns:

CI/CD with rollback

Continuous deployment without rollback capability is like driving a car without brakes. The pipeline uses GitHub Actions to:

  • Run tests on every push
  • Build and push Docker images
  • Deploy to staging automatically
  • Require manual approval for production
  • Support instant rollback to previous versions

The rollback mechanism is critical - it's not enough to deploy new versions; you need to be able to quickly revert when something goes wrong.

Forward-only migrations

Database migrations are one of the most common sources of production incidents. The project uses TypeORM with a strict policy:

  • Migrations can only move forward
  • No data loss is allowed
  • Rollbacks are handled by applying the reverse migration
  • Each migration is tested in CI

This approach prevents the common pitfall of "just drop the table and re-run" in production, which works fine until it doesn't.

Observability stack

Logging and metrics aren't just for debugging - they're your primary interface to production systems. The stack includes:

Vibe check: Do developers trust AI?

Prometheus collects metrics from the application and infrastructure. Key metrics include request latency, error rates, and database query performance.

Grafana visualizes these metrics, providing dashboards for:

  • Application health
  • Request throughput and latency
  • Database performance
  • Infrastructure utilization

Loki handles structured logging with correlation IDs. Every request gets a unique ID that flows through all services, making it possible to trace a single operation across multiple components.

Structured logging + correlation IDs

Instead of dumping JSON to stdout, the project uses structured logging with Pino, enriched with:

  • Request correlation IDs
  • User IDs (when available)
  • Operation names
  • Timings and durations
  • Error contexts

This makes logs searchable and meaningful rather than just noise.

Terraform infrastructure

Infrastructure as Code isn't optional for production systems. The project uses Terraform to define:

  • AWS ECS Fargate services
  • RDS PostgreSQL database
  • Application Load Balancer
  • CloudWatch log groups and metrics
  • IAM roles and permissions

Everything is version-controlled and can be recreated from scratch.

E2E testing with Testcontainers

Unit tests are necessary but not sufficient. The project includes E2E tests that:

  • Spin up real PostgreSQL containers
  • Test the full HTTP API
  • Verify database migrations work
  • Check that observability data is emitted correctly

Testcontainers makes this reliable by providing consistent test environments.

What changed based on community feedback

The initial version received valuable input from developers who've been through production incidents. Key improvements include:

Better documentation structure

Production systems are complex, and documentation needs to match that complexity. The project now has:

  • Clear setup instructions
  • Architecture decision records (ADRs)
  • Deployment guides
  • Troubleshooting sections

Clarified migration and release flow

Database migrations and deployments are the highest-risk operations. The flow is now explicitly documented:

  1. Migrations run in CI to catch issues early
  2. Staging deployments happen automatically
  3. Production deployments require manual approval
  4. Rollbacks are a single command

Refined CI/CD and rollback approach

The original pipeline was functional but could be more robust. Changes include:

  • Better error handling in deployment scripts
  • More comprehensive test coverage
  • Improved rollback verification
  • Enhanced monitoring during deployments

Better explanations of key decisions

Every significant choice in the project now has a rationale:

  • Why Prometheus over other metrics systems
  • Why structured logging over simple console.log
  • Why Terraform over CloudFormation
  • Why Testcontainers for E2E tests

The reality of production systems

This project demonstrates that production-readiness isn't about using the latest framework or having perfect code coverage. It's about:

  • Reliability: Can the system recover from failures?
  • Observability: Can you understand what's happening?
  • Maintainability: Can new team members understand and modify it?
  • Safety: Can you deploy changes without fear?

A Todo API is a simple problem domain, which makes it perfect for focusing on these operational concerns without getting lost in business logic complexity.

What would you add?

If you've worked on real systems, what would you add or do differently? Some ideas I'm considering:

  • Chaos engineering experiments
  • Blue-green deployments
  • Database connection pooling optimization
  • Circuit breakers for external dependencies
  • Service mesh integration
  • Multi-region deployment

What's missing from this production-ready checklist?

pic


Beyond the code: The operational mindset

The most important lesson from this project isn't any specific technology choice - it's the operational mindset. Production systems fail in predictable ways:

  • During deployments: Bad code, failed migrations, configuration issues
  • Under load: Resource exhaustion, cascading failures, slow dependencies
  • During incidents: Missing logs, unclear metrics, slow recovery

Building production-ready systems means designing for these failures upfront rather than reacting to them.

This project provides a concrete example of how to do that for a NestJS backend, but the principles apply regardless of your tech stack.

The goal is understanding, not copying. Take these patterns, adapt them to your context, and build systems that don't just work - they keep working when things go wrong.

Comments

Loading comments...