What 'production-ready' really means for a NestJS backend

A deep dive into the infrastructure, observability, and operational practices that transform a simple Todo API into a production-ready service, with lessons learned from real-world feedback.

Three weeks ago I shared an early version of this project and got a lot of thoughtful feedback. I've since reworked it - here's the updated version: https://github.com/prod-forge/backend

What this project is about

The idea hasn't changed: Writing backend code is the easy part. Everything around it is what makes systems reliable. This project focuses on:

what happens before code
what happens before deploy
what happens when things break

What's inside

A simple Todo API, but built like a real production service:

CI/CD with rollback
forward-only migrations
observability (Prometheus + Grafana + Loki)
structured logging + correlation IDs
Terraform infrastructure (AWS)
E2E testing with Testcontainers

What changed after feedback

improved structure and documentation
clarified migration and release flow
refined CI/CD and rollback approach
better explanations of key decisions

Important

This is not a boilerplate. The goal is not to copy configs, but to understand how production systems are actually put together.

Feedback

If you've worked on real systems: What would you add or do differently?

The gap between "works on my machine" and production-ready

Most tutorials stop at "your API responds to requests." But anyone who's maintained a service in production knows that's about 10% of the battle. The other 90% is what happens when:

Your database migration fails halfway through
A bad deploy goes out to production
You need to understand why a request failed three days ago
Your service needs to scale from 10 to 10,000 requests per second
You need to reproduce a bug that only happens in production

This project bridges that gap by showing how to build a simple Todo API with production-grade operational practices.

The architecture: Simple API, complex operations

The Todo API itself is straightforward - create, read, update, delete tasks. But the surrounding infrastructure demonstrates real production concerns:

CI/CD with rollback

Continuous deployment without rollback capability is like driving a car without brakes. The pipeline uses GitHub Actions to:

Run tests on every push
Build and push Docker images
Deploy to staging automatically
Require manual approval for production
Support instant rollback to previous versions

The rollback mechanism is critical - it's not enough to deploy new versions; you need to be able to quickly revert when something goes wrong.

Forward-only migrations

Database migrations are one of the most common sources of production incidents. The project uses TypeORM with a strict policy:

Migrations can only move forward
No data loss is allowed
Rollbacks are handled by applying the reverse migration
Each migration is tested in CI

This approach prevents the common pitfall of "just drop the table and re-run" in production, which works fine until it doesn't.

Observability stack

Logging and metrics aren't just for debugging - they're your primary interface to production systems. The stack includes:

Vibe check: Do developers trust AI?

Prometheus collects metrics from the application and infrastructure. Key metrics include request latency, error rates, and database query performance.

Grafana visualizes these metrics, providing dashboards for:

Application health
Request throughput and latency
Database performance
Infrastructure utilization

Loki handles structured logging with correlation IDs. Every request gets a unique ID that flows through all services, making it possible to trace a single operation across multiple components.

Structured logging + correlation IDs

Instead of dumping JSON to stdout, the project uses structured logging with Pino, enriched with:

Request correlation IDs
User IDs (when available)
Operation names
Timings and durations
Error contexts

This makes logs searchable and meaningful rather than just noise.

Terraform infrastructure

Infrastructure as Code isn't optional for production systems. The project uses Terraform to define:

AWS ECS Fargate services
RDS PostgreSQL database
Application Load Balancer
CloudWatch log groups and metrics
IAM roles and permissions

Everything is version-controlled and can be recreated from scratch.

E2E testing with Testcontainers

Unit tests are necessary but not sufficient. The project includes E2E tests that:

Spin up real PostgreSQL containers
Test the full HTTP API
Verify database migrations work
Check that observability data is emitted correctly

Testcontainers makes this reliable by providing consistent test environments.

What changed based on community feedback

The initial version received valuable input from developers who've been through production incidents. Key improvements include:

Better documentation structure

Production systems are complex, and documentation needs to match that complexity. The project now has:

Clear setup instructions
Architecture decision records (ADRs)
Deployment guides
Troubleshooting sections

Clarified migration and release flow

Database migrations and deployments are the highest-risk operations. The flow is now explicitly documented:

Migrations run in CI to catch issues early
Staging deployments happen automatically
Production deployments require manual approval
Rollbacks are a single command

Refined CI/CD and rollback approach

The original pipeline was functional but could be more robust. Changes include:

Better error handling in deployment scripts
More comprehensive test coverage
Improved rollback verification
Enhanced monitoring during deployments

Better explanations of key decisions

Every significant choice in the project now has a rationale:

Why Prometheus over other metrics systems
Why structured logging over simple console.log
Why Terraform over CloudFormation
Why Testcontainers for E2E tests

The reality of production systems

This project demonstrates that production-readiness isn't about using the latest framework or having perfect code coverage. It's about:

Reliability: Can the system recover from failures?
Observability: Can you understand what's happening?
Maintainability: Can new team members understand and modify it?
Safety: Can you deploy changes without fear?

A Todo API is a simple problem domain, which makes it perfect for focusing on these operational concerns without getting lost in business logic complexity.

What would you add?

If you've worked on real systems, what would you add or do differently? Some ideas I'm considering:

Chaos engineering experiments
Blue-green deployments
Database connection pooling optimization
Circuit breakers for external dependencies
Service mesh integration
Multi-region deployment

What's missing from this production-ready checklist?

Beyond the code: The operational mindset

The most important lesson from this project isn't any specific technology choice - it's the operational mindset. Production systems fail in predictable ways:

During deployments: Bad code, failed migrations, configuration issues
Under load: Resource exhaustion, cascading failures, slow dependencies
During incidents: Missing logs, unclear metrics, slow recovery

Building production-ready systems means designing for these failures upfront rather than reacting to them.

This project provides a concrete example of how to do that for a NestJS backend, but the principles apply regardless of your tech stack.

The goal is understanding, not copying. Take these patterns, adapt them to your context, and build systems that don't just work - they keep working when things go wrong.

#NestJS #CI/CD #Observability #Terraform #E2E testing