A deep dive into the challenges and solutions for implementing reliable CNPJ (Brazilian company registration) queries in production systems, covering API reliability, data consistency, and performance optimization strategies.

Lessons from Running CNPJ Queries in Production for 8 Months

Eight months ago, I integrated CNPJ queries into a critical path of our product. This article shares the hard-won lessons that no one tells you before going to production.

Context

Our use case involves a B2B onboarding flow requiring CNPJ validation during initial registration and periodic re-validations. The system processes approximately 150,000 queries monthly, with traffic spiking to 3x normal volume on Mondays. Latency is crucial to our conversion rate—queries taking longer than 2 seconds start to negatively impact user completion rates.

Initial Assumptions

I began with the obvious approach: hitting a free public API directly. This worked well until it didn't.

Problems Discovered (in order of discovery)

1. Rate Limiting is the Least of Your Problems

While rate limits are a concern, they can be mitigated with queuing and caching strategies. The real issue proved to be provider instability—services becoming unavailable for 20-minute stretches during business hours, often without status pages or prior notifications.

2. "Complete" Data Varies Between Providers

Different providers return different subsets of data:

QSA (partners/shareholders) data appears in some but not others
Simples Nacional information is inconsistent
Secondary CNAE codes are only present in half of the providers

This forced us to compose responses from multiple sources to get complete information.

3. Data Freshness Varies Significantly

The same CNPJ query can return different registration status information depending on when each provider last updated their data from the Receita Federal (Brazilian Revenue Service). This creates silent bugs where company status appears to change without actual updates.

What I Would Implement Differently from Day One

1. Cache with TTL by Data Type

Initially, I treated all data with the same time-to-live (TTL), which caused problems. Company names and addresses change infrequently, while registration status can change unexpectedly. Our current implementation uses:

30-day TTL for "stable" data (name, address)
24-hour TTL for registration status

This differentiation prevents serving stale status information while still caching stable data effectively.

2. Fallback Between Providers, Not Failover

Traditional failover (switching providers when one fails) introduces unacceptable latency. Instead, we implemented a fallback strategy where queries race between two providers, and we accept the first valid response. This approach:

Reduced p95 latency significantly
Added minimal cost compared to the pain avoided

3. Normalization at the Integration Layer

Each provider uses its own schema and data format. We created a unified internal type with all fields as optional, along with specific adapters for each provider. This approach:

Isolated the rest of our codebase from provider-specific changes
Made it easy to add or remove providers
Simplified testing and validation

4. Log Which Provider Responded to Each Query

Without tracking which provider handled each query, debugging becomes nearly impossible. For example, when a company's legal name appeared different from the previous week, we couldn't determine which provider was responsible. Now we log this information with every query response.

Costly Mistakes

1. Assuming Invalid CNPJ Returns Uniform 4xx Errors

Different providers handle invalid CNPJs inconsistently. Some return 4xx status codes, others return 200 with empty bodies. This assumption led to handling errors incorrectly in several edge cases.

2. Trusting Registration Status Without Timestamp

We initially stored only the registration status without recording when the query was made. This created issues when status appeared to change without actual updates. Now we store both the status and the timestamp of when it was queried.

3. Lack of Rate Limiting on Our Side

A bug in our retry logic created an infinite loop that exhausted our provider quota in just 15 minutes. Implementing proper rate limiting and retry mechanisms with exponential backoff became essential.

Options Tested and Observations

Public Free APIs

These are excellent for initial development and low-volume testing. However, if your product depends on reliable CNPJ data, you'll eventually need to move to paid solutions.

Paid APIs

These generally work as advertised, but carefully review:

Actual SLAs (not just advertised ones)
Technical support quality
Contract terms and data usage rights

Local Receita Federal Database

This is a viable option if you can handle the operational overhead of monthly ETL processes. It wasn't suitable for our use case due to the maintenance complexity.

If I Could Go Back

I would implement caching and fallback strategies from the very first line of code. Everything else became necessary because of this initial oversight.

For anyone starting now, I recommend:

Choose two different providers
Implement caching from the start
Begin logging what each provider returns

The rest you'll learn through experience, or you can skip this part by using a specialized service like cnpj-api.com, which I eventually built after this journey.

System Design Considerations

This experience highlighted several important distributed systems principles:

Consistency Models

CNPJ data doesn't fit well with traditional consistency models. We implemented eventual consistency with appropriate TTLs per data type, accepting that some information might be slightly stale.

Circuit Breaker Patterns

For the provider instability issues, we implemented circuit breakers that automatically switch to fallback providers when error rates exceed thresholds, preventing cascading failures.

API Composition

Since no single provider offered all needed fields, we built an API composition layer that intelligently merges responses from multiple sources while handling version differences and schema mismatches.

Scalability Implications

Our initial approach didn't account for the 3x traffic spikes on Mondays. The final implementation includes auto-scaling provisions and pre-warmed instances to handle these predictable volume increases without latency degradation.

These lessons aren't specific to CNPJ queries—they apply to any external API integration where reliability and data consistency are critical to business operations.

#backend #Infrastructure #DevOps #Cloud

Lessons from Running CNPJ Queries in Production for 8 Months

Lessons from Running CNPJ Queries in Production for 8 Months

Context

Initial Assumptions

Problems Discovered (in order of discovery)

1. Rate Limiting is the Least of Your Problems

2. "Complete" Data Varies Between Providers

3. Data Freshness Varies Significantly

What I Would Implement Differently from Day One

1. Cache with TTL by Data Type

2. Fallback Between Providers, Not Failover

3. Normalization at the Integration Layer

4. Log Which Provider Responded to Each Query

Costly Mistakes

1. Assuming Invalid CNPJ Returns Uniform 4xx Errors

2. Trusting Registration Status Without Timestamp

3. Lack of Rate Limiting on Our Side

Options Tested and Observations

Public Free APIs

Paid APIs

Local Receita Federal Database

If I Could Go Back

System Design Considerations

Consistency Models

Circuit Breaker Patterns

API Composition

Scalability Implications

Comments