A deep dive into the challenges and solutions for implementing reliable CNPJ (Brazilian company registration) queries in production systems, covering API reliability, data consistency, and performance optimization strategies.
Lessons from Running CNPJ Queries in Production for 8 Months
Eight months ago, I integrated CNPJ queries into a critical path of our product. This article shares the hard-won lessons that no one tells you before going to production.
Context
Our use case involves a B2B onboarding flow requiring CNPJ validation during initial registration and periodic re-validations. The system processes approximately 150,000 queries monthly, with traffic spiking to 3x normal volume on Mondays. Latency is crucial to our conversion rate—queries taking longer than 2 seconds start to negatively impact user completion rates.
Initial Assumptions
I began with the obvious approach: hitting a free public API directly. This worked well until it didn't.
Problems Discovered (in order of discovery)
1. Rate Limiting is the Least of Your Problems
While rate limits are a concern, they can be mitigated with queuing and caching strategies. The real issue proved to be provider instability—services becoming unavailable for 20-minute stretches during business hours, often without status pages or prior notifications.
2. "Complete" Data Varies Between Providers
Different providers return different subsets of data:
- QSA (partners/shareholders) data appears in some but not others
- Simples Nacional information is inconsistent
- Secondary CNAE codes are only present in half of the providers
This forced us to compose responses from multiple sources to get complete information.
3. Data Freshness Varies Significantly
The same CNPJ query can return different registration status information depending on when each provider last updated their data from the Receita Federal (Brazilian Revenue Service). This creates silent bugs where company status appears to change without actual updates.
What I Would Implement Differently from Day One
1. Cache with TTL by Data Type
Initially, I treated all data with the same time-to-live (TTL), which caused problems. Company names and addresses change infrequently, while registration status can change unexpectedly. Our current implementation uses:
- 30-day TTL for "stable" data (name, address)
- 24-hour TTL for registration status
This differentiation prevents serving stale status information while still caching stable data effectively.
2. Fallback Between Providers, Not Failover
Traditional failover (switching providers when one fails) introduces unacceptable latency. Instead, we implemented a fallback strategy where queries race between two providers, and we accept the first valid response. This approach:
- Reduced p95 latency significantly
- Added minimal cost compared to the pain avoided
3. Normalization at the Integration Layer
Each provider uses its own schema and data format. We created a unified internal type with all fields as optional, along with specific adapters for each provider. This approach:
- Isolated the rest of our codebase from provider-specific changes
- Made it easy to add or remove providers
- Simplified testing and validation
4. Log Which Provider Responded to Each Query
Without tracking which provider handled each query, debugging becomes nearly impossible. For example, when a company's legal name appeared different from the previous week, we couldn't determine which provider was responsible. Now we log this information with every query response.
Costly Mistakes
1. Assuming Invalid CNPJ Returns Uniform 4xx Errors
Different providers handle invalid CNPJs inconsistently. Some return 4xx status codes, others return 200 with empty bodies. This assumption led to handling errors incorrectly in several edge cases.
2. Trusting Registration Status Without Timestamp
We initially stored only the registration status without recording when the query was made. This created issues when status appeared to change without actual updates. Now we store both the status and the timestamp of when it was queried.
3. Lack of Rate Limiting on Our Side
A bug in our retry logic created an infinite loop that exhausted our provider quota in just 15 minutes. Implementing proper rate limiting and retry mechanisms with exponential backoff became essential.
Options Tested and Observations
Public Free APIs
These are excellent for initial development and low-volume testing. However, if your product depends on reliable CNPJ data, you'll eventually need to move to paid solutions.
Paid APIs
These generally work as advertised, but carefully review:
- Actual SLAs (not just advertised ones)
- Technical support quality
- Contract terms and data usage rights
Local Receita Federal Database
This is a viable option if you can handle the operational overhead of monthly ETL processes. It wasn't suitable for our use case due to the maintenance complexity.
If I Could Go Back
I would implement caching and fallback strategies from the very first line of code. Everything else became necessary because of this initial oversight.
For anyone starting now, I recommend:
- Choose two different providers
- Implement caching from the start
- Begin logging what each provider returns
The rest you'll learn through experience, or you can skip this part by using a specialized service like cnpj-api.com, which I eventually built after this journey.
System Design Considerations
This experience highlighted several important distributed systems principles:
Consistency Models
CNPJ data doesn't fit well with traditional consistency models. We implemented eventual consistency with appropriate TTLs per data type, accepting that some information might be slightly stale.
Circuit Breaker Patterns
For the provider instability issues, we implemented circuit breakers that automatically switch to fallback providers when error rates exceed thresholds, preventing cascading failures.
API Composition
Since no single provider offered all needed fields, we built an API composition layer that intelligently merges responses from multiple sources while handling version differences and schema mismatches.
Scalability Implications
Our initial approach didn't account for the 3x traffic spikes on Mondays. The final implementation includes auto-scaling provisions and pre-warmed instances to handle these predictable volume increases without latency degradation.
These lessons aren't specific to CNPJ queries—they apply to any external API integration where reliability and data consistency are critical to business operations.

Comments
Please log in or register to join the discussion