Cloudflare Outage: 6-Hour Service Disruption Affects 25% of BYOIP Customers

On February 20, 2026, Cloudflare experienced a major service outage when a configuration change caused 1,100 BYOIP prefixes to be withdrawn from the network, impacting 25% of Bring Your Own IP customers and making services unreachable for approximately 6 hours and 7 minutes.

On February 20, 2026, at 17:48 UTC, Cloudflare experienced a significant service outage when a configuration change caused approximately 1,100 Bring Your Own IP (BYOIP) prefixes to be unintentionally withdrawn from the network. This incident impacted 25% of BYOIP prefixes and resulted in services and applications being unreachable for a subset of Cloudflare's customers for approximately 6 hours and 7 minutes.

What Happened

The outage was triggered by a change Cloudflare made to how its network manages IP addresses onboarded through the BYOIP pipeline. The change caused Cloudflare to unintentionally withdraw customer prefixes, resulting in timeouts and connection failures across Cloudflare deployments that used BYOIP. A subset of 1.1.1.1, specifically the destination one.one.one.one, was also impacted.

The total duration of the incident was 6 hours and 7 minutes, with most of that time spent restoring prefix configurations to their state prior to the change. Cloudflare engineers reverted the change and prefixes stopped being withdrawn when failures were observed.

Impact on Customers

During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC. Out of the 4,306 total BYOIP prefixes, 25% were unintentionally withdrawn. The incident did not impact all BYOIP customers because the configuration change was applied iteratively rather than instantaneously across all customers.

Affected services included:

Core CDN and Security Services: Traffic was not attracted to Cloudflare, causing connection failures
Spectrum: Spectrum apps on BYOIP failed to proxy traffic
Dedicated Egress: Customers using Gateway Dedicated Egress or Dedicated IPs for CDN Egress leveraging BYOIP couldn't send traffic to destinations
Magic Transit: End users connecting to applications protected by Magic Transit experienced connection timeouts

Some customers were able to restore their own service by using the Cloudflare dashboard to re-advertise their IP addresses. However, approximately 300 prefixes couldn't be remediated through the dashboard due to a software bug that removed service configurations from edge servers.

Technical Root Cause

The specific configuration that broke was a modification attempting to automate the customer action of removing prefixes from Cloudflare's BYOIP service. This automation was part of Cloudflare's Code Orange: Fail Small initiative, which aims to move manual processes to safe, automated workflows.

The issue stemmed from a bug in an API query. The cleanup sub-task was supposed to query for prefixes pending deletion, but due to how the API interpreted an empty string parameter, it instead retrieved all BYOIP prefixes. The system then interpreted all returned prefixes as queued for deletion and began systematically deleting them along with their dependent objects, including service bindings.

Why the Bug Wasn't Caught

Several factors contributed to the bug making it to production:

The staging environment's mock data was insufficient to catch this scenario
While tests existed for this functionality, coverage for this specific scenario was incomplete
Initial testing and code review focused on the BYOIP self-service API journey and were completed successfully, but didn't cover scenarios where the task-runner service would independently execute changes without explicit input

Recovery Process

Recovery was complicated because affected BYOIP prefixes were in different states:

Most customers: Only had prefixes withdrawn and could restore service via dashboard
Some customers: Had prefixes withdrawn and some bindings removed (partial recovery state)
Some customers: Had prefixes withdrawn and all service bindings removed (required intensive data recovery)

The final 300 prefixes that couldn't be restored through the dashboard required a global configuration update to reapply service bindings across Cloudflare's edge network.

Timeline of Events

2026-02-05 21:53: Broken sub-process merged into code base
2026-02-20 17:46: Address API release with broken sub-process completes
2026-02-20 17:56: Impact starts - prefixes begin to be withdrawn
2026-02-20 18:13: Cloudflare engaged for failures on one.one.one.one
2026-02-20 18:18: Internal incident declared
2026-02-20 18:21: Addressing API team paged
2026-02-20 18:46: Issue identified - broken sub-process terminated
2026-02-20 19:19: Some prefixes mitigated - customers begin self-remediation via dashboard
2026-02-20 20:30: Final mitigation process begins - engineers complete release to restore withdrawn prefixes
2026-02-20 23:03: Configuration update completed - remaining prefixes restored

Connection to Code Orange: Fail Small

This incident occurred while Cloudflare was implementing changes as part of the Code Orange: Fail Small initiative, which has three main goals:

Require controlled rollouts for configuration changes propagated to the network
Change internal "break glass" procedures and remove circular dependencies
Review, improve, and test failure modes of all systems handling network traffic

The change that caused the outage fell under the first goal - moving risky manual changes to safe, automated configuration updates. While preventative measures weren't fully deployed before the outage, teams were actively working on these systems when the incident occurred.

Remediation and Follow-up Steps

Cloudflare has outlined several improvements to prevent similar incidents:

API Schema Standardization

Improve API schema to ensure better standardization
Make it easier for testing and systems to validate whether API calls are properly formed

Better Separation Between Operational and Configured State

Redesign rollback mechanism and database configuration
Introduce layers between customer configuration and production
Implement database snapshots that can be applied through health-mediated deployments

Better Arbitrate Large Withdrawal Actions

Improve monitoring to detect when changes are happening too fast or too broadly
Implement circuit breakers to stop out-of-control processes
Use customer service behavior signals to trigger circuit breakers

Conclusion

Cloudflare has acknowledged the severity of this incident and its impact on customers and the broader internet. The company has committed to making these improvements to ensure improved stability moving forward and to prevent similar problems from occurring again. This incident serves as a reminder of the critical nature of internet infrastructure and the importance of robust testing, deployment processes, and rollback mechanisms in maintaining service reliability.

For more information about Cloudflare's services and their commitment to building a better internet, visit their website or check out their open positions if you're interested in contributing to internet infrastructure reliability.

#Cloudflare #BYOIP #Cloud #Infrastructure