How Cloudflare Fortified Its Network: Inside the 'Fail Small' Initiative

By • min read

Over the course of approximately two financial quarters, Cloudflare embarked on an intensive engineering campaign internally known as "Code Orange: Fail Small." This initiative was designed to enhance the resilience, security, and reliability of our infrastructure for every customer. Earlier this month, the Cloudflare team reached a major milestone: the core work behind this project is now complete. While we recognize that resilience is an ongoing commitment—not a one-time achievement—this effort specifically addresses the root causes that led to the global outages on November 18, 2025, and December 5, 2025. The project targeted four key areas: safer configuration changes, reducing the impact of failures, overhauling our emergency access and incident management procedures, and implementing measures to prevent configuration drift and regressions over time. We also improved how we communicate with customers during incidents. Below, we detail what was shipped and what it means for you.

Safer Configuration Changes

For most customers, the most visible change is that internal configuration changes no longer propagate instantly across our network. Instead, they are now rolled out gradually, with real-time health monitoring at each step. This allows our observability tools to detect anomalies and automatically revert problematic changes before they impact your traffic. To achieve this, we identified high-risk configuration pipelines and built new tools to manage them more effectively.

How Cloudflare Fortified Its Network: Inside the 'Fail Small' Initiative — Source: blog.cloudflare.com

Previously, configuration changes for products running on our network could be deployed across all nodes simultaneously. Now, all relevant teams—including those directly involved in the past outages—have adopted a health-mediated deployment methodology, the same approach we use when releasing software updates. This ensures that every configuration change is tested and monitored before full rollout.

Introducing Snapstone

Central to this shift is a new internal component we call Snapstone. Snapstone bundles configuration changes into packages and enables gradual release with built-in health mediation. Prior to Snapstone, applying this methodology to configuration changes was possible but cumbersome, requiring significant per-team effort and leading to inconsistent application across the network. Snapstone closes this gap by providing a unified framework that makes progressive rollout, real-time health monitoring, and automated rollback the default for all configuration deployments.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, it allows teams to dynamically define any unit of configuration that needs health mediation—whether that's a data file (as in the November 18 outage) or a control flag in our global configuration system (as in the December 5 outage). Teams create these configuration units on demand, and Snapstone handles the rest.

Reducing the Impact of Failures

In addition to safer changes, we focused on limiting the blast radius when things do go wrong. This involved redesigning certain parts of our network to ensure that a failure in one area does not cascade into a global outage. For instance, we have implemented circuit breakers at key points to isolate failing components automatically. We also introduced graceful degradation mechanisms so that non-essential features can be temporarily disabled while preserving core traffic processing. These changes mean that even if an unforeseen issue occurs, it will affect a smaller subset of users and can be resolved more quickly.

Revised Break Glass and Incident Management

Our emergency access procedures—often called "break glass"—have been completely revamped. We introduced stricter controls and audit trails to ensure that only authorized personnel can make emergency changes, and that those changes are reversible. Incident management processes were also updated to include clearer roles, faster escalation paths, and mandatory post-incident reviews with actionable improvement tasks.

Preventing Drift and Regressions

To maintain these gains over time, we deployed automated compliance checks that continuously verify configuration settings against our desired state. Any deviation triggers an alert and, in some cases, automatic remediation. This prevents configuration drift and ensures that the safeguards we've put in place remain effective as our network evolves.

Strengthened Customer Communication

We also revamped how we communicate during outages. Instead of generic status updates, we now provide more detailed technical explanations, estimated recovery times, and transparent post-mortems—all delivered through a newly designed incident communication channel. Customers can subscribe to real-time alerts and access a centralized dashboard that tracks ongoing incidents.

What This Means for You

Perhaps the most important result of the Code Orange initiative is that you can expect fewer outages and faster recovery when problems do occur. The health-mediated deployment process, led by Snapstone, ensures that risky changes are caught early. The improved isolation mechanisms limit the damage from any single failure. And the enhanced incident management means our teams can respond more effectively, with clearer communication to you.

While this project is complete, we view resilience as a continuous journey. We will keep iterating on these systems and processes to make Cloudflare's network even stronger. For more details on specific technical implementations, please refer to our developer documentation or follow our blog for ongoing updates.

Key takeaway 1: Configuration changes are now rolled out progressively with health checks.
Key takeaway 2: Snapstone unifies health-mediated deployment across all teams.
Key takeaway 3: Failure isolation and graceful degradation limit impact.
Key takeaway 4: Communication during incidents is more transparent and frequent.