From the incident page: A change made to how Cloudflare's Web Application Firewa...

reassess_blind · 2025-12-05T10:41:11 1764931271

I’m really curious what their rollout procedure is, because it seems like many of their past outages should have been uncovered if they released these configuration changes to 1% of global traffic first.

lima · 2025-12-05T12:20:22 1764937222

They don't appear to have a rollout procedure for some of their globally replicated application state. They had a number of major outages over the past years which all had the same root cause of "a global config change exposed a bug in our code and everything blew up".

I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.

JB_Dev · 2025-12-05T13:44:18 1764942258

Yea agree.. This is the same discussion point that came up last time they had an incident.

I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.

Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.

Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.

franktankbank · 2025-12-05T15:05:13 1764947113

Can't get hacked when you are down.

ethbr1 · 2025-12-05T13:42:03 1764942123

For hypothetical conflicting changes (read worst case: unupgraded nodes/services can't interop with upgraded nodes/services), what's best practice for a partial rollout?

Blue/green and temporarily ossify capacity? Regional?

cryptonym · 2025-12-05T15:38:49 1764949129

- Push a version with the new logic but not yet enabled, still using legacy logic, able to implement both

- Push a version that enables new logic for 1% of traffic

- Continue rollout until 100%

nrhrjrjrjtntbt · 2025-12-05T23:36:10 1764977770

Can also do canary rollout before that. Canary means rollout to endpoints only used by CF to test. Monitor metrics and automated test results.

cryptonym · 2025-12-06T08:45:06 1765010706

That's ok but doesn't solve issues you notice only on actual prod traffic. While it can be a nice addition to catch issues earlier with minimal user impact, best practice on large scale systems still requires a staged/progressive prod rollout.

nrhrjrjrjtntbt · 2025-12-06T08:49:00 1765010940

Yep. This is definitely an "as well as"

Unit test, Integration Test, Staging Test, Staging Rollout, Production Test, Canary, Progressive Rollout

Can all be automated can smash through all that quickly with no human intervention.

tehlike · 2025-12-05T12:36:49 1764938209

You can selectively bypass many roll out procedures in a properly designed system.

lima · 2025-12-05T12:54:34 1764939274

If there is a proper rollout procedure that would've caught this, and they bypass it for routine WAF configuration changes, they might as well not have one.

nrhrjrjrjtntbt · 2025-12-05T23:34:52 1764977692

Not sure I buy it. Do 1% for 10 minutes. I mean it must have taken over half a day to code and test a patch. Why not wait another 10 minutes.

stogot · 2025-12-05T12:35:27 1764938127

The update they describe should never bring down all services. I agree with other posters that they must lack a rollout strategy yet they sent spam emails mocking the reliability of other clouds

brandensilva · 2025-12-05T13:18:12 1764940692

The irony is they support rolling out incrementally with some of their products for deployment.

They need that same mindset for themselves in config/updates/infra changes but probably easier said than done.

gpi · 2025-12-05T13:36:37 1764941797

I believe they use Argo according to a previous post mortem.

https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...

Traubenfuchs · 2025-12-05T10:52:20 1764931940

"Please don‘t block the rollout pipleline with a simple react security patch update."

philipwhiuk · 2025-12-05T10:25:51 1764930351

So their parser broke again I guess.

And no staged rollout I assume?

tialaramex · 2025-12-05T10:43:13 1764931393

Apparently somehow this had never been how Cloudflare did this. I expressed incredulity about this to one of their employees, but yeah, seems like their attitude was "We never make mistakes so it's fastest to just deploy every change across the entire system immediately" and as we've seen repeatedly in the past short while that means it sometimes blows up.

They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

rhdunn · 2025-12-05T11:11:11 1764933071

Blameless post mortems should be similar to air accident investigations. I.e. don't blame the people involved (unless they are acting maliciously), but identify and fix the issues to ensure this particular incident is unlikely to recur.

The intent of the postmortems is to learn what the issues are and prevent or mitigate similar issues happening in the future. If you don't make changes as a result of a postmortem then there's no point in conducting them.

meindnoch · 2025-12-05T12:44:06 1764938646

>don't blame the people involved (unless they are acting maliciously)

Or negligently.

jq-r · 2025-12-05T13:23:08 1764940988

That still shouldn't be a part of post mortem, more of a performance review item.

tempaccount420 · 2025-12-05T13:38:27 1764941907

They should be performantly removed.

__turbobrew__ · 2025-12-05T16:04:21 1764950661

The aviation industry regularly requires certifications, check rides, and re-qualifications when humans mess up. I have never seen anything like that in tech.

Sometimes the solution is to not let certain people do certain things which are risky.

Xunjin · 2025-12-05T11:24:16 1764933856

Agree 100%, however using your example, there is no regulatory agency that investigate the issue and demand changes to avoid related future problems. Should the industry move towards this way?

tialaramex · 2025-12-05T12:31:42 1764937902

However, one of the things you see (if you read enough of them) in accident investigation reports for regulated industries is a recurring pattern

1. Accident happens 2. Investigators conclude Accident would not happen if people did X. Recommend regulator requires that people do X, citing previous such recommendations each iteration 3. Regulator declined this recommendation, arguing it's too expensive to do X, or people already do X, or even (hilariously) both 4. Go to 1.

Too often, what happens is that eventually

5. Extremely Famous Accident Happens, e.g. killing loved celebrity Space Cowboy 6. Investigators conclude Accident would not happen if people did X, remind regulator that they have previously recommended requiring X 7. Press finally reads dozens of previous reports and so News Story says: Regulator killed Space Cowboy! 8. Regulator decides actually they always meant to require X after all

ethbr1 · 2025-12-05T13:47:00 1764942420

As bad as (3) sounds, I'll strongman the argument: it's important to keep the economic cost of any regulation in mind.*

On the one hand, you'd like to prevent the thing the regulation is seeking to prevent.

On the other hand, you'd have costs for the regulation to be implemented (one-time and/or ongoing).

"Is the good worth the costs?" is a question worth asking every time. (Not least because sometimes it lets you downscope/target regulations to get better good ROI)

*Yes, the easy pessimistic take is 'industry fights all regulation on cost grounds', but the fact that the argument is abused doesn't mean it doesn't have some underlying merit

tialaramex · 2025-12-05T14:30:02 1764945002

I think conventionally the verb is "to steelman" with the intended contrast being to a strawman, an intentionally weak argument by analogy to how straw isn't strong but steel is. I understood what you meant by "strongman" but I think that "steelman" is better here.

There is indeed a good reason regulators aren't just obliged to institute all recommendations - that would be a lot of new rules. The only accident report I remember reading with zero recommendations was a MAIB (Maritime accidents) report here which concluded that a crew member of a fishing boat has died at sea after their vessel capsized because they both they and the skipper (who survived) were on heroin, the rationale for not recommending anything was that heroin is already illegal, operating a fishing boat while on heroin is already illegal, and it's also obviously a bad idea, so, there's nothing to recommend. "Don't do that".

Cost is rarely very persuasive to me, because it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs when we require something but not before (and often not after we cease to require it either)

It's also difficult to anticipate all benefits from a good change without trying it. Lobbyists against a regulation will often try hard not to imagine benefits after all they're fighting not to be regulated. But once it's in action, it may be obvious to everyone that this was just a better idea and absurd it wasn't always the case.

Remember when you were allowed to smoke cigarettes on aeroplanes? That seems crazy, but at the time it was normal and I'm sure carriers insisted that not being allowed to do this would cost them money - and perhaps for a short while it did.

ethbr1 · 2025-12-06T14:14:22 1765030462

> it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs

Difficult, but not impossible.

What are calculable and do NOT scale down is cost for compliance documentation and processes. Changing from 1 form of documentation to 4 forms of documentation has measurable cost, that will be imposed forever.

> It's also difficult to anticipate all benefits from a good change without trying it.

That's not a great argument, because it can be counterbalanced by the equally true opposite: it's difficult to anticipate all downsides to a change without trying it.

> Remember when you were allowed to smoke cigarettes on aeroplanes?

Remember when you could walk up to a gate 5 minutes before a flight, buy a ticket, and fly?

The current TSA security theater has had some benefits, but it's also made using airports far worse as a traveler.

tialaramex · 2025-12-06T20:32:46 1765053166

I mean, I'm pretty sure there was a long period where you could walk up 5 minutes before, and fly on a plane where you're not allowed to smoke. It's completely unrelated.

The TSA makes no sense as a safety intervention, it's theatre, it's supposed to look like we're trying hard to solve the problem, not be an attempt to solve the problem, and if there was an accident investigation for 9/11 I can't think why, that's not an accident.

As to your specific claim about enforcement, actually we don't even know whether we'd increase paperwork overhead in many cases. Rationalization driven by new regulation can actually reduce this instead.

For a non-regulatory (at least in the sense that there's no government regulators involved) example consider Let's Encrypt's ACME which was discussed here recently. ACME complies with the "Ten Blessed Methods". But prior to Let's Encrypt the most common processes weren't stricter, or more robust, they were much worse and much more labour intensive. Some of them were prohibited more or less immediately when the "Ten Blessed Methods" were required because they're just obviously unacceptable.

The Proof of Control records from ACME are much better than what had been the usual practice prior yet Let's Encrypt is $0 at point of use and even if we count the actual cost (borne by donations rather than subscribers) it's much cheaper than the prior commercial operators had been for much more value delivered.

ethbr1 · 2025-12-08T18:11:57 1765217517

Smoking and TSA are unrelated.

You provided an example of where arguing against regulation was ill-conceived in hindsight. I offered an obvious example of the opposite (everyone against plane hijacking -> regulation -> air travel is made worse for everyone without much improvement for the primary issue).

> Rationalization driven by new regulation can actually reduce [paperwork] instead.

Ha! Anything is possible, I suppose.

I'd point out that the TBM were not ratified by committee (much less a government) and were rammed through by unilateral Mozilla fiat.

kypro · 2025-12-05T12:35:55 1764938155

> They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

Or they could say, "we want to continue to prioritise speed of security rollouts over stability, and despite our best efforts, we do make mistakes, so sometimes we expect things will blow up".

I guess it depends what you're optimising for... If the rollout speed of security patches is the priority then maybe increased downtime is a price worth paying (in their eyes anyway)... I don't agree with that, but at least it's an honest position to take.

That said, if this was to address the React CVE then it was hardly a speedy patch anyway... You'd think they could have afforded to stagger the rollout over a few hours at least.

lima · 2025-12-05T13:03:44 1764939824

It's just poor risk management at this point. Making sure that a configuration change doesn't crash the production service shouldn't take more than a few seconds in a well-engineered system even if you're not doing staged rollout.

o_m · 2025-12-05T14:35:28 1764945328

I wonder if this is the new normal? Weekly Cloudflare outages that breaks huge parts of the internet.

meindnoch · 2025-12-05T10:39:08 1764931148

React (a frontend JS framework) can now bring down critical Internet infrastructure.

I will repeat it because it's so surreal: React (a frontend JS framework) can now bring down critical Internet infrastructure.

cryptonym · 2025-12-05T10:50:09 1764931809

That's Next.js, not React.

Mentioning React Server Components in the status page can be seen as a bad way to shift the blame. Would have been better to not specify which CVE they were trying to patch. The issue is their rollout management, not the Vendor and CVE.

JimDabell · 2025-12-05T12:06:40 1764936400

> That's Next.js, not React.

React seems to think that it was React:

https://react.dev/blog/2025/12/03/critical-security-vulnerab...

cryptonym · 2025-12-05T13:03:25 1764939805

True, thanks for sharing. Worth mentioning that's on the "full-stack" part of the framework. It doesn't impact most React website while it impacts most next.js websites.

tempaccount420 · 2025-12-05T13:41:12 1764942072

It was React. Code in React's repository had to be patched to fix this.

Next.JS just happens to be the biggest user of this part of React, but blaming Next.JS is weird...

cryptonym · 2025-12-05T14:30:46 1764945046

Thanks, that's what I acknowledged in the message you just replied to.

I'm not blaming anyone. Mostly outlining who was impacted as it's not really related to the front-end parts of the framework that the initial comment was referring to.

philipwhiuk · 2025-12-05T11:03:47 1764932627

I think the "argument" is that it's a critical vuln so they can't "go slow".

So now a vuln check for a component deployed on, being generous, 1% of servers causes an outage for 30% of the internet.

The argument is dumb.

spiffytech · 2025-12-05T12:59:29 1764939569

To be accurate: React developed server-side capabilities, and that's where the vulnerability exists.

It's feels noteworthy because React started out frontend-only, but pedantically it's just another backend with a vulnerability.

phplovesong · 2025-12-05T10:40:06 1764931206

[flagged]

mvandermeulen · 2025-12-05T10:56:31 1764932191

What was the AI slop part?

GaryBluto · 2025-12-05T12:36:23 1764938183

When something goes wrong, people are starting to immediately assume it's because of the thing they don't like.

uyzstvqs · 2025-12-05T10:27:29 1764930449

Ah yes, Cloudflare's worst enemy: The configuration change.

hinkley · 2025-12-05T18:40:14 1764960014

On fridays, yes.

aatd86 · 2025-12-05T11:59:24 1764935964

so it's react again in the end .. zzzzzzz

pepoluan · 2025-12-05T15:03:23 1764947003

So. Another regex problem?