A change made to how Cloudflare's Web Application Firewall parses requests caused Cloudflare's network to be unavailable for several minutes this morning. This was not an attack; the change was deployed by our team to help mitigate the industry-wide vulnerability disclosed this week in React Server Components. We will share more information as we have it today.
I’m really curious what their rollout procedure is, because it seems like many of their past outages should have been uncovered if they released these configuration changes to 1% of global traffic first.
They don't appear to have a rollout procedure for some of their globally replicated application state. They had a number of major outages over the past years which all had the same root cause of "a global config change exposed a bug in our code and everything blew up".
I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.
Yea agree.. This is the same discussion point that came up last time they had an incident.
I really don’t buy this requirement to always deploy state changes 100% globally immediately.
Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.
Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.
Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.
For hypothetical conflicting changes (read worst case: unupgraded nodes/services can't interop with upgraded nodes/services), what's best practice for a partial rollout?
Blue/green and temporarily ossify capacity? Regional?
That's ok but doesn't solve issues you notice only on actual prod traffic. While it can be a nice addition to catch issues earlier with minimal user impact, best practice on large scale systems still requires a staged/progressive prod rollout.
If there is a proper rollout procedure that would've caught this, and they bypass it for routine WAF configuration changes, they might as well not have one.
The update they describe should never bring down all services. I agree with other posters that they must lack a rollout strategy yet they sent spam emails mocking the reliability of other clouds
Apparently somehow this had never been how Cloudflare did this. I expressed incredulity about this to one of their employees, but yeah, seems like their attitude was "We never make mistakes so it's fastest to just deploy every change across the entire system immediately" and as we've seen repeatedly in the past short while that means it sometimes blows up.
They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.
Blameless post mortems should be similar to air accident investigations. I.e. don't blame the people involved (unless they are acting maliciously), but identify and fix the issues to ensure this particular incident is unlikely to recur.
The intent of the postmortems is to learn what the issues are and prevent or mitigate similar issues happening in the future. If you don't make changes as a result of a postmortem then there's no point in conducting them.
The aviation industry regularly requires certifications, check rides, and re-qualifications when humans mess up. I have never seen anything like that in tech.
Sometimes the solution is to not let certain people do certain things which are risky.
Agree 100%, however using your example, there is no regulatory agency that investigate the issue and demand changes to avoid related future problems. Should the industry move towards this way?
However, one of the things you see (if you read enough of them) in accident investigation reports for regulated industries is a recurring pattern
1. Accident happens
2. Investigators conclude Accident would not happen if people did X. Recommend regulator requires that people do X, citing previous such recommendations each iteration
3. Regulator declined this recommendation, arguing it's too expensive to do X, or people already do X, or even (hilariously) both
4. Go to 1.
Too often, what happens is that eventually
5. Extremely Famous Accident Happens, e.g. killing loved celebrity Space Cowboy
6. Investigators conclude Accident would not happen if people did X, remind regulator that they have previously recommended requiring X
7. Press finally reads dozens of previous reports and so News Story says: Regulator killed Space Cowboy!
8. Regulator decides actually they always meant to require X after all
As bad as (3) sounds, I'll strongman the argument: it's important to keep the economic cost of any regulation in mind.*
On the one hand, you'd like to prevent the thing the regulation is seeking to prevent.
On the other hand, you'd have costs for the regulation to be implemented (one-time and/or ongoing).
"Is the good worth the costs?" is a question worth asking every time. (Not least because sometimes it lets you downscope/target regulations to get better good ROI)
*Yes, the easy pessimistic take is 'industry fights all regulation on cost grounds', but the fact that the argument is abused doesn't mean it doesn't have some underlying merit
I think conventionally the verb is "to steelman" with the intended contrast being to a strawman, an intentionally weak argument by analogy to how straw isn't strong but steel is. I understood what you meant by "strongman" but I think that "steelman" is better here.
There is indeed a good reason regulators aren't just obliged to institute all recommendations - that would be a lot of new rules. The only accident report I remember reading with zero recommendations was a MAIB (Maritime accidents) report here which concluded that a crew member of a fishing boat has died at sea after their vessel capsized because they both they and the skipper (who survived) were on heroin, the rationale for not recommending anything was that heroin is already illegal, operating a fishing boat while on heroin is already illegal, and it's also obviously a bad idea, so, there's nothing to recommend. "Don't do that".
Cost is rarely very persuasive to me, because it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs when we require something but not before (and often not after we cease to require it either)
It's also difficult to anticipate all benefits from a good change without trying it. Lobbyists against a regulation will often try hard not to imagine benefits after all they're fighting not to be regulated. But once it's in action, it may be obvious to everyone that this was just a better idea and absurd it wasn't always the case.
Remember when you were allowed to smoke cigarettes on aeroplanes? That seems crazy, but at the time it was normal and I'm sure carriers insisted that not being allowed to do this would cost them money - and perhaps for a short while it did.
> it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs
Difficult, but not impossible.
What are calculable and do NOT scale down is cost for compliance documentation and processes. Changing from 1 form of documentation to 4 forms of documentation has measurable cost, that will be imposed forever.
> It's also difficult to anticipate all benefits from a good change without trying it.
That's not a great argument, because it can be counterbalanced by the equally true opposite: it's difficult to anticipate all downsides to a change without trying it.
> Remember when you were allowed to smoke cigarettes on aeroplanes?
Remember when you could walk up to a gate 5 minutes before a flight, buy a ticket, and fly?
The current TSA security theater has had some benefits, but it's also made using airports far worse as a traveler.
I mean, I'm pretty sure there was a long period where you could walk up 5 minutes before, and fly on a plane where you're not allowed to smoke. It's completely unrelated.
The TSA makes no sense as a safety intervention, it's theatre, it's supposed to look like we're trying hard to solve the problem, not be an attempt to solve the problem, and if there was an accident investigation for 9/11 I can't think why, that's not an accident.
As to your specific claim about enforcement, actually we don't even know whether we'd increase paperwork overhead in many cases. Rationalization driven by new regulation can actually reduce this instead.
For a non-regulatory (at least in the sense that there's no government regulators involved) example consider Let's Encrypt's ACME which was discussed here recently. ACME complies with the "Ten Blessed Methods". But prior to Let's Encrypt the most common processes weren't stricter, or more robust, they were much worse and much more labour intensive. Some of them were prohibited more or less immediately when the "Ten Blessed Methods" were required because they're just obviously unacceptable.
The Proof of Control records from ACME are much better than what had been the usual practice prior yet Let's Encrypt is $0 at point of use and even if we count the actual cost (borne by donations rather than subscribers) it's much cheaper than the prior commercial operators had been for much more value delivered.
You provided an example of where arguing against regulation was ill-conceived in hindsight. I offered an obvious example of the opposite (everyone against plane hijacking -> regulation -> air travel is made worse for everyone without much improvement for the primary issue).
> Rationalization driven by new regulation can actually reduce [paperwork] instead.
Ha! Anything is possible, I suppose.
I'd point out that the TBM were not ratified by committee (much less a government) and were rammed through by unilateral Mozilla fiat.
> They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.
Or they could say, "we want to continue to prioritise speed of security rollouts over stability, and despite our best efforts, we do make mistakes, so sometimes we expect things will blow up".
I guess it depends what you're optimising for... If the rollout speed of security patches is the priority then maybe increased downtime is a price worth paying (in their eyes anyway)... I don't agree with that, but at least it's an honest position to take.
That said, if this was to address the React CVE then it was hardly a speedy patch anyway... You'd think they could have afforded to stagger the rollout over a few hours at least.
It's just poor risk management at this point. Making sure that a configuration change doesn't crash the production service shouldn't take more than a few seconds in a well-engineered system even if you're not doing staged rollout.
Mentioning React Server Components in the status page can be seen as a bad way to shift the blame. Would have been better to not specify which CVE they were trying to patch. The issue is their rollout management, not the Vendor and CVE.
True, thanks for sharing. Worth mentioning that's on the "full-stack" part of the framework. It doesn't impact most React website while it impacts most next.js websites.
Thanks, that's what I acknowledged in the message you just replied to.
I'm not blaming anyone. Mostly outlining who was impacted as it's not really related to the front-end parts of the framework that the initial comment was referring to.
A change made to how Cloudflare's Web Application Firewall parses requests caused Cloudflare's network to be unavailable for several minutes this morning. This was not an attack; the change was deployed by our team to help mitigate the industry-wide vulnerability disclosed this week in React Server Components. We will share more information as we have it today.
https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q