Hacker Newsnew | past | comments | ask | show | jobs | submit | JB_Dev's commentslogin

Yea agree.. This is the same discussion point that came up last time they had an incident.

I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.

Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.

Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.


Can't get hacked when you are down.


Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.


You're the nth person on this thread to say that and it doesn't make sense. Events that happen multiple times per second change data that you would call "configuration" in systems like these. This isn't `sendmail.cf`.

If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.

But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.


I'm sorry to belabor this but I'm genuinely not understanding what you're saying in this reply. I haven't operated large scale systems. I'm just an IT generalist and casual coder. I acknowledge I'm too inexperienced to even know what I don't know re: running large systems.

I read the parent poster as broadly suggesting configuration updates should have fitness tests applied and be deployed to minimize the blast radius when an update causes a malfunction. That makes intuitive sense to me. It seems like software should be subject to health checks after configuration updates, even if it's just to stop a deployment before it's widely distributed (let alone rolling-back to last-working configurations, etc).

Am I being thick-headed in thinking defensive strategies like those are a good idea? I'm reading your reply as arguing against those types of strategies. I'm also not understanding what you're suggesting as an alternative.

Again, I'm sorry to belabor this. I've replied once, deleted it, tried writing this a couple more times and given up, and now I'm finally pulling the trigger. It's really eating at me. I feel as though I must be deep down the Dunning-Kruger rabbit hole and really thinking "outside my lane".


The things you do to safeguard the rollout of a configuration file change are not the same as the things you do to reliably propagate changes that might happen many times per second.

What's irritating to me are the claims that there's nothing distinguishing real time control plane state changes and config files. Most of us have an intuition for how they'd do a careful rollout of a config file change. That intuition doesn't hold for control plane state; it's like saying, for instance, that OSPF should have canaries and staged rollouts every time a link state changes.

I'm not saying there aren't things you to do make real-time control plane state propagation safer, or that Cloudflare did all those things (I have no idea, I'm not familiar with their system at all, which is another thing irritating me about this thread --- the confident diagnostics and recommendations). I'm saying that people trying to do the "this is just like CrowdStrike" thing are telling on themselves.


Thanks for the reply.

I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.

I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.


It is possible that any number of things people on this thread have called out are, in fact, the right move for the system Cloudflare built (it's hard to know without knowing more about the system, and my intuition for their system is also faulty because I irrationally hate periodic batch systems like these).

Most of what I'm saying is:

(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].

(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!

[1] https://howcomplexsystems.fail/


I just want to point out a few things you may overlooked. First, the bot config gets updated every 5 minutes, not in seconds. Second, they have config checks in other places already ("Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input"). They could probably even align everything in CI/CD if they'd run the config verifier where the configs are generated. This is of course all hindsight blind guessing, but you make it sound a bit arcane and impossible to do anything.


Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.


The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.


I think defence against a DDOS against your network is the best reason for a quick rollout


This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.

https://developers.cloudflare.com/bots/get-started/bot-manag...


Bots can also cause a DoS/DDoS. We use the feature to restrict certain AI scraper tools by user agent that adversly impact performance (they have a tendency to hammer "export all the data" endpoints much more than regular users do)


So if you didn't enable it your stuff would work?


It would still fail if you were unluckily on the new proxy (it's not very clear why if the feature was not enabled, indeed):

> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.

> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.


Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?

Of course, this is all so easy to say after the fact..


Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?


It's literally in the blog post as well

> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:


You are 100% correct but really these engineers should go read the guidance - it’s pretty clear what is required: https://learn.microsoft.com/en-us/entra/identity-platform/cl...


Nah this is just microsofts quality bar in general. AI will only accelerate the decline.


100% agree. i’m not sure why everyone is clowning on them here. This process is a win. Do people want this all being hidden instead in a forked private repo?

It’s showing the actual capabilities in practice. That’s much better and way more illuminating than what normally happens with sales and marketing hype.


Satya says: "I’d say maybe 20%, 30% of the code that is inside of our repos today and some of our projects are probably all written by software".

Zuckerberg says: "Our bet is sort of that in the next year probably … maybe half the development is going to be done by AI, as opposed to people, and then that will just kind of increase from there".

It's hard to square those statements up with what we're seeing happen on these PRs.


These are AI companies selling AI to executives, there's no need to square the circle, the people that they are talking to have no interest in what's happening in a repo, it's about convincing people to buy in early so they can start making money off their massive investments.


Why shouldn’t we judge a company’s capabilities against what their CEOs claim them to be capable of?


Oh, we absolutely should, but I'm saying that the reason the messaging is so discordant when compared with the capabilities is that the messaging isn't aimed at the people who are able to evaluate the capabilities.


You're right. The audience isn't the same. Unfortunately the parent commenters are also right - executives hyping AI are (currently) lying.

It is about as unethical as it gets.

But, our current iteration of capitalism is highly financialized and underinvested in the value of engineering. Stock prices come before truth.


> Satya says: "I’d say maybe 20%, 30% of the code that is inside of our repos today and some of our projects are probably all written by software".

Well, that makes sense to me. Microsoft's software has gotten noticably worse in the last few years. So much that I have abandoned it for my daily driver for the first time since the early 2000s.


The fact that Zuck is saying "sort of" and "probably" is a big giveaway it's not going to happen.


I actually have the opposite position on this. 1st world countries already have the funds and economy to pursue exactly what you describe. Just they lack the political will. I don’t care to subsidise that intentional lack of investment.

I would much rather give to charities focusing on countries that don’t have the economy/ability to fix their basic issues.


Call me pessimistic, but as the sidewalk pattern becomes more common for IoT, I wouldn’t be surprised if a “malfunctioning radio” just results in the device not working properly.


Make it a roundabout with protected pedestrian crossings. That forces drivers to be looking at the conflict point with pedestrians as they manoeuvre the roundabout.


I was very impressed in Denmark, where that roundabout approach worked very well. Every car slowed down & stopped for me at the crosswalks.

It turned out that that was because they installed a cobblestone speed bump in front of every crosswalk. Cars slowed down even if no pedestrians were around, because otherwise they were going to pop a tire. It made walking so much safer than anywhere else I've been.


Those don't fix it in my experience. There's one about a quarter mile from where I'm sitting right now and I avoid it when walking because of how dangerous it is. Yes, they will see you crossing... as they almost hit you. They recently redid it to be a bit safer for driving on (before people were unclear on how many lanes it had and which lanes could turn where) but it doesn't seem to have improved the pedestrian experience much.


in practice i find this does not work well at all… for some reason in roundabouts is when cars most feel justified in running down a pedestrian in a crosswalk. sometimes i think they’re just afraid to slow bc of the cars behind them


This. Roundabouts with medians. The answer is (almost) always roundabouts.


If you do not wait in the intersection itself then you would never get a chance to turn in many intersections. The only solution is to always wait in the intersection itself.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: