Look, not to defend anything Amazon is doing, but this causal chain seems rather pareidolic and under-evidenced. You could spin some kind of crazy narrative about any major outage based on changes in policy that happened just before. But this isn't nearly the first AWS outage, and most of them happened before the recent RTO changes. It needs more evidence at best.
The article wasn't about the outage happening, it was about the amount of time it took to even discover what the problem was. Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
> Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
It's only logical presupposing a lot of other conditions, each of which is worthy of healthy skepticism. And even then, it's only a hypothesis. You need evidence to go from "this could have contributed to the problem" to "this caused the problem."
Based on what little is given in the article, it seems to go strongly against this hypothesis. For example it links to multiple past findings that Amazon's notification times need improvement going back to 2017. If something has been a problem for nearly a decade, it's hard to imagine it is a result of any recent personnel changes.
TFA does not establish how many AWS workers have left or been laid off, nonetheless how many of those were actually undesirable losses of highly skilled individuals. Even if we take it on faith that a large number of such individuals were lost, it is another bridge further to claim that there was neither redundancy in that skillset which remained, nor that any vacancies have been left unfilled since.
No evidence is given that indicates that if a more experienced team were working on the problem it would have been identified and resolved faster. The article even states something to the opposite effect:
> AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.
Indeed, the article doesn't even provide evidence that the response was unreasonably slow. No comparison to similar outages either from AWS in the past, before the hypothecated brain drain, nor from competitors. Note that the author has no idea what the problem actually was, or what AWS had to do to diagnose the issue.
It's the most plausible, fact-based guess, beating other competing theories.
Understaffing and absences would clearly lead to delayed incident response, but such an obvious negligence and breach of contract would have been avoided by a responsible cloud provider, ensuring supposedly adequate people on duty.
An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place.
AWS engineers being formerly competent but currently stupid, without organizational issues, might be explained by brain damage. "RTO" might have caused collective chronic poisoning, e.g. lead in drinking water, but I doubt Amazon is so cheap.
> An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place
You seem to be misunderstanding the nature of the issue.
The DNS records for DynamoDB's API disappeared. They resolve to a dynamic bunch of IPs that constantly change.
A ton of AWS services that use DynamoDB could no longer do so. Hardcoding IPs wasn't an option. Nor could clients do anything on their side.
> a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses)
Did you consider that DNS might’ve been a symptom? If the DynamoDB DNS records use a health-check, switching DNS servers will not resolve the issue and might make it worse by directing an unusually high volume of traffic at static IPs without autoscaling or fault recovery.
The article describes evidence for a concrete, straightforward organizational decay pattern that can explain a large part of this miserable failure. What's "self-serving" about such a theory?
My personal "guess" is that failing to retain knowledge and talent is only one of many components of a well-rounded crisis of bad management and bad company culture that has been eroding Amazon on more fronts than AWS reliability.
What's your theory? Conspiracy within Amazon? Formidable hostile hackers? Epic bad luck? Something even more movie-plot-like? Do you care about making sense of events in general?
We've witnessed someone repeatedly shoot themselves in the foot a few months ago. It is indeed a guess that it may cause their current foot pain, but it is a rather safe one.
Twice I've had to deal with outages where the root cause took a long time to find because there were several distinct root causes interacting in ways that made it difficult or impossible to reproduce the problem in an isolated way, or to even reason about the problem until we started figuring out that there were multiple unrelated root causes. All other outages I've dealt with were the source where experienced engineers and institutional knowledge were sufficient to quickly find the cause and fix it.
Which is to say: it's entirely possible that the inferences drawn by TFA are just wrong. And it's also possible that TFA is wrong but also right to express concern with how Amazon manages talent.
It's about the time between the announcements about finding the cause. I find that to be thin evidence. There are far too many alternate explanations. It's not even that I find the idea to be implausible, but I don't think the article's doom-saying confidence level is warranted.
Indeed. No disrespect to Justin (great person) or any of the engineers who were sacked but Corey's post here is basically "here's someone who was sacked, and here are several other layoff news". AWS is really big organization. Several orders of magnitude bigger than people who were remote/refused to RTO. Organizations like this survive these brain brains.
Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't."
I read that as "of 100 people who quit voluntarily, we wish 69-81 of them hadn't". But that number is meaningless without the context of how many people are quitting out of how many are there, not to mention onboarding processes and how fast new hires get up to speed.
> Organizations like this survive these brain brains.
True, that's the other thing. Even if it's true that brain drain directly caused/exacerbated this event, big companies have a lot of momentum. Money can paper over a terrifying range and magnitude of folly. Amazon won't die quickly.