At a large cloud provider I held a role for a bit in the “safety” organization that was tasked with developing better understanding of our incidents, working on tooling to protect systems, and so on.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
I've applied for a couple jobs like this and was somewhat relieved they didn't call me back.
When you move thinking about reliability or safety outside of the teams that generate the problems, you replace self reflection with scolding, and you have to either cajole people to make changes for you or jump into code you're not spending enough time with to truly understand. And then if you make a mistake this is evidence that you shouldn't be touching it at all. See we told you this would end badly.
Yeah I think that’s accurate. The org had good intentions and owned some reasonable programs, but more and more became basically a security organization focused on cutting off hands to prevent people from touching keyboards, rather than addressing real systemic risks and patterns of operator behaviour leading to incidents.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?