As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
Well the complexity comes not from Kubernetes per se but that the problem it wants to solve (generalized solution for distributed computing) is very hard in itself.
Only if you actually has a system complex enough to require it. A lot of systems that use kubernetes are not complex enough to require it, but use it anyway. In that case kubernetes does indeed add unnecessary complexity.
Except that k8s doesn't solve the problem of generalized distributed computing at all. (For that you need distributed fault-tolerant state handling which k8s doesn't do.)
K8s solves only one problem - the problem of organizational structure scaling. For example, when your Ops team and your Dev team have different product deadlines and different budgets. At this point you will need the insanity of k8s.
I am so happy to read that someone views kubernetes the same way I do. for many years i have been surrounded by people who "kubernetes all the things" and that is absolute madness to me.
Yes, I remember when Kubernetes hit the scene and it was only used by huge companies who needed to spin-up fleets of servers on demand. The idea of using it for small startup infra was absurd.
As another data point, I run a k8s cluster on Hetzner (mainly for my own experience, as I'd rather learn on my pet projects vs production), and haven't had any Hetzner related issues with it.
So Hetzner is OK for the overly complex as well, if you wish to do so.
I think it sounds quite realistic especially if you’re using something like Talos Linux.
I’m not using k8s personally but the moment I moved from traditional infrastructure (chef server + VMs) to containers (Portainer) my level of effort went down by like 10x.
yes and mobile phones existed before smartphones, what's the point? So far in terms of scalability nothing beats k8s. And from OpenAI and Google we also see that it even works for high performance use case such as LLM trainings with huge amounts of nodes.
On the other hand, I had the misfortune of having a hardware failure on one of my Hetzner servers. They got a replacement harddrive in fairly quickly, but still complete data loss on that server, so I had to rebuild it from scratch.
This was extra painful, because I wasn't using one of the OS that is blessed by Hetzner, so it requires a remote install. Remote installs require a system that can run their Java web plugin, and that have a stable and fast enough connection to not time out. The only way I have reliably gotten them to work is by having an ancient Linux VM that was also running in Hetzner, and had the oldest Firefox version I could find that still supported Java in the browser.
My fault for trying to use what they provide in a way that is outside their intended use, and props to them for letting me do it anyway.
That can happen with any server, physical or virtual, at any time, and one should be prepared for it.
I learned a long time ago that servers should be an output of your declarative server management configuration, not something that is the source of any configuration state. In other words, you should have a system where you can recreate all your servers at any time.
In your case, I would indeed consider starting with one of the OS base installs that they provide. Much as I dislike the Linux distribution I'm using now, it is quite popular, so I can treat it as a common denominator that my ansible can start from.
They allow netbooting to a recovery OS from which the disks can be provisioned via an ssh session too, for custom setups.
Likely there are cases that require the remote "keyboard", but I wanted to mention that.
Do you monitor your product closely enough to know that there weren't other brief outages? E.g. something on the scale of unscheduled server restarts, and minute-long network outages?
I personally do through status monitors at larger cloud providers at 30 sec resolutions, never noticed a downtime. They will sometimes drop ICMP though, even though the host is alive and kicking.
actually, why do people block ICMP? I remember in 1997-1998 there were some Cisco ICMP vulnerabilities and people started blocking ICMP then and mostly never stopped, and I never understood why. ICMP is so valuable for troubleshooting in certain situations.
Security through obscurity mostly, I don't know who continues to push the advice to block ICMP without a valid technical reason since at best if you tilt your head and squint your eyes you could almost maybe see a (very new) script kiddie being defeated by it.
I've rarely actually seen that advice anywhere, more so 20 years ago than now but people are still clearly getting it from circles I don't run in.
I do. Routers, switches, and power redundancy are solved problems in datacenter hardware. Network outages rarely occur because of these systems, and if any component goes down, there's usually an automatic failover. The only thing you might notice is TCP connections resetting and reconnecting, which typically lasts just a few seconds.
I do for some time now, on the scale of around 20 hosts in their cloud offering. No restarts or network outages. I do see "migrations" from time to time (vm migrating to a different hardware, I presume), but without impact on metrics.
to stick to the above point, this wasn't a minute long outage.
if you care about seconds/minutes long outages, you monitor. running on aws, hetzer, ovh, or a raspberry in a shoe box makes no difference
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.