Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a data point, I've been running stuff at Hetzner for 10 years now, in two datacenters (physical servers). There were brief network outages when they replaced networking equipment, and exactly ONE outage for hardware replacement, scheduled weeks in advance, with a 4-hour window and around 1-2h duration.

It's just a single data point, but for me that's a pretty good record.

It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.



> physical servers are way simpler than the extremely complex software and networking systems that AWS provides.

Or, rather, it's your fault when the complex software and networking systems you deployed on top of those physical servers go wrong (:


Yes. Which is why I try to keep my software from being overly complex, for example by not succumbing to the Kubernetes craze.


Well the complexity comes not from Kubernetes per se but that the problem it wants to solve (generalized solution for distributed computing) is very hard in itself.


Only if you actually has a system complex enough to require it. A lot of systems that use kubernetes are not complex enough to require it, but use it anyway. In that case kubernetes does indeed add unnecessary complexity.


Except that k8s doesn't solve the problem of generalized distributed computing at all. (For that you need distributed fault-tolerant state handling which k8s doesn't do.)

K8s solves only one problem - the problem of organizational structure scaling. For example, when your Ops team and your Dev team have different product deadlines and different budgets. At this point you will need the insanity of k8s.


I am so happy to read that someone views kubernetes the same way I do. for many years i have been surrounded by people who "kubernetes all the things" and that is absolute madness to me.


Yes, I remember when Kubernetes hit the scene and it was only used by huge companies who needed to spin-up fleets of servers on demand. The idea of using it for small startup infra was absurd.


As another data point, I run a k8s cluster on Hetzner (mainly for my own experience, as I'd rather learn on my pet projects vs production), and haven't had any Hetzner related issues with it.

So Hetzner is OK for the overly complex as well, if you wish to do so.


I love my k8s. Spend 5 minutes per month over the past 8 years and get a very reliable infra


Do you work on k8s professionally outside of the project you’re talking about?

5 mins seems unrealistic unless you’re spending time somewhere else to keep up to speed with version releases, upgrades, etc.


I think it sounds quite realistic especially if you’re using something like Talos Linux.

I’m not using k8s personally but the moment I moved from traditional infrastructure (chef server + VMs) to containers (Portainer) my level of effort went down by like 10x.


I would say even if not using Talos, Argo CD or Flux CD together with Renovate really helps to simplify the reoccuring maintenence.


You've spent less than 8 hours total on kubernetes?


I agree. Even when Kubernetes is used in large environments, is it still cumbersome, verbose and overly complex.


What are the alternatives?


Right, who needs scalability? Each app should have a hard limit of users and just stop acceppting new users when limits are reached.


Yeah scalability is great! Let’s burn through thousands of dollars an hour and give all our money to Amazon/Google/Microsoft

When those pink slips come in, we’ll just go somewhere else and do the same thing!


You know that “scale” existed long before K8s - or even Borg - was a thing, right? I mean, how do you think Google ran before creating them?


yes and mobile phones existed before smartphones, what's the point? So far in terms of scalability nothing beats k8s. And from OpenAI and Google we also see that it even works for high performance use case such as LLM trainings with huge amounts of nodes.


If the complex software you deployed and/or configured goes wrong on AWS it's also your fault.


On the other hand, I had the misfortune of having a hardware failure on one of my Hetzner servers. They got a replacement harddrive in fairly quickly, but still complete data loss on that server, so I had to rebuild it from scratch.

This was extra painful, because I wasn't using one of the OS that is blessed by Hetzner, so it requires a remote install. Remote installs require a system that can run their Java web plugin, and that have a stable and fast enough connection to not time out. The only way I have reliably gotten them to work is by having an ancient Linux VM that was also running in Hetzner, and had the oldest Firefox version I could find that still supported Java in the browser.

My fault for trying to use what they provide in a way that is outside their intended use, and props to them for letting me do it anyway.


That can happen with any server, physical or virtual, at any time, and one should be prepared for it.

I learned a long time ago that servers should be an output of your declarative server management configuration, not something that is the source of any configuration state. In other words, you should have a system where you can recreate all your servers at any time.

In your case, I would indeed consider starting with one of the OS base installs that they provide. Much as I dislike the Linux distribution I'm using now, it is quite popular, so I can treat it as a common denominator that my ansible can start from.


They allow netbooting to a recovery OS from which the disks can be provisioned via an ssh session too, for custom setups. Likely there are cases that require the remote "keyboard", but I wanted to mention that.

Cloud marketing and career incentives seems to have instilled in the average dev that MTBF for hardware is in days rather than years.


MTBF?


Mean time between failures


Mean Time Between Failures.


Do you monitor your product closely enough to know that there weren't other brief outages? E.g. something on the scale of unscheduled server restarts, and minute-long network outages?


I personally do through status monitors at larger cloud providers at 30 sec resolutions, never noticed a downtime. They will sometimes drop ICMP though, even though the host is alive and kicking.


Surprised they allow ICMP at all


why does this surprise you?

actually, why do people block ICMP? I remember in 1997-1998 there were some Cisco ICMP vulnerabilities and people started blocking ICMP then and mostly never stopped, and I never understood why. ICMP is so valuable for troubleshooting in certain situations.


Security through obscurity mostly, I don't know who continues to push the advice to block ICMP without a valid technical reason since at best if you tilt your head and squint your eyes you could almost maybe see a (very new) script kiddie being defeated by it.

I've rarely actually seen that advice anywhere, more so 20 years ago than now but people are still clearly getting it from circles I don't run in.


I don’t disagree. I am used to highly regulated industries where ping is blocked across the WAN


I do. Routers, switches, and power redundancy are solved problems in datacenter hardware. Network outages rarely occur because of these systems, and if any component goes down, there's usually an automatic failover. The only thing you might notice is TCP connections resetting and reconnecting, which typically lasts just a few seconds.


Of course. It's a production SaaS, after all. But I don't monitor with sub-minute resolution.


I do for some time now, on the scale of around 20 hosts in their cloud offering. No restarts or network outages. I do see "migrations" from time to time (vm migrating to a different hardware, I presume), but without impact on metrics.


Having run bare-metal servers for a client + plenty of VMs pre-cloud, you'd be surprised how bloody obvious that sort of thing is when it happens.

Also sorts of monitoring gets flipped.

And no, there generally aren't brief outages in normal servers unless you did it.

I did have someone accidentally shut down one of the servers once though.


to stick to the above point, this wasn't a minute long outage. if you care about seconds/minutes long outages, you monitor. running on aws, hetzer, ovh, or a raspberry in a shoe box makes no difference


7 years, 20 servers, same here.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: