Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The clustering story for etcd is pretty lacking in general. The discovery mechanisms are not built for cattle type infrastructure or public clouds. ie it is difficult to bootstrap a cluster on a public cloud without first knowing the network interfaces your nodes will have or it requires you to already have an etcd cluster OR use SRV records. From my experience etcd makes it hard to use auto scaling groups for healing and rolling updates.

From my experience consul seems to have a better clustering story but I'd be curious why etcd won out over other technologies as the k8s datastore of choice.



> From my experience consul seems to have a better clustering story but I'd be curious why etcd won out over other technologies as the k8s datastore of choice.

That'd be some interesting history. That choice had a big impact in making etcd relevant, I think. As far as I know, etcd was chosen before kubernetes ever went public, pre-2014? So it must have been really bleeding edge at the time. I don't think consul was even out then - it might have been they were just too late to the game. The only other reasonable option was probably ZooKeeper.


I was around at CoreOS before Kubernetes existed. I don't recall exactly when etcd was chosen at the data store, but the Google team valued focus for this very important part of the system.

etcd didn't have an embedded DNS server, etc. Of course, these things can be built on top of etcd easily. Upstream has taken advantage of this by swapping the DNS server used in Kubernetes twice, IIRC.

Contrast this with Consul which contains a DNS server and is now moving into service mesh territory. This isn't a fault of Consul at all, just a desire to be a full solution vs a building block.


My understanding is that Google valued the fact that etcd was willing to support gRPC and Consul wasn't -- i.e., raw performance/latency was the gating factor. etcd was historically far less stable and less well documented than Consul, even though Consul had more functionality. etcd may have caught up in the last couple years, though.


At the time gRPC was not part of etcd - that only arrived in etcd 3.x.

The design of etcd 3.x was heavily influenced by the Kube usecase, but the original value of etcd was that

A) you could actually do an reasonably cheap HA story (vs Singleton DBs)

B) the clustering fundamentals were sound (zookeeper at the time was not able to do dynamic reconfiguration, although in practice this hasn’t been a big issue)

C) consul came with a lot of baggage that we wanted to do differently - not to knock consul, it just overlapped with alternate design decisions (like a large local agent instead of a set of lightweight agents)

D) etcd was the simplest possible option that also supported efficient watch

While I wasn’t part of the pre open sourcing discussions, I agreed with the initial rationale and I don’t regret the choice.

The etcd2 - 3 migration was more painful than it could be, but most of the challenges I think were excacerbated by us not pulling the bandaid off early and forcing a 2-3 migration for all users right after 1.6.


My impression is that etcd works more in a lower-level data store abstraction than Consul, exactly why it's not so feature-rich but is used as building block. Consul packs more out-the-box if that's what you need.

Both are atill much better to operate than ZooKeeper.


There are several ways of bootstraping ETCD. The one I use is the one you mention: since they are brought up with Terraform, always on a brand new VPC, we can calculate what the IP addresses will be on Terraform itself and fill the initial node list that way. We can destroy an ETCD node if need be, and recreate. Granted, it is nowhere near close to being as convenient as an ASG.

The alternate method, and the method we used before, is to use an existing cluster, as you mention. If cattle self-healing is that important, perhaps you could afford a small cluster only for bootstrapping? Load will be very low unless you are bootstrapping a node somewhere. There are costs involved in keeping those instances 24/7, but they may be acceptable in your environment(and the instances can be small). Then the only thing you need is to store the discovery token and inject it with cloud init or some other mechanism.

That said, I just finish a task to automate our ELK clusters. For Elasticsearch I can just point to a load balancer which contains the masters and be done with it. I wish I could do the same for ETCD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: