Distributed Systems: Take Responsibility for Failover by Ilya Volodarsky

More and more systems are advertising high availability but leaving the responsibility for failover in the hands of third-party clients or the end user. That forces the average user to implement their own (janky) failover scheme to keep their app from failing. Systems that advertise availability should be responsible for failover up to and including the client.

Failover preparedness is a necessity, especially on cloud environments like EC2. Almost every week something goes amiss: EBS drives suddenly become unresponsive or you get that dreaded email that says your hardware is being decomissioned. In an ideal world you wouldn’t have to care about any of that. As long as you take the time to set up your cluster correctly, the distributed system should manage availability and replication for you.

Some system designs lend themselves to smarter and safer clients. Small design decisions can go a long way towards making the system easier to maintain for the end user. We’ve used a variety of distributed systems at Segment.io. Here’s how the ones we’ve used recently rank in failover preparedness and ease of maintanability:

Intermediary: talk through a local coordination process.

MongoDB simplifies failover by putting its own mongos process between your client and the database servers. You run mongos on the same machine as your client. As long as you set up your config servers and replica sets correctly, Mongo will manage everything for you.

Just connect to the local process, and you never have to worry about a single failure in your cluster. If you manage to get a network partition in your loopback adapter, then buy yourself some ice cream and take the rest of the day off!

The intermediary design removes the need for your clients to know which database server to connect to. You always connect to localhost, both in development and in production.

As an end user, this design makes it incredibly easy to work with Mongo. No wonder it’s been so widely adopted.

Dynamo Ring: all your nodes are identical.

Cassandra gives clients the opportunity to connect to any the nodes in its dynamo ring, and any node can route the request to where it needs to go. As long as the amount of available replicas is greater than your desired consistency level, you can always talk to your database. This is almost as good as Mongo’s intermediary design, except that you’ve got to keep an updated list of all the database node addresses somewhere in your app-configuration.

Netflix’s Cassandra client, Astyanax, does one better. It auto-discovers your Cassandra ring so that it can send each request to the closest replica. You can even enable nifty features like BadHostDetector to avoid high latency nodes.

Cassandra nails ring availability, but leaves third-party clients to implement their own ring discovery. If Apache provided an Astyanax local endpoint similar to mongos, then Cassandra would assume the responsibility for failover rather than forcing it onto its clients. Instead, you get fragmentation in client quality. Astyanax is great, but the node.js client you’re using doesn’t do ring describe, which means if your seed nodes go down, you’re still out of luck.

Master + Slave: new masters are promoted.

RabbitMQ nodes can cluster together to form a master and slave setup. If a master goes down, the cluster will promote a new master. To achieve high availability, in version 2.6 RabbitMQ added mirrored queues, which replicate a single message between master and slave brokers. That means that if the master dies, you can write and read from the same queue on another broker without losing any messages. That seems great, until you realize that you have to implement your own failover routing.

The problem is that the clients don’t support automatic failover to slave, which means your app code has to detect the failure, and then re-connect to another slave. The Java client doesn’t handle connection pooling between multiple hosts. It leaves that to the user. You can set up a TCP HAProxy load balancer in front of your brokers, but that’s a lot to ask for.

You’re left with writing your own failover layer in front that will select available connections. But (like me) that can get real hairy. One consumes databases by making short-lived requests to them. Unlike databases, message brokers push messages to you. Say you want broker B to take messages from a queue and push them to consumer C. If B has a heart attack, then your code has to:

Establish a connection to a new slave broker.
Traverse all open consumers and re-consume their queues.

That’s not fun code to write, and shouldn’t be the responsibility of the end user.

An intermediary would also work well here, allowing clients to connect and consume from a local process that handles routing to a currently available node.

Master + Slave: clients pick the master.

Redis uses a master-slave system for replication, but from the client’s perspective you only ever write to the master. If the master goes down and slave writes are enabled, you risk accidentially writing to the slave. If that happens, and the master comes back online, you’ll lose all that data on the re-sync.

So what do you do for failover? Your app code needs to either crash and notify the admin, or needs to choose the next master. But as everyone knows, choosing the next master is a difficult task because it requires building or subscribing to another synchronization system like Zookeeper.

The good news is that antirez realized this was Redis users’ main concern. The result is Sentinel. Once stable and deployed (and supported by your client), Sentinel will automatically decide on a new master in the event of a failure.

Looking Forward

All of the systems mentioned above are incredibly good at what they do. After all, we’ve used them all in production on Segment.io at some point.

It’s just a shame that most of them require so much effort from the developer to achieve a highly available setup. Systems should manage their own availability right up to the app code, instead of off-loading that responsibility to their end-users or their third-party clients.

By building the coordination process into the system itself, we could save N people from having to implement the same logic over and over again. And with a single implementation, all improvements are immediately shared by everyone, no matter what language or client.

Nowadays, with everyone striving for the best developer experience, we can do better.

Ilya Volodarsky

Co-founder @ Segment.io

Distributed Systems: Take Responsibility for Failover

By Ilya Volodarsky

Intermediary: talk through a local coordination process.

Dynamo Ring: all your nodes are identical.

Master + Slave: new masters are promoted.

Master + Slave: clients pick the master.

Looking Forward