Overcoming a Zookeeper Problem on Mesos

This week I dealt with an interesting problem at a client dealing with Zookeeper failures and Mesos. The previous week, there was an issue with heavy utilization on the cluster that led to certain services misbehaving. Other people on the client’s development team were trying to take action to rectify the situation and rather than just getting rid of the abusive service that was causing so much trouble, they decided to attack things they thought were problems with the infrastructure. Eventually, somebody thought maybe Zookeeper was having issues and so they went and stopped all of the Zookeeper nodes on this particular cluster and they deleted all of their data.

Anybody that has not seen what happens when Zookeeper’s data is obliterated for live cluster - its a trainwreck. It is a bad idea, you should never do it. The person that killed Zookeeper said they knew that they had just made a terrible mistake immediately after they did it. They restored things the best they could, and everything seemed to be limping along this week. One day this week, someone observed some strange behavior on the Mesos cluster. One of the services was refusing to start, and Spark Drivers were not able to start Spark Workers. There were crazy log events that were spooling through Mesos. The only difference I noticed was that leader election occurred - when one of the standby Mesos masters are promoted to master.

Upon inspection of coordination efforts, it looked like Zookeeper and Mesos were still out of sync. State information was not being handed off to that newly promoted master. So, I realized that Mesos had never been restarted since that fateful event last week involving the killing of Zookeeper. It seemed that the best option was to stop all of Mesos: Mesos slaves and Mesos masters as well as stopping any Mesos frameworks. Once the cluster was quiet, I had to worry about Docker.

We primarily use Docker as the executor for Mesos. That means that when services are started or stopped, Mesos slave has to communicate with Docker executor inside of Mesos 0.20, which then needs to communicate with Docker daemon in order to get the service to stop. Since Mesos was in a bad state, then that did not actually happen.

Something else that I learned from this: if you are using a Docker based infrastructure, restarting Mesos alone is not enough. To recover from this sort of thing, you actually have to go and manually stop all Docker containers that are left behind - essentially orphaned by the shutdown of Mesos. Mesos is no longer managing them properly. This is obviously an opportunity to write some sort of helper script or helper job that gets initiated by the upstart job that is used to start and stop Mesos slave. I actually had to go and stop all of those Docker containers. Without doing so, we would have had duplicate data producing jobs running, which Mesos would no longer have visibility of.

Once all of those Docker containers were stopped, I started up mesos-master on every node, followed by starting up mesos-slave on all every node. I was able to verify through the Mesos Web UI, :5050, that the correct number of slaves were running. Finally, I started back up Marathon - which immediately begins restoring all the services it was previous running.

In summary: I started up each of the Mesos masters, then each of the Mesos slaves, and finally Marathon. That allowed Mesos to properly initialize its information on the Zookeeper path (Mesos uses /mesos by default on Zookeeper), and get all of its state information reset. At that point everything began to work properly.