Contents
This is a design document detailing the integration of Ganeti and Linux HA.
Ganeti doesn’t currently support any self-healing or self-monitoring.
We are now working on trying to improve the situation in this regard:
What is still missing is a way to self-detect “obvious” failures rapidly and to:
Linux-HA provides software that can be used to provide high availability of services through automatic failover of resources. In particular Pacemaker can be used together with Heartbeat or Corosync to make sure a resource is kept active on a self-monitoring cluster.
The Ganeti agents will be slightly special in the HA world. The following will apply:
Note that for what Ganeti does OCF agents are needed: simply relying on the LSB scripts will not work for the Ganeti service.
This agent will manage the Ganeti master role. It needs to be configured as a sticky resource (you don’t want to flap the master role around, do you?) that is active on only one node. You can require quorum or fencing to protect your cluster from multiple masters.
The agent will implement a stateless resource that considers itself “started” only the master node, “stopped” on all master candidates and in error mode for all other nodes.
Note that if not all your nodes are master candidates this resource might have problems:
Other solutions, such as reporting the resource just as “stopped” on non master candidates as well might mean that pacemaker would choose the “wrong” node to promote to master, which is also a bad idea.
This agent will manage the Ganeti node role. It needs to be configured as a cloned resource that is active on all nodes.
In partial mode it will always return success (and thus trigger a failure only upon an HA level or network failure). Full mode, which initially will not be implemented, couls also check for the node daemon being unresponsive or other local conditions (TBD).
When a failure happens the HA notification system will trigger on all other nodes, including the master. The master will then be able to offline the node. Any other work to restore instance availability should then be done by the autorepair system.
The following cluster tags are supported:
Running Ganeti with Pacemaker increases the risk of stability for your Ganeti Cluster. Events like:
will trigger potentially dangerous operations such as node offlining or master role failover. Moreover if the autorepair system will be working they will be able to also trigger instance failovers or migrations, and disk replaces.
Also note that operations like: master-failover, or manual node-modify might interact badly with this setup depending on the way your HA system is configured (see below).
This of course is an inherent problem with any Linux-HA installation, but is probably more visible with Ganeti given that our resources tend to be more heavyweight than many others managed in HA clusters (eg. an IP address).
This code is heavily experimental, and Linux-HA is a very complex subsystem. We might not be able to help you if you decide to run this code: please make sure you understand fully high availability on your production machines. Ganeti only ships this code as an example but it might need customization or complex configurations on your side for it to run properly.
Ganeti does not automate HA configuration for your cluster. You need to do this job by hand. Good luck, don’t get it wrong.