Ganeti Maintenance Daemon¶
This design document outlines the implementation of a new Ganeti daemon coordinating all maintenance operations on a cluster (rebalancing, activate disks, ERROR_down handling, node repairs actions).
Current state and shortcomings¶
harep, Ganeti has a basic mechanism for repairs of instances
in a cluster. The
harep tool can fix a broken DRBD status, migrate,
failover, and reinstall instances. It is intended to be run regularly,
e.g., via a cron job. It will submit appropriate Ganeti jobs to take
action within the range allowed by instance tags and keep track
of them by recoding the job ids in appropriate tags.
harep, Ganeti offers no further support for repair automation.
While useful, this setup can be insufficient in some situations.
Failures in actual hardware, e.g., a physical disk, currently requires coordination around Ganeti: the hardware failure is detected on the node, Ganeti needs to be told to evacuate the node, and, once this is done, some other entity needs to coordinate the actual physical repair. Currently there is no support by Ganeti to automatically prepare everything for a hardware swap.
We propose the addition of an additional daemon, called
that will coordinate cluster balance actions, instance repair actions,
and work for hardware repair needs of individual nodes. The information
about the work to be done will be obtained from a dedicated data collector
via the Ganeti monitoring agent.
Self-diagnose data collector¶
The monitoring daemon will get one additional dedicated data collector for node health. The collector will call an external command supposed to do any hardware-specific diagnose for the node it is running on. That command is configurable, but needs to be white-listed ahead of time by the node. For convenience, the empty string will stand for a build-in diagnose that always reports that everything is OK; this will also be the default value for this collector.
Note that the self-diagnose data collector itself can, and usually will, call separate diagnose tools for separate subsystems. However, it always has to provide a consolidated description of the overall health state of the node.
The collector script takes no arguments and is supposed to output the string representation of a single JSON object where the individual fields have the following meaning. Note that, if several things are broken on that node, the self-diagnose collector script has to merge them into a single repair action.
This is a JSON string where the value is one of
evacuate-failover. This indicates the overall need for
repair and Ganeti actions to be taken. The meaning of these states are
no action needed, some action is needed that can be taken while instances
continue to run on that node, it is necessary to evacuate and offline
the node, and it is necessary to evacuate and offline the node without
attempting live migrations, respectively.
If the status is
live-repair, a repair command can be specified.
This command will be executed as repair action following the
Design for executing commands via RPC, however extended to read information
stdin. The whole diagnose JSON object will be provided as
to those commands.
An opaque JSON value that the repair daemon will just pass through and export. It is intended to contain information about the type of repair that needs to be done after the respective Ganeti action is finished. E.g., it might contain information which piece of hardware is to be swapped, once the node is fully evacuated and offlined.
As two failures are considered different, if the output of the script encodes a different JSON object, the collector script should ensure that as long as the hardware status does not change, the output of the script is stable; otherwise this would cause various events reported for the same failure.
Obviously, running arbitrary commands that are part of the configuration
poses a security risk. Note that an underlying design goal of Ganeti is
that even with RAPI credentials known to the attacker, he still cannot
obtain data from within the instances. As monitoring, however, is configurable
via RAPI, we require the node to white-list the command using a mechanism
similar to the Design for executing commands via RPC; in our case, the white-listing
directory will be
For the repair-commands, as mentioned, we extend the
Design for executing commands via RPC by allowing input on
stdin. All other
restrictions, in particular the white-listing requirement, remain. The
white-listing directory will be
As the repair daemon will take real Ganeti actions based on the diagnose
reported by the self-diagnose script through the monitoring daemon, we
need to verify integrity of such reports to avoid denial-of-service by
fraudaulent error reports. Therefore, the monitoring daemon will sign
the result by an hmac signature with the cluster hmac key, in the same
way as it is done in the
confd wire protocol (see Ganeti 2.1 design).
Repair-event life cycle¶
Once a repair event is detected, a unique identifier is assigned to it. As long as the node-health collector returns the same output (as JSON object), this is still considered the same event. This identifier can be used to cancel an observed event at any time; for this an appropriate command-line and RAPI endpoint will be provided. Cancelling an event tells the repair daemon not to take any actions (despite them being requested) for this event and forget about it, as soon as it is no longer observed.
Corresponding Ganeti actions will be initiated and success or failure of
these Ganeti jobs monitored. All jobs submitted by the repair daemon
will have the string
gnt:daemon:maintd and the event identifier
in the reason trail, so that Filtering of jobs for the Ganeti job queue is possible.
Once a job fails, no further jobs will be submitted for this event
to avoid further damage; the repair action is considered failed in this case.
Once all requested actions succeeded, or one failed, the node where the
event as observed will be tagged by a tag starting with
maintd:repairfailed:, respectively, where the event identifier is
encoded in the rest of the tag. On the one hand, it can be used as an
additional verification whether a node is ready for a specific repair.
However, the main purpose is to provide a simple and uniform interface
to acknowledge an event. Once a
maintd:repairready tag is removed,
the maintenance daemon will forget about this event, as soon as it is no
longer observed by any monitoring daemon. Removing a
tag will make the maintenance daemon to unconditionally forget the event;
note that, if the underlying problem is not fixed yet, this provides an
easy way of restarting a repair flow.
The new daemon
maintd will be running on the master node only. It will
verify the master status of its node by popular vote in the same way as all the
other master-only daemons. If started on a non-master node, it will exit
immediately with exit code
exitNotmaster, i.e., 11.
External Reporting Protocol¶
Upon successful start, the daemon will bind to a port overridable at command-line, by default 1816, on the master network device. There it will serve the current repair state via HTTP. All queries will be HTTP GET requests and all answers will be encoded in JSON format. Initially, the following requests will be supported.
Returns the list of supported protocol versions, initially just
Returns a list of all non-cleared incidents. Each incident is reported as a JSON object with at least the following information.
idThe unique identifier assigned to the event.
nodeThe UUID of the node on which the even was observed.
originalThe very JSON object reported by self-diagnose data collector.
repair-statusA string describing the progress made on this event so far. It is one of the following.
notedThe event has been observed, but no action has been taken yet
pendingAt least one job has been submitted in reaction to the event and none of the submitted jobs has failed so far.
canceledThe event has been canceled, i.e., ordered to be ignored, but is still observed.
failedAt least one of the submitted jobs has failed. To avoid further damage, the repair daemon will not take any further action for this event.
completedAll Ganeti actions associated with this event have been completed successfully, including tagging the node.
jobsThe list of the numbers of ganeti jobs submitted in response to this event.
tagA string that is the tag that either has been added to the node, or, if the repair event is not yet finalized, will be added in case of success.
As repairs, especially those involving physically swapping hardware, can take a long time, the repair daemon needs to store its state persistently. As we cannot exclude master-failovers during a repair cycle, it does so by storing it as part of the Ganeti configuration.
This will be done by adding a new top-level entry to the Ganeti configuration. The SSConf will not be changed.
harep and implicit balancing¶
To have a single point coordinating all repair actions, the new repair daemon
will also have the ability to take over the work currently done by
To allow a smooth transition,
maintd when carrying out
will add tags in precisely the same way as
As the new daemon will have to move instances, it will also have the ability
to balance the cluster in a way coordinated with the necessary evacuation
options; dynamic load information can be taken into account.
The question on whether to do
harep’s work and whether to balance the
cluster and if so using which strategy (e.g., taking dynamic load information
into account or not, allowing disk moves or not) are configurable via the Ganeti
configuration. The default will be to do neither of those tasks.
continue to exist unchanged as part of the
Mode of operation¶
The repair daemon will poll the monitoring daemons for the value of the self-diagnose data collector at the same (configurable) rate the monitoring daemon collects this collector; if load-based balancing is enabled, it will also collect for the the load data needed.
Repair events will be exposed on the web status page as soon as observed. The Ganeti jobs doing the actual maintenance will be submitted in rounds. A new round will be started if all jobs of the old round have finished, and there is an unhandled repair event or the cluster is unbalanced enough (provided that autobalancing is enabled).
In each round,
maintd will first determine the most invasive action for
each node; despite the self-diagnose collector summing observations in a single
action recommendation, a new, more invasive recommendation can be issued before
the handling of the first recommendation is finished. For all nodes to be
evacuated, the first evacuation task is scheduled, in a way that these tasks do
not conflict with each other. Then, for all instances on a non-affected node,
harep-style repair (if enabled) those jobs are scheduled to the
extend of not conflicting with each other. Then on the remaining nodes that
are not part of a failed repair event either, the jobs
of the first balancing step are scheduled. All those jobs of a round are
submitted at once. As they do not conflict they will be able to run in parallel.