========================= Ganeti Maintenance Daemon ========================= .. contents:: :depth: 4 This design document outlines the implementation of a new Ganeti daemon coordinating all maintenance operations on a cluster (rebalancing, activate disks, ERROR_down handling, node repairs actions). Current state and shortcomings ============================== With ``harep``, Ganeti has a basic mechanism for repairs of instances in a cluster. The ``harep`` tool can fix a broken DRBD status, migrate, failover, and reinstall instances. It is intended to be run regularly, e.g., via a cron job. It will submit appropriate Ganeti jobs to take action within the range allowed by instance tags and keep track of them by recoding the job ids in appropriate tags. Besides ``harep``, Ganeti offers no further support for repair automation. While useful, this setup can be insufficient in some situations. Failures in actual hardware, e.g., a physical disk, currently requires coordination around Ganeti: the hardware failure is detected on the node, Ganeti needs to be told to evacuate the node, and, once this is done, some other entity needs to coordinate the actual physical repair. Currently there is no support by Ganeti to automatically prepare everything for a hardware swap. Proposed changes ================ We propose the addition of an additional daemon, called ``maintd`` that will coordinate cluster balance actions, instance repair actions, and work for hardware repair needs of individual nodes. The information about the work to be done will be obtained from a dedicated data collector via the :doc:`design-monitoring-agent`. Self-diagnose data collector ---------------------------- The monitoring daemon will get one additional dedicated data collector for node health. The collector will call an external command supposed to do any hardware-specific diagnose for the node it is running on. That command is configurable, but needs to be white-listed ahead of time by the node. For convenience, the empty string will stand for a build-in diagnose that always reports that everything is OK; this will also be the default value for this collector. Note that the self-diagnose data collector itself can, and usually will, call separate diagnose tools for separate subsystems. However, it always has to provide a consolidated description of the overall health state of the node. Protocol ~~~~~~~~ The collector script takes no arguments and is supposed to output the string representation of a single JSON object where the individual fields have the following meaning. Note that, if several things are broken on that node, the self-diagnose collector script has to merge them into a single repair action. status ...... This is a JSON string where the value is one of ``Ok``, ``live-repair``, ``evacuate``, ``evacuate-failover``. This indicates the overall need for repair and Ganeti actions to be taken. The meaning of these states are no action needed, some action is needed that can be taken while instances continue to run on that node, it is necessary to evacuate and offline the node, and it is necessary to evacuate and offline the node without attempting live migrations, respectively. command ....... If the status is ``live-repair``, a repair command can be specified. This command will be executed as repair action following the :doc:`design-restricted-commands`, however extended to read information on ``stdin``. The whole diagnose JSON object will be provided as ``stdin`` to those commands. details ....... An opaque JSON value that the repair daemon will just pass through and export. It is intended to contain information about the type of repair that needs to be done after the respective Ganeti action is finished. E.g., it might contain information which piece of hardware is to be swapped, once the node is fully evacuated and offlined. As two failures are considered different, if the output of the script encodes a different JSON object, the collector script should ensure that as long as the hardware status does not change, the output of the script is stable; otherwise this would cause various events reported for the same failure. Security considerations ~~~~~~~~~~~~~~~~~~~~~~~ Command execution ................. Obviously, running arbitrary commands that are part of the configuration poses a security risk. Note that an underlying design goal of Ganeti is that even with RAPI credentials known to the attacker, he still cannot obtain data from within the instances. As monitoring, however, is configurable via RAPI, we require the node to white-list the command using a mechanism similar to the :doc:`design-restricted-commands`; in our case, the white-listing directory will be ``/etc/ganeti/node-diagnose-commands``. For the repair-commands, as mentioned, we extend the :doc:`design-restricted-commands` by allowing input on ``stdin``. All other restrictions, in particular the white-listing requirement, remain. The white-listing directory will be ``/etc/ganeti/node-repair-commands``. Result forging .............. As the repair daemon will take real Ganeti actions based on the diagnose reported by the self-diagnose script through the monitoring daemon, we need to verify integrity of such reports to avoid denial-of-service by fraudaulent error reports. Therefore, the monitoring daemon will sign the result by an hmac signature with the cluster hmac key, in the same way as it is done in the ``confd`` wire protocol (see :doc:`design-2.1`). Repair-event life cycle ----------------------- Once a repair event is detected, a unique identifier is assigned to it. As long as the node-health collector returns the same output (as JSON object), this is still considered the same event. This identifier can be used to cancel an observed event at any time; for this an appropriate command-line and RAPI endpoint will be provided. Cancelling an event tells the repair daemon not to take any actions (despite them being requested) for this event and forget about it, as soon as it is no longer observed. Corresponding Ganeti actions will be initiated and success or failure of these Ganeti jobs monitored. All jobs submitted by the repair daemon will have the string ``gnt:daemon:maintd`` and the event identifier in the reason trail, so that :doc:`design-optables` is possible. Once a job fails, no further jobs will be submitted for this event to avoid further damage; the repair action is considered failed in this case. Once all requested actions succeeded, or one failed, the node where the event as observed will be tagged by a tag starting with ``maintd:repairready:`` or ``maintd:repairfailed:``, respectively, where the event identifier is encoded in the rest of the tag. On the one hand, it can be used as an additional verification whether a node is ready for a specific repair. However, the main purpose is to provide a simple and uniform interface to acknowledge an event. Once a ``maintd:repairready`` tag is removed, the maintenance daemon will forget about this event, as soon as it is no longer observed by any monitoring daemon. Removing a ``maintd:repairfailed:`` tag will make the maintenance daemon to unconditionally forget the event; note that, if the underlying problem is not fixed yet, this provides an easy way of restarting a repair flow. Repair daemon ------------- The new daemon ``maintd`` will be running on the master node only. It will verify the master status of its node by popular vote in the same way as all the other master-only daemons. If started on a non-master node, it will exit immediately with exit code ``exitNotmaster``, i.e., 11. External Reporting Protocol ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Upon successful start, the daemon will bind to a port overridable at command-line, by default 1816, on the master network device. There it will serve the current repair state via HTTP. All queries will be HTTP GET requests and all answers will be encoded in JSON format. Initially, the following requests will be supported. ``/`` ..... Returns the list of supported protocol versions, initially just ``[1]``. ``/1/status`` ............. Returns a list of all non-cleared incidents. Each incident is reported as a JSON object with at least the following information. - ``uuid`` The unique identifier assigned to the event. - ``node`` The UUID of the node on which the even was observed. - ``original`` The very JSON object reported by self-diagnose data collector. - ``repair-status`` A string describing the progress made on this event so far. It is one of the following. + ``noted`` The event has been observed, but no action has been taken yet + ``pending`` At least one job has been submitted in reaction to the event and none of the submitted jobs has failed so far. + ``canceled`` The event has been canceled, i.e., ordered to be ignored, but is still observed. + ``failed`` At least one of the submitted jobs has failed. To avoid further damage, the repair daemon will not take any further action for this event. + ``completed`` All Ganeti actions associated with this event have been completed successfully, including tagging the node. - ``jobs`` The list of the numbers of ganeti jobs submitted in response to this event. - ``tag`` A string that is the tag that either has been added to the node, or, if the repair event is not yet finalized, will be added in case of success. State ~~~~~ As repairs, especially those involving physically swapping hardware, can take a long time, the repair daemon needs to store its state persistently. As we cannot exclude master-failovers during a repair cycle, it does so by storing it as part of the Ganeti configuration. This will be done by adding a new top-level entry to the Ganeti configuration. The SSConf will not be changed. Superseeding ``harep`` and implicit balancing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To have a single point coordinating all repair actions, the new repair daemon will also have the ability to take over the work currently done by ``harep``. To allow a smooth transition, ``maintd`` when carrying out ``harep``'s duties will add tags in precisely the same way as ``harep`` does. As the new daemon will have to move instances, it will also have the ability to balance the cluster in a way coordinated with the necessary evacuation options; dynamic load information can be taken into account. The question on whether to do ``harep``'s work and whether to balance the cluster and if so using which strategy (e.g., taking dynamic load information into account or not, allowing disk moves or not) are configurable via the Ganeti configuration. The default will be to do neither of those tasks. ``harep`` will continue to exist unchanged as part of the ``htools``. Mode of operation ~~~~~~~~~~~~~~~~~ The repair daemon will poll the monitoring daemons for the value of the self-diagnose data collector at the same (configurable) rate the monitoring daemon collects this collector; if load-based balancing is enabled, it will also collect for the the load data needed. Repair events will be exposed on the web status page as soon as observed. The Ganeti jobs doing the actual maintenance will be submitted in rounds. A new round will be started if all jobs of the old round have finished, and there is an unhandled repair event or the cluster is unbalanced enough (provided that autobalancing is enabled). In each round, ``maintd`` will first determine the most invasive action for each node; despite the self-diagnose collector summing observations in a single action recommendation, a new, more invasive recommendation can be issued before the handling of the first recommendation is finished. For all nodes to be evacuated, the first evacuation task is scheduled, in a way that these tasks do not conflict with each other. Then, for all instances on a non-affected node, that need ``harep``-style repair (if enabled) those jobs are scheduled to the extend of not conflicting with each other. Then on the remaining nodes that are not part of a failed repair event either, the jobs of the first balancing step are scheduled. All those jobs of a round are submitted at once. As they do not conflict they will be able to run in parallel.