Ganeti Node OOB Management Framework ==================================== Objective --------- Extend Ganeti with Out of Band (:term:`OOB`) Cluster Node Management Capabilities. Background ---------- Ganeti currently has no support for Out of Band management of the nodes in a cluster. It relies on the OS running on the nodes and has therefore limited possibilities when the OS is not responding. The command ``gnt-node powercycle`` can be issued to attempt a reboot of a node that crashed but there are no means to power a node off and power it back on. Supporting this is very handy in the following situations: * **Emergency Power Off**: During emergencies, time is critical and manual tasks just add latency which can be avoided through automation. If a server room overheats, halting the OS on the nodes is not enough. The nodes need to be powered off cleanly to prevent damage to equipment. * **Repairs**: In most cases, repairing a node means that the node has to be powered off. * **Crashes**: Software bugs may crash a node. Having an OS independent way to power-cycle a node helps to recover the node without human intervention. Overview -------- Ganeti will be extended with OOB capabilities through adding a new **Cluster Parameter** (``--oob-program``), a new **Node Property** (``--oob-program``), a new **Node State (powered)** and support in ``gnt-node`` for invoking an **External Helper Command** which executes the actual OOB command (``gnt-node nodename ...``). The supported commands are: ``power on``, ``power off``, ``power cycle``, ``power status`` and ``health``. .. note:: The new **Node State (powered)** is a **State of Record** (:term:`SoR`), not a **State of World** (:term:`SoW`). The maximum execution time of the **External Helper Command** will be limited to 60s to prevent the cluster from getting locked for an undefined amount of time. Detailed Design --------------- New ``gnt-cluster`` Parameter +++++++++++++++++++++++++++++ | Program: ``gnt-cluster`` | Command: ``modify|init`` | Parameters: ``--oob-program`` | Options: ``--oob-program``: executable OOB program (absolute path) New ``gnt-cluster epo`` Command +++++++++++++++++++++++++++++++ | Program: ``gnt-cluster`` | Command: ``epo`` | Parameter: ``--on`` ``--force`` ``--groups`` ``--all`` | Options: ``--on``: By default epo turns off, with ``--on`` it tries to get the | cluster back online | ``--force``: To force the operation without asking for confirmation | ``--groups``: To operate on groups instead of nodes | ``--all``: To operate on the whole cluster This is a convenience command to allow easy emergency power off of a whole cluster or part of it. It takes care of all steps needed to get the cluster into a sane state to turn off the nodes. With ``--on`` it does the reverse and tries to bring the rest of the cluster back to life. .. note:: The master node is not able to shut itself cleanly down. Therefore, this command will not do all the work on single node clusters. On multi node clusters the command tries to find another master or if that is not possible prepares everything to the point where the user has to shutdown the master node itself alone this applies also to the single node cluster configuration. New ``gnt-node`` Property +++++++++++++++++++++++++ | Program: ``gnt-node`` | Command: ``modify|add`` | Parameters: ``--oob-program`` | Options: ``--oob-program``: executable OOB program (absolute path) .. note:: If ``--oob-program`` is set to ``!`` then the node has no OOB capabilities. Otherwise, we will inherit the node group respectively the cluster wide value. I.e. the nodes have to opt out from OOB capabilities. Addition to ``gnt-cluster verify`` ++++++++++++++++++++++++++++++++++ | Program: ``gnt-cluster`` | Command: ``verify`` | Parameter: None | Option: None | Additional Checks: 1. existence and execution flag of OOB program on all Master Candidates if the cluster parameter ``--oob-program`` is set or at least one node has the property ``--oob-program`` set. The OOB helper is just invoked on the master 2. check if node state powered matches actual power state of the machine for those nodes where ``--oob-program`` is set New Node State ++++++++++++++ Ganeti supports the following two boolean states related to the nodes: **drained** The cluster still communicates with drained nodes but excludes them from allocation operations **offline** if offline, the cluster does not communicate with offline nodes; useful for nodes that are not reachable in order to avoid delays And will extend this list with the following boolean state: **powered** if not powered, the cluster does not communicate with not powered nodes if the node property ``--oob-program`` is not set, the state powered is not displayed Additionally modify the meaning of the offline state as follows: **offline** if offline, the cluster does not communicate with offline nodes (**with the exception of OOB commands for nodes where** ``--oob-program`` **is set**); useful for nodes that are not reachable in order to avoid delays The corresponding command extensions are: | Program: ``gnt-node`` | Command: ``info`` | Parameter: [ ``nodename`` ... ] | Option: None Additional Output (:term:`SoR`, ommited if node property ``--oob-program`` is not set): powered: ``[True|False]`` | Program: ``gnt-node`` | Command: ``modify`` | Parameter: nodename | Option: [ ``--powered=yes|no`` ] | Reasoning: sometimes you will need to sync the :term:`SoR` with the :term:`SoW` manually | Caveat: ``--powered`` can only be modified if ``--oob-program`` is set for | the node in question New ``gnt-node`` commands: ``power [on|off|cycle|status]`` ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | Program: ``gnt-node`` | Command: ``power [on|off|cycle|status]`` | Parameters: [ ``nodename`` ... ] | Options: None | Caveats: * If no nodenames are passed to ``power [on|off|cycle]``, the user will be prompted with ``"Do you really want to power [on|off|cycle] the following nodes: