Extend Ganeti with Out of Band Cluster Node Management Capabilities.
Ganeti currently has no support for Out of Band management of the nodes in a cluster. It relies on the OS running on the nodes and has therefore limited possibilities when the OS is not responding. The command gnt-node powercycle can be issued to attempt a reboot of a node that crashed but there are no means to power a node off and power it back on. Supporting this is very handy in the following situations:
- Emergency Power Off: During emergencies, time is critical and manual tasks just add latency which can be avoided through automation. If a server room overheats, halting the OS on the nodes is not enough. The nodes need to be powered off cleanly to prevent damage to equipment.
- Repairs: In most cases, repairing a node means that the node has to be powered off.
- Crashes: Software bugs may crash a node. Having an OS independent way to power-cycle a node helps to recover the node without human intervention.
Ganeti will be extended with OOB capabilities through adding a new Cluster Parameter (--oob-program), a new Node Property (--oob-program), a new Node State (powered) and support in gnt-node for invoking an External Helper Command which executes the actual OOB command (gnt-node <command> nodename ...). The supported commands are: power on, power off, power cycle, power status and health.
Note
The new Node State (powered) is a State of Record (SoR), not a State of World (SoW). The maximum execution time of the External Helper Command will be limited to 60s to prevent the cluster from getting locked for an undefined amount of time.
Note
If --oob-program is set to ! then the node has no OOB capabilities. Otherwise, we will inherit the node group respectively the cluster wide value. I.e. the nodes have to opt out from OOB capabilities.
- existence and execution flag of OOB program on all Master Candidates if the cluster parameter --oob-program is set or at least one node has the property --oob-program set. The OOB helper is just invoked on the master
- check if node state powered matches actual power state of the machine for those nodes where --oob-program is set
Ganeti supports the following two boolean states related to the nodes:
And will extend this list with the following boolean state:
Additionally modify the meaning of the offline state as follows:
The corresponding command extensions are:
Additional Output (SoR, ommited if node property --oob-program is not set): powered: [True|False]
- If no nodenames are passed to power [on|off|cycle], the user will be prompted with "Do you really want to power [on|off|cycle] the following nodes: <display list of OOB capable nodes in the cluster)? (y/n)"
- For power-status, nodename is optional, if omitted, we list the power-status of all OOB capable nodes in the cluster (SoW)
- User should be warned and needs to confirm with yes if s/he tries to power [off|cycle] a node with running instances.
Exception | Error Message |
---|---|
OOB program return code != 0 | OOB program execution failed ($ERROR_MSG) |
OOB program execution time exceeds 60s | OOB program execution timeout exceeded, OOB program execution aborted |
State before execution | Command | State after execution | Comment |
---|---|---|---|
powered: False | power off | powered: False | FYI: IPMI will complain if you try to power off a machine that is already powered off |
powered: False | power cycle | powered: False | FYI: IPMI will complain if you try to cycle a machine that is already powered off |
powered: False | power on | powered: True | |
powered: True | power off | powered: False | |
powered: True | power cycle | powered: True | |
powered: True | power on | powered: True | FYI: IPMI will complain if you try to power on a machine that is already powered on |
Note
Example output (represents SoW):
gnt-node oob power-status
Node Power Status
node1.example.com on
node2.example.com off
node3.example.com on
node4.example.com unknown
Note
Example output (represents SoR):
gnt-node info node1.example.com
Node name: node1.example.com
primary ip: 192.168.1.1
secondary ip: 192.168.2.1
master candidate: True
drained: False
offline: False
powered: True
primary for instances:
- inst1.example.com
- inst2.example.com
- inst3.example.com
secondary for instances:
- inst4.example.com
- inst5.example.com
- inst6.example.com
- inst7.example.com
Note
Only nodes which are not opted out from OOB management will report the powered state.
Caveats:
- If no nodename(s) are provided, we will report the health of all nodes in the cluster which have --oob-program set.
- Only nodes which are not opted out from OOB management will report their health. Invoking the command on a node that does not meet this condition will result in an error message “Node does not support OOB commands”.
For error handling see Error Handling
Return code | Meaning |
---|---|
0 | Command succeeded |
1 | Command failed |
others | Unsupported/undefined |
Error messages are passed from the helper program to Ganeti through StdErr (return code == 1). On StdOut, the helper program will send data back to Ganeti (return code == 0). The format of the data is JSON.
Command | Expected output |
---|---|
power-on | None |
power-off | None |
power-cycle | None |
power-status | { "powered": true|false } |
health | [[item, status],
[item, status],
...]
|
For the health output, the fields are:
Field | Meaning |
---|---|
item | String identifier of the item we are querying the health of, examples:
|
status | String; Can take one of the following four values:
|
Note
The gnt-node power-[on|off] (power state changes) commands will create log entries following current Ganeti logging practices. In addition, health items with status WARNING or CRITICAL will be logged for each run of gnt-node health.