Contents
This is a design document detailing the cluster maintenance scheduler, HRoller.
To enable automating cluster-wide reboots a new htool, called HRoller, was added to Ganeti starting from version 2.7. This tool helps parallelizing cluster offline maintenances by calculating which nodes are not both primary and secondary for a DRBD instance, and thus can be rebooted at the same time, when all instances are down.
The way this is done is documented in the hroller(1) manpage.
We would now like to perform online maintenance on the cluster by rebooting nodes after evacuating their primary instances (rolling reboots).
In order to perform rolling maintenance we need to migrate instances off the nodes before a reboot. How this can be done depends on the instance’s disk template and status:
If an instance was shutdown when the maintenance started it will be considered for avoiding contemporary reboot of its primary and secondary nodes, but will not be considered as a target for the node evacuation. This allows avoiding needlessly moving its primary around, since it won’t suffer a downtime anyway.
Note that a node with non-redundant instances will only ever be considered good for rolling-reboot if these are down (or the checking of status is overridden) and an explicit option to allow it is set.
Each node must migrate all instances off to their secondaries, and then can either be rebooted, or the secondaries can be evacuated as well.
Since currently doing a replace-disks on DRBD breaks redundancy, it’s not any safer than temporarily rebooting a node with secondaries on them (citation needed). As such we’ll implement for now just the “migrate+reboot” mode, and focus later on replace-disks as well.
In order to do that we can use the following algorithm:
All non-DRBD disk templates that can be migrated have no “secondary” concept. As such instances can be migrated to any node (in the same nodegroup). In order to do the job we can either:
Note that for non-DRBD disks that still use local storage (eg. RBD and plain) redundancy might break anyway, and nothing except the first algorithm might be safe. This perhaps would be a good reason to consider managing better RBD pools, if those are implemented on top of nodes storage, rather than on dedicated storage machines.
If full evacuation of the nodes to be rebooted is desired, a simple migration is not enough for the DRBD instances. To keep the number of disk operations small, we restrict moves to migrate, replace-secondary. That is, after migrating instances out of the nodes to be rebooted, replacement secondaries are searched for, for all instances that have their then secondary on one of the rebooted nodes. This is done by a greedy algorithm, refining the initial reboot partition, if necessary.
Hroller should become able to execute rolling maintenances, rather than just calculate them. For this to succeed properly one of the following must happen:
DRBD nodes’ replace-disks‘ functionality should be implemented. Note that when we will support a DRBD version that allows multi-secondary this can be done safely, without losing replication at any time, by adding a temporary secondary and only when the sync is finished dropping the previous one.
Non-redundant (plain or file) instances should have a way to be moved off as well via plain storage live migration or gnt-instance move (which requires downtime).
If/when RBD pools can be managed inside Ganeti, care can be taken so that the pool is evacuated as well from a node before it’s put into maintenance. This is equivalent to evacuating DRBD secondaries.
Master failovers during the maintenance should be performed by hroller. This requires RPC/RAPI support for master failover. Hroller should also be modified to better support running on the master itself and continuing on the new master.