Ganeti 2.3 introduces a number of new features that change the cluster internals significantly enough that the htools suite needs to be updated accordingly in order to function correctly.
The addition of node groups has a small impact on the actual algorithms, which will simply operate at node group level instead of cluster level, but it requires the addition of new algorithms for inter-node group operations.
The following two definitions will be used in the following paragraphs:
In all the below operations, it’s assumed that Ganeti can gather the entire super cluster state cheaply.
Balancing will move from cluster-level balancing to group balancing. In order to achieve a reasonable improvement in a super cluster, without needing to keep state of what groups have been already balanced previously, the balancing algorithm will run as follows:
Of course, explicit selection of a group will be allowed.
Beside the regular group balancing, in a super cluster we have more operations.
In a regular cluster, once we run out of resources (offline nodes which can’t be fully evacuated, N+1 failures, etc.) there is nothing we can do unless nodes are added or instances are removed.
In a super cluster however, there might be resources available in another group, so there is the possibility of relocating instances between groups to re-establish N+1 success within each group.
One difficulty in the presence of both super clusters and shared storage is that the move paths of instances are quite complicated; basically an instance can move inside its local group, and to any other groups which have access to the same storage type and storage pool pair. In effect, the super cluster is composed of multiple ‘partitions’, each containing one or more groups, but a node is simultaneously present in multiple partitions, one for each storage type and storage pool it supports. As such, the interactions between the individual partitions are too complex for non-trivial clusters to assume we can compute a perfect solution: we might need to move some instances using shared storage pool ‘A’ in order to clear some more memory to accept an instance using local storage, which will further clear more VCPUs in a third partition, etc. As such, we’ll limit ourselves at simple relocation steps within a single partition.
Algorithm:
read super cluster data, and exit if cluster doesn’t allow inter-group moves
filter out any groups that are “alone” in their partition (i.e. no other group sharing at least one storage method)
determine list of healthy versus unhealthy groups:
- a group which contains offline nodes still hosting instances is definitely not healthy
- a group which has nodes failing N+1 is ‘weakly’ unhealthy
if either list is empty, exit (no work to do, or no way to fix problems)
for each unhealthy group:
- compute the instances that are causing the problems: all instances living on offline nodes, all instances living as secondary on N+1 failing nodes, all instances living as primaries on N+1 failing nodes (in this order)
- remove instances, one by one, until the source group is healthy again
- try to run a standard allocation procedure for each instance on all potential groups in its partition
- if all instances were relocated successfully, it means we have a solution for repairing the original group
In a super cluster which has had many instance reclamations, it is possible that while none of the groups is empty, overall there is enough empty capacity that an entire group could be removed.
The algorithm for “compressing” the super cluster is as follows:
read super cluster data
compute total (memory, disk, cpu), and free (memory, disk, cpu) for the super-cluster
computer per-group used and free (memory, disk, cpu)
select candidate groups for evacuation:
- they must be connected to other groups via a common storage type and pool
- they must have fewer used resources than the global free resources (minus their own free resources)
for each of these groups, try to relocate all its instances to connected peer groups
report the list of groups that could be evacuated, or if instructed so, perform the evacuation of the group with the largest free resources (i.e. in order to reclaim the most capacity)
Assuming a super cluster using shared storage, where instance failover is cheap, it should be possible to do a load-based balancing across groups.
As opposed to the normal balancing, where we want to balance on all node attributes, here we should look only at the load attributes; in other words, compare the available (total) node capacity with the (total) load generated by instances in a given group, and computing such scores for all groups, trying to see if we have any outliers.
Once a reliable load-weighting method for groups exists, we can apply a modified version of the cluster scoring method to score not imbalances across nodes, but imbalances across groups which result in a super cluster load-related score.
It is important to keep the allocation method across groups internal (in the Ganeti/Iallocator combination), instead of delegating it to an external party (e.g. a RAPI client). For this, the IAllocator protocol should be extended to provide proper group support.
For htools, the new algorithm will work as follows:
The rationale for returning the entire group list, and not only the best choice, is that we anyway have the list, and Ganeti might have other criteria (e.g. the best group might be busy/locked down, etc.) so even if from the point of view of resources it is the best choice, it might not be the overall best one.
While the basic concept in the multi-evac iallocator mode remains unchanged (it’s a simple local group issue), when failing to evacuate and running in a super cluster, we could have resources available elsewhere in the cluster for evacuation.
The algorithm for computing this will be the same as the one for super cluster compression and redistribution, except that the list of instances is fixed to the ones living on the nodes to-be-evacuated.
If the inter-group relocation is successful, the result to Ganeti will not be a local group evacuation target, but instead (for each instance) a pair (remote group, nodes). Ganeti itself will have to decide (based on user input) whether to continue with inter-group evacuation or not.
In case that Ganeti doesn’t provide complete cluster data, just the local group, the inter-group relocation won’t be attempted.