Synchronising htools to Ganeti 2.3

Ganeti 2.3 introduces a number of new features that change the cluster internals significantly enough that the htools suite needs to be updated accordingly in order to function correctly.

Shared storage support

Currently, the htools algorithms presume a model where all of an instance’s resources is served from within the cluster, more specifically from the nodes comprising the cluster. While is this usual for memory and CPU, deployments which use shared storage will invalidate this assumption for storage.

To account for this, we need to move some assumptions from being implicit (and hardcoded) to being explicitly exported from Ganeti.

New instance parameters

It is presumed that Ganeti will export for all instances a new storage_type parameter, that will denote either internal storage (e.g. plain or drbd), or external storage.

Furthermore, a new storage_pool parameter will classify, for both internal and external storage, the pool out of which the storage is allocated. For internal storage, this will be either lvm (the pool that provides space to both plain and drbd instances) or file (for file-storage-based instances). For external storage, this will be the respective NAS/SAN/cloud storage that backs up the instance. Note that for htools, external storage pools are opaque; we only care that they have an identifier, so that we can distinguish between two different pools.

If these two parameters are not present, the instances will be presumed to be internal/lvm.

New node parameters

For each node, it is expected that Ganeti will export what storage types it supports and pools it has access to. So a classic 2.2 cluster will have all nodes supporting internal/lvm and/or internal/file, whereas a new shared storage only 2.3 cluster could have external/my-nas storage.

Whatever the mechanism that Ganeti will use internally to configure the associations between nodes and storage pools, we consider that we’ll have available two node attributes inside htools: the list of internal and external storage pools.

External storage and instances

Currently, for an instance we allow one cheap move type: failover to the current secondary, if it is a healthy node, and four other “expensive” (as in, including data copies) moves that involve changing either the secondary or the primary node or both.

In presence of an external storage type, the following things will change:

  • the disk-based moves will be disallowed; this is already a feature in the algorithm, controlled by a boolean switch, so adapting external storage here will be trivial
  • instead of the current one secondary node, the secondaries will become a list of potential secondaries, based on access to the instance’s storage pool

Except for this, the basic move algorithm remains unchanged.

External storage and nodes

Two separate areas will have to change for nodes and external storage.

First, then allocating instances (either as part of a move or a new allocation), if the instance is using external storage, then the internal disk metrics should be ignored (for both the primary and secondary cases).

Second, the per-node metrics used in the cluster scoring must take into account that nodes might not have internal storage at all, and handle this as a well-balanced case (score 0).

N+1 status

Currently, computing the N+1 status of a node is simple:

  • group the current secondary instances by their primary node, and compute the sum of each instance group memory
  • choose the maximum sum, and check if it’s smaller than the current available memory on this node

In effect, computing the N+1 status is a per-node matter. However, with shared storage, we don’t have secondary nodes, just potential secondaries. Thus computing the N+1 status will be a cluster-level matter, and much more expensive.

A simple version of the N+1 checks would be that for each instance having said node as primary, we have enough memory in the cluster for relocation. This means we would actually need to run allocation checks, and update the cluster status from within allocation on one node, while being careful that we don’t recursively check N+1 status during this relocation, which is too expensive.

However, the shared storage model has some properties that changes the rules of the computation. Speaking broadly (and ignoring hard restrictions like tag based exclusion and CPU limits), the exact location of an instance in the cluster doesn’t matter as long as memory is available. This results in two changes:

  • simply tracking the amount of free memory buckets is enough, cluster-wide
  • moving an instance from one node to another would not change the N+1 status of any node, and only allocation needs to deal with N+1 checks

Unfortunately, this very cheap solution fails in case of any other exclusion or prevention factors.

TODO: find a solution for N+1 checks.

Node groups support

The addition of node groups has a small impact on the actual algorithms, which will simply operate at node group level instead of cluster level, but it requires the addition of new algorithms for inter-node group operations.

The following two definitions will be used in the following paragraphs:

local group
The local group refers to a node’s own node group, or when speaking about an instance, the node group of its primary node
regular cluster
A cluster composed of a single node group, or pre-2.3 cluster
super cluster
This term refers to a cluster which comprises multiple node groups, as opposed to a 2.2 and earlier cluster with a single node group

In all the below operations, it’s assumed that Ganeti can gather the entire super cluster state cheaply.

Balancing changes

Balancing will move from cluster-level balancing to group balancing. In order to achieve a reasonable improvement in a super cluster, without needing to keep state of what groups have been already balanced previously, the balancing algorithm will run as follows:

  1. the cluster data is gathered
  2. if this is a regular cluster, as opposed to a super cluster, balancing will proceed normally as previously
  3. otherwise, compute the cluster scores for all groups
  4. choose the group with the worst score and see if we can improve it; if not choose the next-worst group, so on
  5. once a group has been identified, run the balancing for it

Of course, explicit selection of a group will be allowed.

Super cluster operations

Beside the regular group balancing, in a super cluster we have more operations.

Redistribution

In a regular cluster, once we run out of resources (offline nodes which can’t be fully evacuated, N+1 failures, etc.) there is nothing we can do unless nodes are added or instances are removed.

In a super cluster however, there might be resources available in another group, so there is the possibility of relocating instances between groups to re-establish N+1 success within each group.

One difficulty in the presence of both super clusters and shared storage is that the move paths of instances are quite complicated; basically an instance can move inside its local group, and to any other groups which have access to the same storage type and storage pool pair. In effect, the super cluster is composed of multiple ‘partitions’, each containing one or more groups, but a node is simultaneously present in multiple partitions, one for each storage type and storage pool it supports. As such, the interactions between the individual partitions are too complex for non-trivial clusters to assume we can compute a perfect solution: we might need to move some instances using shared storage pool ‘A’ in order to clear some more memory to accept an instance using local storage, which will further clear more VCPUs in a third partition, etc. As such, we’ll limit ourselves at simple relocation steps within a single partition.

Algorithm:

  1. read super cluster data, and exit if cluster doesn’t allow inter-group moves

  2. filter out any groups that are “alone” in their partition (i.e. no other group sharing at least one storage method)

  3. determine list of healthy versus unhealthy groups:

    1. a group which contains offline nodes still hosting instances is definitely not healthy
    2. a group which has nodes failing N+1 is ‘weakly’ unhealthy
  4. if either list is empty, exit (no work to do, or no way to fix problems)

  5. for each unhealthy group:

    1. compute the instances that are causing the problems: all instances living on offline nodes, all instances living as secondary on N+1 failing nodes, all instances living as primaries on N+1 failing nodes (in this order)
    2. remove instances, one by one, until the source group is healthy again
    3. try to run a standard allocation procedure for each instance on all potential groups in its partition
    4. if all instances were relocated successfully, it means we have a solution for repairing the original group
Compression

In a super cluster which has had many instance reclamations, it is possible that while none of the groups is empty, overall there is enough empty capacity that an entire group could be removed.

The algorithm for “compressing” the super cluster is as follows:

  1. read super cluster data

  2. compute total (memory, disk, cpu), and free (memory, disk, cpu) for the super-cluster

  3. computer per-group used and free (memory, disk, cpu)

  4. select candidate groups for evacuation:

    1. they must be connected to other groups via a common storage type and pool
    2. they must have fewer used resources than the global free resources (minus their own free resources)
  5. for each of these groups, try to relocate all its instances to connected peer groups

  6. report the list of groups that could be evacuated, or if instructed so, perform the evacuation of the group with the largest free resources (i.e. in order to reclaim the most capacity)

Load balancing

Assuming a super cluster using shared storage, where instance failover is cheap, it should be possible to do a load-based balancing across groups.

As opposed to the normal balancing, where we want to balance on all node attributes, here we should look only at the load attributes; in other words, compare the available (total) node capacity with the (total) load generated by instances in a given group, and computing such scores for all groups, trying to see if we have any outliers.

Once a reliable load-weighting method for groups exists, we can apply a modified version of the cluster scoring method to score not imbalances across nodes, but imbalances across groups which result in a super cluster load-related score.

Allocation changes

It is important to keep the allocation method across groups internal (in the Ganeti/Iallocator combination), instead of delegating it to an external party (e.g. a RAPI client). For this, the IAllocator protocol should be extended to provide proper group support.

For htools, the new algorithm will work as follows:

  1. read/receive cluster data from Ganeti
  2. filter out any groups that do not supports the requested storage method
  3. for remaining groups, try allocation and compute scores after allocation
  4. sort valid allocation solutions accordingly and return the entire list to Ganeti

The rationale for returning the entire group list, and not only the best choice, is that we anyway have the list, and Ganeti might have other criteria (e.g. the best group might be busy/locked down, etc.) so even if from the point of view of resources it is the best choice, it might not be the overall best one.

Node evacuation changes

While the basic concept in the multi-evac iallocator mode remains unchanged (it’s a simple local group issue), when failing to evacuate and running in a super cluster, we could have resources available elsewhere in the cluster for evacuation.

The algorithm for computing this will be the same as the one for super cluster compression and redistribution, except that the list of instances is fixed to the ones living on the nodes to-be-evacuated.

If the inter-group relocation is successful, the result to Ganeti will not be a local group evacuation target, but instead (for each instance) a pair (remote group, nodes). Ganeti itself will have to decide (based on user input) whether to continue with inter-group evacuation or not.

In case that Ganeti doesn’t provide complete cluster data, just the local group, the inter-group relocation won’t be attempted.