N+1 redundancy for shared storage¶
- Created
2015-Apr-13
- Status
Partially Implemented
- Ganeti-Version
2.15
Contents
This document describes how N+1 redundancy is achieved for instances using shared storage.
Current state and shortcomings¶
For instances with DRBD as disk template, in case of failures
of their primary node, there is only one node where the instance
can be restarted immediately. Therefore, htools
reserve enough
memory on that node to cope with failure of a single node.
For instances using shared storage, however, they can be restarted
on any node—implying that on no particular node memory has to
be reserved. This, however, motivated the current state where no
memory is reserved at all. And even a large cluster can run out
of capacity.
Proposed changes¶
Definition on N+1 redundancy in the presence of shared storage¶
A cluster is considered N+1 redundant, if, for every node, all DRBD instances can be migrated out and then all shared-storage instances can be relocated to a different node without moving instances on other nodes. This is precisely the operation done after a node breaking. Obviously, simulating failure and evacuation for every single node is an expensive operation.
Basic Considerations¶
For DRBD, keeping N+1 redundancy is affected by moving instances and balancing the cluster. Moreover, taking is into account for balancing can help Improving allocation efficiency by considering the total reserved memory. Hence, N+1 redundancy for DRBD is to be taken into account for all choices affecting instance location, including instance allocation and balancing.
For shared-storage instances, they can move everywhere within the node group. So, in practice, this is mainly a question of capacity planing, especially is most instances have the same size. Nevertheless, offcuts if instances don’t fill a node entirely may not be ignored.
Modifications to existing tools¶
hail
will compute and rank possible allocations as usual. However, before returning a choice it will filter out allocations that are not N+1 redundant.Normal
gnt-cluster verify
will not be changed; in particular, it will still check for DRBD N+1 redundancy, but not for shared storage N+1 redundancy. However,hcheck
will verify shared storage N+1 redundancy and report it that fails.hbal
will consider and rank moves as usual. However, before deciding on the next move, it will filter out those moves that lead from a shared storage N+1 redundant configuration into one that isn’t.hspace
computing the capacity for DRBD instances will be unchanged; In particular, the options--accept-exisiting
and--independent-groups
will continue to work. For shared storage instances, however, will strictly iterate over the same allocation step as hail does.
Other modifications related to opportunistic locking¶
To allow parallel instance creation, instance creation jobs can be instructed
to run with just whatever node locks currently available. In this case, an
allocation has to be chosen from that restricted set of nodes. Currently, this
is achieved by sending hail
a cluster description where all other nodes
are marked offline; that works as long as only local properties are considered.
With global properties, however, the capacity of the cluster is materially
underestimated, causing spurious global N+1 failures.
Therefore, we conservatively extend the request format of hail
by an
optional parameter restrict-to-nodes
. If that parameter is given, only
allocations on those nodes will be considered. This will be an additional
restriction to ones currently considered (e.g., node must be online, a
particular group might have been requested). If opportunistic locking is
enabled, calls to the IAllocator will use this extension to signal which
nodes to restrict to, instead of marking other nodes offline.
It should be noted that this change brings a race. Two concurrent allocations might bring the cluster over the global N+1 capacity limit. As, however, the reason for opportunistic locking is an urgent need for instances, this seems acceptable; Ganeti generally follows the guideline that current problems are more important than future ones. Also, even with that change allocation is more careful than the current approach of completely ignoring N+1 redundancy for shared storage.