Currently, there is no consistent management of different variants of storage in Ganeti. One direct consequence is that storage space reporting is currently broken for all storage that is not based on lvm technology. This design looks at the root causes and proposes a way to fix it.
We propose to streamline handling of different storage types and disk templates. Currently, there is no consistent implementation for dis/enabling of disk templates and/or storage types.
Our idea is to introduce a list of enabled disk templates, which can be used by instances in the cluster. Based on this list, we want to provide storage reporting mechanisms for the available disk templates. Since some disk templates share the same underlying storage technology (for example drbd and plain are based on lvm), we map disk templates to storage types and implement storage space reporting for each storage type.
Add a new attribute “enabled_disk_templates” (type: list of strings) to the cluster config which holds disk templates, for example, “drbd”, “file”, or “ext”. This attribute represents the list of disk templates that are enabled cluster-wide for usage by the instances. It will not be possible to create instances with a disk template that is not enabled, as well as it will not be possible to remove a disk template from the list if there are still instances using it.
The list of enabled disk templates can contain any non-empty subset of the currently implemented disk templates: blockdev, diskless, drbd, ext, file, plain, rbd, and sharedfile. See DISK_TEMPLATES in constants.py.
Note that the abovementioned list of enabled disk types is just a “mechanism” parameter that defines which disk templates the cluster can use. Further filtering about what’s allowed can go in the ipolicy, which is not covered in this design doc. Note that it is possible to force an instance to use a disk template that is not allowed by the ipolicy. This is not possible if the template is not enabled by the cluster.
The ipolicy also contains a list of enabled disk templates. Since the cluster- wide enabled disk templates should be a stronger constraint, the list of enabled disk templates in the ipolicy should be a subset of those. In case the user tries to create an inconsistent situation here, gnt-cluster should display this as an error.
We consider the first disk template in the list to be the default template for instance creation and storage reporting. This will remove the need to specify the disk template with -t on instance creation. Note: It would be better to take the default disk template from the node-group-specific ipolicy. However, when using the iallocator, the nodegroup can only be determined from the node which is determined by the iallocator, which in turn needs the disk-template first. To solve this chicken-and-egg-problem we first need to extend ‘gnt-instance add’ to accept a nodegroup in the first place.
Currently, cluster-wide dis/enabling of disk templates is not implemented consistently. lvm based disk templates are enabled by specifying a volume group name on cluster initialization and can only be disabled by explicitly using the option --no-lvm-storage. This will be replaced by adding/removing drbd and plain from the set of enabled disk templates.
The option --no-drbd-storage is also subsumed by dis/enabling the disk template drbd on the cluster.
Up till now, file storage and shared file storage could be dis/enabled at ./configure time. This will also be replaced by adding/removing the respective disk templates from the set of enabled disk templates.
There is currently no possibility to dis/enable the disk templates diskless, blockdev, ext, and rdb. By introducing the set of enabled disk templates, we will require these disk templates to be explicitly enabled in order to be used. The idea is that the administrator of the cluster can tailor the cluster configuration to what is actually needed in the cluster. There is hope that this will lead to cleaner code, better performance and fewer bugs.
When upgrading the configuration from a version that did not have the list of enabled disk templates, we have to decide which disk templates are enabled based on the current configuration of the cluster. We propose the following update logic to be implemented in the online update of the config in the Cluster class in objects.py: - If a volume_group_name is existing, then enable drbd and plain. - If file or sharedfile was enabled at configure time, add the respective disk template to the list of enabled disk templates. - For disk templates diskless, blockdev, ext, and rbd, we inspect the current cluster configuration regarding whether or not there are instances that use one of those disk templates. We will add only those that are currently in use. The order in which the list of enabled disk templates is built up will be determined by a preference order based on when in the history of Ganeti the disk templates were introduced (thus being a heuristic for which are used more than others).
The list of enabled disk templates can be specified on cluster initialization with gnt-cluster init using the optional parameter --enabled-disk-templates. If it is not set, it will be set to a default set of enabled disk templates, which includes the following disk templates: drbd and plain. The list can be shrunk or extended by gnt-cluster modify using the same parameter.
The storage reporting in gnt-node list will be the first user of the newly introduced list of enabled disk templates. Currently, storage reporting works only for lvm-based storage. We want to extend that and report storage for the enabled disk templates. The default of gnt-node list will only report on storage of the default disk template (the first in the list of enabled disk templates). One can explicitly ask for storage reporting on the other enabled disk templates with the -o option.
Some of the currently implemented disk templates share the same base storage technology. Since the storage reporting is based on the underlying technology rather than on the user-facing disk templates, we introduce storage types to represent the underlying technology. There will be a mapping from disk templates to storage types, which will be used by the storage reporting backend to pick the right method for estimating the storage for the different disk templates.
The proposed storage types are blockdev, diskless, ext, file, lvm-pv, lvm-vg, rados.
The mapping from disk templates to storage types will be: drbd and plain to lvm-vg, file and sharedfile to file, and all others to their obvious counterparts.
Note that there is no disk template mapping to lvm-pv, because this storage type is currently only used to enable the user to mark it as (un)allocatable. (See man gnt-node.) It is not possible to create an instance on a storage unit that is of type lvm-pv directly, therefore it is not included in the mapping.
The storage reporting for file and sharedfile storage will report space on the file storage dir, which is currently limited to one directory. In the future, if we’ll have support for more directories, or for per-nodegroup directories this can be changed.
For now, we will implement only the storage reporting for lvm-based and file-based storage, that is disk templates file, sharedfile, lvm, and drbd. For disk template diskless, there is obviously nothing to report about. When implementing storage reporting for file, we can also use it for sharedfile, since it uses the same file system mechanisms to determine the free space. In the future, we can optimize storage reporting for shared storage by not querying all nodes that use a common shared file for the same space information.
In the future, we extend storage reporting for shared storage types like rados and ext. Note that it will not make sense to query each node for storage reporting on a storage unit that is used by several nodes.
We will not implement storage reporting for the blockdev disk template, because block devices are always adopted after being provided by the system administrator, thus coming from outside Ganeti. There is no point in storage reporting for block devices, because Ganeti will never try to allocate storage inside a block device.
The noded RPC call that reports node storage space will be changed to accept a list of <storage_type>,<key> string tuples. For each of them, it will report the free amount of storage space found on storage <key> as known by the requested storage_type. Depending on the storage_type, the key would be a volume group name in case of lvm, a directory name for the file-based storage, and a rados pool name for rados storage.
Masterd will know through the mapping of storage types to storage calculation functions which storage type uses which mechanism for storage calculation and invoke only the needed ones.
Note that for file and sharedfile the node knows which directories are allowed and won’t allow any other directory to be queried for security reasons. The actual path still needs to be passed to distinguish the two, as the type will be the same for both.
These calculations will be implemented in the node storage system (currently lib/storage.py) but querying will still happen through the node info call, to avoid requiring an extra RPC each time.
gnt-node list` can be queried for the different disk templates, if they are enabled. By default, it will just report information about the default disk template. Examples:
> gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst mynode1 3.6T 3.6T 64.0G 1023M 62.2G 1 0 mynode2 3.6T 3.6T 64.0G 1023M 62.0G 2 1 mynode3 3.6T 3.6T 64.0G 1023M 62.3G 0 2 > gnt-node list -o dtotal/drbd,dfree/file Node DTotal (drbd, myvg) DFree (file, mydir) mynode1 3.6T - mynode2 3.6T -
Note that for drbd, we only report the space of the vg and only if it was not renamed to something different than the default volume group name. With this design, there is also no possibility to ask about the meta volume group. We restrict the design here to make the transition to storage pools easier (as it is an interim state only). It is the administrator’s responsibility to ensure that there is enough space for the meta volume group.
When storage pools are implemented, we switch from referencing the disk template to referencing the storage pool name. For that, of course, the pool names need to be unique over all storage types. For drbd, we will use the default ‘drbd’ storage pool and possibly a second lvm-based storage pool for the metavg. It will be possible to rename storage pools (thus also the default lvm storage pool). There will be new functionality to ask about what storage pools are available and of what type. Storage pools will have a storage pool type which is one of the disk templates. There can be more than one storage pool based on the same disk template, therefore we will then start referencing the storage pool name instead of the disk template.
Note: As of version 2.10, gnt-node list only reports storage space information for the default disk template, as supporting more options turned out to be not feasible without storage pools.
Besides in gnt-node list, storage space information is also displayed in gnt-node list-storage. This will also adapt to the extended storage reporting capabilities. The user can specify a storage type using --storage-type. If he requests storage information about a storage type which does not support space reporting, a warning is emitted. If no storage type is specified explicitly, gnt-node list-storage will try to report storage on the storage type of the default disk template. If the default disk template’s storage type does not support space reporting, an error message is emitted.
gnt-cluster info will report which disk templates are enabled, i.e. which ones are supported according to the cluster configuration. Example output:
> gnt-cluster info [...] Cluster parameters: - [...] - enabled disk templates: plain, drbd, sharedfile, rados - [...]
gnt-node list-storage will not be affected by any changes, since this design is restricted only to free storage reporting for non-shared storage types.
The iallocator protocol doesn’t need to change: since we know which disk template an instance has, we’ll pass only the “free” value for that disk template to the iallocator, when asking for an allocation to be made. Note that for DRBD nowadays we ignore the case when vg and metavg are different, and we only consider the main volume group. Fixing this is outside the scope of this design.
Although the iallocator protocol itself does not need change, the invocation of the iallocator needs quite some adaption. So far, it always requested LVM storage information no matter if that was the disk template to be considered for the allocation. For instance allocation, this is the disk template of the instance. TODO: consider other allocator requests.
With this design, we ensure forward-compatibility with respect to storage pools. For now, we’ll report space for all available disk templates that are based on non-shared storage types, in the future, for all available storage pools.
Hbal will not need changes, as it handles it already. We don’t forecast any changes needed to it.
Hspace will by default report by assuming the allocation will happen on the default disk template for the cluster/nodegroup. An option will be added to manually specify a different storage.
Also the design for Partitioned Ganeti deals with reporting free space. Partitioned Ganeti has a different way to report free space for LVM on nodes where the exclusive_storage flag is set. That doesn’t interact directly with this design, as the specifics of how the free space is computed is not in the scope of this design. But the node info call contains the value of the exclusive_storage flag, which is currently only meaningful for the LVM storage type. Additional flags like the exclusive_storage flag for lvm might be useful for other disk templates / storage types as well. We therefore extend the RPC call with <storage_type>,<key> to <storage_type>,<key>,[<param>] to include any disk-template-specific (or storage-type specific) parameters in the RPC call.
The reporting of free spindles, also part of Partitioned Ganeti, is not concerned with this design doc, as those are seen as a separate resource.