GlusterFS Ganeti support

Created:2013-Jun-24
Status:Implemented
Ganeti-Version:2.11.0

This document describes the plan for adding GlusterFS support inside Ganeti.

Gluster overview

Gluster is a “brick” “translation” service that can turn a number of LVM logical volume or disks (so-called “bricks”) into an unified “volume” that can be mounted over the network through FUSE or NFS.

This is a simplified view of what components are at play and how they interconnect as data flows from the actual disks to the instances. The parts in grey are available for Ganeti to use and included for completeness but not targeted for implementation at this stage.

digraph "gluster-ganeti-overview" {
graph [ spline=ortho ]
node [ shape=rect ]

{

  node [ shape=none ]
  _volume [ label=volume ]

  bricks -> translators -> _volume
  _volume -> network [label=transport]
  network -> instances
}

{ rank=same; brick1 [ shape=oval ]
             brick2 [ shape=oval ]
             brick3 [ shape=oval ]
             bricks }
{ rank=same; translators distribute }
{ rank=same; volume [ shape=oval ]
             _volume }
{ rank=same; instances instanceA instanceB instanceC instanceD }
{ rank=same; network FUSE NFS QEMUC QEMUD }

{
  node [ shape=oval ]
  brick1 [ label=brick ]
  brick2 [ label=brick ]
  brick3 [ label=brick ]
}

{
  node [ shape=oval ]
  volume
}

brick1 -> distribute
brick2 -> distribute
brick3 -> distribute -> volume
volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>>
                 color="black:grey" ]

NFS [ color=grey fontcolor=grey ]
volume -> NFS [ label="TCP" color=grey fontcolor=grey ]
NFS -> mountpoint [ color=grey fontcolor=grey ]

mountpoint [ shape=oval ]

FUSE -> mountpoint

instanceA [ label=instances ]
instanceB [ label=instances ]

mountpoint -> instanceA
mountpoint -> instanceB

mountpoint [ shape=oval ]

QEMUC [ label=QEMU ]
QEMUD [ label=QEMU ]

{
  instanceC [ label=instances ]
  instanceD [ label=instances ]
}

volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>>
                  color="black:grey" ]
volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>>
                  color="black:grey" ]
QEMUC -> instanceC
QEMUD -> instanceD
}

brick:
The unit of storage in gluster. Typically a drive or LVM logical volume formatted using, for example, XFS.
distribute:
One of the translators in Gluster, it assigns files to bricks based on the hash of their full path inside the volume.
volume:
A filesystem you can mount on multiple machines; all machines see the same directory tree and files.
FUSE/NFS:
Gluster offers two ways to mount volumes: through FUSE or a custom NFS server that is incompatible with other NFS servers. FUSE is more compatible with other services running on the storage nodes; NFS gives better performance. For now, FUSE is a priority.
QEMU:
QEMU 1.3 has the ability to use Gluster volumes directly in userspace without the need for mounting anything. Ganeti still needs kernelspace access at disk creation and OS install time.
transport:
FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only supports TCP. Those protocols are called transports in Gluster. For now, TCP is a priority.

It is the administrator’s duty to set up the bricks, the translators and thus the volume as they see fit. Ganeti will take care of connecting the instances to a given volume.

Note

The gluster mountpoint must be whitelisted by the administrator in /etc/ganeti/file-storage-paths for security reasons in order to allow Ganeti to modify the filesystem.

Why not use a sharedfile disk template?

Gluster volumes can be used by Ganeti using the generic shared file disk template. There is a number of reasons why that is probably not a good idea, however:

  • Shared file, being a generic solution, cannot offer userspace access support.
  • Even with userspace support, Ganeti still needs kernelspace access in order to create disks and install OSes on them. Ganeti can manage the mounting for you so that the Gluster servers only have as many connections as necessary.
  • Experiments showed that you can’t trust mount.glusterfs to give useful return codes or error messages. Ganeti can work around its oddities so administrators don’t have to.
  • The shared file folder scheme (../{instance.name}/disk{disk.id}) does not work well with Gluster. The distribute translator distributes files across bricks, but directories need to be replicated on all bricks. As a result, if we have a dozen hundred instances, that means a dozen hundred folders being replicated on all bricks. This does not scale well.
  • This frees up the shared file disk template to use a different, unsupported replication scheme together with Gluster. (Storage pools are the long term solution for this, however.)

So, while gluster is a shared file disk template, essentially, Ganeti can provide better support for it than that.

Implementation strategy

Working with GlusterFS in kernel space essentially boils down to:

  1. Ask FUSE to mount the Gluster volume.
  2. Check that the mount succeeded.
  3. Use files stored in the volume as instance disks, just like sharedfile does.
  4. When the instances are spun down, attempt unmounting the volume. If the gluster connection is still required, the mountpoint is allowed to remain.

Since it is not strictly necessary for Gluster to mount the disk if all that’s needed is userspace access, however, it is inappropriate for the Gluster storage class to inherit from FileStorage. So the implementation should resort to composition rather than inheritance:

  1. Extract the FileStorage disk-facing logic into a FileDeviceHelper class.
  • In order not to further inflate bdev.py, Filestorage should join its helper functions in filestorage.py (thus reducing their visibility) and add Gluster to its own file, gluster.py. Moving the other classes to their own files like it’s been done in lib/hypervisor/) is not addressed as part of this design.
  1. Use the FileDeviceHelper class to implement a GlusterStorage class in much the same way.
  2. Add Gluster as a disk template that behaves like SharedFile in every way.
  3. Provide Ganeti knowledge about what a GlusterVolume is and how to mount, unmount and reference them.
  • Before attempting a mount, we should check if the volume is not mounted already. Linux allows mounting partitions multiple times, but then you also have to unmount them as many times as you mounted them to actually free the resources; this also makes the output of commands such as mount less useful.
  • Every time the device could be released (after instance shutdown, OS installation scripts or file creation), a single unmount is attempted. If the device is still busy (e.g. from other instances, jobs or open administrator shells), the failure is ignored.
  1. Modify GlusterStorage and customize the disk template behavior to fit Gluster’s needs.

Directory structure

In order to address the shortcomings of the generic shared file handling of instance disk directory structure, Gluster uses a different scheme for determining a disk’s logical id and therefore path on the file system.

The naming scheme is:

/ganeti/{instance.uuid}.{disk.id}

...bringing the actual path on a node’s file system to:

/var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id}

This means Ganeti only uses one folder on the Gluster volume (allowing other uses of the Gluster volume in the meantime) and works better with how Gluster distributes storage over its bricks.

Changes to the storage types system

Ganeti has a number of storage types that abstract over disk templates. This matters mainly in terms of disk space reporting. Gluster support is improved by a rethinking of how disk templates are assigned to storage types in Ganeti.

This is the summary of the changes:

Disk template Current storage type New storage type Does it report storage information to...
gnt-node list gnt-node list-storage iallocator
File File File Yes. Yes. Yes.
Shared file File Shared file (new) No. Yes. No.
Gluster (new) N/A
RBD (for reference) RBD No. No. No.

Gluster or Shared File should not, like RBD, report storage information to gnt-node list or to IAllocators. Regrettably, the simplest way to do so right now is by claiming that storage reporting for the relevant storage type is not implemented. An effort was made to claim that the shared storage type did support disk reporting while refusing to provide any value, but it was not successful (hail does not support this combination.)

To do so without breaking the File disk template, a new storage type must be added. Like RBD, it does not claim to support disk reporting. However, we can still make an effort of reporting stats to gnt-node list-storage.

The rationale is simple. For shared file and gluster storage, disk space is not a function of any one node. If storage types with disk space reporting are used, Hail expects them to give useful numbers for allocation purposes, but a shared storage system means disk balancing is not affected by node-instance allocation any longer. Moreover, it would be wasteful to mount a Gluster volume on each node just for running statvfs() if no machine was actually running gluster VMs.

As a result, Gluster support for gnt-node list-storage is necessarily limited and nodes on which Gluster is available but not in use will report failures. Additionally, running gnt-node list will give an output like this:

Node              DTotal DFree MTotal MNode MFree Pinst Sinst
node1.example.com      ?     ?   744M  273M  477M     0     0
node2.example.com      ?     ?   744M  273M  477M     0     0

This is expected and consistent with behaviour in RBD.

An alternative would have been to report DTotal and DFree as 0 in order to allow hail to ignore the disk information, but this incorrectly populates the gnt-node list DTotal and DFree fields with 0s as well.

New configuration switches

Configurable at the cluster and node group level (gnt-cluster modify, gnt-group modify and other commands that support the -D switch to edit disk parameters):

gluster:host

The IP address or hostname of the Gluster server to connect to. In the default deployment of Gluster, that is any machine that is hosting a brick.

Default: "127.0.0.1"

gluster:port

The port where the Gluster server is listening to.

Default: 24007

gluster:volume

The volume Ganeti should use.

Default: "gv0"

Configurable at the cluster level only (gnt-cluster init) and stored in ssconf for all nodes to read (just like shared file):

--gluster-dir

Where the Gluster volume should be mounted.

Default: /var/run/ganeti/gluster

The default values work if all of the Ganeti nodes also host Gluster bricks. This is possible, but not recommended as it can cause the host to hardlock due to deadlocks in the kernel memory (much in the same way RBD works).

Future work

In no particular order:

  • Support the UDP transport.
  • Support mounting through NFS.
  • Filter gnt-node list so DTotal and DFree are not shown for RBD and shared file disk types, or otherwise report the disk storage values as “-” or some other special value to clearly distinguish it from the result of a communication failure between nodes.
  • Allow configuring the in-volume path Ganeti uses.