Instance move improvements

Created:2014-Apr-11
Status:Partially Implemented
Ganeti-Version:2.12.0

Ganeti provides tools for moving instances within and between clusters. Through special export and import calls, a new instance is created with the disk data of the existing one.

The tools work correctly and reliably, but depending on bandwidth and priority, an instance disk of considerable size requires a long time to transfer. The length of the transfer is inconvenient at best, but the problem becomes only worse if excessive locking causes a move operation to be delayed for a longer period of time, or to block other operations.

The performance of moves is a complex topic, with available bandwidth, compression, and encryption all being candidates for choke points that bog down a transfer. Depending on the environment a move is performed in, tuning these can have significant performance benefits, but Ganeti does not expose many options needed for such tuning. The details of what to expose and what tradeoffs can be made will be presented in this document.

Apart from existing functionality, some beneficial features can be introduced to help with instance moves. Zeroing empty space on instance disks can be useful for drastically improving the qualities of compression, effectively not needing to transfer unused disk space during moves. Compression itself can be improved by using different tools. The encryption used can be weakened or eliminated for certain moves. Using opportunistic locking during instance moves results in greater parallelization. As all of these approaches aim to tackle two different aspects of the problem, they do not exclude each other and will be presented independently.

The performance of Ganeti moves

In the current implementation, there are three possible factors limiting the speed of an instance move. The first is the network bandwidth, which Ganeti can exploit better by using compression. The second is the encryption, which is obligatory, and which can throttle an otherwise fast connection. The third is surprisingly the compression, which can cause the connection to be underutilized.

Example 1: some numbers present during an intra-cluster instance move:

  • Network bandwidth: 105MB/s, courtesy of a gigabit switch
  • Encryption performance: 40MB/s, provided by OpenSSL
  • Compression performance: 22.3MB/s input, 7.1MB/s gzip compressed output

As can be seen in this example, the obligatory encryption results in 62% of available bandwidth being wasted, while using compression further lowers the throughput to 55% of what the encryption would allow. The following sections will talk about these numbers in more detail, and suggest improvements and best practices.

Encryption and Ganeti security

Turning compression and encryption off would allow for an immediate improvement, and while that is possible for compression, there are good reasons why encryption is currently not a feature a user can disable.

While it is impossible to secure instance data if an attacker gains SSH access to a node, the RAPI was designed to never allow user data to be accessed through it in case of being compromised. If moves could be performed unencrypted, this property would be broken. Instance moves can take place in environments which may be hostile, and where unencrypted traffic could be intercepted. As they can be instigated through the RAPI, an attacker could access all data on all instances in a cluster by moving them unencrypted and intercepting the data in flight. This is one of the few situations where the current speed of instance moves could be considered a perk.

The performance of encryption can be increased by either using a less secure form of encryption, including no encryption, or using a faster encryption algorithm. The example listed above utilizes AES-256, one of the few ciphers that Ganeti deems secure enough to use. AES-128, also allowed by Ganeti’s current settings, is weaker but 46% faster. A cipher that is not allowed due to its flaws, such as RC4, could offer a 208% increase in speed. On the other hand, using an OS capable of utilizing the AES_NI chip present on modern hardware can double the performance of AES, making it the best tradeoff between security and performance.

Ganeti cannot and should not detect all the factors listed above, but should rather give its users some leeway in what to choose. A precedent already exists, as intra-cluster DRBD replication is already performed unencrypted, albeit on a separate VLAN. For intra-cluster moves, Ganeti should allow its users to set OpenSSL ciphers at will, while still enforcing high-security settings for moves between clusters.

Thus, two settings will be introduced:

  • a cluster-level setting called --allow-cipher-bypassing, a boolean that cannot be set over RAPI
  • a gnt-instance move setting called --ciphers-to-use, bypassing the default cipher list with given ciphers, filtered to ensure no other OpenSSL options are passed in within

This change will serve to address the issues with moving non-redundant instances within the cluster, while keeping Ganeti security at its current level.

Compression

Support for disk compression during instance moves was partially present before, but cleaned up and unified under the --compress option only as of Ganeti 2.11. The only option offered by Ganeti is gzip with no options passed to it, resulting in a good compression ratio, but bad compression speed.

As compression can affect the speed of instance moves significantly, it is worthwhile to explore alternatives. To test compression tool performance, an 8GB drive filled with data matching the expected usage patterns (taken from a workstation) was compressed by using various tools with various settings. The two top performers were lzop and, surprisingly, gzip. The improvement in the performance of gzip was obtained by explicitly optimizing for speed rather than compression.

  • gzip -6: 22.3MB/s in, 7.1MB/s out
  • gzip -1: 44.1MB/s in, 15.1MB/s out
  • lzop: 71.9MB/s in, 28.1MB/s out

If encryption is the limiting factor, and as in the example, limits the bandwidth to 40MB/s, lzop allows for an effective 79% increase in transfer speed. The fast gzip would also prove to be beneficial, but much less than lzop. It should also be noted that as a rule of thumb, tools with a lower compression ratio had a lesser workload, with lzop straining the CPU much less than any of the competitors.

With the test results present here, it is clear that lzop would be a very worthwhile addition to the compression options present in Ganeti, yet the problem is that it is not available by default on all distributions, as the option’s presence might imply. In general, Ganeti may know how to use several tools, and check for their presence, but should add some way of at least hinting at which tools are available.

Additionally, the user might want to use a tool that Ganeti did not account for. Allowing the tool to be named is also helpful, both for cases when multiple custom tools are to be used, and for distinguishing between various tools in case of e.g. inter-cluster moves.

To this end, the --compression-tools cluster parameter will be added to Ganeti. It contains a list of names of compression tools that can be supplied as the parameter of --compress, and by default it contains all the tools Ganeti knows how to use. The user can change the list as desired, removing entries that are not or should not be available on the cluster, and adding custom tools.

Every custom tool is identified by its name, and Ganeti expects the name to correspond to a script invoking the compression tool. Without arguments, the script compresses input on stdin, outputting it on stdout. With the -d argument, the script does the same, only while decompressing. The -h argument is used to check for the presence of the script, and in this case, only the error code is examined. This syntax matches the gzip syntax well, which should allow most compression tools to be adapted to it easily.

Ganeti will not allow arbitrary parameters to be passed to a compression tool, and will restrict the names to contain only a small but assuredly safe subset of characters - alphanumeric values and dashes and underscores. This minimizes the risk of security issues that could arise from an attacker smuggling a malicious command through RAPI. Common variations, like the speed/compression tradeoff of gzip, will be handled by aliases, e.g. gzip-fast or gzip-slow.

It should also be noted that for some purposes - e.g. the writing of OVF files, gzip is the only allowed means of compression, and an appropriate error message should be displayed if the user attempts to use one of the other provided tools.

Zeroing instance disks

While compression lowers the amount of data sent, further reductions can be achieved by taking advantage of the structure of the disk - namely, sending only used disk sectors.

There is no direct way to achieve this, as it would require that the move-instance tool is aware of the structure of the file system. Mounting the filesystem is not an option, primarily due to security issues. A disk primed to take advantage of a disk driver exploit could cause an attacker to breach instance isolation and gain control of a Ganeti node.

An indirect way for this performance gain to be achieved is the zeroing of any hard disk space not in use. While this primarily means empty space, swap partitions can be zeroed as well.

Sequences of zeroes can be compressed and thus transferred very efficiently, all without the host knowing that these are empty space. This approach can also be dangerous if a sparse disk is zeroed in this way, causing ballooning. As Ganeti does not seem to make special concessions for moving sparse disks, the only difference should be the disk space utilization on the current node.

Zeroing approaches

Zeroing is a feasible approach, but the node cannot perform it as it cannot mount the disk. Only virtualization-based options remain, and of those, using Ganeti’s own virtualization capabilities makes the most sense. There are two ways of doing this - creating a new helper instance, temporary or persistent, or reusing the target instance.

Both approaches have their disadvantages. Creating a new helper instance requires managing its lifecycle, taking special care to make sure no helper instance remains left over due to a failed operation. Even if this were to be taken care of, disks are not yet separate entities in Ganeti, making the temporary transfer of disks between instances hard to implement and even harder to make robust. The reuse can be done by modifying the OS running on the instance to perform the zeroing itself when notified via the new instance communication mechanism, but this approach is neither generic, nor particularly safe. There is no guarantee that the zeroing operation will not interfere with the normal operation of the instance, nor that it will be completed if a user-initiated shutdown occurs.

A better solution can be found by combining the two approaches - re-using the virtualized environment, but with a specifically crafted OS image. With the instance shut down as it should be in preparation for the move, it can be extended with an additional disk with the OS image on it. By prepending the disk and changing some instance parameters, the instance can boot from it. The OS can be configured to perform the zeroing on startup, attempting to mount any partitions with a filesystem present, and creating and deleting a zero-filled file on them. After the zeroing is complete, the OS should shut down, and the master should note the shutdown and restore the instance to its previous state.

Note that the requirements above are very similar to the notion of a helper VM suggested in the OS install document. Some potentially unsafe actions are performed within a virtualized environment, acting on disks that belong or will belong to the instance. The mechanisms used will thus be developed with both approaches in mind.

Implementation

There are two components to this solution - the Ganeti changes needed to boot the OS, and the OS image used for the zeroing. Due to the variety of filesystems and architectures that instances can use, no single ready-to-run disk image can satisfy the needs of all the Ganeti users. Instead, the instance-debootstrap scripts can be used to generate a zeroing-capable OS image. This might not be ideal, as there are lightweight distributions that take up less space and boot up more quickly. Generating those with the right set of drivers for the virtualization platform of choice is not easy. Thus we do not provide a script for this purpose, but the user is free to provide any OS image which performs the necessary steps: zero out all virtualization-provided devices on startup, shutdown immediately. The cluster-wide parameter controlling the image to be used would be called --zeroing-image.

The modifications to Ganeti code needed are minor. The zeroing functionality should be implemented as an extension of the instance export, and exposed as the --zero-free-space option. Prior to beginning the export, the instance configuration is temporarily extended with a new read-only disk of sufficient size to host the zeroing image, and the changes necessary for the image to be used as the boot drive. The temporary nature of the configuration changes requires that they are not propagated to other nodes. While this would normally not be feasible with an instance using a disk template offering multi-node redundancy, experiments with the code have shown that the restriction on diverse disk templates can be bypassed to temporarily allow a plain disk-template disk to host the zeroing image. Given that one of the planned changes in Ganeti is to have instance disks as separate entities, with no restriction on templates, this assumption is useful rather than harmful by asserting the desired behavior. The image is dumped to the disk, and the instance is started up.

Once the instance is started up, the zeroing will proceed until completion, when a self-initiated shutdown will occur. The instance-shutdown detection capabilities of 2.11 should prevent the watcher from restarting the instance once this happens, allowing the host to take it as a sign the zeroing was completed. Either way, the host waits until the instance is shut down, or a timeout has been reached and the instance is forcibly shut down. As the time needed to zero an instance is dependent on the size of the disk of the instance, the user can provide a fixed and a per-size timeout, recommended to be set to twice the maximum write speed of the device hosting the instance.

Better progress monitoring can be implemented with the instance-host communication channel proposed by the OS install design document. The first version will most likely use only the shutdown detection, and will be improved to account for the available communication channel at a later time.

After the shutdown, the temporary disk is destroyed and the instance configuration is reverted to its original state. The very same action is done if any error is encountered during the zeroing process. In the case that the zeroing is interrupted while the zero-filled file is being written, the file may remain on the disk of the instance. The script that performs the zeroing will be made to react to system signals by deleting the zero-filled file, but there is little else that can be done to recover.

When to use zeroing

The question of when it is useful to use zeroing is hard to answer because the effectiveness of the approach depends on many factors. All compression tools compress zeroes to almost nothingness, but compressing them takes time. If the time needed to compress zeroes were equal to zero, the approach would boil down to whether it is faster to zero unused space out, performing writes to disk, or to transfer it compressed. For the example used above, the average compression ratio, and write speeds of current disk drives, the answer would almost unanimously be yes.

With a more realistic setup, where zeroes take time to compress, yet less time than ordinary data, the gains depend on the previously mentioned tradeoff and the free space available. Zeroing will definitely lessen the amount of bandwidth used, but it can lead to the connection being underutilized due to the time spent compressing data. It is up to the user to make these tradeoffs, but zeroing should be seen primarily as a means of further reducing the amount of data sent while increasing disk activity, with possible speed gains that should not be relied upon.

In the future, the VM created for zeroing could also undertake other tasks related to the move, such as compression and encryption, and produce a stream of data rather than just modifying the disk. This would lessen the strain on the resources of the hypervisor, both disk I/O and CPU usage, and allow moves to obey the resource constraints placed on the instance being moved.

Lock reduction

An instance move as executed by the move-instance tool consists of several preparatory RAPI calls, leading up to two long-lasting opcodes: OpCreateInstance and OpBackupExport. While OpBackupExport locks only the instance, the locks of OpCreateInstance require more attention.

When executed, this opcode attempts to lock all nodes on which the instance may be created and obtain shared locks on the groups they belong to. In the case that an IAllocator is used, this means all nodes must be locked. Any operation that requires a node lock to be present can delay the move operation, and there is no shortage of these.

The concept of opportunistic locking has been introduced to remedy exactly this situation, allowing the IAllocator to lock as many nodes as possible. Depending whether the allocation can be made on these nodes, the operation either proceeds as expected, or fails noting that it is temporarily infeasible. The failure case would change the semantics of the move-instance tool, which is expected to fail only if the move is impossible. To yield the benefits of opportunistic locking yet satisfy this constraint, the move-instance tool can be extended with the –opportunistic-tries and –opportunistic-try-delay options. A number of opportunistic instance creations are attempted, with a delay between attempts. The delay is slightly altered every time to avoid timing issues. Should all attempts fail, a normal instance creation is requested, which blocks until all the locks can be acquired.

While it may seem excessive to grab so many node locks, the early release mechanism is used to make the situation less dire, releasing all nodes that were not chosen as candidates for allocation. This is taken to the extreme as all the locks acquired are released prior to the start of the transfer, barring the newly-acquired lock over the new instance. This works because all operations that alter the node in a way which could affect the transfer:

  • are prevented by the instance lock or instance presence, e.g. gnt-node remove, gnt-node evacuate,
  • do not interrupt the transfer, e.g. a PV on the node can be set as unallocatable, and the transfer still proceeds as expected,
  • do not care, e.g. a gnt-node powercycle explicitly ignores all locks.

This invariant should be kept in mind, and perhaps verified through tests.

All in all, there is very little space to reduce the number of locks used, and the only improvement that can be made is introducing opportunistic locking as an option of move-instance.

Introduction of changes

All the changes noted will be implemented in Ganeti 2.12, in the way described in the previous chapters. They will be implemented as separate changes, first the lock reduction, then the instance zeroing, then the compression improvements, and finally the encryption changes.