KVM + SCSI¶
This is a design document detailing the refactoring of device handling in the KVM Hypervisor. More specifically, it will use the latest QEMU device model and modify the hotplug implementation so that both PCI and SCSI devices can be managed.
Current state and shortcomings¶
Ganeti currently supports SCSI virtual devices in the KVM hypervisor by setting the disk_type hvparam to scsi. Ganeti will eventually instruct QEMU to use the deprecated device model (i.e. -drive if=scsi), which will expose the backing store as an emulated SCSI device. This means that currently SCSI pass-through is not supported.
On the other hand, the current hotplug implementation Hotplug uses the latest QEMU device model (via the -device option) and is tailored to paravirtual devices, which leads to buggy behavior: if we hotplug a disk to an instance that is configured with disk_type=scsi hvparam, the disk which will get hot-plugged eventually will be a VirtIO device (i.e., virtio-blk-pci) on the PCI bus.
The current implementation of creating the QEMU command line is error-prone, since an instance might not be able to boot due to PCI slot congestion.
We change the way that the KVM hypervisor handles block devices by introducing latest QEMU device model for SCSI devices as well, so that scsi-cd, scsi-hd, scsi-block, and scsi-generic device drivers are supported too. Additionally we refactor the hotplug implementation in order to support hotplugging of SCSI devices too. Finally, we change the way we keep track of device info inside runtime files, and the way we place each device upon instance startup.
How to identify each device?
Currently KVM does not support arbitrary IDs for devices; supported are only names starting with a letter, with max 32 chars length, and only including the ‘.’, ‘_’, ‘-‘ special chars. Currently we generate an ID with the following format: <device type>-<part of uuid>-pci-<slot>. This assumes that the device will be plugged in a certain slot on the PCI bus. Since we want to support devices on a SCSI bus too and adding the PCI slot to the ID is redundant, we dump the last two parts of the existing ID. Additionally we get rid of the ‘hot’ prefix of device type, and we add the next two parts of the UUID so the chance of collitions is reduced significantly. So, as an example, the device ID of a disk with UUID ‘9e7c85f6-b6e5-4243-b27d-680b78c6d203’ would be now ‘disk-9e7c85f6-b6e5-4243’.
Which buses does the guest eventually see?
By default QEMU starts with a single PCI bus named “pci.0”. In case a SCSI controller is added on this bus, a SCSI bus is created with the corresponding name: “scsi.0”. Any SCSI disks will be attached on this SCSI bus. Currently Ganeti does not explicitly use a SCSI controller via a command line option, but lets QEMU add one automatically if needed. Here, in case we have a SCSI disk, a SCSI controller is explicitly added via the -device option. For the SCSI controller, we do not specify the PCI slot to use, but let QEMU find the first available (see below).
What type of SCSI controller to use?
QEMU uses the lsi controller by default. To make this configurable we add a new hvparam, scsi_controller_type. The available types will be lsi, megasas, and virtio-scsi-pci.
Where to place the devices upon instance startup?
The default QEMU machine type, pc, adds a i440FX-pcihost controller on the root bus that creates a PCI bus with pci.0 alias. By default the first three slots of this bus are occupied: slot 0 for Host bridge, slot 1 for ISA bridge, and slot 2 for VGA controller. Thereafter, the slots depend on the QEMU options passed in the command line.
The main reason that we want to be fully aware of the configuration of a running instance (machine type, PCI and SCSI bus state, devices, etc.) is that in case of migration a QEMU process with the exact same configuration should be created on the target node. The configuration is kept in the runtime file created just before starting the instance. Since hotplug has been introduced, the only thing that can change after starting an instance is the configuration related to NICs and Disks.
Before implementing hotplug, Ganeti did not specify PCI slots explicitly, but let QEMU decide how to place the devices on the corresponding bus. This does not work if we want to have hotplug-able devices and migrate-able VMs. Currently, upon runtime file creation, we try to reserve PCI slots based on the hvparams, the disks, and the NICs of the instance. This has three major shortcomings: first, we have to be aware which options modify the PCI bus which is practically impossible due to the huge amount of QEMU options, second, QEMU may change the default PCI configuration from version to version, and third, we cannot know if the extra options passed by the user via the kvm_extra hvparam modify the PCI bus.
All the above makes the current implementation error prone: an instance might not be able to boot if we explicitly add a NIC/Disk on a specific PCI slot that QEMU has already used for another device while parsing its command line options. Besides that, now, we want to use the SCSI bus as well so the above mechanism is insufficient. Here, we decide to put only disks and NICs on specific slots on the corresponding bus, and let QEMU put everything else automatically. To this end, we decide to let the first 12 PCI slots be managed by QEMU, and we start adding PCI devices (VirtIO block and network devices) from the 13th slot onwards. As far as the SCSI bus is concerned, we decide to put each SCSI disk on a different scsi-id (which corresponds to a different target number in SCSI terminology). The SCSI bus will not have any default reservations.
How to support the theoretical maximum of devices, 16 disks and 8 NICs?
By default, one could add up to 20 devices on the PCI bus; that is the 32 slots of the PCI bus, minus the starting 12 slots that Ganeti allows QEMU to manage on its own. In order to by able to add more PCI devices, we add the new kvm_pci_reservations hvparam to denote how many PCI slots QEMU will handle implicitly. The rest will be available for disk and NICs inserted explicitly by Ganeti. By default the default PCI reservations will be 12 as explained above.
How to keep track of the bus state of a running instance?
To be able to hotplug a device, we need to know which slot is
available on the desired bus. Until now, we were using the
QMP command that returns the state of the PCI buses (i.e., which devices
occupy which slots). Unfortunately, there is no equivalent for the SCSI
buses. We could use the
info qtree HMP command that practically
dumps in plain text the whole device tree. This makes it really hard to
parse. So we decide to generate the bus state of a running instance
through our local runtime files.
What info should be kept in runtime files?
Runtime files are used for instance migration (to run a QEMU process on the target node with the same configuration) and for hotplug actions (to update the configuration of a running instance so that it can be migrated). Until now we were using devices only on the PCI bus, so only each device’s PCI slot should be kept in the runtime file. This is obviously not enough. We decide to replace the pci slot of Disk and NIC configuration objects, with an hvinfo dict. It will contain all necessary info for constructing the appropriate -device QEMU option. Specifically the driver, id, and bus parameters will be present to all kind of devices. PCI devices will have the addr parameter, SCSI devices will have channel, scsi-id, and lun. NICs and Disks will have the extra netdev and drive parameters correspondingly.
How to deal with existing instances?
Only existing instances with paravirtual devices (configured via the disk_type and nic_type hvparam) use the latest QEMU device model. Only these have the pci slot filled. We will use the existing _UpgradeSerializedRuntime() method to migrate the old runtime format with pci slot in Disk and NIC configuration objects to the new one with hvinfo instead. The new hvinfo will contain the old driver (either virtio-blk-pci or virtio-net-pci), the old id (hotdisk-123456-pci-4), the default PCI bus (pci.0), and the old PCI slot (addr=4). This way those devices will still be hotplug-able, and the instance will still be migrate-able. When those instances are rebooted, the hvinfo will be re-generated.
How to support downgrades?
There are two possible ways, both not very pretty. The first one is to use _UpgradeSerializedRuntime() to remove the hvinfo slot. This would require the patching of all Ganeti versions down to 2.10 which is practically imposible. Another way is to ssh to all nodes and remove this slot upon a cluster downgrade. This ugly hack would go away on 2.17 since we support downgrades only to the previous minor version.
Disk objects get one extra slot:
hvinfo. It is
hypervisor-specific and will never reach config.data. In case of the KVM
Hypervisor it will contain all necessary info for constructing the -device
QEMU option. Existing entries in runtime files that had a pci slot
will be upgraded to have the corresponding hvinfo (see above).
The new scsi_controller_type hvparam is added to denote what type of SCSI controller should be added to PCI bus if we have a SCSI disk. Allowed values will be lsi, virtio-scsi-pci, and megasas. We decide to use lsi by default since this is the one that QEMU adds automatically if not specified explicitly by an option.
The current implementation verifies if a hotplug action has succeeded
by scanning the PCI bus and searching for a specific device ID. This
will change, and we will use the
query-block along with the
query-pci QMP command to find block devices that are attached to the
SCSI bus as well.
Up until now, if disk_type hvparam was set to scsi, QEMU would use the deprecated device model and end up using SCSI emulation, e.g.:
Now the equivalent, which will also enable hotplugging, will be to set disk_type to scsi-hd. The QEMU command line will include:
-drive file=/var/run/ganeti/instance-disks/test:0,if=none,format=raw,id=disk-9e7c85f6-b6e5-4243 -device scsi-hd,id=disk-9e7c85f6-b6e5-4243,drive=disk-9e7c85f6-b6e5-4243,bus=scsi.0,channel=0,scsi-id=0,lun=0
The disk_type hvparam will additionally support the scsi-hd, scsi-block, and scsi-generic values. The first one is equivalent to the existing scsi value and will make QEMU emulate a SCSI device, while the last two will add support for SCSI pass-through and will require a real SCSI device on the host.