This design document describes the KVM daemon, which is responsible for determining whether a given KVM instance was shutdown by an administrator or a user.
Current state and shortcomings¶
This design document describes the KVM daemon which addresses the KVM side of the user-initiated shutdown problem introduced in Detection of user-initiated shutdown from inside an instance. We are also interested in keeping this functionality optional. That is, an administrator does not necessarily have to run the KVM daemon if either he is running Xen or even, if he is running KVM, he is not interested in instance shutdown detection. This requirement is important because it means the KVM daemon should be a modular component in the overall Ganeti design, i.e., it should be easy to enable and disable it.
The instance shutdown feature for KVM requires listening on events from
the Qemu Machine Protocol (QMP) Unix socket, which is created together
with a KVM instance. A QMP socket typically looks like
/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp and implements
the QMP protocol. This is a bidirectional protocol that allows Ganeti
to send commands, such as, system powerdown, as well as, receive events,
such as, the powerdown and shutdown events.
Listening in on these events allows Ganeti to determine whether a given
KVM instance was shutdown by an administrator, either through
gnt-instance stop|remove <instance> or
<instance-pid>, or by a user, through
poweroff from inside the
instance. Upon an administrator powerdown, the QMP protocol sends two
events, namely, a powerdown event and a shutdown event, whereas upon a
user shutdown only the shutdown event is sent. This is enough to
distinguish between an administrator and a user shutdown. However,
there is one limitation, which is,
kill -TERM <instance-pid>. Even
though this is an action performed by the administrator, it will be
considered a user shutdown by the approach described in this document.
Several design strategies were considered. Most of these strategies consisted of spawning some process listening on the QMP socket when a KVM instance is created. However, having a listener process per KVM instance is not scalable. Therefore, a different strategy is proposed, namely, having a single process, called the KVM daemon, listening on the QMP sockets of all KVM instances within a node. That also means there is an instance of the KVM daemon on each node.
In order to implement the KVM daemon, two problems need to be addressed, namely, how the KVM daemon knows when to open a connection to a given QMP socket and how the KVM daemon communicates with Ganeti whether a given instance was shutdown by an administrator or a user.
QMP connections management¶
As mentioned before, the QMP sockets reside in the KVM control
directory, which is usually located under
/var/run/ganeti/kvm-hypervisor/ctrl/. When a KVM instance is
created, a new QMP socket for this instance is also created in this
In order to simplify the design of the KVM daemon, instead of having
Ganeti communicate to this daemon through a pipe or socket the creation
of a new KVM instance, and thus a new QMP socket, this daemon will
monitor the KVM control directory using
inotify. As a result, the
daemon is not only able to deal with KVM instances being created and
removed, but also capable of overcoming other problematic situations
concerning the filesystem, such as, the case when the KVM control
directory does not exist because, for example, Ganeti was not yet
started, or the KVM control directory was removed, for example, as a
result of a Ganeti reinstallation.
As mentioned before, the KVM daemon is responsible for opening a connection to the QMP socket of a given instance and listening in on the shutdown and powerdown events, which allow the KVM daemon to determine whether the instance stopped because of an administrator or user shutdown. Once the instance is stopped, the KVM daemon needs to communicate to Ganeti whether the user was responsible for shutting down the instance.
In order to achieve this, the KVM daemon writes an empty file, called
the shutdown file, in the KVM control directory with a name similar to
the QMP socket file but with the extension
.qmp replaced with
.shutdown. The presence of this file indicates that the shutdown
was initiated by a user, whereas the absence of this file indicates that
the shutdown was caused by an administrator. This strategy also handles
crashes and signals, such as,
SIGKILL, to be handled correctly,
given that in these cases the KVM daemon never receives the powerdown
and shutdown events and, therefore, never creates the shutdown file.
KVM daemon launch¶
With the above issues addressed, a question remains as to when the KVM daemon should be started. The KVM daemon is different from other Ganeti daemons, which start together with the Ganeti service, because the KVM daemon is optional, given that it is specific to KVM and should not be run on installations containing only Xen, and, even in a KVM installation, the user might still choose not to enable it. And finally because the KVM daemon is not really necessary until the first KVM instance is started. For these reasons, the KVM daemon is started from within Ganeti when a KVM instance is started. And the job process spawned by the node daemon is responsible for starting the KVM daemon.
Given the current design of Ganeti, in which the node daemon spawns a job process to handle the creation of the instance, when launching the KVM daemon it is necessary to first check whether an instance of this daemon is already running and, if this is not the case, then the KVM daemon can be safely started.
At first, it might seem natural to include the instance shutdown detection for KVM in the node daemon. After all, the node daemon is already responsible for managing instances, for example, starting and stopping an instance. Nevertheless, the node daemon is more complicated than it might seem at first.
The node daemon is composed of the main loop, which runs in the main thread and is responsible for receiving requests and spawning jobs for handling these requests, and the jobs, which are independent processes spawned for executing the actual tasks, such as, creating an instance.
Including instance shutdown detection in the node daemon is not viable because adding it to the main loop would cause KVM specific code to taint the generality of the node daemon. In order to add it to the job processes, it would be possible to spawn either a foreground or a background process. However, these options are also not viable because they would lead to the situation described before where there would be a monitoring process per instance, which is not scalable. Moreover, the foreground process has an additional disadvantage: it would require modifications the node daemon in order not to expect a terminating job, which is the current node daemon design.
There is another design issue to have in mind. We could reconsider the place where to write the data that tell Ganeti whether an instance was shutdown by an administrator or the user. Instead of using the KVM shutdown files presented above, in which the presence of the file indicates a user shutdown and its absence an administrator shutdown, we could store a value in the KVM runtime state file, which is where the relevant KVM state information is. The advantage of this approach is that it would keep the KVM related information in one place, thus making it easier to manage. However, it would lead to a more complex implementation and, in the context of the general transition in Ganeti from Python to Haskell, a simpler implementation is preferred.
Finally, it should be noted that the KVM runtime state file benefits from automatic migration. That is, when an instance is migrated so is the KVM state file. However, the instance shutdown detection for KVM does not require this feature and, in fact, migrating the instance shutdown state would be incorrect.
There are potential race conditions between Ganeti and the KVM daemon,
however, in practice they seem unlikely. For example, the KVM daemon
needs to add and remove watches to the parent directories of the KVM
control directory until this directory is finally created. It is
possible that Ganeti creates this directory and a KVM instance before
the KVM daemon has a chance to add a watch to the KVM control directory,
thus causing this daemon to miss the
inotify creation event for the
There are other problems which arise from the limitations of
inotify. For example, if the KVM daemon is started after the first
Ganeti instance has been created, then the
inotify will not produce
any event for the creation of the QMP socket. This can happen, for
example, if the KVM daemon needs to be restarted or upgraded. As a
result, it might be necessary to have an additional mechanism that runs
at KVM daemon startup or at regular intervals to ensure that the current
KVM internal state is consistent with the actual contents of the KVM
Another race condition occurs when Ganeti shuts down a KVM instance
using force. Ganeti uses
TERM signals to stop KVM instances when
force is specified or ACPI is not enabled. However, as mentioned
TERM signals are interpreted by the KVM daemon as a user
shutdown. As a result, the KVM daemon creates a shutdown file which
then must be removed by Ganeti. The race condition occurs because the
KVM daemon might create the shutdown file after the hypervisor code that
tries to remove this file has already run. In practice, the race
condition seems unlikely because Ganeti stops the KVM instance in a
retry loop, which allows Ganeti to stop the instance and cleanup its
It is possible to determine if a process, in this particular case the
KVM process, was terminated by a
TERM signal, using the proc
connector and socket filters.
The proc connector is a socket connected between a userspace process and
the kernel through the netlink protocol and can be used to receive
notifications of process events, and the socket filters is a mechanism
for subscribing only to events that are relevant. There are several
process events which can be
subscribed to, however, in this case, we are interested only in the exit
event, which carries information about the exit signal.