Removal of the Config Lock Overhead

This is a design document detailing how the adverse effect of the config lock can be removed in an incremental way.

Current state and shortcomings

As a result of the Ganeti daemons refactoring, the configuration is held in a proccess different from the processes carrying out the Ganeti jobs. Therefore, job processes have to contact WConfD in order to change the configuration. Of course, these modifications of the configuration need to be synchronised.

The current form of synchronisation is via ConfigLock. Exclusive possession of this lock guarantees that no one else modifies the configuration. In other words, the current procedure for a job to update the configuration is to

  • acquire the ConfigLock from WConfD,
  • read the configration,
  • write the modified configuration, and
  • release ConfigLock.

The current procedure has some drawbacks. These also affect the overall throughput of jobs in a Ganeti cluster.

  • At each configuration update, the whole configuration is transferred between the job and WConfD.
  • More importantly, however, jobs can only release the ConfigLock after the write; the write, in turn, is only confirmed once the configuration is written on disk. In particular, we can only have one update per configuration write. Also, having the ConfigLock is only confirmed to the job, once the new lock status is written to disk.

Additional overhead is caused by the fact that reads are synchronised over a shared config lock. This used to make sense when the configuration was modifiable in the same process to ensure consistent read. With the new structure, all access to the configuration via WConfD are consistent anyway, and local modifications by other jobs do not happen.

Proposed changes for an incremental improvement

Ideally, jobs would just send patches for the configuration to WConfD that are applied by means of atomically updating the respective IORef. This, however, would require chaning all of Ganeti’s logical units in one big change. Therefore, we propose to keep the ConfigLock and, step by step, reduce its impact till it eventually will be just used internally in the WConfD process.

Unlocked Reads

In a first step, all configuration operations that are synchronised over a shared config lock, and therefore necessarily read-only, will instead use WConfD’s readConfig used to obtain a snapshot of the configuration. This will be done without modifying the locks. It is sound, as reads to a Haskell IORef always yield a consistent value. From that snapshot the required view is computed locally. This saves two lock-configurtion write cycles per read and, additionally, does not block any concurrent modifications.

In a second step, more specialised read functions will be added to WConfD. This will reduce the traffic for reads.

Cached Reads

As jobs synchronize with each other by means of regular locks, the parts of the configuration relevant for a job can only change while a job waits for new locks. So, if a job has a copy of the configuration and not asked for locks afterwards, all read-only access can be done from that copy. While this will not affect the ConfigLock, it saves traffic.

Set-and-release action

As a typical pattern is to change the configuration and afterwards release the ConfigLock. To avoid unnecessary RPC call overhead, WConfD will offer a combined call. To make that call retryable, it will do nothing if the the ConfigLock is not held by the caller; in the return value, it will indicate if the config lock was held when the call was made.

Short-lived ConfigLock

For a lot of operations, the regular locks already ensure that only one job can modify a certain part of the configuration. For example, only jobs with an exclusive lock on an instance will modify that instance. Therefore, it can update that entity atomically, without relying on the configuration lock to achive consistency. WConfD will provide such operations. To avoid interference with non-atomic operations that still take the config lock and write the configuration as a whole, this operation will only be carried out at times the config lock is not taken. To ensure this, the thread handling the request will take the config lock itself (hence no one else has it, if that succeeds) before the change and release afterwards; both operations will be done without triggering a writeout of the lock status.

Note that the thread handling the request has to take the lock in its own name and not in that of the requesting job. A writeout of the lock status can still happen, triggered by other requests. Now, if WConfD gets restarted after the lock acquisition, if that happend in the name of the job, it would own a lock without knowing about it, and hence that lock would never get released.

Approaches considered, but not working

Set-and-release action with asynchronous writes

Approach

As a typical pattern is to change the configuration and afterwards release the ConfigLock. To avoid unnecessary delay in this operation (the next modification of the configuration can already happen while the last change is written out), WConfD will offer a combined command that will

  • set the configuration to the specified value,
  • release the config lock,
  • and only then wait for the configuration write to finish; it will not wait for confirmation of the lock-release write.

If jobs use this combined command instead of the sequential set followed by release, new configuration changes can come in during writeout of the current change; in particular, a writeout can contain more than one change.

Problem

This approach works fine, as long as always either WConfD can do an ordered shutdown or the calling process dies as well. If however, we allow random kill signals to be sent to individual daemons (e.g., by an out-of-memory killer), the following race occurs. A process can ask for a combined write-and-unlock operation; while the configuration is still written out, the write out of the updated lock status already finishes. Now, if WConfD forcefully gets killed in that very moment, a restarted WConfD will read the old configuration but the new lock status. This will make the calling process believe that its call, while it didn’t get an answer, succeeded nevertheless, thus resulting in a wrong configuration state.