Performance tests for QA

This design document describes performance tests to be added to QA in order to measure performance changes over time.

Current state and shortcomings

Currently, only functional QA tests are performed. Those tests verify the correct behaviour of Ganeti in various configurations, but are not designed to continuously monitor the performance of Ganeti.

The current QA tests don’t execute multiple tasks/jobs in parallel. Therefore, the locking part of Ganeti does not really receive any testing, neither functional nor performance wise.

On the plus side, Ganeti’s QA code does already measure the runtime of individual tests, which is leveraged in this design.

Proposed changes

The tests to be added in the context of this design document focus on two areas:

  • Job queue performance. How does Ganeti handle a lot of submitted jobs?
  • Parallel job execution performance. How well does Ganeti parallelize jobs?

Jobs are submitted to the job queue in sequential order, but the execution of the jobs runs in parallel. All job submissions must complete within a reasonable timeout.

In order to make it easier to recognize performance related tests, all tests added in the context of this design get a description with a “PERFORMANCE: ” prefix.

Job queue performance

Tests targeting the job queue should eliminate external factors (like network/disk performance or hypervisor delays) as much as possible, so they are designed to run in a vcluster QA environment.

The following tests are added to the QA:

  • Submit the maximum amount of instance create jobs in parallel. As soon as a creation job succeeds, submit a removal job for this instance.
  • Submit as many instance create jobs as there are nodes in the cluster in parallel (for non-redundant instances). Removal jobs as above.
  • For the maximum amount of instances in the cluster, submit modify jobs (modify hypervisor and backend parameters) in parallel.
  • For the maximum amount of instances in the cluster, submit stop, start, reboot and reinstall jobs in parallel.
  • For the maximum amount of instances in the cluster, submit multiple list and info jobs in parallel.
  • For the maximum amount of instances in the cluster, submit move jobs in parallel. While the move operations are running, get instance information using info jobs. Those jobs are required to return within a reasonable low timeout.
  • For the maximum amount of instances in the cluster, submit add-, remove- and list-tags jobs.
  • Submit 200 gnt-debug delay jobs with a delay of 0.1 seconds. To speed up submission, perform multiple job submissions in parallel. Verify that submitting jobs doesn’t significantly slow down during the process. Verify that querying cluster information over CLI and RAPI succeeds in a timely fashion with the delay jobs running/queued.

Parallel job execution performance

Tests targeting the performance of parallel execution of “real” jobs in close-to-production clusters should actually perform all operations, such as creating disks and starting instances. This way, real world locking or waiting issues can be reproduced. Performing all those operations does requires quite some time though, so only a smaller number of instances and parallel jobs can be tested realistically.

The following tests are added to the QA:

  • Submitting twice as many instance creation request as there are nodes in the cluster, using DRBD as disk template. The job parameters are chosen according to best practice for parallel instance creation without running the risk of instance creation failing for too many parallel creation attempts. As soon as a creation job succeeds, submit a removal job for this instance.
  • Submitting twice as many instance creation request as there are nodes in the cluster, using Plain as disk template. As soon as a creation job succeeds, submit a removal job for this instance. This test can make better use of parallelism because only one node must be locked for an instance creation.
  • Create an instance using DRBD. Fail it over, migrate it, change its secondary node, reboot it and reinstall it while creating an additional instance in parallel to each of those operations.

Future work

Based on test results of the tests listed above, additional tests can be added to cover more real-world use-cases. Also, based on user requests, specially crafted performance tests modeling those workloads can be added too.

Additionally, the correlations between job submission time and job queue size could be detected. Therefore, a snapshot of the job queue before job submission could be taken to measure job submission time based on the jobs in the queue.