======================
Memory Over Commitment
======================

.. contents:: :depth: 4

This document describes the proposed changes to support memory
overcommitment in Ganeti.

Background
==========

Memory is a non-preemptable resource, and thus cannot be shared, e.g.,
in a round-robin fashion. Therefore, Ganeti is very careful to make
sure there is always enough physical memory for the memory promised
to the instances. In fact, even in an N+1 redundant way: should one
node fail, its instances can be relocated to other nodes while still
having enough physical memory for the memory promised to all instances.

Overview over the current memory model
--------------------------------------

To make decisions, ``htools`` query the following parameters from Ganeti.

- The amount of memory used by each instance. This is the state-of-record
  backend parameter ``maxmem`` for that instance (maybe inherited from
  group-level or cluster-level backend paramters). It tells the hypervisor
  the maximal amount of memory that instance may use.

- The state-of-world parameters for the node memory. They are collected
  live and are hypervisor specific. The following parameters are collected.

  - memory_total: the total memory size on the node

  - memory_free: the available memory on the node for instances

  - memory_dom0: the memory used by the node itself, if available

  For Xen, the amount of total and free memory are obtained by parsing
  the output of Xen ``info`` command (e.g., ``xm info``). The dom0
  memory is obtained by looking in the output of the ``list`` command
  for ``Domain-0``.

  For the ``kvm`` hypervisor, all these paramters are obtained by
  reading ``/proc/memstate``, where the entries ``MemTotal`` and
  ``Active`` are considered the values for ``memory_total`` and
  ``memory_dom0``, respectively. The value for ``memory_free`` is
  taken as the sum of the entries ``MemFree``, ``Buffers``, and ``Cached``.


Current state and shortcomings
==============================

While the current model of never over committing memory serves well
to provide reliability guarantees to instances, it does not suit well
situations were the actual use of memory in the instances is spiky. Consider
a scenario where instances only touch a small portion of their memory most
of the time, but occasionally use a large amount of memory. Then, at any moment,
a large fraction of the memory used for the instances sits around without
being actively used. By swapping out the not actively used memory, resources
can be used more efficiently.

Proposed changes
================

We propose to support over commitment of memory if desired by the
administrator. Memory will change from being a hard constraint to
being a question of policy. The default will be not to over commit
memory.

Extension of the policy by a new parameter
------------------------------------------

The instance policy is extended by a new real-number field ``memory-ratio``.
Policies on groups inherit this parameter from the cluster wide policy in the
same way as all other parameters of the instance policy.

When a cluster is upgraded from an earlier version not containing
``memory-ratio``, the value ``1.0`` is inserted for this new field in
the cluster-level ``ipolicy``; in this way, the status quo of not over
committing memory is preserved via upgrades. The ``gnt-cluster
modify`` and ``gnt-group modify`` commands are extended to allow
setting of the ``memory-ratio``.

The ``htools`` text format is extended to also contain this new
ipolicy parameter. It is added as an optional entry at the end of the
parameter list of an ipolicy line, to remain backwards compatible.
If the paramter is missing, the value ``1.0`` is assumed.

Changes to the memory reporting on non ``xen-hvm`` and ``xen-pvm``
------------------------------------------------------------------

For all hypervisors ``memory_dom0`` corresponds to the amount of memory used
by Ganeti itself and all other non-hypervisor processes running on this node.
The amount of memory currently reported for ``memory_dom0`` on hypervisors
other than ``xen-hvm`` and ``xen-pvm``, however, includes the amount of active
memory of the hypervisor processes. This is in conflict with the underlying
assumption ``memory_dom0`` memory is not available for instance.

Therefore, for hypervisors other than ``xen-pvm`` and ``xen-hvm`` we will use
a new state-of-recored hypervisor paramter called ``mem_node`` in htools
instead of the reported ``memory_dom0``. As a hypervisor state parameter, it is
run-time tunable and inheritable at group and cluster levels. If this paramter
is not present, a default value of ``1024M`` will be used, which is a
conservative estimate of the amount of memory used by Ganeti on a medium-sized
cluster. The reason for using a state-of-record value is to have a stable
amount of reserved memory, irrespective of the current activity of Ganeti.

Currently, hypervisor state parameters are partly implemented but not used
by ganeti.

Changes to the memory policy
----------------------------

The memory policy will be changed in that we assume that one byte
of physical node memory can hold ``memory-ratio`` bytes of instance
memory, but still only one byte of Ganeti memory. Of course, in practise
this has to be backed by swap space; it is the administrator's responsibility
to ensure that each node has swap of at
least ``(memory-ratio - 1.0) * (memory_total - memory_dom0)``. Ganeti
will warn if the amount of swap space is not big enough.


The new memory policy will be as follows.

- The difference between the total memory of a node and its dom0
  memory will be considered the amount of *available memory*.

- The amount of *used memory* will be (as is now) the sum of
  the memory of all instance and the reserved memory.

- The *relative memory usage* is the fraction of used and available
  memory. Note that the relative usage can be bigger than ``1.0``.

- The memory-related constraint for instance placement is that
  afterwards the relative memory usage be at most the
  memory-ratio. Again, if the ratio of the memory of the real
  instances on the node to available memory is bigger than the
  memory-ratio this is considered a hard violation, otherwise
  it is considered a soft violation.

- The definition of N+1 redundancy (including
  :doc:`design-shared-storage-redundancy`) is kept literally as is.
  Note, however, that the meaning does change, as the definition depends
  on the notion of allowed moves, which is changed by this proposal.


Changes to cluster verify
-------------------------

The only place where the Ganeti core handles memory is
when ``gnt-cluster verify`` verifies N+1 redundancy. This code will be changed
to follow the new memory model.

Additionally, ``gnt-cluster verify`` will warn if the sum of available memory
and swap space is not at least as big as the used memory.

Changes to ``htools``
---------------------

The underlying model of the cluster will be changed in accordance with
the suggested change of the memory policy. As all higher-level ``htools``
operations go through only the primitives of adding/moving an instance
if possible, and inspecting the cluster metrics, changing the base
model will make all ``htools`` compliant with the new memory model.

Balancing
---------

The cluster metric components will not be changed. Note the standard
deviation of relative memory usage is already one of the components.
For dynamic (load-based) balancing, the amount of not immediately
discardable memory will serve as an indication of memory activity;
as usual, the measure will be the standard deviation of the relative
value (i.e., the ratio of non-discardable memory to available
memory). The weighting for this metric component will have to be
determined by experimentation and will depend on the memory ratio;
for a memory ratio of ``1.0`` the weight will be ``0.0``, as memory
need not be taken into account if no over-commitment is in place.
For memory ratios bigger than ``1.0``, the weight will be positive
and grow with the ratio.