Contents
This is a design document detailing the implementation of a Ganeti monitoring agent report system, that can be queried by a monitoring system to calculate health information for a Ganeti cluster.
There is currently no monitoring support in Ganeti. While we don’t want to build something like Nagios or Pacemaker as part of Ganeti, it would be useful if such tools could easily extract information from a Ganeti machine in order to take actions (example actions include logging an outage for future reporting or alerting a person or system about it).
Each Ganeti node should export a status page that can be queried by a monitoring system. Such status page will be exported on a network port and will be encoded in JSON (simple text) over HTTP.
The choice of JSON is obvious as we already depend on it in Ganeti and thus we don’t need to add extra libraries to use it, as opposed to what would happen for XML or some other markup format.
The report will be available from all nodes, and be concerned for all node-local resources. This allows more real-time information to be available, at the cost of querying all nodes.
The monitoring agent system will report on the following basic information:
The report of the will be in JSON format, and it will present an array of report objects. Each report object will be produced by a specific data collector. Each report object includes some mandatory fields, to be provided by all the data collectors:
Here follows a minimal example of a report:
[
{
"name" : "TheCollectorIdentifier",
"version" : "1.2",
"format_version" : 1,
"timestamp" : 1351607182000000000,
"category" : null,
"kind" : 0,
"data" : { "plugin_specific_data" : "go_here" }
},
{
"name" : "AnotherDataCollector",
"version" : "B",
"format_version" : 7,
"timestamp" : 1351609526123854000,
"category" : "storage",
"kind" : 1,
"data" : { "status" : { "code" : 1,
"message" : "Error on disk 2"
},
"plugin_specific" : "data",
"some_late_data" : { "timestamp" : 1351609526123942720,
...
}
}
}
]
These collectors only provide data about some component of the system, without giving any interpretation over their meaning.
The value of the kind field of the report will be 0.
These collectors will provide information about the status of some component of ganeti, or managed by ganeti.
The value of their kind field will be 1.
The rationale behind this kind of collectors is that there are some situations where exporting data about the underlying subsystems would expose potential issues. But if Ganeti itself is able (and going) to fix the problem, conflicts might arise between Ganeti and something/somebody else trying to fix the same problem. Also, some external monitoring systems might not be aware of the internals of a particular subsystem (e.g.: DRBD) and might only exploit the high level response of its data collector, alerting an administrator if anything is wrong. Still, completely hiding the underlying data is not a good idea, as they might still be of use in some cases. So status reporting plugins will provide two output modes: one just exporting a high level information about the status, and one also exporting all the data they gathered. The default output mode will be the status-only one. Through a command line parameter (for stand-alone data collectors) or through the HTTP request to the monitoring agent (when collectors are executed as part of it) the verbose output mode providing all the data can be selected.
When exporting just the status each status reporting collector will provide, in its data section, at least the following field:
summarizes the status of the component being monitored and consists of two subfields:
It assumes a numeric value, encoded in such a way to allow using a bitset to easily distinguish which states are currently present in the whole cluster. If the bitwise OR of all the status fields is 0, the cluster is completely healty. The status codes are as follows:
A message to better explain the reason of the status. The exact format of the message string is data collector dependent.
The field is mandatory, but the content can be an empty string if the code is 0 (working as intended) or 1 (being fixed automatically).
If the status code is 2, the message should specify what has gone wrong. If the status code is 4, the message shoud explain why it was not possible to determine a proper status.
The data section will also contain all the fields describing the gathered data, according to a collector-specific format.
At the moment each node knows which instances are running on it, which instances it is primary for, but not the cause why an instance might not be running. On the other hand we don’t want to distribute full instance “admin” status information to all nodes, because of the performance impact this would have.
As such we propose that:
Monitoring and auditing systems can then use the reason to understand the cause of an instance status, and they can use the timestamp to understand the freshness of their data even in the absence of an atomic cross-node reporting: for example if they see an instance “up” on a node after seeing it running on a previous one, they can compare these values to understand which data is freshest, and repoll the “older” node. Of course if they keep seeing this status this represents an error (either an instance continuously “flapping” between nodes, or an instance is constantly up on more than one), which should be reported and acted upon.
The instance status will be on each node, for the instances it is primary for, and its data section of the report will contain a list of instances, named instances, with at least the following fields for each instance:
Each hypervisor should provide its own instance status data collector, possibly with the addition of more, specific, fields. The category field of all of them will be instance. The kind field will be 1.
Note that as soon as a node knows it’s not the primary anymore for an instance it will stop reporting status for it: this means the instance will either disappear, if it has been deleted, or appear on another node, if it’s been moved.
The code of the status field of the report of the Instance status data collector will be:
The storage collectors will be a series of data collectors that will gather data about storage for the current node. The collection will be performed at different granularity and abstraction levels, from the physical disks, to partitions, logical volumes and to the specific storage types used by Ganeti itself (drbd, rbd, plain, file).
The name of each of these collector will reflect what storage type each of them refers to.
The category field of these collector will be storage.
The kind field will depend on the specific collector.
Each storage collector’s data section will provide collector-specific fields.
The various storage collectors will provide keys to join the data they provide, in order to allow the user to get a better understanding of the system. E.g.: through device names, or instance names.
This storage data collector will gather information about the status of the disks installed in the system, as listed in the /proc/diskstats file. This means that not only physical hard drives, but also ramdisks and loopback devices will be listed.
Its kind in the report will be 0 (Performance reporting collectors).
Its category field in the report will contain the value storage.
When executed in verbose mode, the data section of the report of this collector will be a list of items, each representing one disk, each providing the following fields:
This data collector will gather information about the attributes of logical volumes present in the system.
Its kind in the report will be 0 (Performance reporting collectors).
Its category field in the report will contain the value storage.
The data section of the report of this collector will be a list of items, each representing one logical volume and providing the following fields:
This data collector will run only on nodes where DRBD is actually present and it will gather information about DRBD devices.
Its kind in the report will be 1 (Status reporting collectors).
Its category field in the report will contain the value storage.
When executed in verbose mode, the data section of the report of this collector will provide the following fields:
Information about the DRBD version number, given by a combination of any (but at least one) of the following fields:
A list of structures, each describing a DRBD device (a minor) and containing the following fields:
The performance indicators. This field will contain the following sub-fields:
(Optional) The status of the synchronization of the disk. This is present only if the disk is being synchronized, and includes the following fields:
Ganeti will report what information it has about its own daemons. This should allow identifying possible problems with the Ganeti system itself: for example memory leaks, crashes and high resource utilization should be evident by analyzing this information.
The kind field will be 1 (Status reporting collectors).
Each daemon will have its own data collector, and each of them will have a category field valued daemon.
When executed in verbose mode, their data section will include at least:
Any other daemon-specific information can be included as well in the data section.
Each hypervisor has a view of system resources that sometimes is different than the one the OS sees (for example in Xen the Node OS, running as Dom0, has access to only part of those resources). In this section we’ll report all information we can in a “non hypervisor specific” way. Each hypervisor can then add extra specific information that is not generic enough be abstracted.
The kind field will be 0 (Performance reporting collectors).
Each of the hypervisor data collectory will be of category: hypervisor.
Since Ganeti assumes it’s running on Linux, it’s useful to export some basic information as seen by the host system.
The category field of the report will be null.
The kind field will be 0 (Performance reporting collectors).
The data section will include:
Note that we won’t go into any hardware specific details (e.g. querying a node RAID is outside the scope of this, and can be implemented as a plugin) but we can easily just report the information above, since it’s standard enough across all systems.
This data collector will export CPU load statistics as seen by the host system. Apart from using the data from an external monitoring system we can also use the data to improve instance allocation and/or the Ganeti cluster balance. To compute the CPU load average we will use a number of values collected inside a time window. The collection process will be done by an independent thread (see Mode of Operation).
This report is a subset of the previous report (Node OS resources report) and they might eventually get merged, once reporting for the other fields (memory, filesystem, NICs) gets implemented too.
Specifically:
The category field of the report will be null.
The kind field will be 0 (Performance reporting collectors).
The data section will include:
The CPU load report function will get N values, collected by the CPU load collection function and calculate the above averages. Please see the section Mode of Operation for more information one how the two functions of the data collector interact.
The queries to the monitoring agent will be HTTP GET requests on port 1815. The answer will be encoded in JSON format and will depend on the specific accessed resource.
If a request is sent to a non-existing resource, a 404 error will be returned by the HTTP server.
The following paragraphs will present the existing resources supported by the current protocol version, that is version 1.
The root resource. It will return the list of the supported protocol version numbers.
Currently, this will include only version 1.
Not an actual resource per-se, it is the root of all the resources of protocol version 1.
If requested through GET, the null JSON value will be returned.
Returns a list of tuples (kind, category, name) showing all the collectors available in the system.
A list of the reports of all the data collectors, as a JSON list.
Status reporting collectors will provide their output in non-verbose format. The verbose format can be requested by adding the parameter verbose=1 to the request.
Returns the report of the collector [collector_name] that belongs to the specified [category].
The category has to be written in lowercase.
If a collector does not belong to any category, default will have to be used as the value for [category].
Status reporting collectors will provide their output in non-verbose format. The verbose format can be requested by adding the parameter verbose=1 to the request.
As for the instance status Ganeti has now only partial information about its instance disks: in particular each node is unaware of the disk to instance mapping, that exists only on the master.
For this design doc we plan to fix this by changing all RPCs that create a backend storage or that put an already existing one in use and passing the relevant instance to the node. The node can then export these to the status reporting tool.
While we haven’t implemented these RPC changes yet, we’ll use Confd to fetch this information in the data collectors.
The monitoring system will be equipped with a plugin system that can export specific local information through it.
The plugin system is expected to be used by local installations to export any installation specific information that they want to be monitored, about either hardware or software on their systems.
The plugin system will be in the form of either scripts or binaries whose output will be inserted in the report.
Eventually support for other kinds of plugins might be added as well, such as plain text files which will be inserted into the report, or local unix or network sockets from which the information has to be read. This should allow most flexibility for implementing an efficient system, while being able to keep it as simple as possible.
In order to ease testing as well as to make it simple to reuse this subsystem it will be possible to run just the “data collectors” on each node without passing through the agent daemon.
If a data collector is run independently, it should print on stdout its report, according to the format corresponding to a single data collector report object, as described in the previous paragraphs.
In order to be able to report information fast the monitoring agent daemon will keep an in-memory or on-disk cache of the status, which will be returned when queries are made. The status system will then periodically check resources to make sure the status is up to date.
Different parts of the report will be queried at different speeds. These will depend on: - how often they vary (or we expect them to vary) - how fast they are to query - how important their freshness is
Of course the last parameter is installation specific, and while we’ll try to have defaults, it will be configurable. The first two instead we can use adaptively to query a certain resource faster or slower depending on those two parameters.
When run as stand-alone binaries, the data collector will not using any caching system, and just fetch and return the data immediately.
Since some performance collectors have to operate on a number of values collected in previous times, we need a mechanism independent of the data collector which will trigger the collection of those values and also store them, so that they are available for calculation by the data collectors.
To collect data periodically, a thread will be created by the monitoring agent which will run the collection function of every data collector that provides one. The values returned by the collection function of the data collector will be saved in an appropriate map, associating each value to the corresponding collector, using the collector’s name as the key of the map. This map will be stored in mond’s memory.
The collectors are divided in two categories:
For example: the collection function of the CPU load collector will collect a CPU load value and save it in the map mentioned above. The collection function will be called by the collector thread every t milliseconds. When the report function of the collector is called, it will process the last N values of the map and calculate the corresponding average.
The status daemon will be implemented as a standalone Haskell daemon. In the future it should be easy to merge multiple daemons into one with multiple entry points, should we find out it saves resources and doesn’t impact functionality.
The libekg library should be looked at for easily providing metrics in json format.
We will implement the agent system in this order:
As a future step it can be useful to “centralize” all this reporting data on a single place. This for example can be just the master node, or all the master candidates. We will evaluate doing this after the first node-local version has been developed and tested.
Another possible change is replacing the “read-only” RPCs with queries to the agent system, thus having only one way of collecting information from the nodes from a monitoring system and for Ganeti itself.
One extra feature we may need is a way to query for only sub-parts of the report (eg. instances status only). This can be done by passing arguments to the HTTP GET, which will be defined when we get to this funtionality.
Finally the autorepair system design. system (see its design) can be expanded to use the monitoring agent system as a source of information to decide which repairs it can perform.