package documentation

Tool to restart erroneously downed virtual machines.

This program and set of classes implement a watchdog to restart virtual machines in a Ganeti cluster that have crashed or been killed by a node reboot. Run from cron or similar.

Module nodemaint Module doing node maintenance for Ganeti watcher.
Module state Module keeping state for Ganeti watcher.

From __init__.py:

Class Instance Abstraction for a Virtual Machine instance.
Class Node Data container representing cluster node.
Exception NotMasterError Exception raised when this host is not the master.
Function GetLuxiClient Tries to connect to the luxi daemon.
Function IsRapiResponding Connects to RAPI port and does a simple test.
Function IsWconfdResponding Probes an echo RPC to WConfD.
Function Main Main function.
Function ParseOptions Parse the command line options.
Function RunWatcherHooks Run the watcher hooks.
Function ShouldPause Check whether we should pause.
Function StartNodeDaemons Start all the daemons that should be running on all nodes.
Constant BAD_STATES Undocumented
Constant CHILD_PROCESS_DELAY Undocumented
Constant ERROR Undocumented
Constant HELPLESS_STATES Undocumented
Constant INSTANCE_STATUS_LOCK_TIMEOUT Undocumented
Constant MAXTRIES Undocumented
Constant NOTICE Undocumented
Function _ArchiveJobs Archives old jobs.
Function _CheckDisks Check all nodes for restarted ones.
Function _CheckForOfflineNodes Checks if given instances has any secondary in offline status.
Function _CheckInstances Make a pass over the list of instances, restarting downed ones.
Function _CheckMaster Ensures current host is master node.
Function _CleanupInstance Undocumented
Function _GetGroupData Retrieves instances and nodes per node group.
Function _GetPendingVerifyDisks Checks if there are any currently running or pending group verify jobs and if so, returns their id.
Function _GlobalWatcher Main function for global watcher.
Function _GroupWatcher Main function for per-group watcher process.
Function _LoadKnownGroups Returns a list of all node groups known by ssconf.
Function _MergeInstanceStatus Merges all per-group instance status files into a global one.
Function _ReadInstanceStatus Reads an instance status file.
Function _StartGroupChildren Starts a new instance of the watcher for every node group.
Function _UpdateInstanceStatus Writes an instance status file from Instance objects.
Function _VerifyDisks Run a per-group "gnt-cluster verify-disks".
Function _WriteInstanceStatus Writes the per-group instance status file.
MAXTRIES: int =

Undocumented

Value
5
NOTICE: str =

Undocumented

Value
'NOTICE'
ERROR: str =

Undocumented

Value
'ERROR'
CHILD_PROCESS_DELAY: float =

Undocumented

Value
1.0
INSTANCE_STATUS_LOCK_TIMEOUT: float =

Undocumented

Value
10.0
def ShouldPause():

Check whether we should pause.

def StartNodeDaemons():

Start all the daemons that should be running on all nodes.

def RunWatcherHooks():

Run the watcher hooks.

def _CleanupInstance(cl, notepad, inst, locks):

Undocumented

def _CheckInstances(cl, notepad, instances, locks):

Make a pass over the list of instances, restarting downed ones.

def _CheckDisks(cl, notepad, nodes, instances, started):

Check all nodes for restarted ones.

def _CheckForOfflineNodes(nodes, instance):

Checks if given instances has any secondary in offline status.

Parameters
nodesUndocumented
instanceThe instance object
Returns
True if any of the secondary is offline, False otherwise
def _GetPendingVerifyDisks(cl, uuid):

Checks if there are any currently running or pending group verify jobs and if so, returns their id.

def _VerifyDisks(cl, uuid, nodes, instances, is_strict):

Run a per-group "gnt-cluster verify-disks".

def IsRapiResponding(hostname):

Connects to RAPI port and does a simple test.

Connects to RAPI port of hostname and does a simple test. At this time, the test is GetVersion.

If RAPI responds with error code "401 Unauthorized", the test is successful, because the aim of this function is to assess whether RAPI is responding, not if it is accessible.

Parameters
hostname:stringhostname of the node to connect to.
Returns
boolWhether RAPI is working properly
def IsWconfdResponding():

Probes an echo RPC to WConfD.

def ParseOptions():

Parse the command line options.

Returns
(options, args) as from OptionParser.parse_args()
def _WriteInstanceStatus(filename, data):

Writes the per-group instance status file.

The entries are sorted.

Parameters
filename:stringPath to instance status file
data:list of tuple; (instance name as string, status as string)Instance name and status
def _UpdateInstanceStatus(filename, instances):

Writes an instance status file from Instance objects.

Parameters
filename:stringPath to status file
instances:list of InstanceUndocumented
def _ReadInstanceStatus(filename):

Reads an instance status file.

Parameters
filename:stringPath to status file
Returns
tuple; (None or number, list of lists containing instance name and status)File's mtime and instance status contained in the file; mtime is None if file can't be read
def _MergeInstanceStatus(filename, pergroup_filename, groups):

Merges all per-group instance status files into a global one.

Parameters
filename:stringPath to global instance status file
pergroup_filename:stringPath to per-group status files, must contain "%s" to be replaced with group UUID
groups:sequenceUUIDs of known groups
def GetLuxiClient(try_restart):

Tries to connect to the luxi daemon.

Parameters
try_restart:boolWhether to attempt to restart the master daemon
def _StartGroupChildren(cl, wait):

Starts a new instance of the watcher for every node group.

def _ArchiveJobs(cl, age):

Archives old jobs.

def _CheckMaster(cl):

Ensures current host is master node.

@UsesRapiClient
def _GlobalWatcher(opts):

Main function for global watcher.

At the end child processes are spawned for every node group.

def _GetGroupData(qcl, uuid):

Retrieves instances and nodes per node group.

def _LoadKnownGroups():

Returns a list of all node groups known by ssconf.

def _GroupWatcher(opts):

Main function for per-group watcher process.

def Main():

Main function.