Package ganeti :: Package watcher
[hide private]
[frames] | no frames]

Package watcher

source code

Tool to restart erroneously downed virtual machines.

This program and set of classes implement a watchdog to restart virtual machines in a Ganeti cluster that have crashed or been killed by a node reboot. Run from cron or similar.

Submodules [hide private]

Classes [hide private]
  NotMasterError
Exception raised when this host is not the master.
  Instance
Abstraction for a Virtual Machine instance.
  Node
Data container representing cluster node.
Functions [hide private]
 
ShouldPause()
Check whether we should pause.
source code
 
StartNodeDaemons()
Start all the daemons that should be running on all nodes.
source code
 
RunWatcherHooks()
Run the watcher hooks.
source code
 
_CheckInstances(cl, notepad, instances)
Make a pass over the list of instances, restarting downed ones.
source code
 
_CheckDisks(cl, notepad, nodes, instances, started)
Check all nodes for restarted ones.
source code
 
_CheckForOfflineNodes(nodes, instance)
Checks if given instances has any secondary in offline status.
source code
 
_VerifyDisks(cl, uuid, nodes, instances)
Run a per-group "gnt-cluster verify-disks".
source code
bool
IsRapiResponding(hostname)
Connects to RAPI port and does a simple test.
source code
 
ParseOptions()
Parse the command line options.
source code
 
_WriteInstanceStatus(filename, data)
Writes the per-group instance status file.
source code
 
_UpdateInstanceStatus(filename, instances)
Writes an instance status file from Instance objects.
source code
tuple; (None or number, list of lists containing instance name and status)
_ReadInstanceStatus(filename)
Reads an instance status file.
source code
 
_MergeInstanceStatus(filename, pergroup_filename, groups)
Merges all per-group instance status files into a global one.
source code
 
GetLuxiClient(try_restart)
Tries to connect to the master daemon.
source code
 
_StartGroupChildren(cl, wait)
Starts a new instance of the watcher for every node group.
source code
 
_ArchiveJobs(cl, age)
Archives old jobs.
source code
 
_CheckMaster(cl)
Ensures current host is master node.
source code
 
_GlobalWatcher(opts)
Main function for global watcher.
source code
 
_GetGroupData(cl, uuid)
Retrieves instances and nodes per node group.
source code
 
_LoadKnownGroups()
Returns a list of all node groups known by ssconf.
source code
 
_GroupWatcher(opts)
Main function for per-group watcher process.
source code
 
Main()
Main function.
source code
Variables [hide private]
  MAXTRIES = 5
  BAD_STATES = frozenset([constants.INSTST_ERRORDOWN,])
  HELPLESS_STATES = frozenset([constants.INSTST_NODEDOWN, consta...
  NOTICE = "NOTICE"
  ERROR = "ERROR"
  CHILD_PROCESS_DELAY = 1.0
Number of seconds to wait between starting child processes for node groups
  INSTANCE_STATUS_LOCK_TIMEOUT = 10.0
How many seconds to wait for instance status file lock

Imports: os, sys, time, logging, operator, errno, OptionParser, utils, constants, compat, errors, opcodes, cli, luxi, rapi, netutils, qlang, objects, ssconf, ht, ganeti, UsesRapiClient, nodemaint, state


Function Details [hide private]

_CheckForOfflineNodes(nodes, instance)

source code 

Checks if given instances has any secondary in offline status.

Parameters:
  • instance - The instance object
Returns:
True if any of the secondary is offline, False otherwise

IsRapiResponding(hostname)

source code 

Connects to RAPI port and does a simple test.

Connects to RAPI port of hostname and does a simple test. At this time, the test is GetVersion.

Parameters:
  • hostname (string) - hostname of the node to connect to.
Returns: bool
Whether RAPI is working properly

ParseOptions()

source code 

Parse the command line options.

Returns:
(options, args) as from OptionParser.parse_args()

_WriteInstanceStatus(filename, data)

source code 

Writes the per-group instance status file.

The entries are sorted.

Parameters:
  • filename (string) - Path to instance status file
  • data (list of tuple; (instance name as string, status as string)) - Instance name and status

_UpdateInstanceStatus(filename, instances)

source code 

Writes an instance status file from Instance objects.

Parameters:
  • filename (string) - Path to status file
  • instances (list of Instance)

_ReadInstanceStatus(filename)

source code 

Reads an instance status file.

Parameters:
  • filename (string) - Path to status file
Returns: tuple; (None or number, list of lists containing instance name and status)
File's mtime and instance status contained in the file; mtime is None if file can't be read

_MergeInstanceStatus(filename, pergroup_filename, groups)

source code 

Merges all per-group instance status files into a global one.

Parameters:
  • filename (string) - Path to global instance status file
  • pergroup_filename (string) - Path to per-group status files, must contain "%s" to be replaced with group UUID
  • groups (sequence) - UUIDs of known groups

GetLuxiClient(try_restart)

source code 

Tries to connect to the master daemon.

Parameters:
  • try_restart (bool) - Whether to attempt to restart the master daemon

_GlobalWatcher(opts)

source code 

Main function for global watcher.

At the end child processes are spawned for every node group.

Decorators:
  • @UsesRapiClient

Variables Details [hide private]

HELPLESS_STATES

Value:
frozenset([constants.INSTST_NODEDOWN, constants.INSTST_NODEOFFLINE,])