Package ganeti :: Package watcher
[hide private]
[frames] | no frames]

Package watcher

source code

Tool to restart erroneously downed virtual machines.

This program and set of classes implement a watchdog to restart virtual machines in a Ganeti cluster that have crashed or been killed by a node reboot. Run from cron or similar.

Submodules [hide private]

Classes [hide private]
  NotMasterError
Exception raised when this host is not the master.
  Instance
Abstraction for a Virtual Machine instance.
  Node
Data container representing cluster node.
Functions [hide private]
 
ShouldPause()
Check whether we should pause.
source code
 
StartNodeDaemons()
Start all the daemons that should be running on all nodes.
source code
 
RunWatcherHooks()
Run the watcher hooks.
source code
 
_CleanupInstance(cl, notepad, inst, locks) source code
 
_CheckInstances(cl, notepad, instances, locks)
Make a pass over the list of instances, restarting downed ones.
source code
 
_CheckDisks(cl, notepad, nodes, instances, started)
Check all nodes for restarted ones.
source code
 
_CheckForOfflineNodes(nodes, instance)
Checks if given instances has any secondary in offline status.
source code
 
_VerifyDisks(cl, uuid, nodes, instances)
Run a per-group "gnt-cluster verify-disks".
source code
bool
IsRapiResponding(hostname)
Connects to RAPI port and does a simple test.
source code
 
IsWconfdResponding()
Probes an echo RPC to WConfD.
source code
 
ParseOptions()
Parse the command line options.
source code
 
_WriteInstanceStatus(filename, data)
Writes the per-group instance status file.
source code
 
_UpdateInstanceStatus(filename, instances)
Writes an instance status file from Instance objects.
source code
tuple; (None or number, list of lists containing instance name and status)
_ReadInstanceStatus(filename)
Reads an instance status file.
source code
 
_MergeInstanceStatus(filename, pergroup_filename, groups)
Merges all per-group instance status files into a global one.
source code
 
GetLuxiClient(try_restart)
Tries to connect to the luxi daemon.
source code
 
_StartGroupChildren(cl, wait)
Starts a new instance of the watcher for every node group.
source code
 
_ArchiveJobs(cl, age)
Archives old jobs.
source code
 
_CheckMaster(cl)
Ensures current host is master node.
source code
 
_GlobalWatcher(opts)
Main function for global watcher.
source code
 
_GetGroupData(qcl, uuid)
Retrieves instances and nodes per node group.
source code
 
_LoadKnownGroups()
Returns a list of all node groups known by ssconf.
source code
 
_GroupWatcher(opts)
Main function for per-group watcher process.
source code
 
Main()
Main function.
source code
Variables [hide private]
  MAXTRIES = 5
  BAD_STATES = compat.UniqueFrozenset([constants.INSTST_ERRORDOW...
  HELPLESS_STATES = compat.UniqueFrozenset([constants.INSTST_NOD...
  NOTICE = "NOTICE"
  ERROR = "ERROR"
  CHILD_PROCESS_DELAY = 1.0
Number of seconds to wait between starting child processes for node groups
  INSTANCE_STATUS_LOCK_TIMEOUT = 10.0
How many seconds to wait for instance status file lock

Imports: os, sys, time, logging, operator, errno, OptionParser, utils, wconfd, constants, compat, errors, opcodes, cli, rpcerr, rapi, netutils, qlang, ssconf, ht, pathutils, ganeti, UsesRapiClient, nodemaint, state


Function Details [hide private]

_CheckForOfflineNodes(nodes, instance)

source code 

Checks if given instances has any secondary in offline status.

Parameters:
  • instance - The instance object
Returns:
True if any of the secondary is offline, False otherwise

IsRapiResponding(hostname)

source code 

Connects to RAPI port and does a simple test.

Connects to RAPI port of hostname and does a simple test. At this time, the test is GetVersion.

If RAPI responds with error code "401 Unauthorized", the test is successful, because the aim of this function is to assess whether RAPI is responding, not if it is accessible.

Parameters:
  • hostname (string) - hostname of the node to connect to.
Returns: bool
Whether RAPI is working properly

ParseOptions()

source code 

Parse the command line options.

Returns:
(options, args) as from OptionParser.parse_args()

_WriteInstanceStatus(filename, data)

source code 

Writes the per-group instance status file.

The entries are sorted.

Parameters:
  • filename (string) - Path to instance status file
  • data (list of tuple; (instance name as string, status as string)) - Instance name and status

_UpdateInstanceStatus(filename, instances)

source code 

Writes an instance status file from Instance objects.

Parameters:
  • filename (string) - Path to status file
  • instances (list of Instance)

_ReadInstanceStatus(filename)

source code 

Reads an instance status file.

Parameters:
  • filename (string) - Path to status file
Returns: tuple; (None or number, list of lists containing instance name and status)
File's mtime and instance status contained in the file; mtime is None if file can't be read

_MergeInstanceStatus(filename, pergroup_filename, groups)

source code 

Merges all per-group instance status files into a global one.

Parameters:
  • filename (string) - Path to global instance status file
  • pergroup_filename (string) - Path to per-group status files, must contain "%s" to be replaced with group UUID
  • groups (sequence) - UUIDs of known groups

GetLuxiClient(try_restart)

source code 

Tries to connect to the luxi daemon.

Parameters:
  • try_restart (bool) - Whether to attempt to restart the master daemon

_GlobalWatcher(opts)

source code 

Main function for global watcher.

At the end child processes are spawned for every node group.

Decorators:
  • @UsesRapiClient

Variables Details [hide private]

BAD_STATES

Value:
compat.UniqueFrozenset([constants.INSTST_ERRORDOWN,])

HELPLESS_STATES

Value:
compat.UniqueFrozenset([constants.INSTST_NODEDOWN, constants.INSTST_NO\
DEOFFLINE,])