We have encountered situations where a node was no longer responding to attempts at connecting via SSH or SSH became unavailable through other means. Quite often the node daemon is still available, even in situations where there’s little free memory. The latter is due to the node daemon being locked into main memory using mlock(2).
Since the node daemon does not allow the execution of arbitrary commands, quite often the only solution left was either to attempt a powercycle request via said node daemon or to physically reset the node.
The goal of this design is to allow the execution of non-arbitrary commands via RPC requests. Since this can be dangerous in case the cluster certificate (server.pem) is leaked, some precautions need to be taken:
There shall be no way to list available commands or to retrieve an executable’s contents. The result from a request to execute a specific command will either be its output and exit code, or a generic error message. Only the receiving node’s log files shall contain information as to why executing the command failed.
To slow down dictionary attacks on command names in case an attacker manages to obtain a copy of server.pem, a system-wide, file-based lock is acquired before verifying the command name and its executable. If a command can not be executed for some reason, the lock is only released with a delay of several seconds, after which the generic error message will be returned to the caller.
At first, restricted commands will not be made available through the remote API, though that could be done at a later point (with a separate password).
On the command line, a new sub-command will be added to the gnt-node script.