Simple Parallel Exec, version #7 ( #6 < #8 > ... #24)

Created by georg. Last edited by georg, 4 years and one day ago.. Viewed 610 times. #7

Goal

Provide a very easy way to run one program with different settings on a bunch of computers in parallel and collect the results. Simple configuration and wide applicability is the aim. Fault tolerance in respect to network and adminstration errors.

Why not use something like MPI?

Well, first of all the program I want to run are not witten in C.
I could write a MPI client that executes my program with the parameters, but then I can just use a shell script.
I need a more flexibel way to grasp the results. For example some output goes to a file. A shell or perl script seams to be more convinient to do a flexible conversion.

Terminology:

Master: is the Computer where the server runs
Slave: is one of many Computers that do the work
Server: program that coordinates the process
Client: program that runs on slaves
Worker: program that does the computation (any)
Task: a set of parameters/ settings for the Worker

Specification

Cluster

One Master
Many Slaves
primary Linux machines, however the design is platform independent
any TCP/IP network connection (can be non permanent, but should not :-))
SSH (public key authentication) on Slave
SCP sequrity copy (public key authentication) on Slave
HTTP (any port open on Master, prefered #80)

Features

one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
list of slave computers (names or IPs). Every slave acts as an HTTP client.
platform dependent workers possible
command specification: commandline pattern with space holders for variables and input file generation
result specification: standart output and/or files
validation of the results
list with task characterised through parameters.
timeouts and multiply task assigments if necessary (timeout, free resources and so on)
collecting rules: a) plain concat b) blockwise with parameters
simple statistiks: which slave did what and which parameter sets failed.

Error detection/ dealing

error while connection/authentication (ssh, scp)
machine dead (for whatever reason)
programm terminated without success
programm doesn't return within timeout
server breaks or gets stopped

Server

initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
on http request for configuration: reply client configuration (i.e. what to do on exit)
on http request for worker executeable: reply with the binary for the right platform
on http request for new task: reply with the next command to execute and all parameters.
on post: validate result and mark set and collect results
no more parameter set to process: exit and display statistics.

Client

gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
register at the server and fetch configuration
check for the executeable: if not there or the MD5 checksum is wrong: fetch it (for own platform) from the server
fetch a task
run worker
check return code: if failed -> Post failture otherwise take the results and post them.
fetch next task.
die is there is no more task or the server is not responding.
different settings for termination: delete executeable (if fetched), delete the client program?

Protocol

Configuration

Request: GET http://master/config?sessionID=SESSIONID
Fail (due to wrong session id): 403 (Forbidden)
Successful Reply: List of Key = Value pairs.

DeleteWorker=Yes/No
DeleteClient=Yes/No
Ping=#

Ping (HTTP)

Ping interval is given in seconds. 0 for no ping.
Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
Request: GET http://master/ping?sessionID=SESSIONID&ticket=TICKET
Fail due to wrong session id: 403 (Forbidden)
Successful, but ticket expired (task allready done): 205 Reset Content
Successful (keep on it!): 204 (No Content)

Worker

Request: GET http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
Fail (due to wrong session id, or wrong name): 403 (Forbidden)
Fail (due to unsupported platform): 415 Unsupported Media Type
Success: binary file

Task

Request: GET http://master/task?sessionID=SESSIONID
Fail (due to wrong session id): 403 (Forbidden)
Fail (because no task left): 503 Service Unavailable
Success:

Ticket=#  (unique number (usually 5 digits))
Timeout=# (in seconds)
CommandLine=commandline
[Input]*
File=filename (or "stdin")
--begin content--
real file content here (binary)
--end content--
[Result]+
Name=resultname
File=filename (or "stdout")

Task completed

Successful: POST http://master/complete?sessionID=SESSIONID&ticket=TICKET

[Result]+
Name=resultname
--begin content--
file content here (binary)
--end content--

Failed: GET http://master/failed?sessionID=SESSIONID&ticket=TICKET
Reply Fail due to wrong session id: 403 (Forbidden)
Reply Otherwise: 200 OK

Implementation Details

Configuation and Files

Cluster file: list of computers, one per line
Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
Validation function: function template that gets the parameter set and the result of the worker and returns success or not
Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).

Error detection

Remote shell command (ssh) termination code:

0 => Success: The worker has been executed with success!
otherswise => Failure: Can have the following reasons: connection failed, program not found or terminated without success.

To check a connection and the authentication:

ssh host echo

return 0 (success): Connection is OK and machine lives.
otherwise error