Changes of Simple Parallel Exec from #11 to #12

Changed lines at line 8
- Master: is the Computer where the server runs
- Slave: is one of many Computers that do the work
- Server: program that coordinates the process
- Client: program that runs on slaves
- Worker: program that does the computation (any)
- Task: a set of parameters/ settings for the Worker
- __Result: a file with the results of the computation__
- __SessionID:__
- __Ticket:__
1 Specification
1.1 Cluster
- One Master 
- Many Slaves 
- primary Linux machines, however the design is platform independent
- any TCP/IP network connection (can be non permanent, but should not :-))
- SSH (public key authentication) on Slave
- SCP sequrity copy (public key authentication) on Slave
- HTTP (any port open on Master, prefered #80)
1.1 Features
- one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
- list of slave computers ( __host__ names or IPs). Every slave acts as an HTTP client. 
- platform dependent workers possible
- command specification: commandline pattern with --space-- __place__ holders for variables and input file generation
- result specification: standart output and/or files
- validation of the results __(where on the master or on slaves?)__
- list with tasks characterised through parameters.
- timeouts and multiply task assigments if necessary (timeout, free resources and so on)
- collecting rules: a) plain concat b) blockwise with parameters
- simple statistiks: which slave did what and which parameter sets failed.
- NFS aware
1.1 Error detection/ dealing
- error while connection/authentication (ssh, scp)
- machine dead (for whatever reason)
- programm terminated without success
- programm doesn't return within timeout
- server breaks or gets stopped
1.1 Server
- initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
- on http request for configuration: reply client configuration (i.e. what to do on exit)
- on http request for worker executeable: reply with the binary for the right platform
- on http request for new task:  reply with  the next command to execute and all parameters. __standard file format for a task?__
- on post: validate result and mark set and collect results __ok result validation on server__
- no more --parameter set-- __task?__ ~~(cleaner terminology)~~ to process: exit and display statistics. __statistics during runtime? per webpage?__
1.1 Client
- gets via command line: Session ID, --command name (path and name of the executeable), md5 checksum of the executeable-- __I think this should come over HTTP (MD5 depends on platform)__, server name and port
- register at the server and fetch configuration __Yes this is the config__
- check for the --executeable-- __worker__: if not --there-- __already at local filesystem__ or the MD5 checksum is wrong: fetch it (for own platform) from the server 
- fetch a task 
- run worker
- check return code: if failed -> Post failture otherwise take the results and post them.
- fetch next task.
- die if there is no more task or the server is not responding.
- different settings for termination: delete executeable (if fetched), delete the client program?
 
1 Protocol
1.1 Configuration
- Request: GET \http://master/config?sessionID=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Successful Reply: List of Key = Value pairs.
{code:none}
DeleteWorker=Yes/No
DeleteClient=Yes/No
Ping=#
{code}
1.1 Ping (HTTP)
- Ping interval is given in seconds. 0 for no ping.
- Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
- Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
- Fail due to wrong session id: 403 (Forbidden)
- Successful, but ticket expired (task allready done): 205 Reset Content
- Successful (keep on it!): 204 (No Content)
1.1 Worker
- Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
- Fail (due to wrong session id, or wrong name): 403 (Forbidden)
- Fail (due to unsupported platform): 415 Unsupported Media Type
- Success: binary file
1.1 Task
- Request: GET \http://master/task?sessionID=SESSIONID__&worker=NAME__
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (because no task left): 503 Service Unavailable
- Success: 
{code:none}
Ticket=#  (unique __in session or?__ number (usually 5 digits __only 5?__))
Timeout=# (in seconds)
CommandLine=commandline
[Input]*
File=filename (or "stdin")
--begin content--
real file content here (binary)
--end content--
[Result]+
Name=resultname
File=filename (or "stdout")
{code}
1.1 Task completed
- Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
{code:none}
[Result]+
Name=resultname
--begin content--
file content here (binary)
--end content--
{code}
__binary content in a asscii file? you must use fileupload (multipart request)__
- Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
- Reply Fail due to wrong session id: 403 (Forbidden)
- Reply Otherwise:  200 OK
1 Implementation Details
1.1 Configuation and Files
- Cluster file: list of computers, one per line
- Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
- Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
- Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
- Validation function: function template that gets the parameter set and the result of the worker and returns success or not
- Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file). 
1.1 NFS awareness
The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
different clients overwrite their data. Basically there are three cases:
1. the client is copied
1. the client fetches the worker
1. the worker writes its data to a file.
Solutions:
1. a) start and copy clients in serial (very slow)  b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker. 
1. every worker is started in a speparate directory, given by the session id and the ticket number
1.1 Error detection
Remote shell command (ssh) termination code:
- 0 => Success: The worker __?? I think the client goes in background with nohup.__ has been executed with success!
- otherswise => Failure: Can have the following reasons: connection failed, program not found or terminated without success.
To check a connection and the authentication:
{code:shell}ssh host echo{code}
- return 0 (success): Connection is OK and machine lives. 
- Master: is the computer where the server runs
- Slave: is one of many computers that do the work
- Server: program that coordinates the process
- Client: program that runs on slaves
- Worker: program that does the computation (any)
- Task: a set of parameters/ settings for the Worker
- Result: a file with the results of the computation
- SessionID: unique number for client-server communication (should be unique over multiple runs)
- Ticket: Computation identification. Unique within one run.
1 Specification
1.1 Cluster
- One Master 
- Many Slaves 
- primary Linux machines, however the design is platform independent
- any TCP/IP network connection (can be non permanent, but should not :-))
- SSH (public key authentication) on Slave
- SCP sequrity copy (public key authentication) on Slave
- HTTP (any port open on Master, prefered #80)
1.1 Features
- one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
- list of slave computers ( host names or IPs). Every slave acts as an HTTP client. 
- platform dependent workers possible
- command specification: commandline pattern with place holders for variables and input file generation
- result specification: standart output and/or files
- validation of the results (where on the master)
- list with tasks characterised through parameters.
- timeouts and multiply task assigments if necessary (timeout, free resources and so on)
- collecting rules: a) plain concat b) blockwise with parameters
- simple statistiks: which slave did what and which parameter sets failed.
- NFS aware
1.1 Error detection/ dealing
- error while connection/authentication (ssh, scp)
- slave dead / client killed (don't care there are other slaves :-) )
- server breaks or get stopped (all clients should terminate at some near point)
- worker terminates without success
- worker doesn't return within timeout
1.1 Server
* Format of communication is specified in [Simple Parallel Exec#Protocol] section
- initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
- on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
- on http request for worker executeable: reply with the binary for the right platform
- on http request for new task:  reply with  the next command to execute and all parameters. __standard file format for a task?__
- on post: validate result and mark set and collect results __ok result validation on server__
- no more --parameter set-- __task?__ ~~(cleaner terminology)~~ to process: exit and display statistics. __statistics during runtime? per webpage?__
1.1 Client
* Format of communication is specified in [Simple Parallel Exec#Protocol] section
* gets via command line: Session ID, server URL and port
- register at the server and fetch configuration
- check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server 
- fetch a task 
- run worker
- check return code: if failed -> Post failture otherwise take the results and post them.
- fetch next task
- die if there is no more task or the server is not responding.
* different settings for termination: delete executeable (if fetched), delete the client program, delete results?
 
1 Protocol
1.1 Configuration
- Request: GET \http://master/config?sessionID=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Successful Reply: List of Key = Value pairs.
{code:none}
DeleteWorker=Yes/No
DeleteClient=Yes/No
Ping=#
{code}
1.1 Ping (HTTP)
* Ping interval is given in seconds. 0 for no ping.
* Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
- Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
- Fail due to wrong session id: 403 (Forbidden)
- Successful, but ticket expired (task already done): 205 (Reset Content)
- Successful (keep on it!): 204 (No Content)
1.1 Worker
- Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
- Fail (due to wrong session id, or wrong name): 403 (Forbidden)
- Fail (due to unsupported platform): 415 Unsupported Media Type
- Success: binary file
1.1 Task
- Request: GET \http://master/task?sessionID=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (because no task left): 503 Service Unavailable
- Success: 
{code:none}
Ticket=#  (unique number within session (usually quite short (in the magnitude of the amount of tasks)))
CommandLine=commandline
[Input]*
File=filename (or "stdin")
--begin content--
real file content here (binary)
--end content--
[Result]+
Name=resultname
File=filename (or "stdout")
{code}
* the * behinds the section means there can be _zero_ or more sections
* the + behinds the section means there can be _one_ or more sections
1.1 Task completed
- Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
{code:none}
[Result]+
Name=resultname
Content=<<ENDOFCONTENT
file content here (ASCII)
ENDOFCONTENT
{code}
- Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
- Reply Fail due to wrong session id: 403 (Forbidden)
- Reply Otherwise:  200 OK
* binary content is not supported
1 Implementation Details
1.1 Configuation and Files
- Server config: TODO describe
- Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
- Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
- Result specification: A result consists of a list of name - value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example myoutput="stdout", myfileoutput="out.txt".
- Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture.
- Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file). 
1.1 NFS awareness
The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
different clients overwrite their data. Basically there are three points where it occurs:
1. the client is copied
1. the client fetches the worker
1. the worker writes its data to a file.
Solutions:
1. a) start and copy clients in serial (very slow)  b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker. 
1. every worker is started in a separate directory, given by the session id and the ticket number
1.1 Error detection
Remote shell command (ssh) termination code:
- 0 => Success: The executed command has been executed with success!
- otherswise => Failure: Can have the following reasons: connection failed, program not found or terminated without success.
To check a connection and the authentication:
{code:shell}ssh host echo{code}
- return 0 (success): Connection is OK and machine has a shell. (TODO: check for Windows and Mac machines)