Changes of Simple Parallel Exec from #18 to #19

Changed lines at line 27
- one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
- list of slave computers ( host names or IPs). Every slave acts as an HTTP client. 
- platform dependent workers possible
- command specification: commandline pattern with place holders for variables and input file generation
- result specification: standart output and/or files
- validation of the results (where on the master)
- list with tasks characterised through parameters.
- timeouts and multiply task assigments if necessary (timeout, free resources and so on)
- collecting rules: a) plain concat b) blockwise with parameters
- simple statistiks: which slave did what and which parameter sets failed.
- NFS aware
1.1 Error detection/ dealing
- error while connection/authentication (ssh, scp)
- slave dead / client killed (don't care there are other slaves :-) )
- server breaks or get stopped (all clients should terminate at some near point)
- worker terminates without success
- worker doesn't return within timeout
1.1 Server
* Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
- initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
- on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
- on http request for worker executeable: reply with the binary for the right platform
- on http request for new task:  reply with  the next command to execute and all parameters. 
- on http request for statistics (normal website): reply statistics webpage
- on post: validate result, mark task as completed and collect results
- no more task to process: exit and display statistics.
1.1 Client
* Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
* gets via command line: Session ID, server URL and port
- register at the server and fetch configuration
- check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server 
- fetch a task 
- run worker
- check return code: if failed -> Post failture otherwise take the results and post them.
- fetch next task
- die if there is no more task or the server is not responding.
* different settings for termination: delete executeable (if fetched), delete the client program, delete results?
 
1 Protocol {anchor:Protocol}
1.1 Configuration
- Request: GET \http://master/config?sessionid=SESSIONID&platform=PLATFORM
- Fail (due to unsupported platform): 415 Unsupported Media Type
- Fail (due to wrong session id): 403 (Forbidden)
- Successful Reply: List of Key = Value pairs.
{code:none}
Worker=name of the executeable
MD5=md5 checksum of the executeable
DeleteWorker=Yes/No
DeleteClient=Yes/No
DeleteResults=Yes/No
Ping=#
{code}
* PLATFORM: one of "Linux, Unix, BSD,  WinNT, Win95" (TODO: need a better way then $^0)
1.1 Ping (HTTP)
* Ping interval is given in seconds. 0 for no ping.
* Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
- Request: GET \http://master/ping?sessionid=SESSIONID&ticket=TICKET
- Fail due to wrong session id: 403 (Forbidden)
- Successful, but ticket expired (task already done): 205 (Reset Content)
- Successful (keep on it!): 204 (No Content)
1.1 Worker
- Request: GET \http://master/worker?sessionid=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (due to file not found): 403 (Forbidden)
- Success: binary file
1.1 Task
- Request: GET \http://master/task?sessionid=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (because no task left): 503 Service Unavailable
- Success: 
{code:none}
[Task]
Ticket=#  (unique number within session)
CommandLine=commandline
[Input filename]*
Content=single-line file content 
or
Content= <<EOT
multi-line file content here (ASCII)
EOT
[Result name]+ 
File=filename
{code}
* the * behinds the section means there can be _zero_ or more sections
* the + behinds the section means there can be _one_ or more sections
1.1 Task completed
- Successful: POST \http://master/completed?sessionid=SESSIONID&ticket=TICKET
{code:none}
[Result name]+
Content= single-line file content
or
Content= <<EOT
multi-line file content here (ASCII)
EOT
{code}
- Failed: GET \http://master/failed?sessionid=SESSIONID&ticket=TICKET
- Reply Fail due to wrong session id: 403 (Forbidden)
- Reply Otherwise:  204 (No Content)
* binary content is not supported
1 Implementation Details
1.1 Configuation and Files
- Server config: See __server.sample.conf__ in the side bar.
- Task file: __tasks.csv__ : csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
{code:none}
string|counter
"eins"|1
"zwei"|2
{code}
- Worker config: See __worker.sample.conf__ in the side bar.
- Input specification: The input consists of the command line and one ore more files. In the example above the parameter "counter" is passed as an commandline argument and the parameter "string" is written in the file input.file. This file is used as standart input for the worker. One can specify other files as well, in case the worker reads them. 
- Result specification: A result has a name and filename where to get the result values from.  In the above example one result is calles "Result" and it comes from the file result.file, which is the standart output of the worker. The second result is calles "Output" and is read from output.file. If the worker doesn't write to this file the result will be empty. 
- Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture. (See Validate.pm)
- Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file). (See Collect.pm)
1.1 Server
- The server is implemented in perl. 
- Perl has no reasonable way to use shared memory in multiply threads. Since the program is written using perl objects and objects can't be shared I decided to make a __serial__ implementation first. 
- That means just one request can be responded at the time
- Consequences: slow start if there are many slaves; less suitable for very small tasks with large input/response data
1.1 NFS awareness
The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
different clients overwrite their data. Basically there are three points where it occurs:
1. the client is copied
1. the client fetches the worker
1. the worker writes its data to a file.
Solutions:
1. a) start and copy clients in serial (very slow) (current implementation)  b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
- one master computer with server program (acts as HTTP server). SSH and SCP is needed to get the client program to the slaves and start it. 
- list of slave computers (host names or IPs). 
- platform dependent workers possible 
- command specification: commandline pattern with place holders for variables and input file generation 
- result specification: standart output and/or files 
- validation of the results on the master 
- list of tasks characterised through parameters. 
- timeouts and multiply task assigments if necessary (timeout, free resources and so on) 
- collecting rules: a) plain concat b) blockwise with parameters 
- simple online (via Web) and offline statistics: which slave did what and which parameter sets failed. 
- test mode with different bias to check configuration. 
- NFS aware
1.1 Error detection/ dealing
- error while connection/authentication (ssh, scp)
- slave dead / client killed (don't care there are other slaves :-) )
- server breaks or get stopped (all clients should terminate at some near point)
- worker terminates without success
- worker doesn't return within timeout
1.1 Server
* Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
- initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
- on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
- on http request for worker executeable: reply with the binary for the right platform
- on http request for new task:  reply with  the next command to execute and all parameters. 
- on http request for statistics (normal website): reply statistics webpage
- on post: validate result, mark task as completed and collect results
- no more task to process: exit and display statistics.
1.1 Client
* Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
* gets via command line: Session ID, server URL and port
- register at the server and fetch configuration
- check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server 
- fetch a task 
- run worker
- check return code: if failed -> Post failture otherwise take the results and post them.
- fetch next task
- die if there is no more task or the server is not responding.
* different settings for termination: delete executeable (if fetched), delete the client program, delete results?
 
1 Protocol {anchor:Protocol}
1.1 Configuration
- Request: GET \http://master/config?sessionid=SESSIONID&platform=PLATFORM
- Fail (due to unsupported platform): 415 Unsupported Media Type
- Fail (due to wrong session id): 403 (Forbidden)
- Successful Reply: List of Key = Value pairs.
{code:none}
Worker=name of the executeable
MD5=md5 checksum of the executeable
DeleteWorker=Yes/No
DeleteClient=Yes/No
DeleteResults=Yes/No
Ping=#
{code}
* PLATFORM: one of "Linux, Unix, BSD,  WinNT, Win95" (TODO: need a better way then $^0)
1.1 Ping (HTTP)
* Ping interval is given in seconds. 0 for no ping.
* Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
- Request: GET \http://master/ping?sessionid=SESSIONID&ticket=TICKET
- Fail due to wrong session id: 403 (Forbidden)
- Successful, but ticket expired (task already done): 205 (Reset Content)
- Successful (keep on it!): 204 (No Content)
1.1 Worker
- Request: GET \http://master/worker?sessionid=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (due to file not found): 403 (Forbidden)
- Success: binary file
1.1 Task
- Request: GET \http://master/task?sessionid=SESSIONID
- Fail (due to wrong session id): 403 (Forbidden)
- Fail (because no task left): 503 Service Unavailable
- Success: 
{code:none}
[Task]
Ticket=#  (unique number within session)
CommandLine=commandline
[Input filename]*
Content=single-line file content 
or
Content= <<EOT
multi-line file content here (ASCII)
EOT
[Result name]+ 
File=filename
{code}
* the * behinds the section means there can be _zero_ or more sections
* the + behinds the section means there can be _one_ or more sections
1.1 Task completed (successful or with failture)
- __Successful__: POST \http://master/completed?sessionid=SESSIONID&ticket=TICKET
{code:none}
[Result name]+
Content= single-line file content
or
Content= <<EOT
multi-line file content here (ASCII)
EOT
{code}
- __Failed__: GET \http://master/failed?sessionid=SESSIONID&ticket=TICKET
- Reply Fail due to wrong session id: 403 (Forbidden)
- Reply Last Task (no more task):  204 (No Content)
- Reply Otherwise:  202 (Accepted)
* binary content is not supported
1.1 Client Starting Status
* This is used for communication between threads in the server. 
- GET \http://master/clientstatus?slaveid=SLAVEID&status=STATUS
- STATUS: -1: no ssh; 0: error while starting or copying; 1: started
- Reply Fail due to wrong slave id: 403 (Forbidden)
- Reply Otherwise:  204 (No Content)
1.1 Client Died Notification
* The client notifies that it is about to die
- GET \http://master/died?session=SESSIONID&normal=yes/no&reason=REASON
- REASON: String that describes why
- Reply Fail due to wrong session id: 403 (Forbidden)
- Reply Otherwise:  204 (No Content)
1 Implementation Details
1.1 Configuation and Files
- Server config: See __server.sample.conf__ in the side bar.
- Task file: __tasks.csv__ : csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
Before the headline can occure comments, that start with # until the end of the line.
{code:none}
#comment
string|counter
"eins"|1
"zwei"|2
{code}
- Worker config: See __worker.sample.conf__ in the side bar.
- Input specification: The input consists of the command line and one ore more files. In the example above the parameter "counter" is passed as an commandline argument and the parameter "string" is written in the file input.file. This file is used as standart input for the worker. One can specify other files as well, in case the worker reads them. 
- Result specification: A result has a name and filename where to get the result values from.  In the above example one result is calles "Result" and it comes from the file result.file, which is the standart output of the worker. The second result is calles "Output" and is read from output.file. If the worker doesn't write to this file the result will be empty. 
- Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture. (See Validate.pm)
- Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file). (See Collect.pm)
1.1 Server
- The server is implemented in perl. 
- Perl has no reasonable way to use shared memory in multiply threads. Since the program is written using perl objects and objects can't be shared I decided to make a __serial__ implementation first. 
- That means just one request can be responded at the time
- However I managed to to implement the swarming in parallel, that means that the starting is reasonably fast.
- Consequences: less suitable for very small tasks (short computation time) with large input/response data
1.1 NFS awareness
The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
different clients overwrite their files. Basically there are three points where it occurs:
1. the client is copied
1. the client fetches the worker
1. the worker writes its data to a file.
Solutions:
1. a) start and copy clients in serial (very slow) b) start the first client an copy the client if necessary. After that start the remaining clients in parallel (quite fast, current implementation :-))