Changed lines at line 2
2: Provide a very easy way to run one program with different commandline parameters on a bunch of computers in parallel and collect the results. There should be a simple observation that determines if the machines are still available.
3: 1 Why not use something like MPI?
4: - Well, first of all the program I want to run are not witten in C.
5: - I could write a MPI client that executes my program with the parameters, but then I can just use a shell script.
6: - I need a more flexibel way to grasp the results. For example some output goes to a file. A shell or perl script seams to be more convinient to do a flexible conversion.
7: 1 Specification
8: 1.1 Cluster
9: - Linux machines
10: - NFS
11: - SSH (public key authentication)
12: - permanent network connection
13: 1.1 Features & Configurability
14: - one master computer
15: - list of slave computers (names or IPs)
16: - commandline pattern with space holders for variables
17: - validation (return value and/or existance of a file)
18: - list with all parameter settings, which should be processed
19: - simple task assignment: if a computer is ready with one task give him the next from the list
20: - error detection (see next section)
21: - in case of an error while executing one task give it to the next machine and mark that machine as dead
22: - collecting rules: a) stdout processing (plain concat ; blockwise with parameters) b) file processing (plain concat, with parameters)
23: 1.1 Error detection
24: The folling errors need to be detected:
25: - error while connection/authentication
26: - machine dead (for whatever reason)
27: - connection broken
28: - programm terminated without success
29: 1 Implementation Details
30: 1.1 Configuation and Files
31: - Cluster file: list of computers, one per line
32: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
33: - Commandline pattern: a perl syntax String with parameter Variables for the parameters
34: - Validation function: gets the parameters set and the result of the programm and returns success or not
35: - Collection function: gets the parameters set and the result of the programm and can do whatever with it (usually print it in a file)
36: 1.1 Error detection
37: Remote shell command (ssh) termination code:
38: - 0 => Success: The program has been executed with success!
39: - otherswise => Failure: Can have the following reasons: connection failed, programm not found or terminated without success.
40: To check the connection and authentication run:
41: {code:shell}ssh host echo{code}
42: - return 0 (success): Connection is OK and machine lives. The program was either not found or it didn't terminate with 0. Either case we assume that this parameter set is somehow bad and skip it.
43: - otherwise mark this machine as dead and reschedule the parameter set.
44: One case that is not covered by the above procedure is if the connection breaks. Then the ssh command is just not terminating.
45: Solution: asyncron periodic ping
46: Provide a very easy way to run one program with different settings on a bunch of computers in parallel and collect the results. Simple configuration and wide applicability is the aim. Fault tolerance in respect to network and adminstration errors.
47: 1 Why not use something like MPI?
48: - Well, first of all the program I want to run are not witten in C.
49: - I could write a MPI client that executes my program with the parameters, but then I can just use a shell script.
50: - I need a more flexibel way to grasp the results. For example some output goes to a file. A shell or perl script seams to be more convinient to do a flexible conversion.
51: 1 Specification
52: 1.1 Cluster
53: - Linux machines
54: - SSH (public key authentication)
55: - (SCP sequrity copy, optional)
56: - HTTP
57: 1.1 Features
58: - one master computer with server program(Is a HTTP server). SSH and SCP is needed to get the client program to the slave and start it.
59: - list of slave computers (names or IPs). Every slave acts as an HTTP client.
60: - command specification: commandline pattern with space holders for variables and input file generation
61: - result specification: standart output and/or files
62: - validation of the results
63: - list with task characterised through parameters.
64: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
65: - collecting rules: a) plain concat b) blockwise with parameters
66: - simple statistiks: which slave did what and which parameter sets failed.
67: 1.1 Error detection
68: - error while connection/authentication
69: - machine dead (for whatever reason)
70: - programm terminated without success
71: - programm doesn't return within timeout
72: 1.1 Server
73: - initialisation: for every slave: try to start client (ssh).
74: If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
75: - on http request for executeable: reply with the binary
76: - on http request for new task: reply with the next command to execute and all parameters.
77: - on post: validate result and mark set and collect results
78: - no more parameter set to process: exit and display statistik.
79: 1.1 Client
80: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
81: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it from the server
82: - fetch a task
83: - run program
84: - check return code: if failed -> Post failture otherwise take the results and post them.
85: - fetch next task.
86: - die is there is no more task or the server is not responding.
87: - different settings for termination: delete executeable (if fetched), delete the client program?
88:
89: 1 Implementation Details
90: 1.1 Configuation and Files
91: - Cluster file: list of computers, one per line
92: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
93: - Commandline specification: Command line: a perl syntax String with parameter Variables for the parameters; Input files: Name of the file and and parameter name to write in.
94: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
95: - Validation function: function template that gets the parameter set and the result of the programm and returns success or not
96: - Collection function: gets the parameters set and the result of the programm and can do whatever with it (usually print it in a file).
97: 1.1 Error detection
98: Remote shell command (ssh) termination code:
99: - 0 => Success: The program has been executed with success!
100: - otherswise => Failure: Can have the following reasons: connection failed, programm not found or terminated without success.
101: To check a connection and the authentication:
102: {code:shell}ssh host echo{code}
103: - return 0 (success): Connection is OK and machine lives.
104: - otherwise error