Changed lines at line 1
1: 1. Goal
2: Provide a very easy way to run one program with different commandline parameters on a bunch of computers in parallel and collect the results. There should be a simple observation that determines if the machine is still available and if not then give the task to someone else.
3: 1. Why not use something like MPI
4: - Well, first of all the program I want to run are not witten in C.
5: - I could write a MPI client that executes my program with the parameters, but then I can just use a shell script.
6: - I need a more flexibel way to grasp the results. For example some output goes to a file. A shell or perl script seams to be more convinient to do a flexible conversion.
7: 1. Implementation Details
8: 1 Goal
9: Provide a very easy way to run one program with different commandline parameters on a bunch of computers in parallel and collect the results. There should be a simple observation that determines if the machines are still available.
10: 1 Why not use something like MPI?
11: - Well, first of all the program I want to run are not witten in C.
12: - I could write a MPI client that executes my program with the parameters, but then I can just use a shell script.
13: - I need a more flexibel way to grasp the results. For example some output goes to a file. A shell or perl script seams to be more convinient to do a flexible conversion.
14: 1 Specification
15: 1.1 Cluster
16: - Linux machines
17: - NFS
18: - SSH (public key authorisation)
19: 1.1 Features & Configurability
20: - list of computer names or IPs
21: - commandline pattern with space holders for variables
22: - validation (return value and/or existance of a file)
23: - list with all parameter settings, which should be processed
24: - simple task assignment: if a computer is ready with one task give him the next from the list
25: - error detection (see next section)
26: - in case of an error while executing one task give it to the next machine and mark that machine as dead
27: - collecting rules: a) stdout processing (plain concat ; blockwise with parameters) b) file processing (plain concat, with parameters)
28: 1.1 Error detection
29: The folling errors need to be detected:
30: - error while connection/authentication
31: - machine dead (for whatever reason)
32: - connection broken
33: - programm terminated without success
34: 1 Implementation Details
35: 1.1 Configuation and Files
36: - Cluster file: list of computers, one per line
37: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
38: - Commandline pattern: a perl syntax String with parameter Variables for the parameters
39: - Validation function: gets the parameters set and the result of the programm and returns success or not
40: - Collection function: gets the parameters set and the result of the programm and can do whatever with it (usually print it in a file)
41: 1.1 Error detection
42: Remote shell command (ssh) termination code:
43: - 0 => Success: The program has been executed with success!
44: - otherswise => Failure: Can have the following reasons: connection failed, programm not found or terminated without success.
45: To check the connection and authentication run:
46: {code:shell}ssh host echo{code}
47: - return 0 (success): Connection is OK and machine lives. The program was either not found or it didn't terminate with 0. Either case we assume that this parameter set is somehow bad and skip it.
48: - otherwise mark this machine as dead and reschedule the parameter set.