Changed lines at line 7
7: 1 Specification
8: 1.1 Cluster
9: - Linux machines
10: - SSH (public key authentication)
11: - (SCP sequrity copy, optional)
12: - HTTP
13: 1.1 Features
14: - one master computer with server program(Is a HTTP server). SSH and SCP is needed to get the client program to the slave and start it.
15: - list of slave computers (names or IPs). Every slave acts as an HTTP client.
16: - command specification: commandline pattern with space holders for variables and input file generation
17: - result specification: standart output and/or files
18: - validation of the results
19: - list with task characterised through parameters.
20: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
21: - collecting rules: a) plain concat b) blockwise with parameters
22: - simple statistiks: which slave did what and which parameter sets failed.
23: 1.1 Error detection
24: - error while connection/authentication
25: - machine dead (for whatever reason)
26: - programm terminated without success
27: - programm doesn't return within timeout
28: 1.1 Server
29: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
30: - on http request for executeable: reply with the binary
31: - on http request for new task: reply with the next command to execute and all parameters.
32: - on post: validate result and mark set and collect results
33: - no more parameter set to process: exit and display statistik.
34: 1.1 Client
35: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
36: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it from the server
37: - fetch a task
38: - run program
39: - check return code: if failed -> Post failture otherwise take the results and post them.
40: - fetch next task.
41: - die is there is no more task or the server is not responding.
42: - different settings for termination: delete executeable (if fetched), delete the client program?
43:
44: 1 Implementation Details
45: 1.1 Configuation and Files
46: - Cluster file: list of computers, one per line
47: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
48: - Commandline specification: Command line: a perl syntax String with parameter Variables for the parameters; Input files: Name of the file and and parameter name to write in.
49: 1. Terminology:
50: - Master: is the Computer where the server runs
51: - Slave: is one of many Computers that do the work
52: - Server: program that coordinates the process
53: - Client: program that runs on slaves
54: - Worker: program that does the computation (any)
55: - Task: a set of parameters/ settings for the Worker
56: 1 Specification
57: 1.1 Cluster
58: - One Master
59: - Many Slaves
60: - primary Linux machines, however the design is platform independent
61: - any TCP/IP network connection (can be non permanent, but should not :-))
62: - SSH (public key authentication) on Slave
63: - SCP sequrity copy (public key authentication) on Slave
64: - HTTP (any port open on Master, prefered #80)
65: 1.1 Features
66: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
67: - list of slave computers (names or IPs). Every slave acts as an HTTP client.
68: - platform dependent workers possible
69: - command specification: commandline pattern with space holders for variables and input file generation
70: - result specification: standart output and/or files
71: - validation of the results
72: - list with task characterised through parameters.
73: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
74: - collecting rules: a) plain concat b) blockwise with parameters
75: - simple statistiks: which slave did what and which parameter sets failed.
76: 1.1 Error detection/ dealing
77: - error while connection/authentication (ssh, scp)
78: - machine dead (for whatever reason)
79: - programm terminated without success
80: - programm doesn't return within timeout
81: - server breaks or gets stopped
82: 1.1 Server
83: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
84: - on http request for configuration: reply client configuration (i.e. what to do on exit)
85: - on http request for executeable: reply with the binary for the right platform
86: - on http request for new task: reply with the next command to execute and all parameters.
87: - on post: validate result and mark set and collect results
88: - no more parameter set to process: exit and display statistics.
89: 1.1 Client
90: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
91: - register at the server and fetch configuration
92: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it (for own platform) from the server
93: - fetch a task
94: - run program
95: - check return code: if failed -> Post failture otherwise take the results and post them.
96: - fetch next task.
97: - die is there is no more task or the server is not responding.
98: - different settings for termination: delete executeable (if fetched), delete the client program?
99:
100: 1 Protocol
101: 1.1 Configuration
102: - Request: GET \http://master/config?sessionID=SESSIONID
103: - Fail (due to wrong session id): 403 (Forbidden)
104: - Successful Reply: List of Key = Value pairs.
105: {code:none}
106: DeleteProgram=Yes/No
107: DeleteClient=Yes/No
108: Ping=#
109: {code}
110: 1.1 Ping (HTTP)
111: - Ping interval is given in seconds. 0 for no ping.
112: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
113: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
114: - Fail due to wrong session id: 403 (Forbidden)
115: - Successful, but ticket expired (task allready done): 205 Reset Content
116: - Successful (keep on it!): 204 (No Content)
117: 1.1 Binary
118: - Request: GET \http://master/binary?sessionID=SESSIONID&name=NAME&platform=PLATFORM
119: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
120: - Fail (due to unsupported platform): 415 Unsupported Media Type
121: - Success: binary file
122: 1.1 Task
123: - Request: GET \http://master/task?sessionID=SESSIONID
124: - Fail (due to wrong session id): 403 (Forbidden)
125: - Fail (because no task left): 503 Service Unavailable
126: - Success:
127: {code:none}
128: Ticket=# (unique number (usually 5 digits))
129: Timeout=# (in seconds)
130: CommandLine=commandline
131: [Input]*
132: File=filename (or "stdin")
133: --begin content--
134: real file content here (binary)
135: --end content--
136: [Result]+
137: Name=resultname
138: File=filename (or "stdout")
139: {code}
140: 1.1 Task completed
141: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
142: {code:none}
143: [Result]+
144: Name=resultname
145: --begin content--
146: file content here (binary)
147: --end content--
148: {code}
149: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
150: - Reply Fail due to wrong session id: 403 (Forbidden)
151: - Reply Otherwise: 200 OK
152: 1 Implementation Details
153: 1.1 Configuation and Files
154: - Cluster file: list of computers, one per line
155: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
156: - Commandline specification: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.