Changed lines at line 14
14: 1 Specification
15: 1.1 Cluster
16: - One Master
17: - Many Slaves
18: - primary Linux machines, however the design is platform independent
19: - any TCP/IP network connection (can be non permanent, but should not :-))
20: - SSH (public key authentication) on Slave
21: - SCP sequrity copy (public key authentication) on Slave
22: - HTTP (any port open on Master, prefered #80)
23: 1.1 Features
24: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
25: - list of slave computers (names or IPs). Every slave acts as an HTTP client.
26: - platform dependent workers possible
27: - command specification: commandline pattern with space holders for variables and input file generation
28: - result specification: standart output and/or files
29: - validation of the results
30: - list with task characterised through parameters.
31: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
32: - collecting rules: a) plain concat b) blockwise with parameters
33: - simple statistiks: which slave did what and which parameter sets failed.
34: - NFS aware
35: 1.1 Error detection/ dealing
36: - error while connection/authentication (ssh, scp)
37: - machine dead (for whatever reason)
38: - programm terminated without success
39: - programm doesn't return within timeout
40: - server breaks or gets stopped
41: 1.1 Server
42: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
43: - on http request for configuration: reply client configuration (i.e. what to do on exit)
44: - on http request for worker executeable: reply with the binary for the right platform
45: - on http request for new task: reply with the next command to execute and all parameters.
46: - on post: validate result and mark set and collect results
47: - no more parameter set to process: exit and display statistics.
48: 1.1 Client
49: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
50: - register at the server and fetch configuration
51: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it (for own platform) from the server
52: - fetch a task
53: - run worker
54: - check return code: if failed -> Post failture otherwise take the results and post them.
55: - fetch next task.
56: - die is there is no more task or the server is not responding.
57: - different settings for termination: delete executeable (if fetched), delete the client program?
58:
59: 1 Protocol
60: 1.1 Configuration
61: - Request: GET \http://master/config?sessionID=SESSIONID
62: - Fail (due to wrong session id): 403 (Forbidden)
63: - Successful Reply: List of Key = Value pairs.
64: {code:none}
65: DeleteWorker=Yes/No
66: DeleteClient=Yes/No
67: Ping=#
68: {code}
69: 1.1 Ping (HTTP)
70: - Ping interval is given in seconds. 0 for no ping.
71: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
72: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
73: - Fail due to wrong session id: 403 (Forbidden)
74: - Successful, but ticket expired (task allready done): 205 Reset Content
75: - Successful (keep on it!): 204 (No Content)
76: 1.1 Worker
77: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
78: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
79: - Fail (due to unsupported platform): 415 Unsupported Media Type
80: - Success: binary file
81: 1.1 Task
82: - Request: GET \http://master/task?sessionID=SESSIONID
83: - Fail (due to wrong session id): 403 (Forbidden)
84: - Fail (because no task left): 503 Service Unavailable
85: - Success:
86: {code:none}
87: Ticket=# (unique number (usually 5 digits))
88: Timeout=# (in seconds)
89: CommandLine=commandline
90: [Input]*
91: File=filename (or "stdin")
92: --begin content--
93: real file content here (binary)
94: --end content--
95: [Result]+
96: Name=resultname
97: File=filename (or "stdout")
98: {code}
99: 1.1 Task completed
100: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
101: {code:none}
102: [Result]+
103: Name=resultname
104: --begin content--
105: file content here (binary)
106: --end content--
107: {code}
108: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
109: - Reply Fail due to wrong session id: 403 (Forbidden)
110: - Reply Otherwise: 200 OK
111: 1 Implementation Details
112: 1.1 Configuation and Files
113: - Cluster file: list of computers, one per line
114: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
115: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
116: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
117: - Validation function: function template that gets the parameter set and the result of the worker and returns success or not
118: - Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).
119: 1.1 NFS awareness
120: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
121: different clients overwrite their data. Basically there are three cases:
122: 1. the client is copied
123: 1. the client fetches the worker
124: 1. the worker writes its data to a file.
125: Solutions:
126: 1. a) start and copy clients in serial (very slow) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
127: 1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker.
128: 1. every worker is started in a speparate directory, given by the session id and the ticket number
129: 1.1 Error detection
130: Remote shell command (ssh) termination code:
131: - 0 => Success: The worker has been executed with success!
132: - __Result: a file with the results of the computation__
133: - __SessionID:__
134: - __Ticket:__
135: 1 Specification
136: 1.1 Cluster
137: - One Master
138: - Many Slaves
139: - primary Linux machines, however the design is platform independent
140: - any TCP/IP network connection (can be non permanent, but should not :-))
141: - SSH (public key authentication) on Slave
142: - SCP sequrity copy (public key authentication) on Slave
143: - HTTP (any port open on Master, prefered #80)
144: 1.1 Features
145: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
146: - list of slave computers (__host__names or IPs). Every slave acts as an HTTP client.
147: - platform dependent workers possible
148: - command specification: commandline pattern with --space--__place__holders for variables and input file generation
149: - result specification: standart output and/or files
150: - validation of the results __(where on the master or on slaves?)__
151: - list with tasks characterised through parameters.
152: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
153: - collecting rules: a) plain concat b) blockwise with parameters
154: - simple statistiks: which slave did what and which parameter sets failed.
155: - NFS aware
156: 1.1 Error detection/ dealing
157: - error while connection/authentication (ssh, scp)
158: - machine dead (for whatever reason)
159: - programm terminated without success
160: - programm doesn't return within timeout
161: - server breaks or gets stopped
162: 1.1 Server
163: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
164: - on http request for configuration: reply client configuration (i.e. what to do on exit)
165: - on http request for worker executeable: reply with the binary for the right platform
166: - on http request for new task: reply with the next command to execute and all parameters. __standard file format for a task?__
167: - on post: validate result and mark set and collect results __ok result validation on server__
168: - no more --parameter set-- __task?__ ~~(cleaner terminology)~~ to process: exit and display statistics. __statistics during runtime? per webpage?__
169: 1.1 Client
170: - gets via command line: Session ID, --command name (path and name of the executeable), md5 checksum of the executeable-- __I think this should come over HTTP (MD5 depends on platform)__, server name and port
171: - register at the server and fetch configuration __Yes this is the config__
172: - check for the --executeable-- __worker__: if not --there-- __already at local filesystem__ or the MD5 checksum is wrong: fetch it (for own platform) from the server
173: - fetch a task
174: - run worker
175: - check return code: if failed -> Post failture otherwise take the results and post them.
176: - fetch next task.
177: - die if there is no more task or the server is not responding.
178: - different settings for termination: delete executeable (if fetched), delete the client program?
179:
180: 1 Protocol
181: 1.1 Configuration
182: - Request: GET \http://master/config?sessionID=SESSIONID
183: - Fail (due to wrong session id): 403 (Forbidden)
184: - Successful Reply: List of Key = Value pairs.
185: {code:none}
186: DeleteWorker=Yes/No
187: DeleteClient=Yes/No
188: Ping=#
189: {code}
190: 1.1 Ping (HTTP)
191: - Ping interval is given in seconds. 0 for no ping.
192: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
193: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
194: - Fail due to wrong session id: 403 (Forbidden)
195: - Successful, but ticket expired (task allready done): 205 Reset Content
196: - Successful (keep on it!): 204 (No Content)
197: 1.1 Worker
198: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
199: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
200: - Fail (due to unsupported platform): 415 Unsupported Media Type
201: - Success: binary file
202: 1.1 Task
203: - Request: GET \http://master/task?sessionID=SESSIONID__&worker=NAME__
204: - Fail (due to wrong session id): 403 (Forbidden)
205: - Fail (because no task left): 503 Service Unavailable
206: - Success:
207: {code:none}
208: Ticket=# (unique __in session or?__ number (usually 5 digits __only 5?__))
209: Timeout=# (in seconds)
210: CommandLine=commandline
211: [Input]*
212: File=filename (or "stdin")
213: --begin content--
214: real file content here (binary)
215: --end content--
216: [Result]+
217: Name=resultname
218: File=filename (or "stdout")
219: {code}
220: 1.1 Task completed
221: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
222: {code:none}
223: [Result]+
224: Name=resultname
225: --begin content--
226: file content here (binary)
227: --end content--
228: {code}
229: __binary content in a asscii file? you must use fileupload (multipart request)__
230: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
231: - Reply Fail due to wrong session id: 403 (Forbidden)
232: - Reply Otherwise: 200 OK
233: 1 Implementation Details
234: 1.1 Configuation and Files
235: - Cluster file: list of computers, one per line
236: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
237: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
238: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
239: - Validation function: function template that gets the parameter set and the result of the worker and returns success or not
240: - Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).
241: 1.1 NFS awareness
242: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
243: different clients overwrite their data. Basically there are three cases:
244: 1. the client is copied
245: 1. the client fetches the worker
246: 1. the worker writes its data to a file.
247: Solutions:
248: 1. a) start and copy clients in serial (very slow) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
249: 1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker.
250: 1. every worker is started in a speparate directory, given by the session id and the ticket number
251: 1.1 Error detection
252: Remote shell command (ssh) termination code:
253: - 0 => Success: The worker __?? I think the client goes in background with nohup.__ has been executed with success!