Changed lines at line 8
8: - Master: is the Computer where the server runs
9: - Slave: is one of many Computers that do the work
10: - Server: program that coordinates the process
11: - Client: program that runs on slaves
12: - Worker: program that does the computation (any)
13: - Task: a set of parameters/ settings for the Worker
14: - __Result: a file with the results of the computation__
15: - __SessionID:__
16: - __Ticket:__
17: 1 Specification
18: 1.1 Cluster
19: - One Master
20: - Many Slaves
21: - primary Linux machines, however the design is platform independent
22: - any TCP/IP network connection (can be non permanent, but should not :-))
23: - SSH (public key authentication) on Slave
24: - SCP sequrity copy (public key authentication) on Slave
25: - HTTP (any port open on Master, prefered #80)
26: 1.1 Features
27: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
28: - list of slave computers ( __host__ names or IPs). Every slave acts as an HTTP client.
29: - platform dependent workers possible
30: - command specification: commandline pattern with --space-- __place__ holders for variables and input file generation
31: - result specification: standart output and/or files
32: - validation of the results __(where on the master or on slaves?)__
33: - list with tasks characterised through parameters.
34: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
35: - collecting rules: a) plain concat b) blockwise with parameters
36: - simple statistiks: which slave did what and which parameter sets failed.
37: - NFS aware
38: 1.1 Error detection/ dealing
39: - error while connection/authentication (ssh, scp)
40: - machine dead (for whatever reason)
41: - programm terminated without success
42: - programm doesn't return within timeout
43: - server breaks or gets stopped
44: 1.1 Server
45: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
46: - on http request for configuration: reply client configuration (i.e. what to do on exit)
47: - on http request for worker executeable: reply with the binary for the right platform
48: - on http request for new task: reply with the next command to execute and all parameters. __standard file format for a task?__
49: - on post: validate result and mark set and collect results __ok result validation on server__
50: - no more --parameter set-- __task?__ ~~(cleaner terminology)~~ to process: exit and display statistics. __statistics during runtime? per webpage?__
51: 1.1 Client
52: - gets via command line: Session ID, --command name (path and name of the executeable), md5 checksum of the executeable-- __I think this should come over HTTP (MD5 depends on platform)__, server name and port
53: - register at the server and fetch configuration __Yes this is the config__
54: - check for the --executeable-- __worker__: if not --there-- __already at local filesystem__ or the MD5 checksum is wrong: fetch it (for own platform) from the server
55: - fetch a task
56: - run worker
57: - check return code: if failed -> Post failture otherwise take the results and post them.
58: - fetch next task.
59: - die if there is no more task or the server is not responding.
60: - different settings for termination: delete executeable (if fetched), delete the client program?
61:
62: 1 Protocol
63: 1.1 Configuration
64: - Request: GET \http://master/config?sessionID=SESSIONID
65: - Fail (due to wrong session id): 403 (Forbidden)
66: - Successful Reply: List of Key = Value pairs.
67: {code:none}
68: DeleteWorker=Yes/No
69: DeleteClient=Yes/No
70: Ping=#
71: {code}
72: 1.1 Ping (HTTP)
73: - Ping interval is given in seconds. 0 for no ping.
74: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
75: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
76: - Fail due to wrong session id: 403 (Forbidden)
77: - Successful, but ticket expired (task allready done): 205 Reset Content
78: - Successful (keep on it!): 204 (No Content)
79: 1.1 Worker
80: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
81: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
82: - Fail (due to unsupported platform): 415 Unsupported Media Type
83: - Success: binary file
84: 1.1 Task
85: - Request: GET \http://master/task?sessionID=SESSIONID__&worker=NAME__
86: - Fail (due to wrong session id): 403 (Forbidden)
87: - Fail (because no task left): 503 Service Unavailable
88: - Success:
89: {code:none}
90: Ticket=# (unique __in session or?__ number (usually 5 digits __only 5?__))
91: Timeout=# (in seconds)
92: CommandLine=commandline
93: [Input]*
94: File=filename (or "stdin")
95: --begin content--
96: real file content here (binary)
97: --end content--
98: [Result]+
99: Name=resultname
100: File=filename (or "stdout")
101: {code}
102: 1.1 Task completed
103: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
104: {code:none}
105: [Result]+
106: Name=resultname
107: --begin content--
108: file content here (binary)
109: --end content--
110: {code}
111: __binary content in a asscii file? you must use fileupload (multipart request)__
112: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
113: - Reply Fail due to wrong session id: 403 (Forbidden)
114: - Reply Otherwise: 200 OK
115: 1 Implementation Details
116: 1.1 Configuation and Files
117: - Cluster file: list of computers, one per line
118: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
119: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
120: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
121: - Validation function: function template that gets the parameter set and the result of the worker and returns success or not
122: - Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).
123: 1.1 NFS awareness
124: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
125: different clients overwrite their data. Basically there are three cases:
126: 1. the client is copied
127: 1. the client fetches the worker
128: 1. the worker writes its data to a file.
129: Solutions:
130: 1. a) start and copy clients in serial (very slow) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
131: 1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker.
132: 1. every worker is started in a speparate directory, given by the session id and the ticket number
133: 1.1 Error detection
134: Remote shell command (ssh) termination code:
135: - 0 => Success: The worker __?? I think the client goes in background with nohup.__ has been executed with success!
136: - otherswise => Failure: Can have the following reasons: connection failed, program not found or terminated without success.
137: To check a connection and the authentication:
138: {code:shell}ssh host echo{code}
139: - return 0 (success): Connection is OK and machine lives.
140: - Master: is the computer where the server runs
141: - Slave: is one of many computers that do the work
142: - Server: program that coordinates the process
143: - Client: program that runs on slaves
144: - Worker: program that does the computation (any)
145: - Task: a set of parameters/ settings for the Worker
146: - Result: a file with the results of the computation
147: - SessionID: unique number for client-server communication (should be unique over multiple runs)
148: - Ticket: Computation identification. Unique within one run.
149: 1 Specification
150: 1.1 Cluster
151: - One Master
152: - Many Slaves
153: - primary Linux machines, however the design is platform independent
154: - any TCP/IP network connection (can be non permanent, but should not :-))
155: - SSH (public key authentication) on Slave
156: - SCP sequrity copy (public key authentication) on Slave
157: - HTTP (any port open on Master, prefered #80)
158: 1.1 Features
159: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
160: - list of slave computers ( host names or IPs). Every slave acts as an HTTP client.
161: - platform dependent workers possible
162: - command specification: commandline pattern with place holders for variables and input file generation
163: - result specification: standart output and/or files
164: - validation of the results (where on the master)
165: - list with tasks characterised through parameters.
166: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
167: - collecting rules: a) plain concat b) blockwise with parameters
168: - simple statistiks: which slave did what and which parameter sets failed.
169: - NFS aware
170: 1.1 Error detection/ dealing
171: - error while connection/authentication (ssh, scp)
172: - slave dead / client killed (don't care there are other slaves :-) )
173: - server breaks or get stopped (all clients should terminate at some near point)
174: - worker terminates without success
175: - worker doesn't return within timeout
176: 1.1 Server
177: * Format of communication is specified in [Simple Parallel Exec#Protocol] section
178: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
179: - on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
180: - on http request for worker executeable: reply with the binary for the right platform
181: - on http request for new task: reply with the next command to execute and all parameters. __standard file format for a task?__
182: - on post: validate result and mark set and collect results __ok result validation on server__
183: - no more --parameter set-- __task?__ ~~(cleaner terminology)~~ to process: exit and display statistics. __statistics during runtime? per webpage?__
184: 1.1 Client
185: * Format of communication is specified in [Simple Parallel Exec#Protocol] section
186: * gets via command line: Session ID, server URL and port
187: - register at the server and fetch configuration
188: - check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server
189: - fetch a task
190: - run worker
191: - check return code: if failed -> Post failture otherwise take the results and post them.
192: - fetch next task
193: - die if there is no more task or the server is not responding.
194: * different settings for termination: delete executeable (if fetched), delete the client program, delete results?
195:
196: 1 Protocol
197: 1.1 Configuration
198: - Request: GET \http://master/config?sessionID=SESSIONID
199: - Fail (due to wrong session id): 403 (Forbidden)
200: - Successful Reply: List of Key = Value pairs.
201: {code:none}
202: DeleteWorker=Yes/No
203: DeleteClient=Yes/No
204: Ping=#
205: {code}
206: 1.1 Ping (HTTP)
207: * Ping interval is given in seconds. 0 for no ping.
208: * Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
209: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
210: - Fail due to wrong session id: 403 (Forbidden)
211: - Successful, but ticket expired (task already done): 205 (Reset Content)
212: - Successful (keep on it!): 204 (No Content)
213: 1.1 Worker
214: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
215: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
216: - Fail (due to unsupported platform): 415 Unsupported Media Type
217: - Success: binary file
218: 1.1 Task
219: - Request: GET \http://master/task?sessionID=SESSIONID
220: - Fail (due to wrong session id): 403 (Forbidden)
221: - Fail (because no task left): 503 Service Unavailable
222: - Success:
223: {code:none}
224: Ticket=# (unique number within session (usually quite short (in the magnitude of the amount of tasks)))
225: CommandLine=commandline
226: [Input]*
227: File=filename (or "stdin")
228: --begin content--
229: real file content here (binary)
230: --end content--
231: [Result]+
232: Name=resultname
233: File=filename (or "stdout")
234: {code}
235: * the * behinds the section means there can be _zero_ or more sections
236: * the + behinds the section means there can be _one_ or more sections
237: 1.1 Task completed
238: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
239: {code:none}
240: [Result]+
241: Name=resultname
242: Content=<<ENDOFCONTENT
243: file content here (ASCII)
244: ENDOFCONTENT
245: {code}
246: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
247: - Reply Fail due to wrong session id: 403 (Forbidden)
248: - Reply Otherwise: 200 OK
249: * binary content is not supported
250: 1 Implementation Details
251: 1.1 Configuation and Files
252: - Server config: TODO describe
253: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
254: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
255: - Result specification: A result consists of a list of name - value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example myoutput="stdout", myfileoutput="out.txt".
256: - Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture.
257: - Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file).
258: 1.1 NFS awareness
259: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
260: different clients overwrite their data. Basically there are three points where it occurs:
261: 1. the client is copied
262: 1. the client fetches the worker
263: 1. the worker writes its data to a file.
264: Solutions:
265: 1. a) start and copy clients in serial (very slow) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
266: 1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker.
267: 1. every worker is started in a separate directory, given by the session id and the ticket number
268: 1.1 Error detection
269: Remote shell command (ssh) termination code:
270: - 0 => Success: The executed command has been executed with success!
271: - otherswise => Failure: Can have the following reasons: connection failed, program not found or terminated without success.
272: To check a connection and the authentication:
273: {code:shell}ssh host echo{code}
274: - return 0 (success): Connection is OK and machine has a shell. (TODO: check for Windows and Mac machines)