Changed lines at line 27
27: - one master computer with server program (HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
28: - list of slave computers ( host names or IPs). Every slave acts as an HTTP client.
29: - platform dependent workers possible
30: - command specification: commandline pattern with place holders for variables and input file generation
31: - result specification: standart output and/or files
32: - validation of the results (where on the master)
33: - list with tasks characterised through parameters.
34: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
35: - collecting rules: a) plain concat b) blockwise with parameters
36: - simple statistiks: which slave did what and which parameter sets failed.
37: - NFS aware
38: 1.1 Error detection/ dealing
39: - error while connection/authentication (ssh, scp)
40: - slave dead / client killed (don't care there are other slaves :-) )
41: - server breaks or get stopped (all clients should terminate at some near point)
42: - worker terminates without success
43: - worker doesn't return within timeout
44: 1.1 Server
45: * Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
46: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
47: - on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
48: - on http request for worker executeable: reply with the binary for the right platform
49: - on http request for new task: reply with the next command to execute and all parameters.
50: - on http request for statistics (normal website): reply statistics webpage
51: - on post: validate result, mark task as completed and collect results
52: - no more task to process: exit and display statistics.
53: 1.1 Client
54: * Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
55: * gets via command line: Session ID, server URL and port
56: - register at the server and fetch configuration
57: - check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server
58: - fetch a task
59: - run worker
60: - check return code: if failed -> Post failture otherwise take the results and post them.
61: - fetch next task
62: - die if there is no more task or the server is not responding.
63: * different settings for termination: delete executeable (if fetched), delete the client program, delete results?
64:
65: 1 Protocol {anchor:Protocol}
66: 1.1 Configuration
67: - Request: GET \http://master/config?sessionid=SESSIONID&platform=PLATFORM
68: - Fail (due to unsupported platform): 415 Unsupported Media Type
69: - Fail (due to wrong session id): 403 (Forbidden)
70: - Successful Reply: List of Key = Value pairs.
71: {code:none}
72: Worker=name of the executeable
73: MD5=md5 checksum of the executeable
74: DeleteWorker=Yes/No
75: DeleteClient=Yes/No
76: DeleteResults=Yes/No
77: Ping=#
78: {code}
79: * PLATFORM: one of "Linux, Unix, BSD, WinNT, Win95" (TODO: need a better way then $^0)
80: 1.1 Ping (HTTP)
81: * Ping interval is given in seconds. 0 for no ping.
82: * Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
83: - Request: GET \http://master/ping?sessionid=SESSIONID&ticket=TICKET
84: - Fail due to wrong session id: 403 (Forbidden)
85: - Successful, but ticket expired (task already done): 205 (Reset Content)
86: - Successful (keep on it!): 204 (No Content)
87: 1.1 Worker
88: - Request: GET \http://master/worker?sessionid=SESSIONID
89: - Fail (due to wrong session id): 403 (Forbidden)
90: - Fail (due to file not found): 403 (Forbidden)
91: - Success: binary file
92: 1.1 Task
93: - Request: GET \http://master/task?sessionid=SESSIONID
94: - Fail (due to wrong session id): 403 (Forbidden)
95: - Fail (because no task left): 503 Service Unavailable
96: - Success:
97: {code:none}
98: [Task]
99: Ticket=# (unique number within session)
100: CommandLine=commandline
101: [Input filename]*
102: Content=single-line file content
103: or
104: Content= <<EOT
105: multi-line file content here (ASCII)
106: EOT
107: [Result name]+
108: File=filename
109: {code}
110: * the * behinds the section means there can be _zero_ or more sections
111: * the + behinds the section means there can be _one_ or more sections
112: 1.1 Task completed
113: - Successful: POST \http://master/completed?sessionid=SESSIONID&ticket=TICKET
114: {code:none}
115: [Result name]+
116: Content= single-line file content
117: or
118: Content= <<EOT
119: multi-line file content here (ASCII)
120: EOT
121: {code}
122: - Failed: GET \http://master/failed?sessionid=SESSIONID&ticket=TICKET
123: - Reply Fail due to wrong session id: 403 (Forbidden)
124: - Reply Otherwise: 204 (No Content)
125: * binary content is not supported
126: 1 Implementation Details
127: 1.1 Configuation and Files
128: - Server config: See __server.sample.conf__ in the side bar.
129: - Task file: __tasks.csv__ : csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
130: {code:none}
131: string|counter
132: "eins"|1
133: "zwei"|2
134: {code}
135: - Worker config: See __worker.sample.conf__ in the side bar.
136: - Input specification: The input consists of the command line and one ore more files. In the example above the parameter "counter" is passed as an commandline argument and the parameter "string" is written in the file input.file. This file is used as standart input for the worker. One can specify other files as well, in case the worker reads them.
137: - Result specification: A result has a name and filename where to get the result values from. In the above example one result is calles "Result" and it comes from the file result.file, which is the standart output of the worker. The second result is calles "Output" and is read from output.file. If the worker doesn't write to this file the result will be empty.
138: - Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture. (See Validate.pm)
139: - Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file). (See Collect.pm)
140: 1.1 Server
141: - The server is implemented in perl.
142: - Perl has no reasonable way to use shared memory in multiply threads. Since the program is written using perl objects and objects can't be shared I decided to make a __serial__ implementation first.
143: - That means just one request can be responded at the time
144: - Consequences: slow start if there are many slaves; less suitable for very small tasks with large input/response data
145: 1.1 NFS awareness
146: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
147: different clients overwrite their data. Basically there are three points where it occurs:
148: 1. the client is copied
149: 1. the client fetches the worker
150: 1. the worker writes its data to a file.
151: Solutions:
152: 1. a) start and copy clients in serial (very slow) (current implementation) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
153: - one master computer with server program (acts as HTTP server). SSH and SCP is needed to get the client program to the slaves and start it.
154: - list of slave computers (host names or IPs).
155: - platform dependent workers possible
156: - command specification: commandline pattern with place holders for variables and input file generation
157: - result specification: standart output and/or files
158: - validation of the results on the master
159: - list of tasks characterised through parameters.
160: - timeouts and multiply task assigments if necessary (timeout, free resources and so on)
161: - collecting rules: a) plain concat b) blockwise with parameters
162: - simple online (via Web) and offline statistics: which slave did what and which parameter sets failed.
163: - test mode with different bias to check configuration.
164: - NFS aware
165: 1.1 Error detection/ dealing
166: - error while connection/authentication (ssh, scp)
167: - slave dead / client killed (don't care there are other slaves :-) )
168: - server breaks or get stopped (all clients should terminate at some near point)
169: - worker terminates without success
170: - worker doesn't return within timeout
171: 1.1 Server
172: * Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
173: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
174: - on http request for configuration: reply client configuration (i.e. worker name, MD5 checksum and what to do on exit)
175: - on http request for worker executeable: reply with the binary for the right platform
176: - on http request for new task: reply with the next command to execute and all parameters.
177: - on http request for statistics (normal website): reply statistics webpage
178: - on post: validate result, mark task as completed and collect results
179: - no more task to process: exit and display statistics.
180: 1.1 Client
181: * Format of communication is specified in [Protocol|Simple Parallel Exec#Protocol] section
182: * gets via command line: Session ID, server URL and port
183: - register at the server and fetch configuration
184: - check for the worker: if not already at local filesystem or the MD5 checksum is wrong: fetch it (for own platform) from the server
185: - fetch a task
186: - run worker
187: - check return code: if failed -> Post failture otherwise take the results and post them.
188: - fetch next task
189: - die if there is no more task or the server is not responding.
190: * different settings for termination: delete executeable (if fetched), delete the client program, delete results?
191:
192: 1 Protocol {anchor:Protocol}
193: 1.1 Configuration
194: - Request: GET \http://master/config?sessionid=SESSIONID&platform=PLATFORM
195: - Fail (due to unsupported platform): 415 Unsupported Media Type
196: - Fail (due to wrong session id): 403 (Forbidden)
197: - Successful Reply: List of Key = Value pairs.
198: {code:none}
199: Worker=name of the executeable
200: MD5=md5 checksum of the executeable
201: DeleteWorker=Yes/No
202: DeleteClient=Yes/No
203: DeleteResults=Yes/No
204: Ping=#
205: {code}
206: * PLATFORM: one of "Linux, Unix, BSD, WinNT, Win95" (TODO: need a better way then $^0)
207: 1.1 Ping (HTTP)
208: * Ping interval is given in seconds. 0 for no ping.
209: * Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
210: - Request: GET \http://master/ping?sessionid=SESSIONID&ticket=TICKET
211: - Fail due to wrong session id: 403 (Forbidden)
212: - Successful, but ticket expired (task already done): 205 (Reset Content)
213: - Successful (keep on it!): 204 (No Content)
214: 1.1 Worker
215: - Request: GET \http://master/worker?sessionid=SESSIONID
216: - Fail (due to wrong session id): 403 (Forbidden)
217: - Fail (due to file not found): 403 (Forbidden)
218: - Success: binary file
219: 1.1 Task
220: - Request: GET \http://master/task?sessionid=SESSIONID
221: - Fail (due to wrong session id): 403 (Forbidden)
222: - Fail (because no task left): 503 Service Unavailable
223: - Success:
224: {code:none}
225: [Task]
226: Ticket=# (unique number within session)
227: CommandLine=commandline
228: [Input filename]*
229: Content=single-line file content
230: or
231: Content= <<EOT
232: multi-line file content here (ASCII)
233: EOT
234: [Result name]+
235: File=filename
236: {code}
237: * the * behinds the section means there can be _zero_ or more sections
238: * the + behinds the section means there can be _one_ or more sections
239: 1.1 Task completed (successful or with failture)
240: - __Successful__: POST \http://master/completed?sessionid=SESSIONID&ticket=TICKET
241: {code:none}
242: [Result name]+
243: Content= single-line file content
244: or
245: Content= <<EOT
246: multi-line file content here (ASCII)
247: EOT
248: {code}
249: - __Failed__: GET \http://master/failed?sessionid=SESSIONID&ticket=TICKET
250: - Reply Fail due to wrong session id: 403 (Forbidden)
251: - Reply Last Task (no more task): 204 (No Content)
252: - Reply Otherwise: 202 (Accepted)
253: * binary content is not supported
254: 1.1 Client Starting Status
255: * This is used for communication between threads in the server.
256: - GET \http://master/clientstatus?slaveid=SLAVEID&status=STATUS
257: - STATUS: -1: no ssh; 0: error while starting or copying; 1: started
258: - Reply Fail due to wrong slave id: 403 (Forbidden)
259: - Reply Otherwise: 204 (No Content)
260: 1.1 Client Died Notification
261: * The client notifies that it is about to die
262: - GET \http://master/died?session=SESSIONID&normal=yes/no&reason=REASON
263: - REASON: String that describes why
264: - Reply Fail due to wrong session id: 403 (Forbidden)
265: - Reply Otherwise: 204 (No Content)
266: 1 Implementation Details
267: 1.1 Configuation and Files
268: - Server config: See __server.sample.conf__ in the side bar.
269: - Task file: __tasks.csv__ : csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
270: Before the headline can occure comments, that start with # until the end of the line.
271: {code:none}
272: #comment
273: string|counter
274: "eins"|1
275: "zwei"|2
276: {code}
277: - Worker config: See __worker.sample.conf__ in the side bar.
278: - Input specification: The input consists of the command line and one ore more files. In the example above the parameter "counter" is passed as an commandline argument and the parameter "string" is written in the file input.file. This file is used as standart input for the worker. One can specify other files as well, in case the worker reads them.
279: - Result specification: A result has a name and filename where to get the result values from. In the above example one result is calles "Result" and it comes from the file result.file, which is the standart output of the worker. The second result is calles "Output" and is read from output.file. If the worker doesn't write to this file the result will be empty.
280: - Validation: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A validation function gets the result of the worker and returns success or failture. (See Validate.pm)
281: - Collection: standart implementations are provided and a custom implementation can be provided by the user as a perl function. A collection function gets task description (number, parameters set) and the result of the worker and can do whatever it wants with it (usually writes in a file). (See Collect.pm)
282: 1.1 Server
283: - The server is implemented in perl.
284: - Perl has no reasonable way to use shared memory in multiply threads. Since the program is written using perl objects and objects can't be shared I decided to make a __serial__ implementation first.
285: - That means just one request can be responded at the time
286: - However I managed to to implement the swarming in parallel, that means that the starting is reasonably fast.
287: - Consequences: less suitable for very small tasks (short computation time) with large input/response data
288: 1.1 NFS awareness
289: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
290: different clients overwrite their files. Basically there are three points where it occurs:
291: 1. the client is copied
292: 1. the client fetches the worker
293: 1. the worker writes its data to a file.
294: Solutions:
295: 1. a) start and copy clients in serial (very slow) b) start the first client an copy the client if necessary. After that start the remaining clients in parallel (quite fast, current implementation :-))