Informatik, Modellbau und Privates von Georg
[ start | index | login ]

Changes of Simple Parallel Exec from #8 to #9

Changed lines at line 34
34: 1.1 Error detection/ dealing
35: - error while connection/authentication (ssh, scp)
36: - machine dead (for whatever reason)
37: - programm terminated without success
38: - programm doesn't return within timeout
39: - server breaks or gets stopped
40: 1.1 Server
41: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
42: - on http request for configuration: reply client configuration (i.e. what to do on exit)
43: - on http request for worker executeable: reply with the binary for the right platform
44: - on http request for new task: reply with the next command to execute and all parameters.
45: - on post: validate result and mark set and collect results
46: - no more parameter set to process: exit and display statistics.
47: 1.1 Client
48: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
49: - register at the server and fetch configuration
50: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it (for own platform) from the server
51: - fetch a task
52: - run worker
53: - check return code: if failed -> Post failture otherwise take the results and post them.
54: - fetch next task.
55: - die is there is no more task or the server is not responding.
56: - different settings for termination: delete executeable (if fetched), delete the client program?
57:
58: 1 Protocol
59: 1.1 Configuration
60: - Request: GET \http://master/config?sessionID=SESSIONID
61: - Fail (due to wrong session id): 403 (Forbidden)
62: - Successful Reply: List of Key = Value pairs.
63: {code:none}
64: DeleteWorker=Yes/No
65: DeleteClient=Yes/No
66: Ping=#
67: {code}
68: 1.1 Ping (HTTP)
69: - Ping interval is given in seconds. 0 for no ping.
70: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
71: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
72: - Fail due to wrong session id: 403 (Forbidden)
73: - Successful, but ticket expired (task allready done): 205 Reset Content
74: - Successful (keep on it!): 204 (No Content)
75: 1.1 Worker
76: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
77: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
78: - Fail (due to unsupported platform): 415 Unsupported Media Type
79: - Success: binary file
80: 1.1 Task
81: - Request: GET \http://master/task?sessionID=SESSIONID
82: - Fail (due to wrong session id): 403 (Forbidden)
83: - Fail (because no task left): 503 Service Unavailable
84: - Success:
85: {code:none}
86: Ticket=# (unique number (usually 5 digits))
87: Timeout=# (in seconds)
88: CommandLine=commandline
89: [Input]*
90: File=filename (or "stdin")
91: --begin content--
92: real file content here (binary)
93: --end content--
94: [Result]+
95: Name=resultname
96: File=filename (or "stdout")
97: {code}
98: 1.1 Task completed
99: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
100: {code:none}
101: [Result]+
102: Name=resultname
103: --begin content--
104: file content here (binary)
105: --end content--
106: {code}
107: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
108: - Reply Fail due to wrong session id: 403 (Forbidden)
109: - Reply Otherwise: 200 OK
110: 1 Implementation Details
111: 1.1 Configuation and Files
112: - Cluster file: list of computers, one per line
113: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
114: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
115: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
116: - Validation function: function template that gets the parameter set and the result of the worker and returns success or not
117: - Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).
118: - NFS aware
119: 1.1 Error detection/ dealing
120: - error while connection/authentication (ssh, scp)
121: - machine dead (for whatever reason)
122: - programm terminated without success
123: - programm doesn't return within timeout
124: - server breaks or gets stopped
125: 1.1 Server
126: - initialisation: for every slave: try to start client (ssh). If it fails: check ssh connection with dummy ssh command. Success -> copy client to slave using scp and try to start it.
127: - on http request for configuration: reply client configuration (i.e. what to do on exit)
128: - on http request for worker executeable: reply with the binary for the right platform
129: - on http request for new task: reply with the next command to execute and all parameters.
130: - on post: validate result and mark set and collect results
131: - no more parameter set to process: exit and display statistics.
132: 1.1 Client
133: - gets via command line: Session ID, command name (path and name of the executeable), md5 checksum of the executeable, server name and port
134: - register at the server and fetch configuration
135: - check for the executeable: if not there or the MD5 checksum is wrong: fetch it (for own platform) from the server
136: - fetch a task
137: - run worker
138: - check return code: if failed -> Post failture otherwise take the results and post them.
139: - fetch next task.
140: - die is there is no more task or the server is not responding.
141: - different settings for termination: delete executeable (if fetched), delete the client program?
142:
143: 1 Protocol
144: 1.1 Configuration
145: - Request: GET \http://master/config?sessionID=SESSIONID
146: - Fail (due to wrong session id): 403 (Forbidden)
147: - Successful Reply: List of Key = Value pairs.
148: {code:none}
149: DeleteWorker=Yes/No
150: DeleteClient=Yes/No
151: Ping=#
152: {code}
153: 1.1 Ping (HTTP)
154: - Ping interval is given in seconds. 0 for no ping.
155: - Purpose of the ping is that the client realises if the server is stopped or finished or even dead.
156: - Request: GET \http://master/ping?sessionID=SESSIONID&ticket=TICKET
157: - Fail due to wrong session id: 403 (Forbidden)
158: - Successful, but ticket expired (task allready done): 205 Reset Content
159: - Successful (keep on it!): 204 (No Content)
160: 1.1 Worker
161: - Request: GET \http://master/worker?sessionID=SESSIONID&name=NAME&platform=PLATFORM
162: - Fail (due to wrong session id, or wrong name): 403 (Forbidden)
163: - Fail (due to unsupported platform): 415 Unsupported Media Type
164: - Success: binary file
165: 1.1 Task
166: - Request: GET \http://master/task?sessionID=SESSIONID
167: - Fail (due to wrong session id): 403 (Forbidden)
168: - Fail (because no task left): 503 Service Unavailable
169: - Success:
170: {code:none}
171: Ticket=# (unique number (usually 5 digits))
172: Timeout=# (in seconds)
173: CommandLine=commandline
174: [Input]*
175: File=filename (or "stdin")
176: --begin content--
177: real file content here (binary)
178: --end content--
179: [Result]+
180: Name=resultname
181: File=filename (or "stdout")
182: {code}
183: 1.1 Task completed
184: - Successful: POST \http://master/complete?sessionID=SESSIONID&ticket=TICKET
185: {code:none}
186: [Result]+
187: Name=resultname
188: --begin content--
189: file content here (binary)
190: --end content--
191: {code}
192: - Failed: GET \http://master/failed?sessionID=SESSIONID&ticket=TICKET
193: - Reply Fail due to wrong session id: 403 (Forbidden)
194: - Reply Otherwise: 200 OK
195: 1 Implementation Details
196: 1.1 Configuation and Files
197: - Cluster file: list of computers, one per line
198: - Parameter file: csv file, cells seperated with |, parameter names in the headline and every following line contains one parameter set. All lines have to have the same amount of cells like the headline!
199: - Specification of worker: Command line: a perl syntax string with parameter Variables for the parameters; Input files: Name of the file and parameter name to write in.
200: - Result specification: A result consists of a list of name and value pairs. Where name specifies the name of the particular output and value decides where the output comes from. For example output="stdout", outputfile="out.txt".
201: - Validation function: function template that gets the parameter set and the result of the worker and returns success or not
202: - Collection function: gets the parameters set and the result of the worker and can do whatever with it (usually print it in a file).
203: 1.1 NFS awareness
204: The problem is, that if some slaves share files via NFS or another network filesystem it could happen that
205: different clients overwrite their data. Basically there are three cases:
206: 1. the client is copied
207: 1. the client fetches the worker
208: 1. the worker writes its data to a file.
209: Solutions:
210: 1. a) start and copy clients in serial (very slow) b) copy just one client at time, but start in parallel (fast on NFS, slow otherwise)
211: 1. before fetching the worker the client creates a .lock file. The other clients check the existance and wait for the worker.
212: 1. every worker is started in a speparate directory, given by the session id and the ticket number

Content

Help
For hints about formatting text see snipsnap-help.

Logged in Users: (1)
… and a Guest.

Recently Changed
snipsnap.org | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt