[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Sat Jun 30 15:33:50 CDT 2007


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72


iraicu at cs.uchicago.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |iraicu at cs.uchicago.edu




------- Comment #4 from iraicu at cs.uchicago.edu  2007-06-30 15:33 -------
Hi again,
Here is an update of yesterday's 244 molecule run.  The experiment ran further
than before, but it still did not complete.  There were 240 molecules that
completed successfully (in the previous run, no molecule finished), but 4
molecules still did not finish. 

Here is the breakdown on the tasks:
Exit Code 0: 20695 tasks
Exit Code -3: 6 tasks
Exit Code -1: 3585 tasks
=====================
Total: 24286 tasks

The 3 usual Falkon graphs can be found here:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/executor_graph.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/summary_graph.jpg
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/task_graph.jpg

The relevant Falkon logs are here (there are more if people are interested, in
total over 600MB of logs):
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Falkon_logs/
The Swift log are here:
http://people.cs.uchicago.edu/~iraicu/research/docs/MolDyn/244-mol-failed/Swift_logs/MolDyn-244-63ar6atbg2ae1.log

>From Falkon's point of view, things looked fine, tasks came in, they got
processed, they got returned.  

We haven't got a chance to analyze the Swift end of the logs yet, so we don't
know for sure what happened.  We fixed the potential synchronization issue
Mihael pointed out.  We also fixed a badly handled exception we had in the
Falkon provider, that would give up very easily and exit the Falkon provider
thread in case of an exception, even if it wasn't a fatal one.  This time
around, we changed the logic to simply print the exception, if there were any,
and not exit the Falkon provider, just continue.  Personally, I think this
logic on handling exceptions in the Falkon provider was causing the Falkon
provider to exit prematurely, and hence not send any more tasks to Falkon...
note that Swift was setting the set status of submitted tasks to the Falkon
provider in a separate thread, which was not necesarly exiting when the Falkon
provider was, and hence we had the scenario in which Swift thought it sent out
more tasks than Falkon really saw. 

Now, the issue that I think stopped this experiment.  On the console of Swift,
the last thing that it printed was a "stack overflow error"; I don't think this
printed in the logs, just on the console.  I believe this is a JVM error when a
thread recurses too deep and the thread stack size is not sufficiently large
enough.  We saw this same error on Thursday in some synthetic experiments with
20K sleep jobs, but it was not repeatable every time.  Does anyone have any
idea where this stack overflow could be coming from? 

Ioan


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.



More information about the Swift-devel mailing list