[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Fri Jul 6 11:43:40 CDT 2007


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72





------- Comment #17 from iraicu at cs.uchicago.edu  2007-07-06 11:43 -------
(In reply to comment #16)
> The latest Karajan fix seems to work (i.e. Workflow compiles). Falcon
> experiences some problems. Ioan, please post the details of the current
> problems here.
> 
I made some chnages in the last few days to fix some known issues I have had
with Falkon, although none of these issues were relevant to the MolDyn runs we
have been making recently.  I made some small sanity checks after I made the
changes, and everything seemed fine.  Then, yesterday, when we tried the 244
mol run again, within the first 100 jobs, Falkon seemed to be having problems.

It looked like notifications to the workers weren't always going through (which
has never happened before). This would cause some number of CPUs to sit idle
while Falkon recovered from this (its default is to clean up every 60 sec).  I
made some more synthetic tests from my command line client (independent of
Swift), and the problem was reproducible about 3~4 times in a row that I tried.
 Then, I even managed to crash the GT4 container, as it locked up and it would
not do anything.  This was also a fist, I have never managed to get the GT4
container in a state where it would not answer any more WS calls, yet the CPU
was idle on the machine.  From the surface, it looked like all hell broke
loose....

I added some more debuging statements and turned on all possible debugging...
and a few hours later (last night), I tried again and everything was working
perfect!  I ran some 100K jobs through it and it seemed to work perfect.  I
even disabled all the debugging that I added just to see if that did
anything,and things were still perfect.  I blows my mind what could have
happened, to go from something that was repeatable every time, to something
that I can't reproduce, and this is all in the same environment, configuration,
and hardware.  I'll dig around some more to try to make sense of what happened,
and perhaps we can try the 244 mol run again once I am convinced that I have
not broken anything with my latest changes from earlier this week.

Ioan

PS: I could also try to revert back to the earlier version before my changes,
especially as the changes I made were not geared for the MolDyn app, and more
in general.

> Nika
> 


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.



More information about the Swift-devel mailing list