[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Sun Jul 1 11:36:30 CDT 2007


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72





------- Comment #13 from hategan at mcs.anl.gov  2007-07-01 11:36 -------
(In reply to comment #11)
> (In reply to comment #9)
> > (In reply to comment #7)
> > > (In reply to comment #6)
> > > > (In reply to comment #4)
> > > The same machine (tg-v024) that we had trouble with before acted up again, I
> > > should have removed it before we started the experiment.  If this is the
> > > consensus, we can certainly try it again, and make sure this machine is not in
> > > the resource pool.  Another idea is to increase the retry # from 3 to something
> > > higher, maybe 10, 30, etc?
> > 
> > Not a good idea in the general case, since many times the error may not be
> > something temporary. The swift scheduler takes bad machines into account and
> > attempts to avoid submitting to them.
> >
> Yes, but in this case, Falkon was the only set of resources that were available
> to Swift, so giving up early means giving up on the entire workflow.  If it was
> indeed that the # of failures reached up to the maximum of 3 and that is why
> the worklow didn't complete, I would argue that it would be worthwhile to
> increase this upper ceiling.... at least when running solely with Falkon, or at
> the very least, for this experiment to see th 244 mol run succeed.  Remember
> that Falkon is much faster than GRAM/PBS, so if errors happen quick, as in the
> case on this tg-v024 node, where it happens in <50 ms, then 1000s of errors can
> happen in a matter of seconds to minutes....  I am not sure what the correct
> solution is, bu something to consider as the dynamics of the problem is now
> different than it was before prior to Falkon.

By themselves retries don't solve the problem. There must be a reasonable
chance that a job will finish. If you have 999 busy workers and 1 bad worker,
restarting 100 times will still cause the workflow to fail, and the fact that
restarts will happen fast is not exactly helping. 

While a bit reluctant to add more options, I guess the number of restarts could
be one in the future.

> 
> Ioan 
> > > 
> > > Ioan
> > > 
> > 
> 


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.



More information about the Swift-devel mailing list