[Swift-devel] [Bug 72] Campaign for scaling wf up to 244 molecules

bugzilla-daemon at mcs.anl.gov bugzilla-daemon at mcs.anl.gov
Sun Jul 1 00:47:49 CDT 2007


http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=72





------- Comment #8 from iraicu at cs.uchicago.edu  2007-07-01 00:47 -------
(In reply to comment #5)
> First of all, can you commit the changes to SVN?
> 
Yong made the changes, I am sure he will commit them the first chance he gets!

> (In reply to comment #4)
> > We fixed the potential synchronization issue
> > Mihael pointed out.
> 
> There were two.
> 
I meant to say "issues"... from the discussion I had with Yong, I believe he
addressed both of them.
> > We also fixed a badly handled exception we had in the
> > Falkon provider, that would give up very easily and exit the Falkon provider
> > thread in case of an exception, even if it wasn't a fatal one.  This time
> > around, we changed the logic to simply print the exception, if there were any,
> > and not exit the Falkon provider, just continue.  Personally, I think this
> > logic on handling exceptions in the Falkon provider was causing the Falkon
> > provider to exit prematurely, and hence not send any more tasks to Falkon...
> 
> I can't seem to find anything that would fit that profile in the provider code.
> Can you be more specific? If the provider was setting the status of the task to
> failed, then it doesn't matter. Swift retries failed things.
> 
Sure.  Double check file SubmissionThread.java, notice that the thread will
live as long as exit is not set...
Line 54:    public void run() {
        while(!exit) {

exit is initially set to false, but anything that sets it to true, and the
submission thread will exit.

Notice the end of the file with the setStatus(Executable) function:
Line 98:    public void setStatus (Executable execs[]) {
        try {
            for (int i=0; i<execs.length; i++) {
                Task task = rp.removeTask(execs[i].getId());
                task.setStatus(Status.FAILED);
                System.out.println("*****************************SUPER_DEBUG:
setStatus(execs): " + i);
            }
        } catch (Exception e) {
            //no-op
            e.printStackTrace();
        }
        //this.exit = true;
    }

Notice the exit being set to true.  This setStatus function is being called in
a single place in that file:
Line 91:            } catch (Exception e) {
                setStatus(execs);
                e.printStackTrace();
            } 

So, this would essentially kill the Submission thread from an exception.

Also, check the StatusThread.java, 
...
Line 66:            } catch (Exception e) {
                logger.debug("Error removing tasks");
                e.printStackTrace();
                //exit = true;
            } 
With an exception here, it would have caused the StatusThread to exit, meaning
that no new notifications would be received.

Both of these exception handling have been modified to not exit and shutdown
the respective threads, by simply omitting the change of the exit value from
false to true.

We'll dig through the Falkon provider logs to find out the exceptions that were
thrown throughout the application run (assuming that some were thrown like in
the past), so we can better understand why those exceptions were happening in
the first place, and hopefully find a solution so they do not happen in the
future!

> > note that Swift was setting the set status of submitted tasks to the Falkon
> > provider in a separate thread,
> 
> Swift does not set status of tasks. That's what the provider is supposed to do.
> 
OK, there are several separate threads, one that sets the status of the task
for Swift, another that performs the submit, another that receives
notifications, etc.  The common data structure between the set status thread
and the submit thread is a queue; if the submission thread dies, the queue is
still valid, and the set status thread could still insert tasks into the queue
and set the status to submitted, although there would be no submission thread
alive to perform the submission itself to Falkon.

> > which was not necesarly exiting when the Falkon
> > provider was, and hence we had the scenario in which Swift thought it sent out
> > more tasks than Falkon really saw. 
> 
> Can you be more specific? If there is a problem in Swift, we need to fix it,
> but your comment is too vague.
> 
> > 
> > Now, the issue that I think stopped this experiment.  On the console of Swift,
> > the last thing that it printed was a "stack overflow error"; I don't think this
> > printed in the logs, just on the console.
> 
> Without the stack trace, the information is not very useful.
> 
Nika said it was simply a message printed on the console.  This was the same as
the case we saw on Thursday.  This was not a regular exception that Swift or
the Falkon provider controlled, and hence that it would have a print stack
trace along with it.  As far as I could tell, it was an error from the JVM, and
was not accompanied by any stack trace.  If you don't know where to even start
looking, let's run some quick synthetic runs of 20K jobs on Monday together,
and  hopefully we can reproduce the stack overflow error, and you can see it in
person!

Ioan
> > 
> > Ioan
> > 
> 


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.



More information about the Swift-devel mailing list