[Swift-devel] Q about MolDyn
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 7 11:24:12 CDT 2007
Well, it doesn't look like the falkon provider in SVN has been updated
at all in terms of fixing synchronization issues. All commits on
provider-deef come from either ben or me:
bash-3.1$ svn log
------------------------------------------------------------------------
r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
2007) | 1 line
removed gt4 stuff and added them as a dependency
------------------------------------------------------------------------
r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
2007) | 1 line
removed gt4 stuff and added them as a dependency
------------------------------------------------------------------------
r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
2007) | 1 line
a very small readme for provider-deef
------------------------------------------------------------------------
r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
2007) | 1 line
remove dist directory form svn
------------------------------------------------------------------------
r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
2007) | 20 lines
provider-deef, the Falkon/cog provider
based on source in below message, with .class files deleted
Date: Wed, 27 Jun 2007 09:27:23 -0500
From: Veronika Nefedova <nefedova at mcs.anl.gov>
To: Yong Zhao <yongzh at cs.uchicago.edu>
Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
<hategan at mcs.anl.gov>,
iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
Mike Wilde <wilde at mcs.anl.gov>,
Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
Subject: Re: 244 molecule MolDyn run...
its on viper.uchicago.edu
in : /home/nefedova/cogl/modules/provider-deef/
I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
Nika
------------------------------------------------------------------------
On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
> Mihael, do you have any clues on why this run has failed? Ioan - my
> answers to your questions are below...
>
> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>
> > It looks like viper (where Swift is running) is idle, and so is tg-
> > viz-login2 (where Falkon is running).
> > What looks evident to me is that the normal list of events is for a
> > successful task:
> > iraicu at viper:/home/nefedova/alamines> grep "urn:
> > 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
> > 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn:
> > 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
> > 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn:
> > 0-1-73-2-31-0-0-1186444341989 0
> > 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn:
> > 0-1-73-2-31-0-0-1186444341989) setting status to Completed
> >
> > iraicu at viper:/home/nefedova/alamines> grep "setting status to
> > Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > 17566 175660 2179412
> >
> > iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
> > notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > 7959 55713 785035
> >
> > iraicu at viper:/home/nefedova/alamines> grep "setting status to
> > Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > 190968 1909680 24003796
> >
> > Now, 17566 tasks were submitted, 7959 notifiation were received
> > from Falkon, and 190968 tasks were set to completed...
> >
> > Obviously this isn't right. Falkon only saw 7959 tasks, so I would
> > argue that the # of notifications received is correct. The
> > submitted # of tasks looks like the # I would have expected, but
> > all the tasks did not make it to Falkon. The Falkon provider is
> > what sits between the change of status to submitted, and the
> > receipt of the notification, so I would say that is the first place
> > we need to look for more details... there used to some extra debug
> > info in the Falkon provider that simply printed all the tasks that
> > were actually being submitted to Falkon (as opposed to just the
> > change of status within Karajan). I don't see those debug
> > statements, I bet they got overwritten in the SVN update.
> > What about the completed tasks, why are there so many (190K)
> > completed tasks? Where did they come from?
> >
>
>
> "Task" doesn't mean job. It could be just data being staged in , etc.
> The first 2 are important -- (Submitted vs Completed). Since it
> differs, this is the problem...
>
>
> > Yong, are you keeping up with these emails? Do you still have a
> > copy of the latest Falkon provider that you edited just before you
> > left? Can you just take a look through there to make sure nothing
> > has been broken with the SVN updates? If you don't have time for
> > this now (considering today was your first day on the new job),
> > I'll dig through there and see if I can make some sense of what is
> > happening!
> >
> > One last thing, Ben mentioned that the Falkon provider you saw in
> > Nika's account was different than what was in SVN. Ben, did you at
> > least look at modification dates? How old was one as opposed to
> > the other? I hope we did not revert back to an older version that
> > might have had some bug in it....
> >
>
> I had to update to the latest version of provider-deef from SVN since
> without the update nothing worked. The version I am at now is 1050.
> But this is exactly the same version of swift/deef I used for our
> Friday run (which 'worked' from Falcon/Swift point of view)
>
> Nika
>
>
> > Ioan
> >
> > Veronika Nefedova wrote:
> >> Well, there are some discrepancies:
> >>
> >> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops-
> >> zhgo6be8tjhi1.log | wc
> >> 7959 244749 3241072
> >> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops-
> >> zhgo6be8tjhi1.log | wc
> >> 17207 564648 7949388
> >> nefedova at viper:~/alamines>
> >>
> >> I.e. almost half of the jobs haven't finished (according to swift)
> >>
> >> I also have some exceptions:
> >>
> >> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn:
> >> 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception
> >> in getFile
> >> (80 of those):
> >> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
> >> zhgo6be8tjhi1.log | wc
> >> 80 880 9705
> >> nefedova at viper:~/alamines>
> >>
> >>
> >> Nika
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list