[Swift-devel] Q about MolDyn
Mihael Hategan
hategan at mcs.anl.gov
Wed Aug 8 13:19:05 CDT 2007
On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
> Shouldn't we be certain that things work before we commit the changes?
No.
> I thought the commit would take place after we try MolDyn out and we
> see things are back to normal.
The whole problem we've seen the past few days was due to the fact that
Nika had no clear place to get the code from, so she repeatedly ended up
with broken versions. S o p u t t h e c h a n g e s i n S V N !
>
> Ioan
>
> Mihael Hategan wrote:
> > On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
> >
> > > OK everyone, I found Yong's version of the provider dated July 26th,
> > > much more recent than what was in SVN on June 27th. I updated Nika's
> > > version of the provider (which has been checked out of SVN),
> > >
> >
> > No. P u t t h e c h a n g e s i n S V N !
> >
> >
> > > and recompiled&deploy!
> > >
> > > ant distclean
> > > ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
> > > dist
> > >
> > > I even updated updated some of the logging info to use the logger
> > > (some were not using the logger).
> > >
> > > Nika, Falkon is freshly restarted and ready for another test run!
> > >
> > > Falkon Factory Service:
> > > http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
> > > Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
> > >
> > > Ioan
> > >
> > > Veronika Nefedova wrote:
> > >
> > > > Ioan,
> > > >
> > > >
> > > > It looks like the Falcon (including provider-deef) was put in SVN on
> > > > June 27th. You really were supposed to use the SVN code from that
> > > > point. Sigh. Did you do any changes to viper install after June
> > > > 27th?
> > > >
> > > >
> > > > Nika
> > > >
> > > > On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
> > > >
> > > >
> > > > > Could it be that the fixes were done before the original SVN
> > > > > checkin? If not, then at least we know why things aren't
> > > > > working. I bet the latest provider source was in Nika's Swift
> > > > > install on viper. Nika, I take it you don't have this anymore, as
> > > > > SVN updates overwrote this. Yong, is there any other place you
> > > > > might have the latest provider source? If not, I guess we need to
> > > > > take another look through the provider source to fix the issues
> > > > > that we knew of...
> > > > >
> > > > > Ioan
> > > > >
> > > > > Mihael Hategan wrote:
> > > > >
> > > > > > Well, it doesn't look like the falkon provider in SVN has been updated
> > > > > > at all in terms of fixing synchronization issues. All commits on
> > > > > > provider-deef come from either ben or me:
> > > > > >
> > > > > > bash-3.1$ svn log
> > > > > > ------------------------------------------------------------------------
> > > > > > r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
> > > > > > 2007) | 1 line
> > > > > >
> > > > > > removed gt4 stuff and added them as a dependency
> > > > > > ------------------------------------------------------------------------
> > > > > > r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
> > > > > > 2007) | 1 line
> > > > > >
> > > > > > removed gt4 stuff and added them as a dependency
> > > > > > ------------------------------------------------------------------------
> > > > > > r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
> > > > > > 2007) | 1 line
> > > > > >
> > > > > > a very small readme for provider-deef
> > > > > > ------------------------------------------------------------------------
> > > > > > r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
> > > > > > 2007) | 1 line
> > > > > >
> > > > > > remove dist directory form svn
> > > > > > ------------------------------------------------------------------------
> > > > > > r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
> > > > > > 2007) | 20 lines
> > > > > >
> > > > > > provider-deef, the Falkon/cog provider
> > > > > >
> > > > > > based on source in below message, with .class files deleted
> > > > > >
> > > > > >
> > > > > > Date: Wed, 27 Jun 2007 09:27:23 -0500
> > > > > > From: Veronika Nefedova <nefedova at mcs.anl.gov>
> > > > > > To: Yong Zhao <yongzh at cs.uchicago.edu>
> > > > > > Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
> > > > > > <hategan at mcs.anl.gov>,
> > > > > > iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
> > > > > > Mike Wilde <wilde at mcs.anl.gov>,
> > > > > > Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
> > > > > > Subject: Re: 244 molecule MolDyn run...
> > > > > >
> > > > > > its on viper.uchicago.edu
> > > > > > in : /home/nefedova/cogl/modules/provider-deef/
> > > > > > I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
> > > > > >
> > > > > > Nika
> > > > > >
> > > > > >
> > > > > > ------------------------------------------------------------------------
> > > > > >
> > > > > >
> > > > > > On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
> > > > > >
> > > > > >
> > > > > > > Mihael, do you have any clues on why this run has failed? Ioan - my
> > > > > > > answers to your questions are below...
> > > > > > >
> > > > > > > On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > It looks like viper (where Swift is running) is idle, and so is tg-
> > > > > > > > viz-login2 (where Falkon is running).
> > > > > > > > What looks evident to me is that the normal list of events is for a
> > > > > > > > successful task:
> > > > > > > > iraicu at viper:/home/nefedova/alamines> grep "urn:
> > > > > > > > 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
> > > > > > > > 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn:
> > > > > > > > 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
> > > > > > > > 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn:
> > > > > > > > 0-1-73-2-31-0-0-1186444341989 0
> > > > > > > > 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn:
> > > > > > > > 0-1-73-2-31-0-0-1186444341989) setting status to Completed
> > > > > > > >
> > > > > > > > iraicu at viper:/home/nefedova/alamines> grep "setting status to
> > > > > > > > Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > > > > 17566 175660 2179412
> > > > > > > >
> > > > > > > > iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
> > > > > > > > notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > > > > 7959 55713 785035
> > > > > > > >
> > > > > > > > iraicu at viper:/home/nefedova/alamines> grep "setting status to
> > > > > > > > Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > > > > 190968 1909680 24003796
> > > > > > > >
> > > > > > > > Now, 17566 tasks were submitted, 7959 notifiation were received
> > > > > > > > from Falkon, and 190968 tasks were set to completed...
> > > > > > > >
> > > > > > > > Obviously this isn't right. Falkon only saw 7959 tasks, so I would
> > > > > > > > argue that the # of notifications received is correct. The
> > > > > > > > submitted # of tasks looks like the # I would have expected, but
> > > > > > > > all the tasks did not make it to Falkon. The Falkon provider is
> > > > > > > > what sits between the change of status to submitted, and the
> > > > > > > > receipt of the notification, so I would say that is the first place
> > > > > > > > we need to look for more details... there used to some extra debug
> > > > > > > > info in the Falkon provider that simply printed all the tasks that
> > > > > > > > were actually being submitted to Falkon (as opposed to just the
> > > > > > > > change of status within Karajan). I don't see those debug
> > > > > > > > statements, I bet they got overwritten in the SVN update.
> > > > > > > > What about the completed tasks, why are there so many (190K)
> > > > > > > > completed tasks? Where did they come from?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > "Task" doesn't mean job. It could be just data being staged in , etc.
> > > > > > > The first 2 are important -- (Submitted vs Completed). Since it
> > > > > > > differs, this is the problem...
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Yong, are you keeping up with these emails? Do you still have a
> > > > > > > > copy of the latest Falkon provider that you edited just before you
> > > > > > > > left? Can you just take a look through there to make sure nothing
> > > > > > > > has been broken with the SVN updates? If you don't have time for
> > > > > > > > this now (considering today was your first day on the new job),
> > > > > > > > I'll dig through there and see if I can make some sense of what is
> > > > > > > > happening!
> > > > > > > >
> > > > > > > > One last thing, Ben mentioned that the Falkon provider you saw in
> > > > > > > > Nika's account was different than what was in SVN. Ben, did you at
> > > > > > > > least look at modification dates? How old was one as opposed to
> > > > > > > > the other? I hope we did not revert back to an older version that
> > > > > > > > might have had some bug in it....
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > I had to update to the latest version of provider-deef from SVN since
> > > > > > > without the update nothing worked. The version I am at now is 1050.
> > > > > > > But this is exactly the same version of swift/deef I used for our
> > > > > > > Friday run (which 'worked' from Falcon/Swift point of view)
> > > > > > >
> > > > > > > Nika
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Ioan
> > > > > > > >
> > > > > > > > Veronika Nefedova wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > Well, there are some discrepancies:
> > > > > > > > >
> > > > > > > > > nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops-
> > > > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > > > > 7959 244749 3241072
> > > > > > > > > nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops-
> > > > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > > > > 17207 564648 7949388
> > > > > > > > > nefedova at viper:~/alamines>
> > > > > > > > >
> > > > > > > > > I.e. almost half of the jobs haven't finished (according to swift)
> > > > > > > > >
> > > > > > > > > I also have some exceptions:
> > > > > > > > >
> > > > > > > > > 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn:
> > > > > > > > > 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception
> > > > > > > > > in getFile
> > > > > > > > > (80 of those):
> > > > > > > > > nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
> > > > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > > > > 80 880 9705
> > > > > > > > > nefedova at viper:~/alamines>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Nika
> > > > > > > > >
> > > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Swift-devel mailing list
> > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > >
> >
> >
> >
More information about the Swift-devel
mailing list