[Swift-devel] Q about MolDyn

Mihael Hategan hategan at mcs.anl.gov
Wed Aug 8 13:00:43 CDT 2007


On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
> OK everyone, I found Yong's version of the provider dated July 26th,
> much more recent than what was in SVN on June 27th.  I updated Nika's
> version of the provider (which has been checked out of SVN),

No. P u t  t h e  c h a n g e s  i n  S V N !

>  and recompiled&deploy!  
> 
>   ant distclean
>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
> dist
> 
> I even updated updated some of the logging info to use the logger
> (some were not using the logger).
> 
> Nika, Falkon is freshly restarted and ready for another test run!
> 
> Falkon Factory Service:
> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
> 
> Ioan
> 
> Veronika Nefedova wrote: 
> > Ioan, 
> > 
> > 
> > It looks like the Falcon (including provider-deef) was put in SVN on
> > June 27th. You really were supposed to use the SVN code from that
> > point. Sigh. Did you do any changes to viper install after June
> > 27th?
> > 
> > 
> > Nika
> > 
> > On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
> > 
> > > Could it be that the fixes were done before the original SVN
> > > checkin?   If not, then at least we know why things aren't
> > > working.  I bet the latest provider source was in Nika's Swift
> > > install on viper.  Nika, I take it you don't have this anymore, as
> > > SVN updates overwrote this.  Yong, is there any other place you
> > > might have the latest provider source?  If not, I guess we need to
> > > take another look through the provider source to fix the issues
> > > that we knew of...
> > > 
> > > Ioan
> > > 
> > > Mihael Hategan wrote: 
> > > > Well, it doesn't look like the falkon provider in SVN has been updated
> > > > at all in terms of fixing synchronization issues. All commits on
> > > > provider-deef come from either ben or me:
> > > > 
> > > > bash-3.1$ svn log
> > > > ------------------------------------------------------------------------
> > > > r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
> > > > 2007) | 1 line
> > > > 
> > > > removed gt4 stuff and added them as a dependency
> > > > ------------------------------------------------------------------------
> > > > r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
> > > > 2007) | 1 line
> > > > 
> > > > removed gt4 stuff and added them as a dependency
> > > > ------------------------------------------------------------------------
> > > > r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
> > > > 2007) | 1 line
> > > > 
> > > > a very small readme for provider-deef
> > > > ------------------------------------------------------------------------
> > > > r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
> > > > 2007) | 1 line
> > > > 
> > > > remove dist directory form svn
> > > > ------------------------------------------------------------------------
> > > > r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
> > > > 2007) | 20 lines
> > > > 
> > > > provider-deef, the Falkon/cog provider
> > > > 
> > > > based on source in below message, with .class files deleted
> > > > 
> > > > 
> > > > Date: Wed, 27 Jun 2007 09:27:23 -0500
> > > > From: Veronika Nefedova <nefedova at mcs.anl.gov>
> > > > To: Yong Zhao <yongzh at cs.uchicago.edu>
> > > > Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
> > > > <hategan at mcs.anl.gov>,
> > > >     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
> > > >     Mike Wilde <wilde at mcs.anl.gov>,
> > > >     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
> > > > Subject: Re: 244 molecule MolDyn run...
> > > > 
> > > > its on viper.uchicago.edu
> > > > in : /home/nefedova/cogl/modules/provider-deef/
> > > > I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
> > > > 
> > > > Nika
> > > > 
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > 
> > > > On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
> > > >   
> > > > > Mihael, do you have any clues on why this run has failed? Ioan - my  
> > > > > answers to your questions are below...
> > > > > 
> > > > > On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
> > > > > 
> > > > >     
> > > > > > It looks like viper (where Swift is running) is idle, and so is tg- 
> > > > > > viz-login2 (where Falkon is running).
> > > > > > What looks evident to me is that the normal list of events is for a  
> > > > > > successful task:
> > > > > > iraicu at viper:/home/nefedova/alamines> grep "urn: 
> > > > > > 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
> > > > > > 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn: 
> > > > > > 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
> > > > > > 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn: 
> > > > > > 0-1-73-2-31-0-0-1186444341989 0
> > > > > > 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn: 
> > > > > > 0-1-73-2-31-0-0-1186444341989) setting status to Completed
> > > > > > 
> > > > > > iraicu at viper:/home/nefedova/alamines> grep "setting status to  
> > > > > > Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > >  17566  175660 2179412
> > > > > > 
> > > > > > iraicu at viper:/home/nefedova/alamines> grep "NotificationThread  
> > > > > > notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > >   7959   55713  785035
> > > > > > 
> > > > > > iraicu at viper:/home/nefedova/alamines> grep "setting status to  
> > > > > > Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> > > > > > 190968 1909680 24003796
> > > > > > 
> > > > > > Now, 17566 tasks were submitted, 7959 notifiation were received  
> > > > > > from Falkon, and 190968 tasks were set to completed...
> > > > > > 
> > > > > > Obviously this isn't right.  Falkon only saw 7959 tasks, so I would  
> > > > > > argue that the # of notifications received is correct.  The  
> > > > > > submitted # of tasks looks like the # I would have expected, but  
> > > > > > all the tasks did not make it to Falkon.  The Falkon provider is  
> > > > > > what sits between the change of status to submitted, and the  
> > > > > > receipt of the notification, so I would say that is the first place  
> > > > > > we need to look for more details... there used to some extra debug  
> > > > > > info in the Falkon provider that simply printed all the tasks that  
> > > > > > were actually being submitted to Falkon (as opposed to just the  
> > > > > > change of status within Karajan).  I don't see those debug  
> > > > > > statements, I bet they got overwritten in the SVN update.
> > > > > > What about the completed tasks, why are there so many (190K)  
> > > > > > completed tasks?  Where did they come from?
> > > > > > 
> > > > > >       
> > > > > "Task" doesn't mean job. It could be just data being staged in , etc.  
> > > > > The first 2 are important -- (Submitted vs Completed). Since it  
> > > > > differs, this is the problem...
> > > > > 
> > > > > 
> > > > >     
> > > > > > Yong, are you keeping up with these emails?  Do you still have a  
> > > > > > copy of the latest Falkon provider that you edited just before you  
> > > > > > left?  Can you just take a look through there to make sure nothing  
> > > > > > has been broken with the SVN updates?  If you don't have time for  
> > > > > > this now (considering today was your first day on the new job),  
> > > > > > I'll dig through there and see if I can make some sense of what is  
> > > > > > happening!
> > > > > > 
> > > > > > One last thing, Ben mentioned that the Falkon provider you saw in  
> > > > > > Nika's account was different than what was in SVN.  Ben, did you at  
> > > > > > least look at modification dates?  How old was one as opposed to  
> > > > > > the other?  I hope we did not revert back to an older version that  
> > > > > > might have had some bug in it....
> > > > > > 
> > > > > >       
> > > > > I had to update to the latest version of provider-deef from SVN since  
> > > > > without the update nothing worked. The version I am at now is 1050.  
> > > > > But this is exactly the same version of swift/deef I used for our  
> > > > > Friday run (which 'worked' from Falcon/Swift point of view)
> > > > > 
> > > > > Nika
> > > > > 
> > > > > 
> > > > >     
> > > > > > Ioan
> > > > > > 
> > > > > > Veronika Nefedova wrote:
> > > > > >       
> > > > > > > Well, there are some discrepancies:
> > > > > > > 
> > > > > > > nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops- 
> > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > >    7959  244749 3241072
> > > > > > > nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops- 
> > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > >   17207  564648 7949388
> > > > > > > nefedova at viper:~/alamines>
> > > > > > > 
> > > > > > > I.e. almost half of the jobs haven't finished (according to swift)
> > > > > > > 
> > > > > > > I also have some exceptions:
> > > > > > > 
> > > > > > > 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn: 
> > > > > > > 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception  
> > > > > > > in getFile
> > > > > > > (80 of those):
> > > > > > > nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops- 
> > > > > > > zhgo6be8tjhi1.log | wc
> > > > > > >      80     880    9705
> > > > > > > nefedova at viper:~/alamines>
> > > > > > > 
> > > > > > > 
> > > > > > > Nika
> > > > > > >         
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > >     
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 




More information about the Swift-devel mailing list