[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Wed Aug 8 14:20:48 CDT 2007


All my work was related to the deef-provider... I did not touch anything 
else!

in the folder
nefedova at viper:~/cogl/modules/provider-deef

I did:

cp yongs_source_files src/org/globus/cog/abstraction/impl/execution/deef/
svn update
ant distclean
ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/

Now why would this screw up your logging or anything else in Swift?  
Unless it screwed something up in the deef-provider (which was already 
screwed up prior).  Now, the message "booting deef" comes from 
Boot.java.  This file was from SVN, as Mihael modified it a few days 
ago, so Yong's Boot.java was not carried over.  Should I have used the 
older Boot.java (Yong's version from July 26th)?  If this is not the 
issue, and its something else related to the deef-provider, you can find 
the old deef-provider that you had before at:
viper:/home/nefedova/cogl/modules/provider-deef_8-8-07_svn

Ioan
PS: I don't have rights to commit changes to SVN, so if you don't want 
me to make any more changes to your Swift install, we can wait until I 
get the right to commit my changes so you can see them and pull them in 
yourself through SVN.

Veronika Nefedova wrote:
> the current changes screwed up my logging again...
> Please, do not touch my install --- I'd rather get everything from SVN,
>
> nefedova at viper:~/alamines> swift -tc.file tc-uc.data -sites.file 
> sites-uc-64.xml -debug MolDyn-244-loops.swift&
> [1] 10562
> nefedova at viper:~/alamines> WARN   - Failed to configure log file name
> DEBUG  - Booting deef
>
>
> Nika
>
> On Aug 8, 2007, at 1:19 PM, Mihael Hategan wrote:
>
>> On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
>>> Shouldn't we be certain that things work before we commit the changes?
>>
>> No.
>>
>>>   I thought the commit would take place after we try MolDyn out and we
>>> see things are back to normal.
>>
>> The whole problem we've seen the past few days was due to the fact that
>> Nika had no clear place to get the code from, so she repeatedly ended up
>> with broken versions. S o  p u t  t h e  c h a n g e s  i n  S V N !
>>
>>>
>>> Ioan
>>>
>>> Mihael Hategan wrote:
>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>
>>>>> OK everyone, I found Yong's version of the provider dated July 26th,
>>>>> much more recent than what was in SVN on June 27th.  I updated Nika's
>>>>> version of the provider (which has been checked out of SVN),
>>>>>
>>>>
>>>> No. P u t  t h e  c h a n g e s  i n  S V N !
>>>>
>>>>
>>>>> and recompiled&deploy!
>>>>>
>>>>>   ant distclean
>>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>>>> dist
>>>>>
>>>>> I even updated updated some of the logging info to use the logger
>>>>> (some were not using the logger).
>>>>>
>>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>>
>>>>> Falkon Factory Service:
>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService 
>>>>>
>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>
>>>>> Ioan
>>>>>
>>>>> Veronika Nefedova wrote:
>>>>>
>>>>>> Ioan,
>>>>>>
>>>>>>
>>>>>> It looks like the Falcon (including provider-deef) was put in SVN on
>>>>>> June 27th. You really were supposed to use the SVN code from that
>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>> 27th?
>>>>>>
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>
>>>>>>
>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>>> install on viper.  Nika, I take it you don't have this anymore, as
>>>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>>>> might have the latest provider source?  If not, I guess we need to
>>>>>>> take another look through the provider source to fix the issues
>>>>>>> that we knew of...
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Mihael Hategan wrote:
>>>>>>>
>>>>>>>> Well, it doesn't look like the falkon provider in SVN has been 
>>>>>>>> updated
>>>>>>>> at all in terms of fixing synchronization issues. All commits on
>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>
>>>>>>>> bash-3.1$ svn log
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 
>>>>>>>> (Fri, 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 
>>>>>>>> (Fri, 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 
>>>>>>>> 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> a very small readme for provider-deef
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 
>>>>>>>> 27 Jun
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> remove dist directory form svn
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 
>>>>>>>> 27 Jun
>>>>>>>> 2007) | 20 lines
>>>>>>>>
>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>
>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>
>>>>>>>>
>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>
>>>>>>>> its on viper.uchicago.edu
>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>> I also tared it up and put in my home on terminable: 
>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Mihael, do you have any clues on why this run has failed? Ioan 
>>>>>>>>> - my
>>>>>>>>> answers to your questions are below...
>>>>>>>>>
>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It looks like viper (where Swift is running) is idle, and so 
>>>>>>>>>> is tg-
>>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>>> What looks evident to me is that the normal list of events is 
>>>>>>>>>> for a
>>>>>>>>>> successful task:
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" 
>>>>>>>>>> MolDyn-244-loops-zhgo6be8tjhi1.log
>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, 
>>>>>>>>>> identity=urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread 
>>>>>>>>>> notification: urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, 
>>>>>>>>>> identity=urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
>>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>
>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were received
>>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>>
>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so I 
>>>>>>>>>> would
>>>>>>>>>> argue that the # of notifications received is correct.  The
>>>>>>>>>> submitted # of tasks looks like the # I would have expected, but
>>>>>>>>>> all the tasks did not make it to Falkon.  The Falkon provider is
>>>>>>>>>> what sits between the change of status to submitted, and the
>>>>>>>>>> receipt of the notification, so I would say that is the first 
>>>>>>>>>> place
>>>>>>>>>> we need to look for more details... there used to some extra 
>>>>>>>>>> debug
>>>>>>>>>> info in the Falkon provider that simply printed all the tasks 
>>>>>>>>>> that
>>>>>>>>>> were actually being submitted to Falkon (as opposed to just the
>>>>>>>>>> change of status within Karajan).  I don't see those debug
>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>> What about the completed tasks, why are there so many (190K)
>>>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> "Task" doesn't mean job. It could be just data being staged in 
>>>>>>>>> , etc.
>>>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it
>>>>>>>>> differs, this is the problem...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Yong, are you keeping up with these emails?  Do you still have a
>>>>>>>>>> copy of the latest Falkon provider that you edited just 
>>>>>>>>>> before you
>>>>>>>>>> left?  Can you just take a look through there to make sure 
>>>>>>>>>> nothing
>>>>>>>>>> has been broken with the SVN updates?  If you don't have time 
>>>>>>>>>> for
>>>>>>>>>> this now (considering today was your first day on the new job),
>>>>>>>>>> I'll dig through there and see if I can make some sense of 
>>>>>>>>>> what is
>>>>>>>>>> happening!
>>>>>>>>>>
>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider you 
>>>>>>>>>> saw in
>>>>>>>>>> Nika's account was different than what was in SVN.  Ben, did 
>>>>>>>>>> you at
>>>>>>>>>> least look at modification dates?  How old was one as opposed to
>>>>>>>>>> the other?  I hope we did not revert back to an older version 
>>>>>>>>>> that
>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I had to update to the latest version of provider-deef from 
>>>>>>>>> SVN since
>>>>>>>>> without the update nothing worked. The version I am at now is 
>>>>>>>>> 1050.
>>>>>>>>> But this is exactly the same version of swift/deef I used for our
>>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Ioan
>>>>>>>>>>
>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>
>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" 
>>>>>>>>>>> MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>
>>>>>>>>>>> I.e. almost half of the jobs haven't finished (according to 
>>>>>>>>>>> swift)
>>>>>>>>>>>
>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>
>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, 
>>>>>>>>>>> identity=urn:
>>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed 
>>>>>>>>>>> Exception
>>>>>>>>>>> in getFile
>>>>>>>>>>> (80 of those):
>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>      80     880    9705
>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nika
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>
>>>>
>>>>
>>
>
>



More information about the Swift-devel mailing list