[Swift-devel] Q about MolDyn
Ioan Raicu
iraicu at cs.uchicago.edu
Wed Aug 8 14:20:48 CDT 2007
All my work was related to the deef-provider... I did not touch anything
else!
in the folder
nefedova at viper:~/cogl/modules/provider-deef
I did:
cp yongs_source_files src/org/globus/cog/abstraction/impl/execution/deef/
svn update
ant distclean
ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
Now why would this screw up your logging or anything else in Swift?
Unless it screwed something up in the deef-provider (which was already
screwed up prior). Now, the message "booting deef" comes from
Boot.java. This file was from SVN, as Mihael modified it a few days
ago, so Yong's Boot.java was not carried over. Should I have used the
older Boot.java (Yong's version from July 26th)? If this is not the
issue, and its something else related to the deef-provider, you can find
the old deef-provider that you had before at:
viper:/home/nefedova/cogl/modules/provider-deef_8-8-07_svn
Ioan
PS: I don't have rights to commit changes to SVN, so if you don't want
me to make any more changes to your Swift install, we can wait until I
get the right to commit my changes so you can see them and pull them in
yourself through SVN.
Veronika Nefedova wrote:
> the current changes screwed up my logging again...
> Please, do not touch my install --- I'd rather get everything from SVN,
>
> nefedova at viper:~/alamines> swift -tc.file tc-uc.data -sites.file
> sites-uc-64.xml -debug MolDyn-244-loops.swift&
> [1] 10562
> nefedova at viper:~/alamines> WARN - Failed to configure log file name
> DEBUG - Booting deef
>
>
> Nika
>
> On Aug 8, 2007, at 1:19 PM, Mihael Hategan wrote:
>
>> On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
>>> Shouldn't we be certain that things work before we commit the changes?
>>
>> No.
>>
>>> I thought the commit would take place after we try MolDyn out and we
>>> see things are back to normal.
>>
>> The whole problem we've seen the past few days was due to the fact that
>> Nika had no clear place to get the code from, so she repeatedly ended up
>> with broken versions. S o p u t t h e c h a n g e s i n S V N !
>>
>>>
>>> Ioan
>>>
>>> Mihael Hategan wrote:
>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>
>>>>> OK everyone, I found Yong's version of the provider dated July 26th,
>>>>> much more recent than what was in SVN on June 27th. I updated Nika's
>>>>> version of the provider (which has been checked out of SVN),
>>>>>
>>>>
>>>> No. P u t t h e c h a n g e s i n S V N !
>>>>
>>>>
>>>>> and recompiled&deploy!
>>>>>
>>>>> ant distclean
>>>>> ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>>>> dist
>>>>>
>>>>> I even updated updated some of the logging info to use the logger
>>>>> (some were not using the logger).
>>>>>
>>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>>
>>>>> Falkon Factory Service:
>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
>>>>>
>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>
>>>>> Ioan
>>>>>
>>>>> Veronika Nefedova wrote:
>>>>>
>>>>>> Ioan,
>>>>>>
>>>>>>
>>>>>> It looks like the Falcon (including provider-deef) was put in SVN on
>>>>>> June 27th. You really were supposed to use the SVN code from that
>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>> 27th?
>>>>>>
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>
>>>>>>
>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>> checkin? If not, then at least we know why things aren't
>>>>>>> working. I bet the latest provider source was in Nika's Swift
>>>>>>> install on viper. Nika, I take it you don't have this anymore, as
>>>>>>> SVN updates overwrote this. Yong, is there any other place you
>>>>>>> might have the latest provider source? If not, I guess we need to
>>>>>>> take another look through the provider source to fix the issues
>>>>>>> that we knew of...
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Mihael Hategan wrote:
>>>>>>>
>>>>>>>> Well, it doesn't look like the falkon provider in SVN has been
>>>>>>>> updated
>>>>>>>> at all in terms of fixing synchronization issues. All commits on
>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>
>>>>>>>> bash-3.1$ svn log
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500
>>>>>>>> (Fri, 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500
>>>>>>>> (Fri, 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri,
>>>>>>>> 03 Aug
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> a very small readme for provider-deef
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed,
>>>>>>>> 27 Jun
>>>>>>>> 2007) | 1 line
>>>>>>>>
>>>>>>>> remove dist directory form svn
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed,
>>>>>>>> 27 Jun
>>>>>>>> 2007) | 20 lines
>>>>>>>>
>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>
>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>
>>>>>>>>
>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>> iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>> Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>> Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>
>>>>>>>> its on viper.uchicago.edu
>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>> I also tared it up and put in my home on terminable:
>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Mihael, do you have any clues on why this run has failed? Ioan
>>>>>>>>> - my
>>>>>>>>> answers to your questions are below...
>>>>>>>>>
>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It looks like viper (where Swift is running) is idle, and so
>>>>>>>>>> is tg-
>>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>>> What looks evident to me is that the normal list of events is
>>>>>>>>>> for a
>>>>>>>>>> successful task:
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989"
>>>>>>>>>> MolDyn-244-loops-zhgo6be8tjhi1.log
>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,
>>>>>>>>>> identity=urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread
>>>>>>>>>> notification: urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,
>>>>>>>>>> identity=urn:
>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>> 17566 175660 2179412
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
>>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>> 7959 55713 785035
>>>>>>>>>>
>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>
>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were received
>>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>>
>>>>>>>>>> Obviously this isn't right. Falkon only saw 7959 tasks, so I
>>>>>>>>>> would
>>>>>>>>>> argue that the # of notifications received is correct. The
>>>>>>>>>> submitted # of tasks looks like the # I would have expected, but
>>>>>>>>>> all the tasks did not make it to Falkon. The Falkon provider is
>>>>>>>>>> what sits between the change of status to submitted, and the
>>>>>>>>>> receipt of the notification, so I would say that is the first
>>>>>>>>>> place
>>>>>>>>>> we need to look for more details... there used to some extra
>>>>>>>>>> debug
>>>>>>>>>> info in the Falkon provider that simply printed all the tasks
>>>>>>>>>> that
>>>>>>>>>> were actually being submitted to Falkon (as opposed to just the
>>>>>>>>>> change of status within Karajan). I don't see those debug
>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>> What about the completed tasks, why are there so many (190K)
>>>>>>>>>> completed tasks? Where did they come from?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> "Task" doesn't mean job. It could be just data being staged in
>>>>>>>>> , etc.
>>>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it
>>>>>>>>> differs, this is the problem...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Yong, are you keeping up with these emails? Do you still have a
>>>>>>>>>> copy of the latest Falkon provider that you edited just
>>>>>>>>>> before you
>>>>>>>>>> left? Can you just take a look through there to make sure
>>>>>>>>>> nothing
>>>>>>>>>> has been broken with the SVN updates? If you don't have time
>>>>>>>>>> for
>>>>>>>>>> this now (considering today was your first day on the new job),
>>>>>>>>>> I'll dig through there and see if I can make some sense of
>>>>>>>>>> what is
>>>>>>>>>> happening!
>>>>>>>>>>
>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider you
>>>>>>>>>> saw in
>>>>>>>>>> Nika's account was different than what was in SVN. Ben, did
>>>>>>>>>> you at
>>>>>>>>>> least look at modification dates? How old was one as opposed to
>>>>>>>>>> the other? I hope we did not revert back to an older version
>>>>>>>>>> that
>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I had to update to the latest version of provider-deef from
>>>>>>>>> SVN since
>>>>>>>>> without the update nothing worked. The version I am at now is
>>>>>>>>> 1050.
>>>>>>>>> But this is exactly the same version of swift/deef I used for our
>>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Ioan
>>>>>>>>>>
>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>
>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job"
>>>>>>>>>>> MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>> 7959 244749 3241072
>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>> 17207 564648 7949388
>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>
>>>>>>>>>>> I.e. almost half of the jobs haven't finished (according to
>>>>>>>>>>> swift)
>>>>>>>>>>>
>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>
>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,
>>>>>>>>>>> identity=urn:
>>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed
>>>>>>>>>>> Exception
>>>>>>>>>>> in getFile
>>>>>>>>>>> (80 of those):
>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>> 80 880 9705
>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nika
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>
>>>>
>>>>
>>
>
>
More information about the Swift-devel
mailing list