[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Wed Aug 8 14:52:55 CDT 2007


anyway - I fixed the log4j.properties file and started the run

Nika

On Aug 8, 2007, at 2:20 PM, Ioan Raicu wrote:

> All my work was related to the deef-provider... I did not touch  
> anything else!
>
> in the folder
> nefedova at viper:~/cogl/modules/provider-deef
>
> I did:
>
> cp yongs_source_files src/org/globus/cog/abstraction/impl/execution/ 
> deef/
> svn update
> ant distclean
> ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>
> Now why would this screw up your logging or anything else in  
> Swift?  Unless it screwed something up in the deef-provider (which  
> was already screwed up prior).  Now, the message "booting deef"  
> comes from Boot.java.  This file was from SVN, as Mihael modified  
> it a few days ago, so Yong's Boot.java was not carried over.   
> Should I have used the older Boot.java (Yong's version from July  
> 26th)?  If this is not the issue, and its something else related to  
> the deef-provider, you can find the old deef-provider that you had  
> before at:
> viper:/home/nefedova/cogl/modules/provider-deef_8-8-07_svn
>
> Ioan
> PS: I don't have rights to commit changes to SVN, so if you don't  
> want me to make any more changes to your Swift install, we can wait  
> until I get the right to commit my changes so you can see them and  
> pull them in yourself through SVN.
>
> Veronika Nefedova wrote:
>> the current changes screwed up my logging again...
>> Please, do not touch my install --- I'd rather get everything from  
>> SVN,
>>
>> nefedova at viper:~/alamines> swift -tc.file tc-uc.data -sites.file  
>> sites-uc-64.xml -debug MolDyn-244-loops.swift&
>> [1] 10562
>> nefedova at viper:~/alamines> WARN   - Failed to configure log file name
>> DEBUG  - Booting deef
>>
>>
>> Nika
>>
>> On Aug 8, 2007, at 1:19 PM, Mihael Hategan wrote:
>>
>>> On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
>>>> Shouldn't we be certain that things work before we commit the  
>>>> changes?
>>>
>>> No.
>>>
>>>>   I thought the commit would take place after we try MolDyn out  
>>>> and we
>>>> see things are back to normal.
>>>
>>> The whole problem we've seen the past few days was due to the  
>>> fact that
>>> Nika had no clear place to get the code from, so she repeatedly  
>>> ended up
>>> with broken versions. S o  p u t  t h e  c h a n g e s  i n  S V N !
>>>
>>>>
>>>> Ioan
>>>>
>>>> Mihael Hategan wrote:
>>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>>
>>>>>> OK everyone, I found Yong's version of the provider dated July  
>>>>>> 26th,
>>>>>> much more recent than what was in SVN on June 27th.  I updated  
>>>>>> Nika's
>>>>>> version of the provider (which has been checked out of SVN),
>>>>>>
>>>>>
>>>>> No. P u t  t h e  c h a n g e s  i n  S V N !
>>>>>
>>>>>
>>>>>> and recompiled&deploy!
>>>>>>
>>>>>>   ant distclean
>>>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/ 
>>>>>> vdsk-0.2-dev/
>>>>>> dist
>>>>>>
>>>>>> I even updated updated some of the logging info to use the logger
>>>>>> (some were not using the logger).
>>>>>>
>>>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>>>
>>>>>> Falkon Factory Service:
>>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/ 
>>>>>> GenericPortal/core/WS/GPFactoryService
>>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Veronika Nefedova wrote:
>>>>>>
>>>>>>> Ioan,
>>>>>>>
>>>>>>>
>>>>>>> It looks like the Falcon (including provider-deef) was put in  
>>>>>>> SVN on
>>>>>>> June 27th. You really were supposed to use the SVN code from  
>>>>>>> that
>>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>>> 27th?
>>>>>>>
>>>>>>>
>>>>>>> Nika
>>>>>>>
>>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>>>> install on viper.  Nika, I take it you don't have this  
>>>>>>>> anymore, as
>>>>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>>>>> might have the latest provider source?  If not, I guess we  
>>>>>>>> need to
>>>>>>>> take another look through the provider source to fix the issues
>>>>>>>> that we knew of...
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Mihael Hategan wrote:
>>>>>>>>
>>>>>>>>> Well, it doesn't look like the falkon provider in SVN has  
>>>>>>>>> been updated
>>>>>>>>> at all in terms of fixing synchronization issues. All  
>>>>>>>>> commits on
>>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>>
>>>>>>>>> bash-3.1$ svn log
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500  
>>>>>>>>> (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500  
>>>>>>>>> (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500  
>>>>>>>>> (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> a very small readme for provider-deef
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500  
>>>>>>>>> (Wed, 27 Jun
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> remove dist directory form svn
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500  
>>>>>>>>> (Wed, 27 Jun
>>>>>>>>> 2007) | 20 lines
>>>>>>>>>
>>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>>
>>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>>
>>>>>>>>> its on viper.uchicago.edu
>>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>>> I also tared it up and put in my home on terminable:  
>>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------- 
>>>>>>>>> ----------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Mihael, do you have any clues on why this run has failed?  
>>>>>>>>>> Ioan - my
>>>>>>>>>> answers to your questions are below...
>>>>>>>>>>
>>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> It looks like viper (where Swift is running) is idle, and  
>>>>>>>>>>> so is tg-
>>>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>>>> What looks evident to me is that the normal list of  
>>>>>>>>>>> events is for a
>>>>>>>>>>> successful task:
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops- 
>>>>>>>>>>> zhgo6be8tjhi1.log
>>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>> identity=urn:
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread  
>>>>>>>>>>> notification: urn:
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>> identity=urn:
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>> status to
>>>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep  
>>>>>>>>>>> "NotificationThread
>>>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>> status to
>>>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>>
>>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were  
>>>>>>>>>>> received
>>>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>>>
>>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks,  
>>>>>>>>>>> so I would
>>>>>>>>>>> argue that the # of notifications received is correct.  The
>>>>>>>>>>> submitted # of tasks looks like the # I would have  
>>>>>>>>>>> expected, but
>>>>>>>>>>> all the tasks did not make it to Falkon.  The Falkon  
>>>>>>>>>>> provider is
>>>>>>>>>>> what sits between the change of status to submitted, and the
>>>>>>>>>>> receipt of the notification, so I would say that is the  
>>>>>>>>>>> first place
>>>>>>>>>>> we need to look for more details... there used to some  
>>>>>>>>>>> extra debug
>>>>>>>>>>> info in the Falkon provider that simply printed all the  
>>>>>>>>>>> tasks that
>>>>>>>>>>> were actually being submitted to Falkon (as opposed to  
>>>>>>>>>>> just the
>>>>>>>>>>> change of status within Karajan).  I don't see those debug
>>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>>> What about the completed tasks, why are there so many (190K)
>>>>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> "Task" doesn't mean job. It could be just data being  
>>>>>>>>>> staged in , etc.
>>>>>>>>>> The first 2 are important -- (Submitted vs Completed).  
>>>>>>>>>> Since it
>>>>>>>>>> differs, this is the problem...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Yong, are you keeping up with these emails?  Do you still  
>>>>>>>>>>> have a
>>>>>>>>>>> copy of the latest Falkon provider that you edited just  
>>>>>>>>>>> before you
>>>>>>>>>>> left?  Can you just take a look through there to make  
>>>>>>>>>>> sure nothing
>>>>>>>>>>> has been broken with the SVN updates?  If you don't have  
>>>>>>>>>>> time for
>>>>>>>>>>> this now (considering today was your first day on the new  
>>>>>>>>>>> job),
>>>>>>>>>>> I'll dig through there and see if I can make some sense  
>>>>>>>>>>> of what is
>>>>>>>>>>> happening!
>>>>>>>>>>>
>>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider  
>>>>>>>>>>> you saw in
>>>>>>>>>>> Nika's account was different than what was in SVN.  Ben,  
>>>>>>>>>>> did you at
>>>>>>>>>>> least look at modification dates?  How old was one as  
>>>>>>>>>>> opposed to
>>>>>>>>>>> the other?  I hope we did not revert back to an older  
>>>>>>>>>>> version that
>>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I had to update to the latest version of provider-deef  
>>>>>>>>>> from SVN since
>>>>>>>>>> without the update nothing worked. The version I am at now  
>>>>>>>>>> is 1050.
>>>>>>>>>> But this is exactly the same version of swift/deef I used  
>>>>>>>>>> for our
>>>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>>
>>>>>>>>>> Nika
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Ioan
>>>>>>>>>>>
>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>>
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job"  
>>>>>>>>>>>> MolDyn-244-loops-
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244- 
>>>>>>>>>>>> loops-
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>
>>>>>>>>>>>> I.e. almost half of the jobs haven't finished (according  
>>>>>>>>>>>> to swift)
>>>>>>>>>>>>
>>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>>
>>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,  
>>>>>>>>>>>> identity=urn:
>>>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed  
>>>>>>>>>>>> Exception
>>>>>>>>>>>> in getFile
>>>>>>>>>>>> (80 of those):
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>      80     880    9705
>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Nika
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>




More information about the Swift-devel mailing list