[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Wed Aug 8 13:25:32 CDT 2007


the current changes screwed up my logging again...
Please, do not touch my install --- I'd rather get everything from SVN,

nefedova at viper:~/alamines> swift -tc.file tc-uc.data -sites.file  
sites-uc-64.xml -debug MolDyn-244-loops.swift&
[1] 10562
nefedova at viper:~/alamines> WARN   - Failed to configure log file name
DEBUG  - Booting deef


Nika

On Aug 8, 2007, at 1:19 PM, Mihael Hategan wrote:

> On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
>> Shouldn't we be certain that things work before we commit the  
>> changes?
>
> No.
>
>>   I thought the commit would take place after we try MolDyn out  
>> and we
>> see things are back to normal.
>
> The whole problem we've seen the past few days was due to the fact  
> that
> Nika had no clear place to get the code from, so she repeatedly  
> ended up
> with broken versions. S o  p u t  t h e  c h a n g e s  i n  S V N !
>
>>
>> Ioan
>>
>> Mihael Hategan wrote:
>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>
>>>> OK everyone, I found Yong's version of the provider dated July  
>>>> 26th,
>>>> much more recent than what was in SVN on June 27th.  I updated  
>>>> Nika's
>>>> version of the provider (which has been checked out of SVN),
>>>>
>>>
>>> No. P u t  t h e  c h a n g e s  i n  S V N !
>>>
>>>
>>>> and recompiled&deploy!
>>>>
>>>>   ant distclean
>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2- 
>>>> dev/
>>>> dist
>>>>
>>>> I even updated updated some of the logging info to use the logger
>>>> (some were not using the logger).
>>>>
>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>
>>>> Falkon Factory Service:
>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/ 
>>>> GenericPortal/core/WS/GPFactoryService
>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>
>>>> Ioan
>>>>
>>>> Veronika Nefedova wrote:
>>>>
>>>>> Ioan,
>>>>>
>>>>>
>>>>> It looks like the Falcon (including provider-deef) was put in  
>>>>> SVN on
>>>>> June 27th. You really were supposed to use the SVN code from that
>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>> 27th?
>>>>>
>>>>>
>>>>> Nika
>>>>>
>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>
>>>>>
>>>>>> Could it be that the fixes were done before the original SVN
>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>> install on viper.  Nika, I take it you don't have this  
>>>>>> anymore, as
>>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>>> might have the latest provider source?  If not, I guess we  
>>>>>> need to
>>>>>> take another look through the provider source to fix the issues
>>>>>> that we knew of...
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Mihael Hategan wrote:
>>>>>>
>>>>>>> Well, it doesn't look like the falkon provider in SVN has  
>>>>>>> been updated
>>>>>>> at all in terms of fixing synchronization issues. All commits on
>>>>>>> provider-deef come from either ben or me:
>>>>>>>
>>>>>>> bash-3.1$ svn log
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500  
>>>>>>> (Fri, 03 Aug
>>>>>>> 2007) | 1 line
>>>>>>>
>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500  
>>>>>>> (Fri, 03 Aug
>>>>>>> 2007) | 1 line
>>>>>>>
>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500  
>>>>>>> (Fri, 03 Aug
>>>>>>> 2007) | 1 line
>>>>>>>
>>>>>>> a very small readme for provider-deef
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed,  
>>>>>>> 27 Jun
>>>>>>> 2007) | 1 line
>>>>>>>
>>>>>>> remove dist directory form svn
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed,  
>>>>>>> 27 Jun
>>>>>>> 2007) | 20 lines
>>>>>>>
>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>
>>>>>>> based on source in below message, with .class files deleted
>>>>>>>
>>>>>>>
>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>
>>>>>>> its on viper.uchicago.edu
>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>> I also tared it up and put in my home on terminable:  
>>>>>>> ~nefedova/cogl.tgz
>>>>>>>
>>>>>>> Nika
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --------
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Mihael, do you have any clues on why this run has failed?  
>>>>>>>> Ioan - my
>>>>>>>> answers to your questions are below...
>>>>>>>>
>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> It looks like viper (where Swift is running) is idle, and  
>>>>>>>>> so is tg-
>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>> What looks evident to me is that the normal list of events  
>>>>>>>>> is for a
>>>>>>>>> successful task:
>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops- 
>>>>>>>>> zhgo6be8tjhi1.log
>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,  
>>>>>>>>> identity=urn:
>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread  
>>>>>>>>> notification: urn:
>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,  
>>>>>>>>> identity=urn:
>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>
>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>  17566  175660 2179412
>>>>>>>>>
>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>   7959   55713  785035
>>>>>>>>>
>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>
>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were  
>>>>>>>>> received
>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>
>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so  
>>>>>>>>> I would
>>>>>>>>> argue that the # of notifications received is correct.  The
>>>>>>>>> submitted # of tasks looks like the # I would have  
>>>>>>>>> expected, but
>>>>>>>>> all the tasks did not make it to Falkon.  The Falkon  
>>>>>>>>> provider is
>>>>>>>>> what sits between the change of status to submitted, and the
>>>>>>>>> receipt of the notification, so I would say that is the  
>>>>>>>>> first place
>>>>>>>>> we need to look for more details... there used to some  
>>>>>>>>> extra debug
>>>>>>>>> info in the Falkon provider that simply printed all the  
>>>>>>>>> tasks that
>>>>>>>>> were actually being submitted to Falkon (as opposed to just  
>>>>>>>>> the
>>>>>>>>> change of status within Karajan).  I don't see those debug
>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>> What about the completed tasks, why are there so many (190K)
>>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> "Task" doesn't mean job. It could be just data being staged  
>>>>>>>> in , etc.
>>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it
>>>>>>>> differs, this is the problem...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Yong, are you keeping up with these emails?  Do you still  
>>>>>>>>> have a
>>>>>>>>> copy of the latest Falkon provider that you edited just  
>>>>>>>>> before you
>>>>>>>>> left?  Can you just take a look through there to make sure  
>>>>>>>>> nothing
>>>>>>>>> has been broken with the SVN updates?  If you don't have  
>>>>>>>>> time for
>>>>>>>>> this now (considering today was your first day on the new  
>>>>>>>>> job),
>>>>>>>>> I'll dig through there and see if I can make some sense of  
>>>>>>>>> what is
>>>>>>>>> happening!
>>>>>>>>>
>>>>>>>>> One last thing, Ben mentioned that the Falkon provider you  
>>>>>>>>> saw in
>>>>>>>>> Nika's account was different than what was in SVN.  Ben,  
>>>>>>>>> did you at
>>>>>>>>> least look at modification dates?  How old was one as  
>>>>>>>>> opposed to
>>>>>>>>> the other?  I hope we did not revert back to an older  
>>>>>>>>> version that
>>>>>>>>> might have had some bug in it....
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I had to update to the latest version of provider-deef from  
>>>>>>>> SVN since
>>>>>>>> without the update nothing worked. The version I am at now  
>>>>>>>> is 1050.
>>>>>>>> But this is exactly the same version of swift/deef I used  
>>>>>>>> for our
>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>
>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244- 
>>>>>>>>>> loops-
>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244- 
>>>>>>>>>> loops-
>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>
>>>>>>>>>> I.e. almost half of the jobs haven't finished (according  
>>>>>>>>>> to swift)
>>>>>>>>>>
>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>
>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,  
>>>>>>>>>> identity=urn:
>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed  
>>>>>>>>>> Exception
>>>>>>>>>> in getFile
>>>>>>>>>> (80 of those):
>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>      80     880    9705
>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nika
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>
>>>
>>>
>




More information about the Swift-devel mailing list