[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Wed Aug 8 15:45:52 CDT 2007


nope, its a 244-mol workflow.

I have no errors or exceptions in the log.

nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops- 
p2p6vy21s5fj0.log | wc
     247    6411   45191
nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops- 
p2p6vy21s5fj0.log | wc
     489   13923  149727
nefedova at viper:~/alamines> grep "xception"  MolDyn-244-loops- 
p2p6vy21s5fj0.log | wc
       0       0       0


So I guess something else is wrong here?

Nika


On Aug 8, 2007, at 3:35 PM, Ioan Raicu wrote:

> Did you try just a small workflow to test? It looks to be idle
>
> 13014.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
> 13015.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
> 13016.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
> 13017.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
> 13018.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
> 13019.996 0 1 42 188 188 0 0 0.0 0.0 0.0 0.0 489.0 0.0
>
> with 489 jobs completed... is this normal?
>
> Veronika Nefedova wrote:
>> anyway - I fixed the log4j.properties file and started the run
>>
>> Nika
>>
>> On Aug 8, 2007, at 2:20 PM, Ioan Raicu wrote:
>>
>>> All my work was related to the deef-provider... I did not touch  
>>> anything else!
>>>
>>> in the folder
>>> nefedova at viper:~/cogl/modules/provider-deef
>>>
>>> I did:
>>>
>>> cp yongs_source_files src/org/globus/cog/abstraction/impl/ 
>>> execution/deef/
>>> svn update
>>> ant distclean
>>> ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>>
>>> Now why would this screw up your logging or anything else in  
>>> Swift?  Unless it screwed something up in the deef-provider  
>>> (which was already screwed up prior).  Now, the message "booting  
>>> deef" comes from Boot.java.  This file was from SVN, as Mihael  
>>> modified it a few days ago, so Yong's Boot.java was not carried  
>>> over.  Should I have used the older Boot.java (Yong's version  
>>> from July 26th)?  If this is not the issue, and its something  
>>> else related to the deef-provider, you can find the old deef- 
>>> provider that you had before at:
>>> viper:/home/nefedova/cogl/modules/provider-deef_8-8-07_svn
>>>
>>> Ioan
>>> PS: I don't have rights to commit changes to SVN, so if you don't  
>>> want me to make any more changes to your Swift install, we can  
>>> wait until I get the right to commit my changes so you can see  
>>> them and pull them in yourself through SVN.
>>>
>>> Veronika Nefedova wrote:
>>>> the current changes screwed up my logging again...
>>>> Please, do not touch my install --- I'd rather get everything  
>>>> from SVN,
>>>>
>>>> nefedova at viper:~/alamines> swift -tc.file tc-uc.data -sites.file  
>>>> sites-uc-64.xml -debug MolDyn-244-loops.swift&
>>>> [1] 10562
>>>> nefedova at viper:~/alamines> WARN   - Failed to configure log file  
>>>> name
>>>> DEBUG  - Booting deef
>>>>
>>>>
>>>> Nika
>>>>
>>>> On Aug 8, 2007, at 1:19 PM, Mihael Hategan wrote:
>>>>
>>>>> On Wed, 2007-08-08 at 13:04 -0500, Ioan Raicu wrote:
>>>>>> Shouldn't we be certain that things work before we commit the  
>>>>>> changes?
>>>>>
>>>>> No.
>>>>>
>>>>>>   I thought the commit would take place after we try MolDyn  
>>>>>> out and we
>>>>>> see things are back to normal.
>>>>>
>>>>> The whole problem we've seen the past few days was due to the  
>>>>> fact that
>>>>> Nika had no clear place to get the code from, so she repeatedly  
>>>>> ended up
>>>>> with broken versions. S o  p u t  t h e  c h a n g e s  i n  S  
>>>>> V N !
>>>>>
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Mihael Hategan wrote:
>>>>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>>>>
>>>>>>>> OK everyone, I found Yong's version of the provider dated  
>>>>>>>> July 26th,
>>>>>>>> much more recent than what was in SVN on June 27th.  I  
>>>>>>>> updated Nika's
>>>>>>>> version of the provider (which has been checked out of SVN),
>>>>>>>>
>>>>>>>
>>>>>>> No. P u t  t h e  c h a n g e s  i n  S V N !
>>>>>>>
>>>>>>>
>>>>>>>> and recompiled&deploy!
>>>>>>>>
>>>>>>>>   ant distclean
>>>>>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/ 
>>>>>>>> vdsk-0.2-dev/
>>>>>>>> dist
>>>>>>>>
>>>>>>>> I even updated updated some of the logging info to use the  
>>>>>>>> logger
>>>>>>>> (some were not using the logger).
>>>>>>>>
>>>>>>>> Nika, Falkon is freshly restarted and ready for another test  
>>>>>>>> run!
>>>>>>>>
>>>>>>>> Falkon Factory Service:
>>>>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/ 
>>>>>>>> GenericPortal/core/WS/GPFactoryService
>>>>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/ 
>>>>>>>> index.htm
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>
>>>>>>>>> Ioan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It looks like the Falcon (including provider-deef) was put  
>>>>>>>>> in SVN on
>>>>>>>>> June 27th. You really were supposed to use the SVN code  
>>>>>>>>> from that
>>>>>>>>> point. Sigh. Did you do any changes to viper install after  
>>>>>>>>> June
>>>>>>>>> 27th?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>>>>> working.  I bet the latest provider source was in Nika's  
>>>>>>>>>> Swift
>>>>>>>>>> install on viper.  Nika, I take it you don't have this  
>>>>>>>>>> anymore, as
>>>>>>>>>> SVN updates overwrote this.  Yong, is there any other  
>>>>>>>>>> place you
>>>>>>>>>> might have the latest provider source?  If not, I guess we  
>>>>>>>>>> need to
>>>>>>>>>> take another look through the provider source to fix the  
>>>>>>>>>> issues
>>>>>>>>>> that we knew of...
>>>>>>>>>>
>>>>>>>>>> Ioan
>>>>>>>>>>
>>>>>>>>>> Mihael Hategan wrote:
>>>>>>>>>>
>>>>>>>>>>> Well, it doesn't look like the falkon provider in SVN has  
>>>>>>>>>>> been updated
>>>>>>>>>>> at all in terms of fixing synchronization issues. All  
>>>>>>>>>>> commits on
>>>>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>>>>
>>>>>>>>>>> bash-3.1$ svn log
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48  
>>>>>>>>>>> -0500 (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25  
>>>>>>>>>>> -0500 (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500  
>>>>>>>>>>> (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> a very small readme for provider-deef
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500  
>>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> remove dist directory form svn
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500  
>>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>>> 2007) | 20 lines
>>>>>>>>>>>
>>>>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>>>>
>>>>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>>>>
>>>>>>>>>>> its on viper.uchicago.edu
>>>>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>>>>> I also tared it up and put in my home on terminable:  
>>>>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>>>>
>>>>>>>>>>> Nika
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------ 
>>>>>>>>>>> ------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Mihael, do you have any clues on why this run has  
>>>>>>>>>>>> failed? Ioan - my
>>>>>>>>>>>> answers to your questions are below...
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like viper (where Swift is running) is idle,  
>>>>>>>>>>>>> and so is tg-
>>>>>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>>>>>> What looks evident to me is that the normal list of  
>>>>>>>>>>>>> events is for a
>>>>>>>>>>>>> successful task:
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops- 
>>>>>>>>>>>>> zhgo6be8tjhi1.log
>>>>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>>>> identity=urn:
>>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread  
>>>>>>>>>>>>> notification: urn:
>>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>>>> identity=urn:
>>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>>>> status to
>>>>>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep  
>>>>>>>>>>>>> "NotificationThread
>>>>>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>>>> status to
>>>>>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were  
>>>>>>>>>>>>> received
>>>>>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959  
>>>>>>>>>>>>> tasks, so I would
>>>>>>>>>>>>> argue that the # of notifications received is correct.   
>>>>>>>>>>>>> The
>>>>>>>>>>>>> submitted # of tasks looks like the # I would have  
>>>>>>>>>>>>> expected, but
>>>>>>>>>>>>> all the tasks did not make it to Falkon.  The Falkon  
>>>>>>>>>>>>> provider is
>>>>>>>>>>>>> what sits between the change of status to submitted,  
>>>>>>>>>>>>> and the
>>>>>>>>>>>>> receipt of the notification, so I would say that is the  
>>>>>>>>>>>>> first place
>>>>>>>>>>>>> we need to look for more details... there used to some  
>>>>>>>>>>>>> extra debug
>>>>>>>>>>>>> info in the Falkon provider that simply printed all the  
>>>>>>>>>>>>> tasks that
>>>>>>>>>>>>> were actually being submitted to Falkon (as opposed to  
>>>>>>>>>>>>> just the
>>>>>>>>>>>>> change of status within Karajan).  I don't see those debug
>>>>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>>>>> What about the completed tasks, why are there so many  
>>>>>>>>>>>>> (190K)
>>>>>>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> "Task" doesn't mean job. It could be just data being  
>>>>>>>>>>>> staged in , etc.
>>>>>>>>>>>> The first 2 are important -- (Submitted vs Completed).  
>>>>>>>>>>>> Since it
>>>>>>>>>>>> differs, this is the problem...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Yong, are you keeping up with these emails?  Do you  
>>>>>>>>>>>>> still have a
>>>>>>>>>>>>> copy of the latest Falkon provider that you edited just  
>>>>>>>>>>>>> before you
>>>>>>>>>>>>> left?  Can you just take a look through there to make  
>>>>>>>>>>>>> sure nothing
>>>>>>>>>>>>> has been broken with the SVN updates?  If you don't  
>>>>>>>>>>>>> have time for
>>>>>>>>>>>>> this now (considering today was your first day on the  
>>>>>>>>>>>>> new job),
>>>>>>>>>>>>> I'll dig through there and see if I can make some sense  
>>>>>>>>>>>>> of what is
>>>>>>>>>>>>> happening!
>>>>>>>>>>>>>
>>>>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider  
>>>>>>>>>>>>> you saw in
>>>>>>>>>>>>> Nika's account was different than what was in SVN.   
>>>>>>>>>>>>> Ben, did you at
>>>>>>>>>>>>> least look at modification dates?  How old was one as  
>>>>>>>>>>>>> opposed to
>>>>>>>>>>>>> the other?  I hope we did not revert back to an older  
>>>>>>>>>>>>> version that
>>>>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> I had to update to the latest version of provider-deef  
>>>>>>>>>>>> from SVN since
>>>>>>>>>>>> without the update nothing worked. The version I am at  
>>>>>>>>>>>> now is 1050.
>>>>>>>>>>>> But this is exactly the same version of swift/deef I  
>>>>>>>>>>>> used for our
>>>>>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>>>>
>>>>>>>>>>>> Nika
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Ioan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job"  
>>>>>>>>>>>>>> MolDyn-244-loops-
>>>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job"  
>>>>>>>>>>>>>> MolDyn-244-loops-
>>>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I.e. almost half of the jobs haven't finished  
>>>>>>>>>>>>>> (according to swift)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,  
>>>>>>>>>>>>>> identity=urn:
>>>>>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to  
>>>>>>>>>>>>>> Failed Exception
>>>>>>>>>>>>>> in getFile
>>>>>>>>>>>>>> (80 of those):
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>      80     880    9705
>>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nika
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>




More information about the Swift-devel mailing list