[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Tue Aug 7 11:32:13 CDT 2007


Could it be that the fixes were done before the original SVN checkin?   
If not, then at least we know why things aren't working.  I bet the 
latest provider source was in Nika's Swift install on viper.  Nika, I 
take it you don't have this anymore, as SVN updates overwrote this.  
Yong, is there any other place you might have the latest provider 
source?  If not, I guess we need to take another look through the 
provider source to fix the issues that we knew of...

Ioan

Mihael Hategan wrote:
> Well, it doesn't look like the falkon provider in SVN has been updated
> at all in terms of fixing synchronization issues. All commits on
> provider-deef come from either ben or me:
>
> bash-3.1$ svn log
> ------------------------------------------------------------------------
> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
> 2007) | 1 line
>
> removed gt4 stuff and added them as a dependency
> ------------------------------------------------------------------------
> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
> 2007) | 1 line
>
> removed gt4 stuff and added them as a dependency
> ------------------------------------------------------------------------
> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
> 2007) | 1 line
>
> a very small readme for provider-deef
> ------------------------------------------------------------------------
> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
> 2007) | 1 line
>
> remove dist directory form svn
> ------------------------------------------------------------------------
> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
> 2007) | 20 lines
>
> provider-deef, the Falkon/cog provider
>
> based on source in below message, with .class files deleted
>
>
> Date: Wed, 27 Jun 2007 09:27:23 -0500
> From: Veronika Nefedova <nefedova at mcs.anl.gov>
> To: Yong Zhao <yongzh at cs.uchicago.edu>
> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
> <hategan at mcs.anl.gov>,
>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>     Mike Wilde <wilde at mcs.anl.gov>,
>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
> Subject: Re: 244 molecule MolDyn run...
>
> its on viper.uchicago.edu
> in : /home/nefedova/cogl/modules/provider-deef/
> I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
>
> Nika
>
>
> ------------------------------------------------------------------------
>
>
> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>   
>> Mihael, do you have any clues on why this run has failed? Ioan - my  
>> answers to your questions are below...
>>
>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>
>>     
>>> It looks like viper (where Swift is running) is idle, and so is tg- 
>>> viz-login2 (where Falkon is running).
>>> What looks evident to me is that the normal list of events is for a  
>>> successful task:
>>> iraicu at viper:/home/nefedova/alamines> grep "urn: 
>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn: 
>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn: 
>>> 0-1-73-2-31-0-0-1186444341989 0
>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn: 
>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>
>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>  17566  175660 2179412
>>>
>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread  
>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>   7959   55713  785035
>>>
>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>> 190968 1909680 24003796
>>>
>>> Now, 17566 tasks were submitted, 7959 notifiation were received  
>>> from Falkon, and 190968 tasks were set to completed...
>>>
>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so I would  
>>> argue that the # of notifications received is correct.  The  
>>> submitted # of tasks looks like the # I would have expected, but  
>>> all the tasks did not make it to Falkon.  The Falkon provider is  
>>> what sits between the change of status to submitted, and the  
>>> receipt of the notification, so I would say that is the first place  
>>> we need to look for more details... there used to some extra debug  
>>> info in the Falkon provider that simply printed all the tasks that  
>>> were actually being submitted to Falkon (as opposed to just the  
>>> change of status within Karajan).  I don't see those debug  
>>> statements, I bet they got overwritten in the SVN update.
>>> What about the completed tasks, why are there so many (190K)  
>>> completed tasks?  Where did they come from?
>>>
>>>       
>> "Task" doesn't mean job. It could be just data being staged in , etc.  
>> The first 2 are important -- (Submitted vs Completed). Since it  
>> differs, this is the problem...
>>
>>
>>     
>>> Yong, are you keeping up with these emails?  Do you still have a  
>>> copy of the latest Falkon provider that you edited just before you  
>>> left?  Can you just take a look through there to make sure nothing  
>>> has been broken with the SVN updates?  If you don't have time for  
>>> this now (considering today was your first day on the new job),  
>>> I'll dig through there and see if I can make some sense of what is  
>>> happening!
>>>
>>> One last thing, Ben mentioned that the Falkon provider you saw in  
>>> Nika's account was different than what was in SVN.  Ben, did you at  
>>> least look at modification dates?  How old was one as opposed to  
>>> the other?  I hope we did not revert back to an older version that  
>>> might have had some bug in it....
>>>
>>>       
>> I had to update to the latest version of provider-deef from SVN since  
>> without the update nothing worked. The version I am at now is 1050.  
>> But this is exactly the same version of swift/deef I used for our  
>> Friday run (which 'worked' from Falcon/Swift point of view)
>>
>> Nika
>>
>>
>>     
>>> Ioan
>>>
>>> Veronika Nefedova wrote:
>>>       
>>>> Well, there are some discrepancies:
>>>>
>>>> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops- 
>>>> zhgo6be8tjhi1.log | wc
>>>>    7959  244749 3241072
>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops- 
>>>> zhgo6be8tjhi1.log | wc
>>>>   17207  564648 7949388
>>>> nefedova at viper:~/alamines>
>>>>
>>>> I.e. almost half of the jobs haven't finished (according to swift)
>>>>
>>>> I also have some exceptions:
>>>>
>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn: 
>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception  
>>>> in getFile
>>>> (80 of those):
>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops- 
>>>> zhgo6be8tjhi1.log | wc
>>>>      80     880    9705
>>>> nefedova at viper:~/alamines>
>>>>
>>>>
>>>> Nika
>>>>         
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>     
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070807/2e65868e/attachment.html>


More information about the Swift-devel mailing list