[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Thu Aug 9 13:01:54 CDT 2007


I don't know, and the machine looked relatively idle...  I am trying to 
track down what Falkon stubs are in Nika's Swift install, from the looks 
of it, they are really old, back from March.  I have not made many 
changes, but in the last month or so, I did change the notificaiton 
engine to support persistent connections to speed it up a bit.  I made 
it so its backwards compatible, but maybe its not 100%.  Basically, the 
service was using persistent socket support, while the client was not, 
and maybe that cause some problem.  I am now updating the Falkon stubs.  
I found a single jar file in the modules/provider-deef, which I have 
updated and committed!  But, there are another bunch of them all over 
the place... let me track them down, and see if I can clean them up... 
as I suppose there should only be a single instance of these stubs, in a 
lib directory!  Where should this master jar be, in which lib directory? 

Ioan

Mihael Hategan wrote:
> So I see a gap in the log from 19:32 to 21:45. No log messages
> whatsoever in between. Which is weird. I wonder what could cause log4j
> to stop writing things to the log file.
>
> On Wed, 2007-08-08 at 20:34 -0500, Ioan Raicu wrote:
>   
>> Things are not halted, Falkon is still running, and its delivering 
>> results really slowly...
>>
>> http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>
>> notice the black area in the second graph, that is the time to deliver 
>> notificaitons to Swift... all machines are basically idle, I don't know 
>> what it could be... there is ample space on the disks... CPU is idle, 
>> memory is OK, yet things are just crawling, and Swift seems to have 
>> stopped printing anything to the screen or file. 
>>
>> The logs show nothing strange... but there is obviosuly something that 
>> is not right...
>>
>> I'll let the experiment keep going for now, and I'll dig into it deeper 
>> later tonight...
>>
>> Ioan
>>
>> Veronika Nefedova wrote:
>>     
>>> Everything seemed to come to a halt.
>>>
>>> This is the last stdout that I have:
>>>
>>> Staged out 
>>> MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040.wham 
>>> to solv_repu_0.7_0.8_a0_m040.wham from UC-64
>>> Staged out 
>>> MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040_done 
>>> to solv_repu_0.7_0.8_a0_m040_done from UC-64
>>> Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510)
>>> No host specified
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
>>> status to Active
>>> Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513)
>>> No host specified
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
>>> status to Completed
>>> Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516)
>>> No host specified
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
>>> status to Active
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) Completed. 
>>> Waiting: 1, Running: 14926. Heap size: 1518M, Heap free: 962M, Max 
>>> heap: 1518M
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
>>> status to Completed
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) Completed. 
>>> Waiting: 0, Running: 14926. Heap size: 1518M, Heap free: 962M, Max 
>>> heap: 1518M
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
>>> status to Active
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
>>> status to Completed
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) Completed. 
>>> Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max 
>>> heap: 1518M
>>> Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519)
>>> No host specified
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
>>> status to Active
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
>>> status to Completed
>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) Completed. 
>>> Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max 
>>> heap: 1518M
>>> Resolved 2078 to UC-64
>>> chrm_long completed
>>>
>>>
>>>
>>> Notice 'No host specified' -- this message was printing throughout the 
>>> whole execution, from the very beginning.
>>> The log is in ~nefedova/alamines/MolDyn-244-loops-knt9h8fru9sm2.log on 
>>> viper
>>>
>>> Nika
>>>
>>> On Aug 8, 2007, at 5:36 PM, Ioan Raicu wrote:
>>>
>>>       
>>>> viper in Yong's account... he ran some tests just before he left with 
>>>> this version, and it worked just fine!
>>>> I saved Nika's provider which I replaced, so we can always go back to 
>>>> that if we need to.
>>>>
>>>> Ioan
>>>>
>>>> Mihael Hategan wrote:
>>>>         
>>>>> Where exactly is this version?
>>>>>
>>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>>   
>>>>>           
>>>>>> OK everyone, I found Yong's version of the provider dated July 26th,
>>>>>> much more recent than what was in SVN on June 27th.  I updated Nika's
>>>>>> version of the provider (which has been checked out of SVN), and
>>>>>> recompiled&deploy!  
>>>>>>
>>>>>>   ant distclean
>>>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>>>>> dist
>>>>>>
>>>>>> I even updated updated some of the logging info to use the logger
>>>>>> (some were not using the logger).
>>>>>>
>>>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>>>
>>>>>> Falkon Factory Service:
>>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
>>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Veronika Nefedova wrote: 
>>>>>>     
>>>>>>             
>>>>>>> Ioan, 
>>>>>>>
>>>>>>>
>>>>>>> It looks like the Falcon (including provider-deef) was put in SVN on
>>>>>>> June 27th. You really were supposed to use the SVN code from that
>>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>>> 27th?
>>>>>>>
>>>>>>>
>>>>>>> Nika
>>>>>>>
>>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>>
>>>>>>>       
>>>>>>>               
>>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>>>> install on viper.  Nika, I take it you don't have this anymore, as
>>>>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>>>>> might have the latest provider source?  If not, I guess we need to
>>>>>>>> take another look through the provider source to fix the issues
>>>>>>>> that we knew of...
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Mihael Hategan wrote: 
>>>>>>>>         
>>>>>>>>                 
>>>>>>>>> Well, it doesn't look like the falkon provider in SVN has been updated
>>>>>>>>> at all in terms of fixing synchronization issues. All commits on
>>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>>
>>>>>>>>> bash-3.1$ svn log
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> a very small readme for provider-deef
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
>>>>>>>>> 2007) | 1 line
>>>>>>>>>
>>>>>>>>> remove dist directory form svn
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
>>>>>>>>> 2007) | 20 lines
>>>>>>>>>
>>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>>
>>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>>
>>>>>>>>> its on viper.uchicago.edu
>>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>>> I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>>   
>>>>>>>>>           
>>>>>>>>>                   
>>>>>>>>>> Mihael, do you have any clues on why this run has failed? Ioan - my  
>>>>>>>>>> answers to your questions are below...
>>>>>>>>>>
>>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>             
>>>>>>>>>>                     
>>>>>>>>>>> It looks like viper (where Swift is running) is idle, and so is tg- 
>>>>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>>>>> What looks evident to me is that the normal list of events is for a  
>>>>>>>>>>> successful task:
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn: 
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
>>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn: 
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn: 
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn: 
>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
>>>>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread  
>>>>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>>
>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
>>>>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>>
>>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were received  
>>>>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>>>>
>>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so I would  
>>>>>>>>>>> argue that the # of notifications received is correct.  The  
>>>>>>>>>>> submitted # of tasks looks like the # I would have expected, but  
>>>>>>>>>>> all the tasks did not make it to Falkon.  The Falkon provider is  
>>>>>>>>>>> what sits between the change of status to submitted, and the  
>>>>>>>>>>> receipt of the notification, so I would say that is the first place  
>>>>>>>>>>> we need to look for more details... there used to some extra debug  
>>>>>>>>>>> info in the Falkon provider that simply printed all the tasks that  
>>>>>>>>>>> were actually being submitted to Falkon (as opposed to just the  
>>>>>>>>>>> change of status within Karajan).  I don't see those debug  
>>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>>> What about the completed tasks, why are there so many (190K)  
>>>>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>>>>
>>>>>>>>>>>       
>>>>>>>>>>>               
>>>>>>>>>>>                       
>>>>>>>>>> "Task" doesn't mean job. It could be just data being staged in , etc.  
>>>>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it  
>>>>>>>>>> differs, this is the problem...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>             
>>>>>>>>>>                     
>>>>>>>>>>> Yong, are you keeping up with these emails?  Do you still have a  
>>>>>>>>>>> copy of the latest Falkon provider that you edited just before you  
>>>>>>>>>>> left?  Can you just take a look through there to make sure nothing  
>>>>>>>>>>> has been broken with the SVN updates?  If you don't have time for  
>>>>>>>>>>> this now (considering today was your first day on the new job),  
>>>>>>>>>>> I'll dig through there and see if I can make some sense of what is  
>>>>>>>>>>> happening!
>>>>>>>>>>>
>>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider you saw in  
>>>>>>>>>>> Nika's account was different than what was in SVN.  Ben, did you at  
>>>>>>>>>>> least look at modification dates?  How old was one as opposed to  
>>>>>>>>>>> the other?  I hope we did not revert back to an older version that  
>>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>>
>>>>>>>>>>>       
>>>>>>>>>>>               
>>>>>>>>>>>                       
>>>>>>>>>> I had to update to the latest version of provider-deef from SVN since  
>>>>>>>>>> without the update nothing worked. The version I am at now is 1050.  
>>>>>>>>>> But this is exactly the same version of swift/deef I used for our  
>>>>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>>
>>>>>>>>>> Nika
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>             
>>>>>>>>>>                     
>>>>>>>>>>> Ioan
>>>>>>>>>>>
>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>       
>>>>>>>>>>>               
>>>>>>>>>>>                       
>>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>>
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops- 
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops- 
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>
>>>>>>>>>>>> I.e. almost half of the jobs haven't finished (according to swift)
>>>>>>>>>>>>
>>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>>
>>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn: 
>>>>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception  
>>>>>>>>>>>> in getFile
>>>>>>>>>>>> (80 of those):
>>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops- 
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>      80     880    9705
>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Nika
>>>>>>>>>>>>         
>>>>>>>>>>>>                 
>>>>>>>>>>>>                         
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>>     
>>>>>>>>>>             
>>>>>>>>>>                     
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>         
>>>>>>>>                 
>>>>>>>       
>>>>>>>               
>>>>>   
>>>>>           
>
>
>   



More information about the Swift-devel mailing list