[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Thu Aug 9 13:55:27 CDT 2007


Right, that is what we are trying to do now... get a working version of 
the falkon provider, and commit the changes...

Veronika Nefedova wrote:
> Its such a mess... really, we should start using SVN asap.
>
> Nika
>
> On Aug 9, 2007, at 1:01 PM, Ioan Raicu wrote:
>
>> I don't know, and the machine looked relatively idle...  I am trying 
>> to track down what Falkon stubs are in Nika's Swift install, from the 
>> looks of it, they are really old, back from March.  I have not made 
>> many changes, but in the last month or so, I did change the 
>> notificaiton engine to support persistent connections to speed it up 
>> a bit.  I made it so its backwards compatible, but maybe its not 
>> 100%.  Basically, the service was using persistent socket support, 
>> while the client was not, and maybe that cause some problem.  I am 
>> now updating the Falkon stubs.  I found a single jar file in the 
>> modules/provider-deef, which I have updated and committed!  But, 
>> there are another bunch of them all over the place... let me track 
>> them down, and see if I can clean them up... as I suppose there 
>> should only be a single instance of these stubs, in a lib directory!  
>> Where should this master jar be, in which lib directory?
>> Ioan
>>
>> Mihael Hategan wrote:
>>> So I see a gap in the log from 19:32 to 21:45. No log messages
>>> whatsoever in between. Which is weird. I wonder what could cause log4j
>>> to stop writing things to the log file.
>>>
>>> On Wed, 2007-08-08 at 20:34 -0500, Ioan Raicu wrote:
>>>
>>>> Things are not halted, Falkon is still running, and its delivering 
>>>> results really slowly...
>>>>
>>>> http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>
>>>> notice the black area in the second graph, that is the time to 
>>>> deliver notificaitons to Swift... all machines are basically idle, 
>>>> I don't know what it could be... there is ample space on the 
>>>> disks... CPU is idle, memory is OK, yet things are just crawling, 
>>>> and Swift seems to have stopped printing anything to the screen or 
>>>> file.
>>>> The logs show nothing strange... but there is obviosuly something 
>>>> that is not right...
>>>>
>>>> I'll let the experiment keep going for now, and I'll dig into it 
>>>> deeper later tonight...
>>>>
>>>> Ioan
>>>>
>>>> Veronika Nefedova wrote:
>>>>
>>>>> Everything seemed to come to a halt.
>>>>>
>>>>> This is the last stdout that I have:
>>>>>
>>>>> Staged out 
>>>>> MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040.wham 
>>>>> to solv_repu_0.7_0.8_a0_m040.wham from UC-64
>>>>> Staged out 
>>>>> MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040_done 
>>>>> to solv_repu_0.7_0.8_a0_m040_done from UC-64
>>>>> Submitting task Task(type=4, 
>>>>> identity=urn:0-1-91-2-29-0-0-2-1186617126510)
>>>>> No host specified
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
>>>>> status to Active
>>>>> Submitting task Task(type=4, 
>>>>> identity=urn:0-1-91-2-29-0-0-1-1186617126513)
>>>>> No host specified
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
>>>>> status to Completed
>>>>> Submitting task Task(type=4, 
>>>>> identity=urn:0-1-91-2-29-0-0-3-1186617126516)
>>>>> No host specified
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
>>>>> status to Active
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) 
>>>>> Completed. Waiting: 1, Running: 14926. Heap size: 1518M, Heap 
>>>>> free: 962M, Max heap: 1518M
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
>>>>> status to Completed
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) 
>>>>> Completed. Waiting: 0, Running: 14926. Heap size: 1518M, Heap 
>>>>> free: 962M, Max heap: 1518M
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
>>>>> status to Active
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
>>>>> status to Completed
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) 
>>>>> Completed. Waiting: 0, Running: 14925. Heap size: 1518M, Heap 
>>>>> free: 962M, Max heap: 1518M
>>>>> Submitting task Task(type=4, 
>>>>> identity=urn:0-1-91-2-29-0-0-4-1186617126519)
>>>>> No host specified
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
>>>>> status to Active
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
>>>>> status to Completed
>>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) 
>>>>> Completed. Waiting: 0, Running: 14925. Heap size: 1518M, Heap 
>>>>> free: 962M, Max heap: 1518M
>>>>> Resolved 2078 to UC-64
>>>>> chrm_long completed
>>>>>
>>>>>
>>>>>
>>>>> Notice 'No host specified' -- this message was printing throughout 
>>>>> the whole execution, from the very beginning.
>>>>> The log is in 
>>>>> ~nefedova/alamines/MolDyn-244-loops-knt9h8fru9sm2.log on viper
>>>>>
>>>>> Nika
>>>>>
>>>>> On Aug 8, 2007, at 5:36 PM, Ioan Raicu wrote:
>>>>>
>>>>>
>>>>>> viper in Yong's account... he ran some tests just before he left 
>>>>>> with this version, and it worked just fine!
>>>>>> I saved Nika's provider which I replaced, so we can always go 
>>>>>> back to that if we need to.
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>>> Mihael Hategan wrote:
>>>>>>
>>>>>>> Where exactly is this version?
>>>>>>>
>>>>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>>>>
>>>>>>>> OK everyone, I found Yong's version of the provider dated July 
>>>>>>>> 26th,
>>>>>>>> much more recent than what was in SVN on June 27th.  I updated 
>>>>>>>> Nika's
>>>>>>>> version of the provider (which has been checked out of SVN), and
>>>>>>>> recompiled&deploy!
>>>>>>>>   ant distclean
>>>>>>>>   ant 
>>>>>>>> -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>>>>>>> dist
>>>>>>>>
>>>>>>>> I even updated updated some of the logging info to use the logger
>>>>>>>> (some were not using the logger).
>>>>>>>>
>>>>>>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>>>>>>
>>>>>>>> Falkon Factory Service:
>>>>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService 
>>>>>>>>
>>>>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>> Ioan,
>>>>>>>>>
>>>>>>>>> It looks like the Falcon (including provider-deef) was put in 
>>>>>>>>> SVN on
>>>>>>>>> June 27th. You really were supposed to use the SVN code from that
>>>>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>>>>> 27th?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>>>>>> install on viper.  Nika, I take it you don't have this 
>>>>>>>>>> anymore, as
>>>>>>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>>>>>>> might have the latest provider source?  If not, I guess we 
>>>>>>>>>> need to
>>>>>>>>>> take another look through the provider source to fix the issues
>>>>>>>>>> that we knew of...
>>>>>>>>>>
>>>>>>>>>> Ioan
>>>>>>>>>>
>>>>>>>>>> Mihael Hategan wrote:
>>>>>>>>>>> Well, it doesn't look like the falkon provider in SVN has 
>>>>>>>>>>> been updated
>>>>>>>>>>> at all in terms of fixing synchronization issues. All 
>>>>>>>>>>> commits on
>>>>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>>>>
>>>>>>>>>>> bash-3.1$ svn log
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 
>>>>>>>>>>> (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 
>>>>>>>>>>> (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 
>>>>>>>>>>> (Fri, 03 Aug
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> a very small readme for provider-deef
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 
>>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>>
>>>>>>>>>>> remove dist directory form svn
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 
>>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>>> 2007) | 20 lines
>>>>>>>>>>>
>>>>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>>>>
>>>>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>>>>
>>>>>>>>>>> its on viper.uchicago.edu
>>>>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>>>>> I also tared it up and put in my home on terminable: 
>>>>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>>>>
>>>>>>>>>>> Nika
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Mihael, do you have any clues on why this run has failed? 
>>>>>>>>>>>> Ioan - my  answers to your questions are below...
>>>>>>>>>>>>
>>>>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> It looks like viper (where Swift is running) is idle, and 
>>>>>>>>>>>>> so is tg- viz-login2 (where Falkon is running).
>>>>>>>>>>>>> What looks evident to me is that the normal list of events 
>>>>>>>>>>>>> is for a  successful task:
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn: 
>>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" 
>>>>>>>>>>>>> MolDyn-244-loops-zhgo6be8tjhi1.log
>>>>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, 
>>>>>>>>>>>>> identity=urn: 0-1-73-2-31-0-0-1186444341989) setting 
>>>>>>>>>>>>> status to Submitted
>>>>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread 
>>>>>>>>>>>>> notification: urn: 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, 
>>>>>>>>>>>>> identity=urn: 0-1-73-2-31-0-0-1186444341989) setting 
>>>>>>>>>>>>> status to Completed
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status 
>>>>>>>>>>>>> to  Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep 
>>>>>>>>>>>>> "NotificationThread  notification" 
>>>>>>>>>>>>> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>>>>
>>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status 
>>>>>>>>>>>>> to  Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were 
>>>>>>>>>>>>> received  from Falkon, and 190968 tasks were set to 
>>>>>>>>>>>>> completed...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, 
>>>>>>>>>>>>> so I would  argue that the # of notifications received is 
>>>>>>>>>>>>> correct.  The  submitted # of tasks looks like the # I 
>>>>>>>>>>>>> would have expected, but  all the tasks did not make it to 
>>>>>>>>>>>>> Falkon.  The Falkon provider is  what sits between the 
>>>>>>>>>>>>> change of status to submitted, and the  receipt of the 
>>>>>>>>>>>>> notification, so I would say that is the first place  we 
>>>>>>>>>>>>> need to look for more details... there used to some extra 
>>>>>>>>>>>>> debug  info in the Falkon provider that simply printed all 
>>>>>>>>>>>>> the tasks that  were actually being submitted to Falkon 
>>>>>>>>>>>>> (as opposed to just the  change of status within 
>>>>>>>>>>>>> Karajan).  I don't see those debug  statements, I bet they 
>>>>>>>>>>>>> got overwritten in the SVN update.
>>>>>>>>>>>>> What about the completed tasks, why are there so many 
>>>>>>>>>>>>> (190K)  completed tasks?  Where did they come from?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> "Task" doesn't mean job. It could be just data being staged 
>>>>>>>>>>>> in , etc.  The first 2 are important -- (Submitted vs 
>>>>>>>>>>>> Completed). Since it  differs, this is the problem...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Yong, are you keeping up with these emails?  Do you still 
>>>>>>>>>>>>> have a  copy of the latest Falkon provider that you edited 
>>>>>>>>>>>>> just before you  left?  Can you just take a look through 
>>>>>>>>>>>>> there to make sure nothing  has been broken with the SVN 
>>>>>>>>>>>>> updates?  If you don't have time for  this now 
>>>>>>>>>>>>> (considering today was your first day on the new job),  
>>>>>>>>>>>>> I'll dig through there and see if I can make some sense of 
>>>>>>>>>>>>> what is  happening!
>>>>>>>>>>>>>
>>>>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider you 
>>>>>>>>>>>>> saw in  Nika's account was different than what was in 
>>>>>>>>>>>>> SVN.  Ben, did you at  least look at modification dates?  
>>>>>>>>>>>>> How old was one as opposed to  the other?  I hope we did 
>>>>>>>>>>>>> not revert back to an older version that  might have had 
>>>>>>>>>>>>> some bug in it....
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> I had to update to the latest version of provider-deef from 
>>>>>>>>>>>> SVN since  without the update nothing worked. The version I 
>>>>>>>>>>>> am at now is 1050.  But this is exactly the same version of 
>>>>>>>>>>>> swift/deef I used for our  Friday run (which 'worked' from 
>>>>>>>>>>>> Falcon/Swift point of view)
>>>>>>>>>>>>
>>>>>>>>>>>> Nika
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Ioan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" 
>>>>>>>>>>>>>> MolDyn-244-loops- zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" 
>>>>>>>>>>>>>> MolDyn-244-loops- zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I.e. almost half of the jobs haven't finished (according 
>>>>>>>>>>>>>> to swift)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, 
>>>>>>>>>>>>>> identity=urn: 0-1-101-2-37-0-0-1186444363341) setting 
>>>>>>>>>>>>>> status to Failed Exception  in getFile
>>>>>>>>>>>>>> (80 of those):
>>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops- 
>>>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>>      80     880    9705
>>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nika
>>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>
>>>
>>>
>>
>
>



More information about the Swift-devel mailing list