[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Thu Aug 9 13:07:48 CDT 2007


Its such a mess... really, we should start using SVN asap.

Nika

On Aug 9, 2007, at 1:01 PM, Ioan Raicu wrote:

> I don't know, and the machine looked relatively idle...  I am  
> trying to track down what Falkon stubs are in Nika's Swift install,  
> from the looks of it, they are really old, back from March.  I have  
> not made many changes, but in the last month or so, I did change  
> the notificaiton engine to support persistent connections to speed  
> it up a bit.  I made it so its backwards compatible, but maybe its  
> not 100%.  Basically, the service was using persistent socket  
> support, while the client was not, and maybe that cause some  
> problem.  I am now updating the Falkon stubs.  I found a single jar  
> file in the modules/provider-deef, which I have updated and  
> committed!  But, there are another bunch of them all over the  
> place... let me track them down, and see if I can clean them up...  
> as I suppose there should only be a single instance of these stubs,  
> in a lib directory!  Where should this master jar be, in which lib  
> directory?
> Ioan
>
> Mihael Hategan wrote:
>> So I see a gap in the log from 19:32 to 21:45. No log messages
>> whatsoever in between. Which is weird. I wonder what could cause  
>> log4j
>> to stop writing things to the log file.
>>
>> On Wed, 2007-08-08 at 20:34 -0500, Ioan Raicu wrote:
>>
>>> Things are not halted, Falkon is still running, and its  
>>> delivering results really slowly...
>>>
>>> http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>
>>> notice the black area in the second graph, that is the time to  
>>> deliver notificaitons to Swift... all machines are basically  
>>> idle, I don't know what it could be... there is ample space on  
>>> the disks... CPU is idle, memory is OK, yet things are just  
>>> crawling, and Swift seems to have stopped printing anything to  
>>> the screen or file.
>>> The logs show nothing strange... but there is obviosuly something  
>>> that is not right...
>>>
>>> I'll let the experiment keep going for now, and I'll dig into it  
>>> deeper later tonight...
>>>
>>> Ioan
>>>
>>> Veronika Nefedova wrote:
>>>
>>>> Everything seemed to come to a halt.
>>>>
>>>> This is the last stdout that I have:
>>>>
>>>> Staged out MolDyn-244-loops-knt9h8fru9sm2/shared/ 
>>>> solv_repu_0.7_0.8_a0_m040.wham to solv_repu_0.7_0.8_a0_m040.wham  
>>>> from UC-64
>>>> Staged out MolDyn-244-loops-knt9h8fru9sm2/shared/ 
>>>> solv_repu_0.7_0.8_a0_m040_done to solv_repu_0.7_0.8_a0_m040_done  
>>>> from UC-64
>>>> Submitting task Task(type=4, identity=urn: 
>>>> 0-1-91-2-29-0-0-2-1186617126510)
>>>> No host specified
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510)  
>>>> setting status to Active
>>>> Submitting task Task(type=4, identity=urn: 
>>>> 0-1-91-2-29-0-0-1-1186617126513)
>>>> No host specified
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510)  
>>>> setting status to Completed
>>>> Submitting task Task(type=4, identity=urn: 
>>>> 0-1-91-2-29-0-0-3-1186617126516)
>>>> No host specified
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513)  
>>>> setting status to Active
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510)  
>>>> Completed. Waiting: 1, Running: 14926. Heap size: 1518M, Heap  
>>>> free: 962M, Max heap: 1518M
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513)  
>>>> setting status to Completed
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513)  
>>>> Completed. Waiting: 0, Running: 14926. Heap size: 1518M, Heap  
>>>> free: 962M, Max heap: 1518M
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516)  
>>>> setting status to Active
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516)  
>>>> setting status to Completed
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516)  
>>>> Completed. Waiting: 0, Running: 14925. Heap size: 1518M, Heap  
>>>> free: 962M, Max heap: 1518M
>>>> Submitting task Task(type=4, identity=urn: 
>>>> 0-1-91-2-29-0-0-4-1186617126519)
>>>> No host specified
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519)  
>>>> setting status to Active
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519)  
>>>> setting status to Completed
>>>> Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519)  
>>>> Completed. Waiting: 0, Running: 14925. Heap size: 1518M, Heap  
>>>> free: 962M, Max heap: 1518M
>>>> Resolved 2078 to UC-64
>>>> chrm_long completed
>>>>
>>>>
>>>>
>>>> Notice 'No host specified' -- this message was printing  
>>>> throughout the whole execution, from the very beginning.
>>>> The log is in ~nefedova/alamines/MolDyn-244-loops- 
>>>> knt9h8fru9sm2.log on viper
>>>>
>>>> Nika
>>>>
>>>> On Aug 8, 2007, at 5:36 PM, Ioan Raicu wrote:
>>>>
>>>>
>>>>> viper in Yong's account... he ran some tests just before he  
>>>>> left with this version, and it worked just fine!
>>>>> I saved Nika's provider which I replaced, so we can always go  
>>>>> back to that if we need to.
>>>>>
>>>>> Ioan
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>
>>>>>> Where exactly is this version?
>>>>>>
>>>>>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>>>>>
>>>>>>> OK everyone, I found Yong's version of the provider dated  
>>>>>>> July 26th,
>>>>>>> much more recent than what was in SVN on June 27th.  I  
>>>>>>> updated Nika's
>>>>>>> version of the provider (which has been checked out of SVN), and
>>>>>>> recompiled&deploy!
>>>>>>>   ant distclean
>>>>>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/ 
>>>>>>> vdsk-0.2-dev/
>>>>>>> dist
>>>>>>>
>>>>>>> I even updated updated some of the logging info to use the  
>>>>>>> logger
>>>>>>> (some were not using the logger).
>>>>>>>
>>>>>>> Nika, Falkon is freshly restarted and ready for another test  
>>>>>>> run!
>>>>>>>
>>>>>>> Falkon Factory Service:
>>>>>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/ 
>>>>>>> GenericPortal/core/WS/GPFactoryService
>>>>>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>>>>>
>>>>>>> Ioan
>>>>>>>
>>>>>>> Veronika Nefedova wrote:
>>>>>>>> Ioan,
>>>>>>>>
>>>>>>>> It looks like the Falcon (including provider-deef) was put  
>>>>>>>> in SVN on
>>>>>>>> June 27th. You really were supposed to use the SVN code from  
>>>>>>>> that
>>>>>>>> point. Sigh. Did you do any changes to viper install after June
>>>>>>>> 27th?
>>>>>>>>
>>>>>>>>
>>>>>>>> Nika
>>>>>>>>
>>>>>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Could it be that the fixes were done before the original SVN
>>>>>>>>> checkin?   If not, then at least we know why things aren't
>>>>>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>>>>>> install on viper.  Nika, I take it you don't have this  
>>>>>>>>> anymore, as
>>>>>>>>> SVN updates overwrote this.  Yong, is there any other place  
>>>>>>>>> you
>>>>>>>>> might have the latest provider source?  If not, I guess we  
>>>>>>>>> need to
>>>>>>>>> take another look through the provider source to fix the  
>>>>>>>>> issues
>>>>>>>>> that we knew of...
>>>>>>>>>
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> Mihael Hategan wrote:
>>>>>>>>>> Well, it doesn't look like the falkon provider in SVN has  
>>>>>>>>>> been updated
>>>>>>>>>> at all in terms of fixing synchronization issues. All  
>>>>>>>>>> commits on
>>>>>>>>>> provider-deef come from either ben or me:
>>>>>>>>>>
>>>>>>>>>> bash-3.1$ svn log
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48  
>>>>>>>>>> -0500 (Fri, 03 Aug
>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>
>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25  
>>>>>>>>>> -0500 (Fri, 03 Aug
>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>
>>>>>>>>>> removed gt4 stuff and added them as a dependency
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500  
>>>>>>>>>> (Fri, 03 Aug
>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>
>>>>>>>>>> a very small readme for provider-deef
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500  
>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>> 2007) | 1 line
>>>>>>>>>>
>>>>>>>>>> remove dist directory form svn
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500  
>>>>>>>>>> (Wed, 27 Jun
>>>>>>>>>> 2007) | 20 lines
>>>>>>>>>>
>>>>>>>>>> provider-deef, the Falkon/cog provider
>>>>>>>>>>
>>>>>>>>>> based on source in below message, with .class files deleted
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>>>>>> <hategan at mcs.anl.gov>,
>>>>>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>>>>>
>>>>>>>>>> its on viper.uchicago.edu
>>>>>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>>>>>> I also tared it up and put in my home on terminable:  
>>>>>>>>>> ~nefedova/cogl.tgz
>>>>>>>>>>
>>>>>>>>>> Nika
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------- 
>>>>>>>>>> -----------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>>>>>
>>>>>>>>>>> Mihael, do you have any clues on why this run has failed?  
>>>>>>>>>>> Ioan - my  answers to your questions are below...
>>>>>>>>>>>
>>>>>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> It looks like viper (where Swift is running) is idle,  
>>>>>>>>>>>> and so is tg- viz-login2 (where Falkon is running).
>>>>>>>>>>>> What looks evident to me is that the normal list of  
>>>>>>>>>>>> events is for a  successful task:
>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:  
>>>>>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops- 
>>>>>>>>>>>> zhgo6be8tjhi1.log
>>>>>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>>> identity=urn: 0-1-73-2-31-0-0-1186444341989) setting  
>>>>>>>>>>>> status to Submitted
>>>>>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread  
>>>>>>>>>>>> notification: urn: 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,  
>>>>>>>>>>>> identity=urn: 0-1-73-2-31-0-0-1186444341989) setting  
>>>>>>>>>>>> status to Completed
>>>>>>>>>>>>
>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>>> status to  Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log  
>>>>>>>>>>>> | wc
>>>>>>>>>>>>  17566  175660 2179412
>>>>>>>>>>>>
>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep  
>>>>>>>>>>>> "NotificationThread  notification" MolDyn-244-loops- 
>>>>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>   7959   55713  785035
>>>>>>>>>>>>
>>>>>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting  
>>>>>>>>>>>> status to  Completed" MolDyn-244-loops-zhgo6be8tjhi1.log  
>>>>>>>>>>>> | wc
>>>>>>>>>>>> 190968 1909680 24003796
>>>>>>>>>>>>
>>>>>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were  
>>>>>>>>>>>> received  from Falkon, and 190968 tasks were set to  
>>>>>>>>>>>> completed...
>>>>>>>>>>>>
>>>>>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks,  
>>>>>>>>>>>> so I would  argue that the # of notifications received  
>>>>>>>>>>>> is correct.  The  submitted # of tasks looks like the #  
>>>>>>>>>>>> I would have expected, but  all the tasks did not make  
>>>>>>>>>>>> it to Falkon.  The Falkon provider is  what sits between  
>>>>>>>>>>>> the change of status to submitted, and the  receipt of  
>>>>>>>>>>>> the notification, so I would say that is the first  
>>>>>>>>>>>> place  we need to look for more details... there used to  
>>>>>>>>>>>> some extra debug  info in the Falkon provider that  
>>>>>>>>>>>> simply printed all the tasks that  were actually being  
>>>>>>>>>>>> submitted to Falkon (as opposed to just the  change of  
>>>>>>>>>>>> status within Karajan).  I don't see those debug   
>>>>>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>>>>>> What about the completed tasks, why are there so many  
>>>>>>>>>>>> (190K)  completed tasks?  Where did they come from?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> "Task" doesn't mean job. It could be just data being  
>>>>>>>>>>> staged in , etc.  The first 2 are important -- (Submitted  
>>>>>>>>>>> vs Completed). Since it  differs, this is the problem...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Yong, are you keeping up with these emails?  Do you  
>>>>>>>>>>>> still have a  copy of the latest Falkon provider that  
>>>>>>>>>>>> you edited just before you  left?  Can you just take a  
>>>>>>>>>>>> look through there to make sure nothing  has been broken  
>>>>>>>>>>>> with the SVN updates?  If you don't have time for  this  
>>>>>>>>>>>> now (considering today was your first day on the new  
>>>>>>>>>>>> job),  I'll dig through there and see if I can make some  
>>>>>>>>>>>> sense of what is  happening!
>>>>>>>>>>>>
>>>>>>>>>>>> One last thing, Ben mentioned that the Falkon provider  
>>>>>>>>>>>> you saw in  Nika's account was different than what was  
>>>>>>>>>>>> in SVN.  Ben, did you at  least look at modification  
>>>>>>>>>>>> dates?  How old was one as opposed to  the other?  I  
>>>>>>>>>>>> hope we did not revert back to an older version that   
>>>>>>>>>>>> might have had some bug in it....
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> I had to update to the latest version of provider-deef  
>>>>>>>>>>> from SVN since  without the update nothing worked. The  
>>>>>>>>>>> version I am at now is 1050.  But this is exactly the  
>>>>>>>>>>> same version of swift/deef I used for our  Friday run  
>>>>>>>>>>> (which 'worked' from Falcon/Swift point of view)
>>>>>>>>>>>
>>>>>>>>>>> Nika
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Ioan
>>>>>>>>>>>>
>>>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>>>>>
>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job"  
>>>>>>>>>>>>> MolDyn-244-loops- zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>    7959  244749 3241072
>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "Running job"  
>>>>>>>>>>>>> MolDyn-244-loops- zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>   17207  564648 7949388
>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I.e. almost half of the jobs haven't finished  
>>>>>>>>>>>>> (according to swift)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also have some exceptions:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,  
>>>>>>>>>>>>> identity=urn: 0-1-101-2-37-0-0-1186444363341) setting  
>>>>>>>>>>>>> status to Failed Exception  in getFile
>>>>>>>>>>>>> (80 of those):
>>>>>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244- 
>>>>>>>>>>>>> loops- zhgo6be8tjhi1.log | wc
>>>>>>>>>>>>>      80     880    9705
>>>>>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nika
>>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>
>>>>>>
>>
>>
>>
>




More information about the Swift-devel mailing list