[Swift-devel] Q about MolDyn

Veronika Nefedova nefedova at mcs.anl.gov
Wed Aug 8 19:44:41 CDT 2007


Everything seemed to come to a halt.

This is the last stdout that I have:

Staged out MolDyn-244-loops-knt9h8fru9sm2/shared/ 
solv_repu_0.7_0.8_a0_m040.wham to solv_repu_0.7_0.8_a0_m040.wham from  
UC-64
Staged out MolDyn-244-loops-knt9h8fru9sm2/shared/ 
solv_repu_0.7_0.8_a0_m040_done to solv_repu_0.7_0.8_a0_m040_done from  
UC-64
Submitting task Task(type=4, identity=urn: 
0-1-91-2-29-0-0-2-1186617126510)
No host specified
Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting  
status to Active
Submitting task Task(type=4, identity=urn: 
0-1-91-2-29-0-0-1-1186617126513)
No host specified
Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting  
status to Completed
Submitting task Task(type=4, identity=urn: 
0-1-91-2-29-0-0-3-1186617126516)
No host specified
Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting  
status to Active
Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) Completed.  
Waiting: 1, Running: 14926. Heap size: 1518M, Heap free: 962M, Max  
heap: 1518M
Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting  
status to Completed
Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) Completed.  
Waiting: 0, Running: 14926. Heap size: 1518M, Heap free: 962M, Max  
heap: 1518M
Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting  
status to Active
Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting  
status to Completed
Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) Completed.  
Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max  
heap: 1518M
Submitting task Task(type=4, identity=urn: 
0-1-91-2-29-0-0-4-1186617126519)
No host specified
Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting  
status to Active
Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting  
status to Completed
Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) Completed.  
Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max  
heap: 1518M
Resolved 2078 to UC-64
chrm_long completed



Notice 'No host specified' -- this message was printing throughout  
the whole execution, from the very beginning.
The log is in ~nefedova/alamines/MolDyn-244-loops-knt9h8fru9sm2.log  
on viper

Nika

On Aug 8, 2007, at 5:36 PM, Ioan Raicu wrote:

> viper in Yong's account... he ran some tests just before he left  
> with this version, and it worked just fine!
> I saved Nika's provider which I replaced, so we can always go back  
> to that if we need to.
>
> Ioan
>
> Mihael Hategan wrote:
>> Where exactly is this version?
>>
>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
>>
>>> OK everyone, I found Yong's version of the provider dated July 26th,
>>> much more recent than what was in SVN on June 27th.  I updated  
>>> Nika's
>>> version of the provider (which has been checked out of SVN), and
>>> recompiled&deploy!
>>>
>>>   ant distclean
>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
>>> dist
>>>
>>> I even updated updated some of the logging info to use the logger
>>> (some were not using the logger).
>>>
>>> Nika, Falkon is freshly restarted and ready for another test run!
>>>
>>> Falkon Factory Service:
>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/ 
>>> GenericPortal/core/WS/GPFactoryService
>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>
>>> Ioan
>>>
>>> Veronika Nefedova wrote:
>>>
>>>> Ioan,
>>>>
>>>>
>>>> It looks like the Falcon (including provider-deef) was put in  
>>>> SVN on
>>>> June 27th. You really were supposed to use the SVN code from that
>>>> point. Sigh. Did you do any changes to viper install after June
>>>> 27th?
>>>>
>>>>
>>>> Nika
>>>>
>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
>>>>
>>>>
>>>>> Could it be that the fixes were done before the original SVN
>>>>> checkin?   If not, then at least we know why things aren't
>>>>> working.  I bet the latest provider source was in Nika's Swift
>>>>> install on viper.  Nika, I take it you don't have this anymore, as
>>>>> SVN updates overwrote this.  Yong, is there any other place you
>>>>> might have the latest provider source?  If not, I guess we need to
>>>>> take another look through the provider source to fix the issues
>>>>> that we knew of...
>>>>>
>>>>> Ioan
>>>>>
>>>>> Mihael Hategan wrote:
>>>>>
>>>>>> Well, it doesn't look like the falkon provider in SVN has been  
>>>>>> updated
>>>>>> at all in terms of fixing synchronization issues. All commits on
>>>>>> provider-deef come from either ben or me:
>>>>>>
>>>>>> bash-3.1$ svn log
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500  
>>>>>> (Fri, 03 Aug
>>>>>> 2007) | 1 line
>>>>>>
>>>>>> removed gt4 stuff and added them as a dependency
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500  
>>>>>> (Fri, 03 Aug
>>>>>> 2007) | 1 line
>>>>>>
>>>>>> removed gt4 stuff and added them as a dependency
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri,  
>>>>>> 03 Aug
>>>>>> 2007) | 1 line
>>>>>>
>>>>>> a very small readme for provider-deef
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed,  
>>>>>> 27 Jun
>>>>>> 2007) | 1 line
>>>>>>
>>>>>> remove dist directory form svn
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed,  
>>>>>> 27 Jun
>>>>>> 2007) | 20 lines
>>>>>>
>>>>>> provider-deef, the Falkon/cog provider
>>>>>>
>>>>>> based on source in below message, with .class files deleted
>>>>>>
>>>>>>
>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
>>>>>> <hategan at mcs.anl.gov>,
>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
>>>>>> Subject: Re: 244 molecule MolDyn run...
>>>>>>
>>>>>> its on viper.uchicago.edu
>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
>>>>>> I also tared it up and put in my home on terminable: ~nefedova/ 
>>>>>> cogl.tgz
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>>
>>>>>> ----------------------------------------------------------------- 
>>>>>> -------
>>>>>>
>>>>>>
>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
>>>>>>
>>>>>>
>>>>>>> Mihael, do you have any clues on why this run has failed?  
>>>>>>> Ioan - my
>>>>>>> answers to your questions are below...
>>>>>>>
>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> It looks like viper (where Swift is running) is idle, and so  
>>>>>>>> is tg-
>>>>>>>> viz-login2 (where Falkon is running).
>>>>>>>> What looks evident to me is that the normal list of events  
>>>>>>>> is for a
>>>>>>>> successful task:
>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn:
>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops- 
>>>>>>>> zhgo6be8tjhi1.log
>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,  
>>>>>>>> identity=urn:
>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread  
>>>>>>>> notification: urn:
>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,  
>>>>>>>> identity=urn:
>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
>>>>>>>>
>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>  17566  175660 2179412
>>>>>>>>
>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>>   7959   55713  785035
>>>>>>>>
>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to
>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>>>>>>>> 190968 1909680 24003796
>>>>>>>>
>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were received
>>>>>>>> from Falkon, and 190968 tasks were set to completed...
>>>>>>>>
>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so  
>>>>>>>> I would
>>>>>>>> argue that the # of notifications received is correct.  The
>>>>>>>> submitted # of tasks looks like the # I would have expected,  
>>>>>>>> but
>>>>>>>> all the tasks did not make it to Falkon.  The Falkon  
>>>>>>>> provider is
>>>>>>>> what sits between the change of status to submitted, and the
>>>>>>>> receipt of the notification, so I would say that is the  
>>>>>>>> first place
>>>>>>>> we need to look for more details... there used to some extra  
>>>>>>>> debug
>>>>>>>> info in the Falkon provider that simply printed all the  
>>>>>>>> tasks that
>>>>>>>> were actually being submitted to Falkon (as opposed to just the
>>>>>>>> change of status within Karajan).  I don't see those debug
>>>>>>>> statements, I bet they got overwritten in the SVN update.
>>>>>>>> What about the completed tasks, why are there so many (190K)
>>>>>>>> completed tasks?  Where did they come from?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> "Task" doesn't mean job. It could be just data being staged  
>>>>>>> in , etc.
>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it
>>>>>>> differs, this is the problem...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Yong, are you keeping up with these emails?  Do you still  
>>>>>>>> have a
>>>>>>>> copy of the latest Falkon provider that you edited just  
>>>>>>>> before you
>>>>>>>> left?  Can you just take a look through there to make sure  
>>>>>>>> nothing
>>>>>>>> has been broken with the SVN updates?  If you don't have  
>>>>>>>> time for
>>>>>>>> this now (considering today was your first day on the new job),
>>>>>>>> I'll dig through there and see if I can make some sense of  
>>>>>>>> what is
>>>>>>>> happening!
>>>>>>>>
>>>>>>>> One last thing, Ben mentioned that the Falkon provider you  
>>>>>>>> saw in
>>>>>>>> Nika's account was different than what was in SVN.  Ben, did  
>>>>>>>> you at
>>>>>>>> least look at modification dates?  How old was one as  
>>>>>>>> opposed to
>>>>>>>> the other?  I hope we did not revert back to an older  
>>>>>>>> version that
>>>>>>>> might have had some bug in it....
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I had to update to the latest version of provider-deef from  
>>>>>>> SVN since
>>>>>>> without the update nothing worked. The version I am at now is  
>>>>>>> 1050.
>>>>>>> But this is exactly the same version of swift/deef I used for  
>>>>>>> our
>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
>>>>>>>
>>>>>>> Nika
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Ioan
>>>>>>>>
>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Well, there are some discrepancies:
>>>>>>>>>
>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244- 
>>>>>>>>> loops-
>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>    7959  244749 3241072
>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244- 
>>>>>>>>> loops-
>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>   17207  564648 7949388
>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>
>>>>>>>>> I.e. almost half of the jobs haven't finished (according to  
>>>>>>>>> swift)
>>>>>>>>>
>>>>>>>>> I also have some exceptions:
>>>>>>>>>
>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,  
>>>>>>>>> identity=urn:
>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed  
>>>>>>>>> Exception
>>>>>>>>> in getFile
>>>>>>>>> (80 of those):
>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops-
>>>>>>>>> zhgo6be8tjhi1.log | wc
>>>>>>>>>      80     880    9705
>>>>>>>>> nefedova at viper:~/alamines>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nika
>>>>>>>>>
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>
>>
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070808/ae0f5463/attachment.html>


More information about the Swift-devel mailing list