[Swift-devel] Q about MolDyn

Mihael Hategan hategan at mcs.anl.gov
Thu Aug 9 12:33:17 CDT 2007


So I see a gap in the log from 19:32 to 21:45. No log messages
whatsoever in between. Which is weird. I wonder what could cause log4j
to stop writing things to the log file.

On Wed, 2007-08-08 at 20:34 -0500, Ioan Raicu wrote:
> Things are not halted, Falkon is still running, and its delivering 
> results really slowly...
> 
> http://tg-viz-login2.uc.teragrid.org:51000/index.htm
> 
> notice the black area in the second graph, that is the time to deliver 
> notificaitons to Swift... all machines are basically idle, I don't know 
> what it could be... there is ample space on the disks... CPU is idle, 
> memory is OK, yet things are just crawling, and Swift seems to have 
> stopped printing anything to the screen or file. 
> 
> The logs show nothing strange... but there is obviosuly something that 
> is not right...
> 
> I'll let the experiment keep going for now, and I'll dig into it deeper 
> later tonight...
> 
> Ioan
> 
> Veronika Nefedova wrote:
> > Everything seemed to come to a halt.
> >
> > This is the last stdout that I have:
> >
> > Staged out 
> > MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040.wham 
> > to solv_repu_0.7_0.8_a0_m040.wham from UC-64
> > Staged out 
> > MolDyn-244-loops-knt9h8fru9sm2/shared/solv_repu_0.7_0.8_a0_m040_done 
> > to solv_repu_0.7_0.8_a0_m040_done from UC-64
> > Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510)
> > No host specified
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
> > status to Active
> > Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513)
> > No host specified
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) setting 
> > status to Completed
> > Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516)
> > No host specified
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
> > status to Active
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-2-1186617126510) Completed. 
> > Waiting: 1, Running: 14926. Heap size: 1518M, Heap free: 962M, Max 
> > heap: 1518M
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) setting 
> > status to Completed
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-1-1186617126513) Completed. 
> > Waiting: 0, Running: 14926. Heap size: 1518M, Heap free: 962M, Max 
> > heap: 1518M
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
> > status to Active
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) setting 
> > status to Completed
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-3-1186617126516) Completed. 
> > Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max 
> > heap: 1518M
> > Submitting task Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519)
> > No host specified
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
> > status to Active
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) setting 
> > status to Completed
> > Task(type=4, identity=urn:0-1-91-2-29-0-0-4-1186617126519) Completed. 
> > Waiting: 0, Running: 14925. Heap size: 1518M, Heap free: 962M, Max 
> > heap: 1518M
> > Resolved 2078 to UC-64
> > chrm_long completed
> >
> >
> >
> > Notice 'No host specified' -- this message was printing throughout the 
> > whole execution, from the very beginning.
> > The log is in ~nefedova/alamines/MolDyn-244-loops-knt9h8fru9sm2.log on 
> > viper
> >
> > Nika
> >
> > On Aug 8, 2007, at 5:36 PM, Ioan Raicu wrote:
> >
> >> viper in Yong's account... he ran some tests just before he left with 
> >> this version, and it worked just fine!
> >> I saved Nika's provider which I replaced, so we can always go back to 
> >> that if we need to.
> >>
> >> Ioan
> >>
> >> Mihael Hategan wrote:
> >>> Where exactly is this version?
> >>>
> >>> On Wed, 2007-08-08 at 11:59 -0500, Ioan Raicu wrote:
> >>>   
> >>>> OK everyone, I found Yong's version of the provider dated July 26th,
> >>>> much more recent than what was in SVN on June 27th.  I updated Nika's
> >>>> version of the provider (which has been checked out of SVN), and
> >>>> recompiled&deploy!  
> >>>>
> >>>>   ant distclean
> >>>>   ant -Ddist.dir=/home/nefedova/cogl/modules/vdsk/dist/vdsk-0.2-dev/
> >>>> dist
> >>>>
> >>>> I even updated updated some of the logging info to use the logger
> >>>> (some were not using the logger).
> >>>>
> >>>> Nika, Falkon is freshly restarted and ready for another test run!
> >>>>
> >>>> Falkon Factory Service:
> >>>> http://tg-viz-login2.uc.teragrid.org:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
> >>>> Web Server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
> >>>>
> >>>> Ioan
> >>>>
> >>>> Veronika Nefedova wrote: 
> >>>>     
> >>>>> Ioan, 
> >>>>>
> >>>>>
> >>>>> It looks like the Falcon (including provider-deef) was put in SVN on
> >>>>> June 27th. You really were supposed to use the SVN code from that
> >>>>> point. Sigh. Did you do any changes to viper install after June
> >>>>> 27th?
> >>>>>
> >>>>>
> >>>>> Nika
> >>>>>
> >>>>> On Aug 7, 2007, at 11:32 AM, Ioan Raicu wrote:
> >>>>>
> >>>>>       
> >>>>>> Could it be that the fixes were done before the original SVN
> >>>>>> checkin?   If not, then at least we know why things aren't
> >>>>>> working.  I bet the latest provider source was in Nika's Swift
> >>>>>> install on viper.  Nika, I take it you don't have this anymore, as
> >>>>>> SVN updates overwrote this.  Yong, is there any other place you
> >>>>>> might have the latest provider source?  If not, I guess we need to
> >>>>>> take another look through the provider source to fix the issues
> >>>>>> that we knew of...
> >>>>>>
> >>>>>> Ioan
> >>>>>>
> >>>>>> Mihael Hategan wrote: 
> >>>>>>         
> >>>>>>> Well, it doesn't look like the falkon provider in SVN has been updated
> >>>>>>> at all in terms of fixing synchronization issues. All commits on
> >>>>>>> provider-deef come from either ben or me:
> >>>>>>>
> >>>>>>> bash-3.1$ svn log
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>> r1053 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:49:48 -0500 (Fri, 03 Aug
> >>>>>>> 2007) | 1 line
> >>>>>>>
> >>>>>>> removed gt4 stuff and added them as a dependency
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>> r1052 | hategan at CI.UCHICAGO.EDU | 2007-08-03 14:48:25 -0500 (Fri, 03 Aug
> >>>>>>> 2007) | 1 line
> >>>>>>>
> >>>>>>> removed gt4 stuff and added them as a dependency
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>> r1051 | benc at CI.UCHICAGO.EDU | 2007-08-03 14:20:21 -0500 (Fri, 03 Aug
> >>>>>>> 2007) | 1 line
> >>>>>>>
> >>>>>>> a very small readme for provider-deef
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>> r875 | benc at CI.UCHICAGO.EDU | 2007-06-27 15:00:12 -0500 (Wed, 27 Jun
> >>>>>>> 2007) | 1 line
> >>>>>>>
> >>>>>>> remove dist directory form svn
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>> r873 | benc at CI.UCHICAGO.EDU | 2007-06-27 10:23:15 -0500 (Wed, 27 Jun
> >>>>>>> 2007) | 20 lines
> >>>>>>>
> >>>>>>> provider-deef, the Falkon/cog provider
> >>>>>>>
> >>>>>>> based on source in below message, with .class files deleted
> >>>>>>>
> >>>>>>>
> >>>>>>> Date: Wed, 27 Jun 2007 09:27:23 -0500
> >>>>>>> From: Veronika Nefedova <nefedova at mcs.anl.gov>
> >>>>>>> To: Yong Zhao <yongzh at cs.uchicago.edu>
> >>>>>>> Cc: Ben Clifford <benc at hawaga.org.uk>, Mihael Hategan
> >>>>>>> <hategan at mcs.anl.gov>,
> >>>>>>>     iraicu at cs.uchicago.edu, Ian Foster <foster at mcs.anl.gov>,
> >>>>>>>     Mike Wilde <wilde at mcs.anl.gov>,
> >>>>>>>     Tiberiu Stef-Praun <tiberius at ci.uchicago.edu>
> >>>>>>> Subject: Re: 244 molecule MolDyn run...
> >>>>>>>
> >>>>>>> its on viper.uchicago.edu
> >>>>>>> in : /home/nefedova/cogl/modules/provider-deef/
> >>>>>>> I also tared it up and put in my home on terminable: ~nefedova/cogl.tgz
> >>>>>>>
> >>>>>>> Nika
> >>>>>>>
> >>>>>>>
> >>>>>>> ------------------------------------------------------------------------
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, 2007-08-07 at 10:01 -0500, Veronika Nefedova wrote:
> >>>>>>>   
> >>>>>>>           
> >>>>>>>> Mihael, do you have any clues on why this run has failed? Ioan - my  
> >>>>>>>> answers to your questions are below...
> >>>>>>>>
> >>>>>>>> On Aug 6, 2007, at 10:28 PM, Ioan Raicu wrote:
> >>>>>>>>
> >>>>>>>>     
> >>>>>>>>             
> >>>>>>>>> It looks like viper (where Swift is running) is idle, and so is tg- 
> >>>>>>>>> viz-login2 (where Falkon is running).
> >>>>>>>>> What looks evident to me is that the normal list of events is for a  
> >>>>>>>>> successful task:
> >>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "urn: 
> >>>>>>>>> 0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
> >>>>>>>>> 2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, identity=urn: 
> >>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Submitted
> >>>>>>>>> 2007-08-06 20:58:17,685 DEBUG NotificationThread notification: urn: 
> >>>>>>>>> 0-1-73-2-31-0-0-1186444341989 0
> >>>>>>>>> 2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, identity=urn: 
> >>>>>>>>> 0-1-73-2-31-0-0-1186444341989) setting status to Completed
> >>>>>>>>>
> >>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
> >>>>>>>>> Submitted" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> >>>>>>>>>  17566  175660 2179412
> >>>>>>>>>
> >>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "NotificationThread  
> >>>>>>>>> notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> >>>>>>>>>   7959   55713  785035
> >>>>>>>>>
> >>>>>>>>> iraicu at viper:/home/nefedova/alamines> grep "setting status to  
> >>>>>>>>> Completed" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> >>>>>>>>> 190968 1909680 24003796
> >>>>>>>>>
> >>>>>>>>> Now, 17566 tasks were submitted, 7959 notifiation were received  
> >>>>>>>>> from Falkon, and 190968 tasks were set to completed...
> >>>>>>>>>
> >>>>>>>>> Obviously this isn't right.  Falkon only saw 7959 tasks, so I would  
> >>>>>>>>> argue that the # of notifications received is correct.  The  
> >>>>>>>>> submitted # of tasks looks like the # I would have expected, but  
> >>>>>>>>> all the tasks did not make it to Falkon.  The Falkon provider is  
> >>>>>>>>> what sits between the change of status to submitted, and the  
> >>>>>>>>> receipt of the notification, so I would say that is the first place  
> >>>>>>>>> we need to look for more details... there used to some extra debug  
> >>>>>>>>> info in the Falkon provider that simply printed all the tasks that  
> >>>>>>>>> were actually being submitted to Falkon (as opposed to just the  
> >>>>>>>>> change of status within Karajan).  I don't see those debug  
> >>>>>>>>> statements, I bet they got overwritten in the SVN update.
> >>>>>>>>> What about the completed tasks, why are there so many (190K)  
> >>>>>>>>> completed tasks?  Where did they come from?
> >>>>>>>>>
> >>>>>>>>>       
> >>>>>>>>>               
> >>>>>>>> "Task" doesn't mean job. It could be just data being staged in , etc.  
> >>>>>>>> The first 2 are important -- (Submitted vs Completed). Since it  
> >>>>>>>> differs, this is the problem...
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     
> >>>>>>>>             
> >>>>>>>>> Yong, are you keeping up with these emails?  Do you still have a  
> >>>>>>>>> copy of the latest Falkon provider that you edited just before you  
> >>>>>>>>> left?  Can you just take a look through there to make sure nothing  
> >>>>>>>>> has been broken with the SVN updates?  If you don't have time for  
> >>>>>>>>> this now (considering today was your first day on the new job),  
> >>>>>>>>> I'll dig through there and see if I can make some sense of what is  
> >>>>>>>>> happening!
> >>>>>>>>>
> >>>>>>>>> One last thing, Ben mentioned that the Falkon provider you saw in  
> >>>>>>>>> Nika's account was different than what was in SVN.  Ben, did you at  
> >>>>>>>>> least look at modification dates?  How old was one as opposed to  
> >>>>>>>>> the other?  I hope we did not revert back to an older version that  
> >>>>>>>>> might have had some bug in it....
> >>>>>>>>>
> >>>>>>>>>       
> >>>>>>>>>               
> >>>>>>>> I had to update to the latest version of provider-deef from SVN since  
> >>>>>>>> without the update nothing worked. The version I am at now is 1050.  
> >>>>>>>> But this is exactly the same version of swift/deef I used for our  
> >>>>>>>> Friday run (which 'worked' from Falcon/Swift point of view)
> >>>>>>>>
> >>>>>>>> Nika
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     
> >>>>>>>>             
> >>>>>>>>> Ioan
> >>>>>>>>>
> >>>>>>>>> Veronika Nefedova wrote:
> >>>>>>>>>       
> >>>>>>>>>               
> >>>>>>>>>> Well, there are some discrepancies:
> >>>>>>>>>>
> >>>>>>>>>> nefedova at viper:~/alamines> grep "Completed job" MolDyn-244-loops- 
> >>>>>>>>>> zhgo6be8tjhi1.log | wc
> >>>>>>>>>>    7959  244749 3241072
> >>>>>>>>>> nefedova at viper:~/alamines> grep "Running job" MolDyn-244-loops- 
> >>>>>>>>>> zhgo6be8tjhi1.log | wc
> >>>>>>>>>>   17207  564648 7949388
> >>>>>>>>>> nefedova at viper:~/alamines>
> >>>>>>>>>>
> >>>>>>>>>> I.e. almost half of the jobs haven't finished (according to swift)
> >>>>>>>>>>
> >>>>>>>>>> I also have some exceptions:
> >>>>>>>>>>
> >>>>>>>>>> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, identity=urn: 
> >>>>>>>>>> 0-1-101-2-37-0-0-1186444363341) setting status to Failed Exception  
> >>>>>>>>>> in getFile
> >>>>>>>>>> (80 of those):
> >>>>>>>>>> nefedova at viper:~/alamines> grep "ailed" MolDyn-244-loops- 
> >>>>>>>>>> zhgo6be8tjhi1.log | wc
> >>>>>>>>>>      80     880    9705
> >>>>>>>>>> nefedova at viper:~/alamines>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Nika
> >>>>>>>>>>         
> >>>>>>>>>>                 
> >>>>>>>> _______________________________________________
> >>>>>>>> Swift-devel mailing list
> >>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>
> >>>>>>>>     
> >>>>>>>>             
> >>>>>> _______________________________________________
> >>>>>> Swift-devel mailing list
> >>>>>> Swift-devel at ci.uchicago.edu
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>         
> >>>>>       
> >>>   
> >
> 




More information about the Swift-devel mailing list