From benc at hawaga.org.uk  Fri Feb  1 16:03:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Feb 2008 22:03:07 +0000 (GMT)
Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid?
In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov>
	<479F5047.2040805@mcs.anl.gov>
	<479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu>
	<3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov>
	<523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov>
	<1201664122.32688.36.camel@blabla.mcs.anl.gov>
	<C8F7DDD6-8972-4743-809A-0C3CE6AC70C4@mcs.anl.gov>
	<Pine.LNX.4.64.0801301342300.6302@dildano.hawaga.org.uk>
	<47A08A36.1000502@mcs.anl.gov>
	<Pine.LNX.4.64.0801301432250.6232@dildano.hawaga.org.uk>
	<F749B2E5-FD11-46C5-A74A-B6B0CD1A03D1@mcs.anl.gov>
	<Pine.LNX.4.64.0801301543120.6232@dildano.hawaga.org.uk>
	<0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802012200210.6232@dildano.hawaga.org.uk>

related to this, Swift can use PBS directly if its run on the headnode. in 
some cases, this is going to be preferable to using either version of 
GRAM. I think this would have avoided the particular problem encountered 
here.

I haven't tried this on TG-UC, but it seems to work ok for me on teraport.

-- 


From benc at hawaga.org.uk  Fri Feb  1 17:29:51 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Feb 2008 23:29:51 +0000 (GMT)
Subject: [Swift-devel] nightly rebuild of documentation
Message-ID: <Pine.LNX.4.64.0802012329340.6232@dildano.hawaga.org.uk>

I finally got round to setting up a cron job to update the webspace 
from SVN every 24h (i.e. it runs update.sh)

So now, unless its urgent, you can commit doc changes to SVN and not have 
to log in to update the actual deployment of those docs.

-- 


From hategan at mcs.anl.gov  Fri Feb  1 19:07:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 01 Feb 2008 19:07:02 -0600
Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid?
In-Reply-To: <1201750553.11697.8.camel@blabla.mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov>
	<479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov>
	<479F809B.5050306@cs.uchicago.edu>
	<3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov>
	<523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov>
	<1201664122.32688.36.camel@blabla.mcs.anl.gov>
	<C8F7DDD6-8972-4743-809A-0C3CE6AC70C4@mcs.anl.gov>
	<Pine.LNX.4.64.0801301342300.6302@dildano.hawaga.org.uk>
	<47A08A36.1000502@mcs.anl.gov>
	<Pine.LNX.4.64.0801301432250.6232@dildano.hawaga.org.uk>
	<F749B2E5-FD11-46C5-A74A-B6B0CD1A03D1@mcs.anl.gov>
	<Pine.LNX.4.64.0801301543120.6232@dildano.hawaga.org.uk>
	<0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov>
	<A2B69399-5C1E-46D9-843A-83059F06552C@mcs.anl.gov>
	<1201718259.5465.1.camel@blabla.mcs.anl.gov>
	<69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov>
	<1201742109.9441.3.camel@blabla.mcs.anl.gov>
	<47A13F4C.60507@mcs.anl.gov>
	<1201750553.11697.8.camel@blabla.mcs.anl.gov>
Message-ID: <1201914422.5589.1.camel@blabla.mcs.anl.gov>

Nice. I can see other people's jobs:

hategan at tg-grid1:~>
cat /soft/prews-gram-4.0.1-r3/tmp/gram_job_state/job.tg-grid1.uc.teragrid.org.9324.1184364728
https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/
  12
 128
   0
1460443.tg-master.uc.teragrid.org
&(rsl_substitution=(GRIDMANAGER_GASS_URL
https://sidgrid.ci.uchicago.edu:60651))(executable='/home/skenny/vds_32/bin/kickstart')(directory='/home/skenny/sidgrid_out/skenny/skenny/wf_test/run0001')(arguments=-n upload::uploader -N sidgrid::UploadClient -R ANLUCTERAGRID32 /home/skenny/sidgrid/soft/upload/uploader skenny wf_test graspB.lh.forperm.txt_260.output graspB.lh.forperm.txt_261.output graspB.lh.forperm.txt_262.output graspB.lh.forperm.txt_263.output graspB.lh.forperm.txt_264.output graspB.lh.forperm.txt_265.output)(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CACHED_STDERR) $(GRIDMANAGER_GASS_URL)#'/ci/sidgrid.ci.uchicago.edu/htdocs/sidgrid/sidgrid_test_server/sidgrid/transformations/sidgridUsers/skenny/wf_test/run0001/uploader_ID000007.err'))(environment=(app '/app/osg_app')(data '/home/skenny/data')(tmp '/tmp')(wntmp '/tmp'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io_url=$(GRIDMANAGER_GASS_URL))(jobtype=single)(maxwalltime=2400)
https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/
...


On Wed, 2008-01-30 at 21:35 -0600, Mihael Hategan wrote:
> Sure, I'd do that anyway to test the testing script(s)/process. I mean
> if I do mess it, I want to make sure I only need to do it once.
> 
> But I'm thinking it's better to agree on some time than for Joe or Ti or
> JP to randomly wonder what's going on.
> 
> On the other hand, seeing many processes in my name will probably
> eliminate the confusion :)
> 
> On Wed, 2008-01-30 at 21:23 -0600, Michael Wilde wrote:
> > I suggested we start the tests at a moderate intensity, and record the 
> > impact on CPU, mem, qlength, etc.
> > 
> > Then ramp up untl those indicators start to suggest that the gk is under 
> > strain.
> > 
> > Its not 100% foolproof, but better than blind stress testing.
> > 
> > - mike
> > 
> > 
> > On 1/30/08 7:15 PM, Mihael Hategan wrote:
> > > Me doing such tests will probably mess the gatekeeper node again. How do
> > > we proceed?
> > > 
> > > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote:
> > >> I'm saying run swift tests using GRAM4 and see what you get.  Run a  
> > >> similar job scenario like 2000 jobs to the same GRAM4 service.  I will  
> > >> be interested to see how swift does for performance, scalability,  
> > >> errors...
> > >> It's possible that condor-g is not optimal, so seeing how another  
> > >> GRAM4 client dong similar job submission scenarios fares would make  
> > >> for an interesting comparison.
> > >>
> > >> -Stu
> > >>
> > >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote:
> > >>
> > >>> I'm confused. Why would you want to test GRAM scalability while
> > >>> introducing additional biasing elements, such as Condor-G?
> > >>>
> > >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote:
> > >>>> All,
> > >>>>
> > >>>> I wanted to chime in with a number of things being discussed here.
> > >>>>
> > >>>> There is a GRAM RFT Core reliability group focused on ensuring the
> > >>>> GRAM service stays up and functional in spit of an onslaught from a
> > >>>> client.  http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team
> > >>>>
> > >>>> The ultimate goal here is that a client may get a timeout and that
> > >>>> would be the signal to backoff some.
> > >>>>
> > >>>> -----
> > >>>>
> > >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are
> > >>>> his test results.
> > >>>> 	http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests
> > >>>>
> > >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM
> > >>>> service better than GRAM4.  But again, this is with the condor-g
> > >>>> tricks.  Without the tricks, GRAM2 will handle the load better.
> > >>>>
> > >>>> OSG VTB testing: These were using globusrun-ws and also condor-g.
> > >>>> 	https://twiki.grid.iu.edu/twiki/bin/view/Integration/ 
> > >>>> WSGramValidation
> > >>>>
> > >>>> clients in these tests got a variety of errors depending on the jobs
> > >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ...
> > >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's
> > >>>> scenario.  But it handled 1000 jobs x 1 condor-g client just fine.
> > >>>>
> > >>>> -----
> > >>>>
> > >>>> It would be very interesting to see how swift does with GRAM4.  This
> > >>>> would make for a nice comparison to condor-g.
> > >>>>
> > >>>> As far as having functioning GRAM4 services on TG, things have
> > >>>> improved.  LEAD is using GRAM4 exclusively and we've been working  
> > >>>> with
> > >>>> them to make sure the GRAM4 services are up and functioning.  INCA  
> > >>>> has
> > >>>> been updated to more effectively test and monitor GRAM4 and GridFTP
> > >>>> services that LEAD is targeting.  This could be extended for any  
> > >>>> hosts
> > >>>> that swift would like to test against.  Here are some interesting
> > >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi
> > >>>>
> > >>>> -Stu
> > >>>>
> > >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote:
> > >>>>
> > >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote:
> > >>>>>
> > >>>>> [snip]
> > >>>>>
> > >>>>>> No. The default behaviour when working with a user who is "just
> > >>>>>> trying to
> > >>>>>> get their stuff to run" is "screw this, use GRAM2 because it  
> > >>>>>> works".
> > >>>>>>
> > >>>>>> Its a self-reinforcing feedback loop, that will be broken at the
> > >>>>>> point
> > >>>>>> that it becomes easier for people to stick with GRAM4 than default
> > >>>>>> back to
> > >>>>>> GRAM2. I guess we need to keep trying every now and then and hope
> > >>>>>> that one
> > >>>>>> time it sticks ;-)
> > >>>>>>
> > >>>>>> -- 
> > >>>>> Well this works to a point, but if falling back to a technology that
> > >>>>> is known to not be scalable for your sizes results in killing a
> > >>>>> machine, I, as a site admin, will eventually either a) deny you
> > >>>>> service b) shut down the poorly performing service or c) all of the
> > >>>>> above. So it's in your best interest to find and use those
> > >>>>> technologies that are best suited to the task at hand so the users
> > >>>>> of your software don't get nailed by (a).
> > >>>>>
> > >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM
> > >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to
> > >>>>> throttle itself (think of how apache reports "Server busy. Try again
> > >>>>> later") is the best path forward. For the short term, it seems that
> > >>>>> the Swift developers should manually find those limits for sites
> > >>>>> that the users use regularly for them to use, *and* educate their
> > >>>>> users on how to identify that they could be adversely affecting a
> > >>>>> resource and throttle themselves till the ideal, automated method is
> > >>>>> a usable reality.
> > >>>>>
> > >>>> _______________________________________________
> > >>>> Swift-devel mailing list
> > >>>> Swift-devel at ci.uchicago.edu
> > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>>
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From leggett at mcs.anl.gov  Sun Feb  3 15:34:45 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Sun, 3 Feb 2008 15:34:45 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <921658.18899.qm@web52308.mail.re2.yahoo.com>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
Message-ID: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>

Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
some swift settings that don't kill our server?

On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:

> Yes, I'm submitting molecular dynamics simulations
> using Swift.
>
> Is there a default wall-time limit for jobs on tg-uc?
>
>
>
> --- joseph insley <insley at mcs.anl.gov> wrote:
>
>> Actually, these numbers are now escalating...
>>
>> top - 17:18:54 up  2:29,  1 user,  load average:
>> 149.02, 123.63, 91.94
>> Tasks: 469 total,   4 running, 465 sleeping,   0
>> stopped,   0 zombie
>>
>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>     479
>>
>> insley at tg-viz-login1:~> time globusrun -a -r
>> tg-grid.uc.teragrid.org
>> GRAM Authentication test successful
>> real    0m26.134s
>> user    0m0.090s
>> sys     0m0.010s
>>
>>
>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>
>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>> TG GRAM host)
>>> became unresponsive and had to be rebooted.  I am
>> now seeing slow
>>> response times from the Gatekeeper there again.
>> Authenticating to
>>> the gatekeeper should only take a second or two,
>> but it is
>>> periodically taking up to 16 seconds:
>>>
>>> insley at tg-viz-login1:~> time globusrun -a -r
>> tg-grid.uc.teragrid.org
>>> GRAM Authentication test successful
>>> real    0m16.096s
>>> user    0m0.060s
>>> sys     0m0.020s
>>>
>>> looking at the load on tg-grid, it is rather high:
>>>
>>> top - 16:55:26 up  2:06,  1 user,  load average:
>> 89.59, 78.69, 62.92
>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>> stopped,   0 zombie
>>>
>>> And there appear to be a large number of processes
>> owned by kubal:
>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>    380
>>>
>>> I assume that Mike is using swift to do the job
>> submission.  Is
>>> there some throttling of the rate at which jobs
>> are submitted to
>>> the gatekeeper that could be done that would
>> lighten this load
>>> some?  (Or has that already been done since
>> earlier today?)  The
>>> current response times are not unacceptable, but
>> I'm hoping to
>>> avoid having the machine grind to a halt as it did
>> earlier today.
>>>
>>> Thanks,
>>> joe.
>>>
>>>
>>>
>> ===================================================
>>> joseph a.
>>> insley
>>
>>> insley at mcs.anl.gov
>>> mathematics & computer science division
>> (630) 252-5649
>>> argonne national laboratory
>>       (630)
>>> 252-5986 (fax)
>>>
>>>
>>
>> ===================================================
>> joseph a. insley
>>
>> insley at mcs.anl.gov
>> mathematics & computer science division       (630)
>> 252-5649
>> argonne national laboratory
>>     (630)
>> 252-5986 (fax)
>>
>>
>>
>
>
>
>       
> ____________________________________________________________________________________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>


From leggett at mcs.anl.gov  Sun Feb  3 15:36:57 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Sun, 3 Feb 2008 15:36:57 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
Message-ID: <2D238AED-3D5C-479D-B017-AE8105F5ABA5@mcs.anl.gov>

I should say I killed all your processes running on tg-grid1 so your  
jobs most likely are going to fail.

On Feb 3, 2008, at 3:34 PM, Ti Leggett wrote:

> Mike, You're killing tg-grid1 again. Can someone work with Mike to  
> get some swift settings that don't kill our server?
>
> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>
>> Yes, I'm submitting molecular dynamics simulations
>> using Swift.
>>
>> Is there a default wall-time limit for jobs on tg-uc?
>>
>>
>>
>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>
>>> Actually, these numbers are now escalating...
>>>
>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>> 149.02, 123.63, 91.94
>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>> stopped,   0 zombie
>>>
>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>    479
>>>
>>> insley at tg-viz-login1:~> time globusrun -a -r
>>> tg-grid.uc.teragrid.org
>>> GRAM Authentication test successful
>>> real    0m26.134s
>>> user    0m0.090s
>>> sys     0m0.010s
>>>
>>>
>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>
>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>> TG GRAM host)
>>>> became unresponsive and had to be rebooted.  I am
>>> now seeing slow
>>>> response times from the Gatekeeper there again.
>>> Authenticating to
>>>> the gatekeeper should only take a second or two,
>>> but it is
>>>> periodically taking up to 16 seconds:
>>>>
>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>> tg-grid.uc.teragrid.org
>>>> GRAM Authentication test successful
>>>> real    0m16.096s
>>>> user    0m0.060s
>>>> sys     0m0.020s
>>>>
>>>> looking at the load on tg-grid, it is rather high:
>>>>
>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>> 89.59, 78.69, 62.92
>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>> stopped,   0 zombie
>>>>
>>>> And there appear to be a large number of processes
>>> owned by kubal:
>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>   380
>>>>
>>>> I assume that Mike is using swift to do the job
>>> submission.  Is
>>>> there some throttling of the rate at which jobs
>>> are submitted to
>>>> the gatekeeper that could be done that would
>>> lighten this load
>>>> some?  (Or has that already been done since
>>> earlier today?)  The
>>>> current response times are not unacceptable, but
>>> I'm hoping to
>>>> avoid having the machine grind to a halt as it did
>>> earlier today.
>>>>
>>>> Thanks,
>>>> joe.
>>>>
>>>>
>>>>
>>> ===================================================
>>>> joseph a.
>>>> insley
>>>
>>>> insley at mcs.anl.gov
>>>> mathematics & computer science division
>>> (630) 252-5649
>>>> argonne national laboratory
>>>      (630)
>>>> 252-5986 (fax)
>>>>
>>>>
>>>
>>> ===================================================
>>> joseph a. insley
>>>
>>> insley at mcs.anl.gov
>>> mathematics & computer science division       (630)
>>> 252-5649
>>> argonne national laboratory
>>>    (630)
>>> 252-5986 (fax)
>>>
>>>
>>>
>>
>>
>>
>>      
>> ____________________________________________________________________________________
>> Be a better friend, newshound, and
>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>
>


From hategan at mcs.anl.gov  Sun Feb  3 21:09:13 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 03 Feb 2008 21:09:13 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
Message-ID: <1202094553.13259.4.camel@blabla.mcs.anl.gov>

So I was trying some stuff on Friday night. I guess I've found the
strategy on when to run the tests: when nobody else has jobs there
(besides Buzz doing gridftp tests, Ioan having some Falkon workers
running, and the occasional Inca tests).

In any event, the machine jumps to about 100% utilization at around 130
jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
1 in swift.properties.

There's still more work I need to do test-wise.

On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
> some swift settings that don't kill our server?
> 
> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> 
> > Yes, I'm submitting molecular dynamics simulations
> > using Swift.
> >
> > Is there a default wall-time limit for jobs on tg-uc?
> >
> >
> >
> > --- joseph insley <insley at mcs.anl.gov> wrote:
> >
> >> Actually, these numbers are now escalating...
> >>
> >> top - 17:18:54 up  2:29,  1 user,  load average:
> >> 149.02, 123.63, 91.94
> >> Tasks: 469 total,   4 running, 465 sleeping,   0
> >> stopped,   0 zombie
> >>
> >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>     479
> >>
> >> insley at tg-viz-login1:~> time globusrun -a -r
> >> tg-grid.uc.teragrid.org
> >> GRAM Authentication test successful
> >> real    0m26.134s
> >> user    0m0.090s
> >> sys     0m0.010s
> >>
> >>
> >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> >>
> >>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> >> TG GRAM host)
> >>> became unresponsive and had to be rebooted.  I am
> >> now seeing slow
> >>> response times from the Gatekeeper there again.
> >> Authenticating to
> >>> the gatekeeper should only take a second or two,
> >> but it is
> >>> periodically taking up to 16 seconds:
> >>>
> >>> insley at tg-viz-login1:~> time globusrun -a -r
> >> tg-grid.uc.teragrid.org
> >>> GRAM Authentication test successful
> >>> real    0m16.096s
> >>> user    0m0.060s
> >>> sys     0m0.020s
> >>>
> >>> looking at the load on tg-grid, it is rather high:
> >>>
> >>> top - 16:55:26 up  2:06,  1 user,  load average:
> >> 89.59, 78.69, 62.92
> >>> Tasks: 398 total,  20 running, 378 sleeping,   0
> >> stopped,   0 zombie
> >>>
> >>> And there appear to be a large number of processes
> >> owned by kubal:
> >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>    380
> >>>
> >>> I assume that Mike is using swift to do the job
> >> submission.  Is
> >>> there some throttling of the rate at which jobs
> >> are submitted to
> >>> the gatekeeper that could be done that would
> >> lighten this load
> >>> some?  (Or has that already been done since
> >> earlier today?)  The
> >>> current response times are not unacceptable, but
> >> I'm hoping to
> >>> avoid having the machine grind to a halt as it did
> >> earlier today.
> >>>
> >>> Thanks,
> >>> joe.
> >>>
> >>>
> >>>
> >> ===================================================
> >>> joseph a.
> >>> insley
> >>
> >>> insley at mcs.anl.gov
> >>> mathematics & computer science division
> >> (630) 252-5649
> >>> argonne national laboratory
> >>       (630)
> >>> 252-5986 (fax)
> >>>
> >>>
> >>
> >> ===================================================
> >> joseph a. insley
> >>
> >> insley at mcs.anl.gov
> >> mathematics & computer science division       (630)
> >> 252-5649
> >> argonne national laboratory
> >>     (630)
> >> 252-5986 (fax)
> >>
> >>
> >>
> >
> >
> >
> >       
> > ____________________________________________________________________________________
> > Be a better friend, newshound, and
> > know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From foster at mcs.anl.gov  Sun Feb  3 21:12:08 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Sun, 03 Feb 2008 21:12:08 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202094553.13259.4.camel@blabla.mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>
Message-ID: <47A68288.8060702@mcs.anl.gov>

Mihael:

Is there any chance you can try GRAM4, as was requested early last week?

Ian.

Mihael Hategan wrote:
> So I was trying some stuff on Friday night. I guess I've found the
> strategy on when to run the tests: when nobody else has jobs there
> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
> running, and the occasional Inca tests).
>
> In any event, the machine jumps to about 100% utilization at around 130
> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
> 1 in swift.properties.
>
> There's still more work I need to do test-wise.
>
> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>   
>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
>> some swift settings that don't kill our server?
>>
>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>
>>     
>>> Yes, I'm submitting molecular dynamics simulations
>>> using Swift.
>>>
>>> Is there a default wall-time limit for jobs on tg-uc?
>>>
>>>
>>>
>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>
>>>       
>>>> Actually, these numbers are now escalating...
>>>>
>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>> 149.02, 123.63, 91.94
>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>> stopped,   0 zombie
>>>>
>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>     479
>>>>
>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>> tg-grid.uc.teragrid.org
>>>> GRAM Authentication test successful
>>>> real    0m26.134s
>>>> user    0m0.090s
>>>> sys     0m0.010s
>>>>
>>>>
>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>
>>>>         
>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>           
>>>> TG GRAM host)
>>>>         
>>>>> became unresponsive and had to be rebooted.  I am
>>>>>           
>>>> now seeing slow
>>>>         
>>>>> response times from the Gatekeeper there again.
>>>>>           
>>>> Authenticating to
>>>>         
>>>>> the gatekeeper should only take a second or two,
>>>>>           
>>>> but it is
>>>>         
>>>>> periodically taking up to 16 seconds:
>>>>>
>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>           
>>>> tg-grid.uc.teragrid.org
>>>>         
>>>>> GRAM Authentication test successful
>>>>> real    0m16.096s
>>>>> user    0m0.060s
>>>>> sys     0m0.020s
>>>>>
>>>>> looking at the load on tg-grid, it is rather high:
>>>>>
>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>           
>>>> 89.59, 78.69, 62.92
>>>>         
>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>           
>>>> stopped,   0 zombie
>>>>         
>>>>> And there appear to be a large number of processes
>>>>>           
>>>> owned by kubal:
>>>>         
>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>    380
>>>>>
>>>>> I assume that Mike is using swift to do the job
>>>>>           
>>>> submission.  Is
>>>>         
>>>>> there some throttling of the rate at which jobs
>>>>>           
>>>> are submitted to
>>>>         
>>>>> the gatekeeper that could be done that would
>>>>>           
>>>> lighten this load
>>>>         
>>>>> some?  (Or has that already been done since
>>>>>           
>>>> earlier today?)  The
>>>>         
>>>>> current response times are not unacceptable, but
>>>>>           
>>>> I'm hoping to
>>>>         
>>>>> avoid having the machine grind to a halt as it did
>>>>>           
>>>> earlier today.
>>>>         
>>>>> Thanks,
>>>>> joe.
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> ===================================================
>>>>         
>>>>> joseph a.
>>>>> insley
>>>>>           
>>>>> insley at mcs.anl.gov
>>>>> mathematics & computer science division
>>>>>           
>>>> (630) 252-5649
>>>>         
>>>>> argonne national laboratory
>>>>>           
>>>>       (630)
>>>>         
>>>>> 252-5986 (fax)
>>>>>
>>>>>
>>>>>           
>>>> ===================================================
>>>> joseph a. insley
>>>>
>>>> insley at mcs.anl.gov
>>>> mathematics & computer science division       (630)
>>>> 252-5649
>>>> argonne national laboratory
>>>>     (630)
>>>> 252-5986 (fax)
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>>> ____________________________________________________________________________________
>>> Be a better friend, newshound, and
>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>
>>>       
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080203/533fe40c/attachment.html>

From hategan at mcs.anl.gov  Sun Feb  3 21:16:05 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 03 Feb 2008 21:16:05 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <47A68288.8060702@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>
	<47A68288.8060702@mcs.anl.gov>
Message-ID: <1202094965.13259.8.camel@blabla.mcs.anl.gov>

On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
> Mihael:
> 
> Is there any chance you can try GRAM4, as was requested early last
> week?

For the tests, sure. That's a big part of why I'm doing them.

If we're talking about the workflow that seems to be repeatedly killing
tg-grid1, then Mike Kubal would be the right person to ask.

> 
> Ian.
> 
> Mihael Hategan wrote: 
> > So I was trying some stuff on Friday night. I guess I've found the
> > strategy on when to run the tests: when nobody else has jobs there
> > (besides Buzz doing gridftp tests, Ioan having some Falkon workers
> > running, and the occasional Inca tests).
> > 
> > In any event, the machine jumps to about 100% utilization at around 130
> > jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
> > 1 in swift.properties.
> > 
> > There's still more work I need to do test-wise.
> > 
> > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >   
> > > Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
> > > some swift settings that don't kill our server?
> > > 
> > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > > 
> > >     
> > > > Yes, I'm submitting molecular dynamics simulations
> > > > using Swift.
> > > > 
> > > > Is there a default wall-time limit for jobs on tg-uc?
> > > > 
> > > > 
> > > > 
> > > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > > > 
> > > >       
> > > > > Actually, these numbers are now escalating...
> > > > > 
> > > > > top - 17:18:54 up  2:29,  1 user,  load average:
> > > > > 149.02, 123.63, 91.94
> > > > > Tasks: 469 total,   4 running, 465 sleeping,   0
> > > > > stopped,   0 zombie
> > > > > 
> > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > > >     479
> > > > > 
> > > > > insley at tg-viz-login1:~> time globusrun -a -r
> > > > > tg-grid.uc.teragrid.org
> > > > > GRAM Authentication test successful
> > > > > real    0m26.134s
> > > > > user    0m0.090s
> > > > > sys     0m0.010s
> > > > > 
> > > > > 
> > > > > On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> > > > > 
> > > > >         
> > > > > > Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> > > > > >           
> > > > > TG GRAM host)
> > > > >         
> > > > > > became unresponsive and had to be rebooted.  I am
> > > > > >           
> > > > > now seeing slow
> > > > >         
> > > > > > response times from the Gatekeeper there again.
> > > > > >           
> > > > > Authenticating to
> > > > >         
> > > > > > the gatekeeper should only take a second or two,
> > > > > >           
> > > > > but it is
> > > > >         
> > > > > > periodically taking up to 16 seconds:
> > > > > > 
> > > > > > insley at tg-viz-login1:~> time globusrun -a -r
> > > > > >           
> > > > > tg-grid.uc.teragrid.org
> > > > >         
> > > > > > GRAM Authentication test successful
> > > > > > real    0m16.096s
> > > > > > user    0m0.060s
> > > > > > sys     0m0.020s
> > > > > > 
> > > > > > looking at the load on tg-grid, it is rather high:
> > > > > > 
> > > > > > top - 16:55:26 up  2:06,  1 user,  load average:
> > > > > >           
> > > > > 89.59, 78.69, 62.92
> > > > >         
> > > > > > Tasks: 398 total,  20 running, 378 sleeping,   0
> > > > > >           
> > > > > stopped,   0 zombie
> > > > >         
> > > > > > And there appear to be a large number of processes
> > > > > >           
> > > > > owned by kubal:
> > > > >         
> > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > > > >    380
> > > > > > 
> > > > > > I assume that Mike is using swift to do the job
> > > > > >           
> > > > > submission.  Is
> > > > >         
> > > > > > there some throttling of the rate at which jobs
> > > > > >           
> > > > > are submitted to
> > > > >         
> > > > > > the gatekeeper that could be done that would
> > > > > >           
> > > > > lighten this load
> > > > >         
> > > > > > some?  (Or has that already been done since
> > > > > >           
> > > > > earlier today?)  The
> > > > >         
> > > > > > current response times are not unacceptable, but
> > > > > >           
> > > > > I'm hoping to
> > > > >         
> > > > > > avoid having the machine grind to a halt as it did
> > > > > >           
> > > > > earlier today.
> > > > >         
> > > > > > Thanks,
> > > > > > joe.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >           
> > > > > ===================================================
> > > > >         
> > > > > > joseph a.
> > > > > > insley
> > > > > >           
> > > > > > insley at mcs.anl.gov
> > > > > > mathematics & computer science division
> > > > > >           
> > > > > (630) 252-5649
> > > > >         
> > > > > > argonne national laboratory
> > > > > >           
> > > > > (630)
> > > > >         
> > > > > > 252-5986 (fax)
> > > > > > 
> > > > > > 
> > > > > >           
> > > > > ===================================================
> > > > > joseph a. insley
> > > > > 
> > > > > insley at mcs.anl.gov
> > > > > mathematics & computer science division       (630)
> > > > > 252-5649
> > > > > argonne national laboratory
> > > > >     (630)
> > > > > 252-5986 (fax)
> > > > > 
> > > > > 
> > > > > 
> > > > >         
> > > > 
> > > >       
> > > > ____________________________________________________________________________________
> > > > Be a better friend, newshound, and
> > > > know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > > 
> > > >       
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > >     
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   


From foster at mcs.anl.gov  Sun Feb  3 21:23:24 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Sun, 03 Feb 2008 21:23:24 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202094965.13259.8.camel@blabla.mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>	
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>	
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>	
	<47A68288.8060702@mcs.anl.gov>
	<1202094965.13259.8.camel@blabla.mcs.anl.gov>
Message-ID: <47A6852C.9080208@mcs.anl.gov>

Mihael:

The motivation for doing the tests is so that we can provide appropriate 
advice to Mike, our super-high-priority Swift user who we want to help 
as much and as quickly as possible. I'm concerned that we don't seem to 
feel any sense of urgency in doing this. I'd like to emphasize that the 
sole reason for anyone funding work on Swift is because they believe us 
when we say that Swift can help people make more effective use of 
high-performance computing systems (parallel and grid). Mike K. is our 
most engaged and committed user, and if he is successful, will bring us 
fame and fortune (and fun, I think, to provide three Fs!). It shouldn't 
take a week for us to get back to him with information on how to run his 
application efficiently on TG.

Ian.

Mihael Hategan wrote:
> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
>   
>> Mihael:
>>
>> Is there any chance you can try GRAM4, as was requested early last
>> week?
>>     
>
> For the tests, sure. That's a big part of why I'm doing them.
>
> If we're talking about the workflow that seems to be repeatedly killing
> tg-grid1, then Mike Kubal would be the right person to ask.
>
>   
>> Ian.
>>
>> Mihael Hategan wrote: 
>>     
>>> So I was trying some stuff on Friday night. I guess I've found the
>>> strategy on when to run the tests: when nobody else has jobs there
>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
>>> running, and the occasional Inca tests).
>>>
>>> In any event, the machine jumps to about 100% utilization at around 130
>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
>>> 1 in swift.properties.
>>>
>>> There's still more work I need to do test-wise.
>>>
>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>   
>>>       
>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
>>>> some swift settings that don't kill our server?
>>>>
>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>
>>>>     
>>>>         
>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>> using Swift.
>>>>>
>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>
>>>>>
>>>>>
>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>
>>>>>       
>>>>>           
>>>>>> Actually, these numbers are now escalating...
>>>>>>
>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>> 149.02, 123.63, 91.94
>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>> stopped,   0 zombie
>>>>>>
>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>     479
>>>>>>
>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>> GRAM Authentication test successful
>>>>>> real    0m26.134s
>>>>>> user    0m0.090s
>>>>>> sys     0m0.010s
>>>>>>
>>>>>>
>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>           
>>>>>>>               
>>>>>> TG GRAM host)
>>>>>>         
>>>>>>             
>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>>           
>>>>>>>               
>>>>>> now seeing slow
>>>>>>         
>>>>>>             
>>>>>>> response times from the Gatekeeper there again.
>>>>>>>           
>>>>>>>               
>>>>>> Authenticating to
>>>>>>         
>>>>>>             
>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>           
>>>>>>>               
>>>>>> but it is
>>>>>>         
>>>>>>             
>>>>>>> periodically taking up to 16 seconds:
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>           
>>>>>>>               
>>>>>> tg-grid.uc.teragrid.org
>>>>>>         
>>>>>>             
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m16.096s
>>>>>>> user    0m0.060s
>>>>>>> sys     0m0.020s
>>>>>>>
>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>
>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>>           
>>>>>>>               
>>>>>> 89.59, 78.69, 62.92
>>>>>>         
>>>>>>             
>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>>           
>>>>>>>               
>>>>>> stopped,   0 zombie
>>>>>>         
>>>>>>             
>>>>>>> And there appear to be a large number of processes
>>>>>>>           
>>>>>>>               
>>>>>> owned by kubal:
>>>>>>         
>>>>>>             
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>    380
>>>>>>>
>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>           
>>>>>>>               
>>>>>> submission.  Is
>>>>>>         
>>>>>>             
>>>>>>> there some throttling of the rate at which jobs
>>>>>>>           
>>>>>>>               
>>>>>> are submitted to
>>>>>>         
>>>>>>             
>>>>>>> the gatekeeper that could be done that would
>>>>>>>           
>>>>>>>               
>>>>>> lighten this load
>>>>>>         
>>>>>>             
>>>>>>> some?  (Or has that already been done since
>>>>>>>           
>>>>>>>               
>>>>>> earlier today?)  The
>>>>>>         
>>>>>>             
>>>>>>> current response times are not unacceptable, but
>>>>>>>           
>>>>>>>               
>>>>>> I'm hoping to
>>>>>>         
>>>>>>             
>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>           
>>>>>>>               
>>>>>> earlier today.
>>>>>>         
>>>>>>             
>>>>>>> Thanks,
>>>>>>> joe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>> ===================================================
>>>>>>         
>>>>>>             
>>>>>>> joseph a.
>>>>>>> insley
>>>>>>>           
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>>>>>           
>>>>>>>               
>>>>>> (630) 252-5649
>>>>>>         
>>>>>>             
>>>>>>> argonne national laboratory
>>>>>>>           
>>>>>>>               
>>>>>> (630)
>>>>>>         
>>>>>>             
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>> ===================================================
>>>>>> joseph a. insley
>>>>>>
>>>>>> insley at mcs.anl.gov
>>>>>> mathematics & computer science division       (630)
>>>>>> 252-5649
>>>>>> argonne national laboratory
>>>>>>     (630)
>>>>>> 252-5986 (fax)
>>>>>>
>>>>>>
>>>>>>
>>>>>>         
>>>>>>             
>>>>>       
>>>>> ____________________________________________________________________________________
>>>>> Be a better friend, newshound, and
>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>
>>>>>       
>>>>>           
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>     
>>>>         
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>   
>>>       
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080203/5c6c391c/attachment.html>

From hategan at mcs.anl.gov  Sun Feb  3 21:53:51 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 03 Feb 2008 21:53:51 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <47A6852C.9080208@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>
	<47A68288.8060702@mcs.anl.gov>
	<1202094965.13259.8.camel@blabla.mcs.anl.gov>
	<47A6852C.9080208@mcs.anl.gov>
Message-ID: <1202097231.13666.21.camel@blabla.mcs.anl.gov>

If you want to prioritize things differently, then please do so from the
beginning instead of pointing out the priorities were wrong after a
while. So please stop doing this. It is frustrating and it is not what I
signed up for.

Mihael

On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
> Mihael:
> 
> The motivation for doing the tests is so that we can provide
> appropriate advice to Mike, our super-high-priority Swift user who we
> want to help as much and as quickly as possible. I'm concerned that we
> don't seem to feel any sense of urgency in doing this. I'd like to
> emphasize that the sole reason for anyone funding work on Swift is
> because they believe us when we say that Swift can help people make
> more effective use of high-performance computing systems (parallel and
> grid). Mike K. is our most engaged and committed user, and if he is
> successful, will bring us fame and fortune (and fun, I think, to
> provide three Fs!). It shouldn't take a week for us to get back to him
> with information on how to run his application efficiently on TG.
> 
> Ian.
> 
> Mihael Hategan wrote: 
> > On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
> >   
> > > Mihael:
> > > 
> > > Is there any chance you can try GRAM4, as was requested early last
> > > week?
> > >     
> > 
> > For the tests, sure. That's a big part of why I'm doing them.
> > 
> > If we're talking about the workflow that seems to be repeatedly killing
> > tg-grid1, then Mike Kubal would be the right person to ask.
> > 
> >   
> > > Ian.
> > > 
> > > Mihael Hategan wrote: 
> > >     
> > > > So I was trying some stuff on Friday night. I guess I've found the
> > > > strategy on when to run the tests: when nobody else has jobs there
> > > > (besides Buzz doing gridftp tests, Ioan having some Falkon workers
> > > > running, and the occasional Inca tests).
> > > > 
> > > > In any event, the machine jumps to about 100% utilization at around 130
> > > > jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
> > > > 1 in swift.properties.
> > > > 
> > > > There's still more work I need to do test-wise.
> > > > 
> > > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > > >   
> > > >       
> > > > > Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
> > > > > some swift settings that don't kill our server?
> > > > > 
> > > > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > > > > 
> > > > >     
> > > > >         
> > > > > > Yes, I'm submitting molecular dynamics simulations
> > > > > > using Swift.
> > > > > > 
> > > > > > Is there a default wall-time limit for jobs on tg-uc?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > > > > > 
> > > > > >       
> > > > > >           
> > > > > > > Actually, these numbers are now escalating...
> > > > > > > 
> > > > > > > top - 17:18:54 up  2:29,  1 user,  load average:
> > > > > > > 149.02, 123.63, 91.94
> > > > > > > Tasks: 469 total,   4 running, 465 sleeping,   0
> > > > > > > stopped,   0 zombie
> > > > > > > 
> > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > > > > >     479
> > > > > > > 
> > > > > > > insley at tg-viz-login1:~> time globusrun -a -r
> > > > > > > tg-grid.uc.teragrid.org
> > > > > > > GRAM Authentication test successful
> > > > > > > real    0m26.134s
> > > > > > > user    0m0.090s
> > > > > > > sys     0m0.010s
> > > > > > > 
> > > > > > > 
> > > > > > > On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> > > > > > > 
> > > > > > >         
> > > > > > >             
> > > > > > > > Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> > > > > > > >           
> > > > > > > >               
> > > > > > > TG GRAM host)
> > > > > > >         
> > > > > > >             
> > > > > > > > became unresponsive and had to be rebooted.  I am
> > > > > > > >           
> > > > > > > >               
> > > > > > > now seeing slow
> > > > > > >         
> > > > > > >             
> > > > > > > > response times from the Gatekeeper there again.
> > > > > > > >           
> > > > > > > >               
> > > > > > > Authenticating to
> > > > > > >         
> > > > > > >             
> > > > > > > > the gatekeeper should only take a second or two,
> > > > > > > >           
> > > > > > > >               
> > > > > > > but it is
> > > > > > >         
> > > > > > >             
> > > > > > > > periodically taking up to 16 seconds:
> > > > > > > > 
> > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r
> > > > > > > >           
> > > > > > > >               
> > > > > > > tg-grid.uc.teragrid.org
> > > > > > >         
> > > > > > >             
> > > > > > > > GRAM Authentication test successful
> > > > > > > > real    0m16.096s
> > > > > > > > user    0m0.060s
> > > > > > > > sys     0m0.020s
> > > > > > > > 
> > > > > > > > looking at the load on tg-grid, it is rather high:
> > > > > > > > 
> > > > > > > > top - 16:55:26 up  2:06,  1 user,  load average:
> > > > > > > >           
> > > > > > > >               
> > > > > > > 89.59, 78.69, 62.92
> > > > > > >         
> > > > > > >             
> > > > > > > > Tasks: 398 total,  20 running, 378 sleeping,   0
> > > > > > > >           
> > > > > > > >               
> > > > > > > stopped,   0 zombie
> > > > > > >         
> > > > > > >             
> > > > > > > > And there appear to be a large number of processes
> > > > > > > >           
> > > > > > > >               
> > > > > > > owned by kubal:
> > > > > > >         
> > > > > > >             
> > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > > > > > >    380
> > > > > > > > 
> > > > > > > > I assume that Mike is using swift to do the job
> > > > > > > >           
> > > > > > > >               
> > > > > > > submission.  Is
> > > > > > >         
> > > > > > >             
> > > > > > > > there some throttling of the rate at which jobs
> > > > > > > >           
> > > > > > > >               
> > > > > > > are submitted to
> > > > > > >         
> > > > > > >             
> > > > > > > > the gatekeeper that could be done that would
> > > > > > > >           
> > > > > > > >               
> > > > > > > lighten this load
> > > > > > >         
> > > > > > >             
> > > > > > > > some?  (Or has that already been done since
> > > > > > > >           
> > > > > > > >               
> > > > > > > earlier today?)  The
> > > > > > >         
> > > > > > >             
> > > > > > > > current response times are not unacceptable, but
> > > > > > > >           
> > > > > > > >               
> > > > > > > I'm hoping to
> > > > > > >         
> > > > > > >             
> > > > > > > > avoid having the machine grind to a halt as it did
> > > > > > > >           
> > > > > > > >               
> > > > > > > earlier today.
> > > > > > >         
> > > > > > >             
> > > > > > > > Thanks,
> > > > > > > > joe.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > >           
> > > > > > > >               
> > > > > > > ===================================================
> > > > > > >         
> > > > > > >             
> > > > > > > > joseph a.
> > > > > > > > insley
> > > > > > > >           
> > > > > > > > insley at mcs.anl.gov
> > > > > > > > mathematics & computer science division
> > > > > > > >           
> > > > > > > >               
> > > > > > > (630) 252-5649
> > > > > > >         
> > > > > > >             
> > > > > > > > argonne national laboratory
> > > > > > > >           
> > > > > > > >               
> > > > > > > (630)
> > > > > > >         
> > > > > > >             
> > > > > > > > 252-5986 (fax)
> > > > > > > > 
> > > > > > > > 
> > > > > > > >           
> > > > > > > >               
> > > > > > > ===================================================
> > > > > > > joseph a. insley
> > > > > > > 
> > > > > > > insley at mcs.anl.gov
> > > > > > > mathematics & computer science division       (630)
> > > > > > > 252-5649
> > > > > > > argonne national laboratory
> > > > > > >     (630)
> > > > > > > 252-5986 (fax)
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >         
> > > > > > >             
> > > > > > ____________________________________________________________________________________
> > > > > > Be a better friend, newshound, and
> > > > > > know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > > > > 
> > > > > >       
> > > > > >           
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > >     
> > > > >         
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > >   
> > > >       
> > 
> >   


From wilde at mcs.anl.gov  Sun Feb  3 22:02:02 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 03 Feb 2008 22:02:02 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202097231.13666.21.camel@blabla.mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>	
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>	
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>	
	<47A68288.8060702@mcs.anl.gov>	
	<1202094965.13259.8.camel@blabla.mcs.anl.gov>	
	<47A6852C.9080208@mcs.anl.gov>
	<1202097231.13666.21.camel@blabla.mcs.anl.gov>
Message-ID: <47A68E3A.1090603@mcs.anl.gov>

Ian, Mihael, confusion on the priorities is my fault, and I'll work to 
fix that.

- Mike


On 2/3/08 9:53 PM, Mihael Hategan wrote:
> If you want to prioritize things differently, then please do so from the
> beginning instead of pointing out the priorities were wrong after a
> while. So please stop doing this. It is frustrating and it is not what I
> signed up for.
> 
> Mihael
> 
> On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
>> Mihael:
>>
>> The motivation for doing the tests is so that we can provide
>> appropriate advice to Mike, our super-high-priority Swift user who we
>> want to help as much and as quickly as possible. I'm concerned that we
>> don't seem to feel any sense of urgency in doing this. I'd like to
>> emphasize that the sole reason for anyone funding work on Swift is
>> because they believe us when we say that Swift can help people make
>> more effective use of high-performance computing systems (parallel and
>> grid). Mike K. is our most engaged and committed user, and if he is
>> successful, will bring us fame and fortune (and fun, I think, to
>> provide three Fs!). It shouldn't take a week for us to get back to him
>> with information on how to run his application efficiently on TG.
>>
>> Ian.
>>
>> Mihael Hategan wrote: 
>>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
>>>   
>>>> Mihael:
>>>>
>>>> Is there any chance you can try GRAM4, as was requested early last
>>>> week?
>>>>     
>>> For the tests, sure. That's a big part of why I'm doing them.
>>>
>>> If we're talking about the workflow that seems to be repeatedly killing
>>> tg-grid1, then Mike Kubal would be the right person to ask.
>>>
>>>   
>>>> Ian.
>>>>
>>>> Mihael Hategan wrote: 
>>>>     
>>>>> So I was trying some stuff on Friday night. I guess I've found the
>>>>> strategy on when to run the tests: when nobody else has jobs there
>>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
>>>>> running, and the occasional Inca tests).
>>>>>
>>>>> In any event, the machine jumps to about 100% utilization at around 130
>>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
>>>>> 1 in swift.properties.
>>>>>
>>>>> There's still more work I need to do test-wise.
>>>>>
>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>   
>>>>>       
>>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
>>>>>> some swift settings that don't kill our server?
>>>>>>
>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>>> using Swift.
>>>>>>>
>>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>
>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>     479
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m26.134s
>>>>>>>> user    0m0.090s
>>>>>>>> sys     0m0.010s
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> TG GRAM host)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> now seeing slow
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> response times from the Gatekeeper there again.
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> Authenticating to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> but it is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real    0m16.096s
>>>>>>>>> user    0m0.060s
>>>>>>>>> sys     0m0.020s
>>>>>>>>>
>>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>>
>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> stopped,   0 zombie
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> And there appear to be a large number of processes
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> owned by kubal:
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>    380
>>>>>>>>>
>>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> submission.  Is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> there some throttling of the rate at which jobs
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> are submitted to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> lighten this load
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> earlier today?)  The
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> current response times are not unacceptable, but
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> I'm hoping to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> earlier today.
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> Thanks,
>>>>>>>>> joe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> ===================================================
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> joseph a.
>>>>>>>>> insley
>>>>>>>>>           
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> (630) 252-5649
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> argonne national laboratory
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> (630)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>> ===================================================
>>>>>>>> joseph a. insley
>>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division       (630)
>>>>>>>> 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>>     (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>> ____________________________________________________________________________________
>>>>>>> Be a better friend, newshound, and
>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>     
>>>>>>         
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>   
>>>>>       
>>>   
> 
> 


From foster at mcs.anl.gov  Sun Feb  3 22:05:03 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Sun, 03 Feb 2008 22:05:03 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202097231.13666.21.camel@blabla.mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>	
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>	
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>	
	<47A68288.8060702@mcs.anl.gov>	
	<1202094965.13259.8.camel@blabla.mcs.anl.gov>	
	<47A6852C.9080208@mcs.anl.gov>
	<1202097231.13666.21.camel@blabla.mcs.anl.gov>
Message-ID: <47A68EEF.50804@mcs.anl.gov>

Mihael:

The point of my mail was to express what I think our priorities should be.

It would be useful to have a discussion of what our priorities are, and 
how they differ from what I think they should be. But probably we 
shouldn't do that via email.

Ian.

Mihael Hategan wrote:
> If you want to prioritize things differently, then please do so from the
> beginning instead of pointing out the priorities were wrong after a
> while. So please stop doing this. It is frustrating and it is not what I
> signed up for.
>
> Mihael
>
> On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
>   
>> Mihael:
>>
>> The motivation for doing the tests is so that we can provide
>> appropriate advice to Mike, our super-high-priority Swift user who we
>> want to help as much and as quickly as possible. I'm concerned that we
>> don't seem to feel any sense of urgency in doing this. I'd like to
>> emphasize that the sole reason for anyone funding work on Swift is
>> because they believe us when we say that Swift can help people make
>> more effective use of high-performance computing systems (parallel and
>> grid). Mike K. is our most engaged and committed user, and if he is
>> successful, will bring us fame and fortune (and fun, I think, to
>> provide three Fs!). It shouldn't take a week for us to get back to him
>> with information on how to run his application efficiently on TG.
>>
>> Ian.
>>
>> Mihael Hategan wrote: 
>>     
>>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
>>>   
>>>       
>>>> Mihael:
>>>>
>>>> Is there any chance you can try GRAM4, as was requested early last
>>>> week?
>>>>     
>>>>         
>>> For the tests, sure. That's a big part of why I'm doing them.
>>>
>>> If we're talking about the workflow that seems to be repeatedly killing
>>> tg-grid1, then Mike Kubal would be the right person to ask.
>>>
>>>   
>>>       
>>>> Ian.
>>>>
>>>> Mihael Hategan wrote: 
>>>>     
>>>>         
>>>>> So I was trying some stuff on Friday night. I guess I've found the
>>>>> strategy on when to run the tests: when nobody else has jobs there
>>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
>>>>> running, and the occasional Inca tests).
>>>>>
>>>>> In any event, the machine jumps to about 100% utilization at around 130
>>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
>>>>> 1 in swift.properties.
>>>>>
>>>>> There's still more work I need to do test-wise.
>>>>>
>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
>>>>>> some swift settings that don't kill our server?
>>>>>>
>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> Yes, I'm submitting molecular dynamics simulations
>>>>>>> using Swift.
>>>>>>>
>>>>>>> Is there a default wall-time limit for jobs on tg-uc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>
>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
>>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>     479
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m26.134s
>>>>>>>> user    0m0.090s
>>>>>>>> sys     0m0.010s
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> TG GRAM host)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> became unresponsive and had to be rebooted.  I am
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> now seeing slow
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> response times from the Gatekeeper there again.
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> Authenticating to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> the gatekeeper should only take a second or two,
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> but it is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real    0m16.096s
>>>>>>>>> user    0m0.060s
>>>>>>>>> sys     0m0.020s
>>>>>>>>>
>>>>>>>>> looking at the load on tg-grid, it is rather high:
>>>>>>>>>
>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> stopped,   0 zombie
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> And there appear to be a large number of processes
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> owned by kubal:
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>    380
>>>>>>>>>
>>>>>>>>> I assume that Mike is using swift to do the job
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> submission.  Is
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> there some throttling of the rate at which jobs
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> are submitted to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> lighten this load
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> earlier today?)  The
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> current response times are not unacceptable, but
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> I'm hoping to
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> avoid having the machine grind to a halt as it did
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> earlier today.
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> Thanks,
>>>>>>>>> joe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> ===================================================
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> joseph a.
>>>>>>>>> insley
>>>>>>>>>           
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> (630) 252-5649
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> argonne national laboratory
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> (630)
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> ===================================================
>>>>>>>> joseph a. insley
>>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division       (630)
>>>>>>>> 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>>     (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> ____________________________________________________________________________________
>>>>>>> Be a better friend, newshound, and
>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>   
>>>       
>
>   


From hategan at mcs.anl.gov  Sun Feb  3 22:39:05 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 03 Feb 2008 22:39:05 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <47A68EEF.50804@mcs.anl.gov>
References: <921658.18899.qm@web52308.mail.re2.yahoo.com>
	<50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov>
	<1202094553.13259.4.camel@blabla.mcs.anl.gov>
	<47A68288.8060702@mcs.anl.gov>
	<1202094965.13259.8.camel@blabla.mcs.anl.gov>
	<47A6852C.9080208@mcs.anl.gov>
	<1202097231.13666.21.camel@blabla.mcs.anl.gov>
	<47A68EEF.50804@mcs.anl.gov>
Message-ID: <1202099945.14375.22.camel@blabla.mcs.anl.gov>

We cannot define priorities about things we don't know. This management
by crisis (i.e. every new thing is of utmost priority, and maybe some
older things that used to be of utmost priority may or may not still be
of utmost priority) doesn't seem to work well.

Add to that the implications that x didn't do things right (so that we
make it slightly personal), and you've got a recipe for things not
working well at all.

Repeat this a few times, and even the most resilient of people will
begin having second thoughts. And the reaction to things one cannot
control are not those of fight but those of flight.

Now, onto the problem. The tests are no easy thing. I need time to find
the right settings, the right ways to do it, and the right times to do
it (the process involves getting that machine close to the point of
crashing). And then some way to transform some seemingly garbage like
log files into something meaningful. So no, it's not a one day job.

In the mean time, Mike was informed about what we believe might be
better ways to make things work (throttling parameters, trying ws-gram,
local PBS).

Mihael

On Sun, 2008-02-03 at 22:05 -0600, Ian Foster wrote:
> Mihael:
> 
> The point of my mail was to express what I think our priorities should be.
> 
> It would be useful to have a discussion of what our priorities are, and 
> how they differ from what I think they should be. But probably we 
> shouldn't do that via email.
> 
> Ian.
> 
> Mihael Hategan wrote:
> > If you want to prioritize things differently, then please do so from the
> > beginning instead of pointing out the priorities were wrong after a
> > while. So please stop doing this. It is frustrating and it is not what I
> > signed up for.
> >
> > Mihael
> >
> > On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote:
> >   
> >> Mihael:
> >>
> >> The motivation for doing the tests is so that we can provide
> >> appropriate advice to Mike, our super-high-priority Swift user who we
> >> want to help as much and as quickly as possible. I'm concerned that we
> >> don't seem to feel any sense of urgency in doing this. I'd like to
> >> emphasize that the sole reason for anyone funding work on Swift is
> >> because they believe us when we say that Swift can help people make
> >> more effective use of high-performance computing systems (parallel and
> >> grid). Mike K. is our most engaged and committed user, and if he is
> >> successful, will bring us fame and fortune (and fun, I think, to
> >> provide three Fs!). It shouldn't take a week for us to get back to him
> >> with information on how to run his application efficiently on TG.
> >>
> >> Ian.
> >>
> >> Mihael Hategan wrote: 
> >>     
> >>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote:
> >>>   
> >>>       
> >>>> Mihael:
> >>>>
> >>>> Is there any chance you can try GRAM4, as was requested early last
> >>>> week?
> >>>>     
> >>>>         
> >>> For the tests, sure. That's a big part of why I'm doing them.
> >>>
> >>> If we're talking about the workflow that seems to be repeatedly killing
> >>> tg-grid1, then Mike Kubal would be the right person to ask.
> >>>
> >>>   
> >>>       
> >>>> Ian.
> >>>>
> >>>> Mihael Hategan wrote: 
> >>>>     
> >>>>         
> >>>>> So I was trying some stuff on Friday night. I guess I've found the
> >>>>> strategy on when to run the tests: when nobody else has jobs there
> >>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers
> >>>>> running, and the occasional Inca tests).
> >>>>>
> >>>>> In any event, the machine jumps to about 100% utilization at around 130
> >>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to
> >>>>> 1 in swift.properties.
> >>>>>
> >>>>> There's still more work I need to do test-wise.
> >>>>>
> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get  
> >>>>>> some swift settings that don't kill our server?
> >>>>>>
> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>>>> Yes, I'm submitting molecular dynamics simulations
> >>>>>>> using Swift.
> >>>>>>>
> >>>>>>> Is there a default wall-time limit for jobs on tg-uc?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>
> >>>>>>>> top - 17:18:54 up  2:29,  1 user,  load average:
> >>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,   0
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>
> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>     479
> >>>>>>>>
> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>> GRAM Authentication test successful
> >>>>>>>> real    0m26.134s
> >>>>>>>> user    0m0.090s
> >>>>>>>> sys     0m0.010s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote:
> >>>>>>>>
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> TG GRAM host)
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> became unresponsive and had to be rebooted.  I am
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> now seeing slow
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> response times from the Gatekeeper there again.
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> Authenticating to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> the gatekeeper should only take a second or two,
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> but it is
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>
> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> GRAM Authentication test successful
> >>>>>>>>> real    0m16.096s
> >>>>>>>>> user    0m0.060s
> >>>>>>>>> sys     0m0.020s
> >>>>>>>>>
> >>>>>>>>> looking at the load on tg-grid, it is rather high:
> >>>>>>>>>
> >>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load average:
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,   0
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> And there appear to be a large number of processes
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> owned by kubal:
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>    380
> >>>>>>>>>
> >>>>>>>>> I assume that Mike is using swift to do the job
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> submission.  Is
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> there some throttling of the rate at which jobs
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> are submitted to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> lighten this load
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> earlier today?)  The
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> current response times are not unacceptable, but
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> I'm hoping to
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> avoid having the machine grind to a halt as it did
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> earlier today.
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> Thanks,
> >>>>>>>>> joe.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> ===================================================
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> joseph a.
> >>>>>>>>> insley
> >>>>>>>>>           
> >>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>> mathematics & computer science division
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> (630) 252-5649
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> argonne national laboratory
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> (630)
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>>>> 252-5986 (fax)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>           
> >>>>>>>>>               
> >>>>>>>>>                   
> >>>>>>>> ===================================================
> >>>>>>>> joseph a. insley
> >>>>>>>>
> >>>>>>>> insley at mcs.anl.gov
> >>>>>>>> mathematics & computer science division       (630)
> >>>>>>>> 252-5649
> >>>>>>>> argonne national laboratory
> >>>>>>>>     (630)
> >>>>>>>> 252-5986 (fax)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>         
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> ____________________________________________________________________________________
> >>>>>>> Be a better friend, newshound, and
> >>>>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>
> >>>>>>>       
> >>>>>>>           
> >>>>>>>               
> >>>>>> _______________________________________________
> >>>>>> Swift-devel mailing list
> >>>>>> Swift-devel at ci.uchicago.edu
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>   
> >>>       
> >
> >   
> 


From mikekubal at yahoo.com  Mon Feb  4 00:11:34 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Sun, 3 Feb 2008 22:11:34 -0800 (PST)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202094553.13259.4.camel@blabla.mcs.anl.gov>
Message-ID: <548830.35963.qm@web52311.mail.re2.yahoo.com>

Sorry for killing the server. I'm pushing to get
results to guide the selection of compounds for
wet-lab testing.

I had set the throttle.score.job.factor to 1 in the
swift.properties file.

I certainly appreciate everyone's efforts and
responsiveness.

Let me know what to try next, before I kill again.
 
Cheers,

Mike 


--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> So I was trying some stuff on Friday night. I guess
> I've found the
> strategy on when to run the tests: when nobody else
> has jobs there
> (besides Buzz doing gridftp tests, Ioan having some
> Falkon workers
> running, and the occasional Inca tests).
> 
> In any event, the machine jumps to about 100%
> utilization at around 130
> jobs with pre-ws gram. So Mike, please set
> throttle.score.job.factor to
> 1 in swift.properties.
> 
> There's still more work I need to do test-wise.
> 
> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > Mike, You're killing tg-grid1 again. Can someone
> work with Mike to get  
> > some swift settings that don't kill our server?
> > 
> > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > 
> > > Yes, I'm submitting molecular dynamics
> simulations
> > > using Swift.
> > >
> > > Is there a default wall-time limit for jobs on
> tg-uc?
> > >
> > >
> > >
> > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > >
> > >> Actually, these numbers are now escalating...
> > >>
> > >> top - 17:18:54 up  2:29,  1 user,  load
> average:
> > >> 149.02, 123.63, 91.94
> > >> Tasks: 469 total,   4 running, 465 sleeping,  
> 0
> > >> stopped,   0 zombie
> > >>
> > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>     479
> > >>
> > >> insley at tg-viz-login1:~> time globusrun -a -r
> > >> tg-grid.uc.teragrid.org
> > >> GRAM Authentication test successful
> > >> real    0m26.134s
> > >> user    0m0.090s
> > >> sys     0m0.010s
> > >>
> > >>
> > >> On Jan 28, 2008, at 5:15 PM, joseph insley
> wrote:
> > >>
> > >>> Earlier today tg-grid.uc.teragrid.org (the
> UC/ANL
> > >> TG GRAM host)
> > >>> became unresponsive and had to be rebooted.  I
> am
> > >> now seeing slow
> > >>> response times from the Gatekeeper there
> again.
> > >> Authenticating to
> > >>> the gatekeeper should only take a second or
> two,
> > >> but it is
> > >>> periodically taking up to 16 seconds:
> > >>>
> > >>> insley at tg-viz-login1:~> time globusrun -a -r
> > >> tg-grid.uc.teragrid.org
> > >>> GRAM Authentication test successful
> > >>> real    0m16.096s
> > >>> user    0m0.060s
> > >>> sys     0m0.020s
> > >>>
> > >>> looking at the load on tg-grid, it is rather
> high:
> > >>>
> > >>> top - 16:55:26 up  2:06,  1 user,  load
> average:
> > >> 89.59, 78.69, 62.92
> > >>> Tasks: 398 total,  20 running, 378 sleeping,  
> 0
> > >> stopped,   0 zombie
> > >>>
> > >>> And there appear to be a large number of
> processes
> > >> owned by kubal:
> > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>>    380
> > >>>
> > >>> I assume that Mike is using swift to do the
> job
> > >> submission.  Is
> > >>> there some throttling of the rate at which
> jobs
> > >> are submitted to
> > >>> the gatekeeper that could be done that would
> > >> lighten this load
> > >>> some?  (Or has that already been done since
> > >> earlier today?)  The
> > >>> current response times are not unacceptable,
> but
> > >> I'm hoping to
> > >>> avoid having the machine grind to a halt as it
> did
> > >> earlier today.
> > >>>
> > >>> Thanks,
> > >>> joe.
> > >>>
> > >>>
> > >>>
> > >>
> ===================================================
> > >>> joseph a.
> > >>> insley
> > >>
> > >>> insley at mcs.anl.gov
> > >>> mathematics & computer science division
> > >> (630) 252-5649
> > >>> argonne national laboratory
> > >>       (630)
> > >>> 252-5986 (fax)
> > >>>
> > >>>
> > >>
> > >>
> ===================================================
> > >> joseph a. insley
> > >>
> > >> insley at mcs.anl.gov
> > >> mathematics & computer science division      
> (630)
> > >> 252-5649
> > >> argonne national laboratory
> > >>     (630)
> > >> 252-5986 (fax)
> > >>
> > >>
> > >>
> > >
> > >
> > >
> > >       
> > >
>
____________________________________________________________________________________
> > > Be a better friend, newshound, and
> > > know-it-all with Yahoo! Mobile.  Try it now. 
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > >
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From hategan at mcs.anl.gov  Mon Feb  4 00:14:09 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 00:14:09 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
Message-ID: <1202105649.15397.46.camel@blabla.mcs.anl.gov>


On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> Sorry for killing the server. I'm pushing to get
> results to guide the selection of compounds for
> wet-lab testing.
> 
> I had set the throttle.score.job.factor to 1 in the
> swift.properties file.

Hmm. Ti, at the time of the massacre, how many did you kill?

Mihael

> 
> I certainly appreciate everyone's efforts and
> responsiveness.
> 
> Let me know what to try next, before I kill again.
>  
> Cheers,
> 
> Mike 
> 
> 
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > So I was trying some stuff on Friday night. I guess
> > I've found the
> > strategy on when to run the tests: when nobody else
> > has jobs there
> > (besides Buzz doing gridftp tests, Ioan having some
> > Falkon workers
> > running, and the occasional Inca tests).
> > 
> > In any event, the machine jumps to about 100%
> > utilization at around 130
> > jobs with pre-ws gram. So Mike, please set
> > throttle.score.job.factor to
> > 1 in swift.properties.
> > 
> > There's still more work I need to do test-wise.
> > 
> > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > > Mike, You're killing tg-grid1 again. Can someone
> > work with Mike to get  
> > > some swift settings that don't kill our server?
> > > 
> > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > > 
> > > > Yes, I'm submitting molecular dynamics
> > simulations
> > > > using Swift.
> > > >
> > > > Is there a default wall-time limit for jobs on
> > tg-uc?
> > > >
> > > >
> > > >
> > > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > > >
> > > >> Actually, these numbers are now escalating...
> > > >>
> > > >> top - 17:18:54 up  2:29,  1 user,  load
> > average:
> > > >> 149.02, 123.63, 91.94
> > > >> Tasks: 469 total,   4 running, 465 sleeping,  
> > 0
> > > >> stopped,   0 zombie
> > > >>
> > > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>     479
> > > >>
> > > >> insley at tg-viz-login1:~> time globusrun -a -r
> > > >> tg-grid.uc.teragrid.org
> > > >> GRAM Authentication test successful
> > > >> real    0m26.134s
> > > >> user    0m0.090s
> > > >> sys     0m0.010s
> > > >>
> > > >>
> > > >> On Jan 28, 2008, at 5:15 PM, joseph insley
> > wrote:
> > > >>
> > > >>> Earlier today tg-grid.uc.teragrid.org (the
> > UC/ANL
> > > >> TG GRAM host)
> > > >>> became unresponsive and had to be rebooted.  I
> > am
> > > >> now seeing slow
> > > >>> response times from the Gatekeeper there
> > again.
> > > >> Authenticating to
> > > >>> the gatekeeper should only take a second or
> > two,
> > > >> but it is
> > > >>> periodically taking up to 16 seconds:
> > > >>>
> > > >>> insley at tg-viz-login1:~> time globusrun -a -r
> > > >> tg-grid.uc.teragrid.org
> > > >>> GRAM Authentication test successful
> > > >>> real    0m16.096s
> > > >>> user    0m0.060s
> > > >>> sys     0m0.020s
> > > >>>
> > > >>> looking at the load on tg-grid, it is rather
> > high:
> > > >>>
> > > >>> top - 16:55:26 up  2:06,  1 user,  load
> > average:
> > > >> 89.59, 78.69, 62.92
> > > >>> Tasks: 398 total,  20 running, 378 sleeping,  
> > 0
> > > >> stopped,   0 zombie
> > > >>>
> > > >>> And there appear to be a large number of
> > processes
> > > >> owned by kubal:
> > > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>>    380
> > > >>>
> > > >>> I assume that Mike is using swift to do the
> > job
> > > >> submission.  Is
> > > >>> there some throttling of the rate at which
> > jobs
> > > >> are submitted to
> > > >>> the gatekeeper that could be done that would
> > > >> lighten this load
> > > >>> some?  (Or has that already been done since
> > > >> earlier today?)  The
> > > >>> current response times are not unacceptable,
> > but
> > > >> I'm hoping to
> > > >>> avoid having the machine grind to a halt as it
> > did
> > > >> earlier today.
> > > >>>
> > > >>> Thanks,
> > > >>> joe.
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > ===================================================
> > > >>> joseph a.
> > > >>> insley
> > > >>
> > > >>> insley at mcs.anl.gov
> > > >>> mathematics & computer science division
> > > >> (630) 252-5649
> > > >>> argonne national laboratory
> > > >>       (630)
> > > >>> 252-5986 (fax)
> > > >>>
> > > >>>
> > > >>
> > > >>
> > ===================================================
> > > >> joseph a. insley
> > > >>
> > > >> insley at mcs.anl.gov
> > > >> mathematics & computer science division      
> > (630)
> > > >> 252-5649
> > > >> argonne national laboratory
> > > >>     (630)
> > > >> 252-5986 (fax)
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > >       
> > > >
> >
> ____________________________________________________________________________________
> > > > Be a better friend, newshound, and
> > > > know-it-all with Yahoo! Mobile.  Try it now. 
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > >
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> 


From leggett at mcs.anl.gov  Mon Feb  4 07:16:38 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Mon, 4 Feb 2008 07:16:38 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202105649.15397.46.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
Message-ID: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>

Around 80.

On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:

>
> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>> Sorry for killing the server. I'm pushing to get
>> results to guide the selection of compounds for
>> wet-lab testing.
>>
>> I had set the throttle.score.job.factor to 1 in the
>> swift.properties file.
>
> Hmm. Ti, at the time of the massacre, how many did you kill?
>
> Mihael
>
>>
>> I certainly appreciate everyone's efforts and
>> responsiveness.
>>
>> Let me know what to try next, before I kill again.
>>
>> Cheers,
>>
>> Mike
>>
>>
>>
>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>
>>> So I was trying some stuff on Friday night. I guess
>>> I've found the
>>> strategy on when to run the tests: when nobody else
>>> has jobs there
>>> (besides Buzz doing gridftp tests, Ioan having some
>>> Falkon workers
>>> running, and the occasional Inca tests).
>>>
>>> In any event, the machine jumps to about 100%
>>> utilization at around 130
>>> jobs with pre-ws gram. So Mike, please set
>>> throttle.score.job.factor to
>>> 1 in swift.properties.
>>>
>>> There's still more work I need to do test-wise.
>>>
>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>> Mike, You're killing tg-grid1 again. Can someone
>>> work with Mike to get
>>>> some swift settings that don't kill our server?
>>>>
>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>
>>>>> Yes, I'm submitting molecular dynamics
>>> simulations
>>>>> using Swift.
>>>>>
>>>>> Is there a default wall-time limit for jobs on
>>> tg-uc?
>>>>>
>>>>>
>>>>>
>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>
>>>>>> Actually, these numbers are now escalating...
>>>>>>
>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>> average:
>>>>>> 149.02, 123.63, 91.94
>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>> 0
>>>>>> stopped,   0 zombie
>>>>>>
>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>    479
>>>>>>
>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>> GRAM Authentication test successful
>>>>>> real    0m26.134s
>>>>>> user    0m0.090s
>>>>>> sys     0m0.010s
>>>>>>
>>>>>>
>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>> wrote:
>>>>>>
>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>> UC/ANL
>>>>>> TG GRAM host)
>>>>>>> became unresponsive and had to be rebooted.  I
>>> am
>>>>>> now seeing slow
>>>>>>> response times from the Gatekeeper there
>>> again.
>>>>>> Authenticating to
>>>>>>> the gatekeeper should only take a second or
>>> two,
>>>>>> but it is
>>>>>>> periodically taking up to 16 seconds:
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m16.096s
>>>>>>> user    0m0.060s
>>>>>>> sys     0m0.020s
>>>>>>>
>>>>>>> looking at the load on tg-grid, it is rather
>>> high:
>>>>>>>
>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>> average:
>>>>>> 89.59, 78.69, 62.92
>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>> 0
>>>>>> stopped,   0 zombie
>>>>>>>
>>>>>>> And there appear to be a large number of
>>> processes
>>>>>> owned by kubal:
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>   380
>>>>>>>
>>>>>>> I assume that Mike is using swift to do the
>>> job
>>>>>> submission.  Is
>>>>>>> there some throttling of the rate at which
>>> jobs
>>>>>> are submitted to
>>>>>>> the gatekeeper that could be done that would
>>>>>> lighten this load
>>>>>>> some?  (Or has that already been done since
>>>>>> earlier today?)  The
>>>>>>> current response times are not unacceptable,
>>> but
>>>>>> I'm hoping to
>>>>>>> avoid having the machine grind to a halt as it
>>> did
>>>>>> earlier today.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> joe.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>> ===================================================
>>>>>>> joseph a.
>>>>>>> insley
>>>>>>
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>>>> (630) 252-5649
>>>>>>> argonne national laboratory
>>>>>>      (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>> ===================================================
>>>>>> joseph a. insley
>>>>>>
>>>>>> insley at mcs.anl.gov
>>>>>> mathematics & computer science division
>>> (630)
>>>>>> 252-5649
>>>>>> argonne national laboratory
>>>>>>    (630)
>>>>>> 252-5986 (fax)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> ____________________________________________________________________________________
>>>>> Be a better friend, newshound, and
>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>>
>>>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>
>>
>>
>>       
>> ____________________________________________________________________________________
>> Never miss a thing.  Make Yahoo your home page.
>> http://www.yahoo.com/r/hs
>>
>


From wilde at mcs.anl.gov  Mon Feb  4 08:13:36 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 04 Feb 2008 08:13:36 -0600
Subject: [Swift-devel] Swift throttling
In-Reply-To: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
Message-ID: <47A71D90.6080907@mcs.anl.gov>

Mihael, Ben - bear with me - I'd like to revisit where we are n throttling.

The following may already be in place, but I think we need to review and 
clarify it, maybe re-assess the numbers:

Seems like for both pre-WS and WS-GRAM we need to stay within two 
roughly-known limits:

- number of jobs submitted per second
- total # of jobs that can be submitted at once

It seems that we need to set limits on these two parameters, *around* 
the slow-start algorithm that tries to sense a sustainable maximum rate 
of job submission.

To what extent is that in the code already, and does it need improvement?

I thought that for pre-WS GRAM the parameters are approximately

- .5 jobs/sec
- < 100 jobs in queue

I realize that these can only be limited on per-workflow basis, but for 
interactions between two workflows, hopefully the slow-start sensing 
algorithms will sense that resource is already under strain and stay at 
a low submission rate.

So what Im suggesting here is:

- we agree on some arbitrary conservative numbers for the moment (till 
we can do more measurement)

- we modify the code to enable explicit limits on the algorithm to be 
set by the user, eg:
  throttle.host.submitlimit - max # jobs that can be queued to a host
  throttle.host.submitrate - max #jobs/sec that can be queued to a host
                             (float)

Does Ti's report of 80 jobs indicates that maybe even 100 jobs in the 
queue is too much (for pre-WS)?

Does this seem reasonable? If not, what is the mechanism by which we can 
reliably avoid over-running a site?

- Mike


On 2/4/08 7:16 AM, Ti Leggett wrote:
> Around 80.
> 
> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> 
>>
>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>> Sorry for killing the server. I'm pushing to get
>>> results to guide the selection of compounds for
>>> wet-lab testing.
>>>
>>> I had set the throttle.score.job.factor to 1 in the
>>> swift.properties file.
>>
>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>
>> Mihael
>>
>>>
>>> I certainly appreciate everyone's efforts and
>>> responsiveness.
>>>
>>> Let me know what to try next, before I kill again.
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>>
>>>
>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>
>>>> So I was trying some stuff on Friday night. I guess
>>>> I've found the
>>>> strategy on when to run the tests: when nobody else
>>>> has jobs there
>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>> Falkon workers
>>>> running, and the occasional Inca tests).
>>>>
>>>> In any event, the machine jumps to about 100%
>>>> utilization at around 130
>>>> jobs with pre-ws gram. So Mike, please set
>>>> throttle.score.job.factor to
>>>> 1 in swift.properties.
>>>>
>>>> There's still more work I need to do test-wise.
>>>>
>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>> work with Mike to get
>>>>> some swift settings that don't kill our server?
>>>>>
>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>
>>>>>> Yes, I'm submitting molecular dynamics
>>>> simulations
>>>>>> using Swift.
>>>>>>
>>>>>> Is there a default wall-time limit for jobs on
>>>> tg-uc?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> Actually, these numbers are now escalating...
>>>>>>>
>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>> average:
>>>>>>> 149.02, 123.63, 91.94
>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>> 0
>>>>>>> stopped,   0 zombie
>>>>>>>
>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>    479
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>> GRAM Authentication test successful
>>>>>>> real    0m26.134s
>>>>>>> user    0m0.090s
>>>>>>> sys     0m0.010s
>>>>>>>
>>>>>>>
>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>> wrote:
>>>>>>>
>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>> UC/ANL
>>>>>>> TG GRAM host)
>>>>>>>> became unresponsive and had to be rebooted.  I
>>>> am
>>>>>>> now seeing slow
>>>>>>>> response times from the Gatekeeper there
>>>> again.
>>>>>>> Authenticating to
>>>>>>>> the gatekeeper should only take a second or
>>>> two,
>>>>>>> but it is
>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m16.096s
>>>>>>>> user    0m0.060s
>>>>>>>> sys     0m0.020s
>>>>>>>>
>>>>>>>> looking at the load on tg-grid, it is rather
>>>> high:
>>>>>>>>
>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>> average:
>>>>>>> 89.59, 78.69, 62.92
>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>> 0
>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> And there appear to be a large number of
>>>> processes
>>>>>>> owned by kubal:
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>   380
>>>>>>>>
>>>>>>>> I assume that Mike is using swift to do the
>>>> job
>>>>>>> submission.  Is
>>>>>>>> there some throttling of the rate at which
>>>> jobs
>>>>>>> are submitted to
>>>>>>>> the gatekeeper that could be done that would
>>>>>>> lighten this load
>>>>>>>> some?  (Or has that already been done since
>>>>>>> earlier today?)  The
>>>>>>>> current response times are not unacceptable,
>>>> but
>>>>>>> I'm hoping to
>>>>>>>> avoid having the machine grind to a halt as it
>>>> did
>>>>>>> earlier today.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> joe.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>> ===================================================
>>>>>>>> joseph a.
>>>>>>>> insley
>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division
>>>>>>> (630) 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>      (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>> ===================================================
>>>>>>> joseph a. insley
>>>>>>>
>>>>>>> insley at mcs.anl.gov
>>>>>>> mathematics & computer science division
>>>> (630)
>>>>>>> 252-5649
>>>>>>> argonne national laboratory
>>>>>>>    (630)
>>>>>>> 252-5986 (fax)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>> ____________________________________________________________________________________ 
>>>
>>>>>> Be a better friend, newshound, and
>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>
>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>>
>>>>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>>>
>>>
>>>
>>>
>>>      
>>> ____________________________________________________________________________________ 
>>>
>>> Never miss a thing.  Make Yahoo your home page.
>>> http://www.yahoo.com/r/hs
>>>
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From hategan at mcs.anl.gov  Mon Feb  4 09:30:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 09:30:54 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
Message-ID: <1202139054.16407.5.camel@blabla.mcs.anl.gov>

That's odd. Clearly if that's not acceptable from your perspective, yet
I thought 130 are fine, there's a disconnect between what you think is
acceptable and what I think is acceptable.

What was that prompted you to conclude things are bad?

On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> Around 80.
> 
> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> 
> >
> > On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >> Sorry for killing the server. I'm pushing to get
> >> results to guide the selection of compounds for
> >> wet-lab testing.
> >>
> >> I had set the throttle.score.job.factor to 1 in the
> >> swift.properties file.
> >
> > Hmm. Ti, at the time of the massacre, how many did you kill?
> >
> > Mihael
> >
> >>
> >> I certainly appreciate everyone's efforts and
> >> responsiveness.
> >>
> >> Let me know what to try next, before I kill again.
> >>
> >> Cheers,
> >>
> >> Mike
> >>
> >>
> >>
> >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>
> >>> So I was trying some stuff on Friday night. I guess
> >>> I've found the
> >>> strategy on when to run the tests: when nobody else
> >>> has jobs there
> >>> (besides Buzz doing gridftp tests, Ioan having some
> >>> Falkon workers
> >>> running, and the occasional Inca tests).
> >>>
> >>> In any event, the machine jumps to about 100%
> >>> utilization at around 130
> >>> jobs with pre-ws gram. So Mike, please set
> >>> throttle.score.job.factor to
> >>> 1 in swift.properties.
> >>>
> >>> There's still more work I need to do test-wise.
> >>>
> >>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>> Mike, You're killing tg-grid1 again. Can someone
> >>> work with Mike to get
> >>>> some swift settings that don't kill our server?
> >>>>
> >>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>
> >>>>> Yes, I'm submitting molecular dynamics
> >>> simulations
> >>>>> using Swift.
> >>>>>
> >>>>> Is there a default wall-time limit for jobs on
> >>> tg-uc?
> >>>>>
> >>>>>
> >>>>>
> >>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>
> >>>>>> Actually, these numbers are now escalating...
> >>>>>>
> >>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>> average:
> >>>>>> 149.02, 123.63, 91.94
> >>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>> 0
> >>>>>> stopped,   0 zombie
> >>>>>>
> >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>    479
> >>>>>>
> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>> tg-grid.uc.teragrid.org
> >>>>>> GRAM Authentication test successful
> >>>>>> real    0m26.134s
> >>>>>> user    0m0.090s
> >>>>>> sys     0m0.010s
> >>>>>>
> >>>>>>
> >>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>> wrote:
> >>>>>>
> >>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>> UC/ANL
> >>>>>> TG GRAM host)
> >>>>>>> became unresponsive and had to be rebooted.  I
> >>> am
> >>>>>> now seeing slow
> >>>>>>> response times from the Gatekeeper there
> >>> again.
> >>>>>> Authenticating to
> >>>>>>> the gatekeeper should only take a second or
> >>> two,
> >>>>>> but it is
> >>>>>>> periodically taking up to 16 seconds:
> >>>>>>>
> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>> tg-grid.uc.teragrid.org
> >>>>>>> GRAM Authentication test successful
> >>>>>>> real    0m16.096s
> >>>>>>> user    0m0.060s
> >>>>>>> sys     0m0.020s
> >>>>>>>
> >>>>>>> looking at the load on tg-grid, it is rather
> >>> high:
> >>>>>>>
> >>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>> average:
> >>>>>> 89.59, 78.69, 62.92
> >>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>> 0
> >>>>>> stopped,   0 zombie
> >>>>>>>
> >>>>>>> And there appear to be a large number of
> >>> processes
> >>>>>> owned by kubal:
> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>   380
> >>>>>>>
> >>>>>>> I assume that Mike is using swift to do the
> >>> job
> >>>>>> submission.  Is
> >>>>>>> there some throttling of the rate at which
> >>> jobs
> >>>>>> are submitted to
> >>>>>>> the gatekeeper that could be done that would
> >>>>>> lighten this load
> >>>>>>> some?  (Or has that already been done since
> >>>>>> earlier today?)  The
> >>>>>>> current response times are not unacceptable,
> >>> but
> >>>>>> I'm hoping to
> >>>>>>> avoid having the machine grind to a halt as it
> >>> did
> >>>>>> earlier today.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> joe.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>> ===================================================
> >>>>>>> joseph a.
> >>>>>>> insley
> >>>>>>
> >>>>>>> insley at mcs.anl.gov
> >>>>>>> mathematics & computer science division
> >>>>>> (630) 252-5649
> >>>>>>> argonne national laboratory
> >>>>>>      (630)
> >>>>>>> 252-5986 (fax)
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>> ===================================================
> >>>>>> joseph a. insley
> >>>>>>
> >>>>>> insley at mcs.anl.gov
> >>>>>> mathematics & computer science division
> >>> (630)
> >>>>>> 252-5649
> >>>>>> argonne national laboratory
> >>>>>>    (630)
> >>>>>> 252-5986 (fax)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >> ____________________________________________________________________________________
> >>>>> Be a better friend, newshound, and
> >>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>
> >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>>
> >>>
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>
> >>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>>
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>
> >>
> >>
> >>
> >>       
> >> ____________________________________________________________________________________
> >> Never miss a thing.  Make Yahoo your home page.
> >> http://www.yahoo.com/r/hs
> >>
> >
> 


From leggett at mcs.anl.gov  Mon Feb  4 09:58:40 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Mon, 4 Feb 2008 09:58:40 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202139054.16407.5.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
Message-ID: <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>

That inca tests were timing out after 5 minutes and the load on the  
machine was ~27. How are you concluding when things aren't acceptable?

On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:

> That's odd. Clearly if that's not acceptable from your perspective,  
> yet
> I thought 130 are fine, there's a disconnect between what you think is
> acceptable and what I think is acceptable.
>
> What was that prompted you to conclude things are bad?
>
> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>> Around 80.
>>
>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>
>>>
>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>> Sorry for killing the server. I'm pushing to get
>>>> results to guide the selection of compounds for
>>>> wet-lab testing.
>>>>
>>>> I had set the throttle.score.job.factor to 1 in the
>>>> swift.properties file.
>>>
>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>
>>> Mihael
>>>
>>>>
>>>> I certainly appreciate everyone's efforts and
>>>> responsiveness.
>>>>
>>>> Let me know what to try next, before I kill again.
>>>>
>>>> Cheers,
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>
>>>>> So I was trying some stuff on Friday night. I guess
>>>>> I've found the
>>>>> strategy on when to run the tests: when nobody else
>>>>> has jobs there
>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>> Falkon workers
>>>>> running, and the occasional Inca tests).
>>>>>
>>>>> In any event, the machine jumps to about 100%
>>>>> utilization at around 130
>>>>> jobs with pre-ws gram. So Mike, please set
>>>>> throttle.score.job.factor to
>>>>> 1 in swift.properties.
>>>>>
>>>>> There's still more work I need to do test-wise.
>>>>>
>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>> work with Mike to get
>>>>>> some swift settings that don't kill our server?
>>>>>>
>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>
>>>>>>> Yes, I'm submitting molecular dynamics
>>>>> simulations
>>>>>>> using Swift.
>>>>>>>
>>>>>>> Is there a default wall-time limit for jobs on
>>>>> tg-uc?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>
>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>> average:
>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>> 0
>>>>>>>> stopped,   0 zombie
>>>>>>>>
>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>   479
>>>>>>>>
>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>> GRAM Authentication test successful
>>>>>>>> real    0m26.134s
>>>>>>>> user    0m0.090s
>>>>>>>> sys     0m0.010s
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>> UC/ANL
>>>>>>>> TG GRAM host)
>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>> am
>>>>>>>> now seeing slow
>>>>>>>>> response times from the Gatekeeper there
>>>>> again.
>>>>>>>> Authenticating to
>>>>>>>>> the gatekeeper should only take a second or
>>>>> two,
>>>>>>>> but it is
>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>
>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>> GRAM Authentication test successful
>>>>>>>>> real    0m16.096s
>>>>>>>>> user    0m0.060s
>>>>>>>>> sys     0m0.020s
>>>>>>>>>
>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>> high:
>>>>>>>>>
>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>> average:
>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>> 0
>>>>>>>> stopped,   0 zombie
>>>>>>>>>
>>>>>>>>> And there appear to be a large number of
>>>>> processes
>>>>>>>> owned by kubal:
>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>  380
>>>>>>>>>
>>>>>>>>> I assume that Mike is using swift to do the
>>>>> job
>>>>>>>> submission.  Is
>>>>>>>>> there some throttling of the rate at which
>>>>> jobs
>>>>>>>> are submitted to
>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>> lighten this load
>>>>>>>>> some?  (Or has that already been done since
>>>>>>>> earlier today?)  The
>>>>>>>>> current response times are not unacceptable,
>>>>> but
>>>>>>>> I'm hoping to
>>>>>>>>> avoid having the machine grind to a halt as it
>>>>> did
>>>>>>>> earlier today.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> joe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>> ===================================================
>>>>>>>>> joseph a.
>>>>>>>>> insley
>>>>>>>>
>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>> mathematics & computer science division
>>>>>>>> (630) 252-5649
>>>>>>>>> argonne national laboratory
>>>>>>>>     (630)
>>>>>>>>> 252-5986 (fax)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>> ===================================================
>>>>>>>> joseph a. insley
>>>>>>>>
>>>>>>>> insley at mcs.anl.gov
>>>>>>>> mathematics & computer science division
>>>>> (630)
>>>>>>>> 252-5649
>>>>>>>> argonne national laboratory
>>>>>>>>   (630)
>>>>>>>> 252-5986 (fax)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>> ____________________________________________________________________________________
>>>>>>> Be a better friend, newshound, and
>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>
>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>
>>>>>
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>>
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ____________________________________________________________________________________
>>>> Never miss a thing.  Make Yahoo your home page.
>>>> http://www.yahoo.com/r/hs
>>>>
>>>
>>
>


From benc at hawaga.org.uk  Mon Feb  4 10:03:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 16:03:17 +0000 (GMT)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>


On Sun, 3 Feb 2008, Mike Kubal wrote:

> Let me know what to try next, before I kill again.

You can try the PB local provider perhaps. That needs you to run Swift on 
eg tg-grid1 rather than on some arbitrary grid machine.

Then use a sites.xml entry something like:

  <pool handle="localhost">
    <gridftp  url="local://localhost" />
    <execution provider="pbs" url="none" />
    <workdirectory >/home/benc/swift-run-dir/</workdirectory>
  </pool>

Make sure the workdirectory is somewhere shared (the directory you're 
using at the moment is probably OK).

I've run this on teraport with a hundred or so jobs without any apparent 
problem, so hopefully this will scale better.

-- 


From benc at hawaga.org.uk  Mon Feb  4 10:09:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 16:09:40 +0000 (GMT)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802041603470.5372@dildano.hawaga.org.uk>


On Sun, 3 Feb 2008, Mike Kubal wrote:

> Let me know what to try next, before I kill again.

Also, there is a clustering mechanism - this is where swift takes a bunch 
of jobs and aggregates them into a single submission to GRAM or PBS.

If you know a maximum execution time for your jobs, you can do that. 
There's a users guide section with some details:

http://www.ci.uchicago.edu/swift/guides/userguide.php#clustering

Basically, you need to set a maximum time for your executables in your 
tc.data file using the maxwalltime profie and then specify a value for 
clustering.min.time property. Clusters fo about 10 jobs perhaps the size 
to aim for.

You can use this with any submission mechanism - GRAM2, GRAM4 or PBS.

-- 


From benc at hawaga.org.uk  Mon Feb  4 10:13:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 16:13:07 +0000 (GMT)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0802041612510.4874@dildano.hawaga.org.uk>

out of the clustering and pbs suggestions, I'd try PBS first...
-- 


From benc at hawaga.org.uk  Mon Feb  4 10:17:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 16:17:52 +0000 (GMT)
Subject: [Swift-devel] Swift throttling
In-Reply-To: <47A71D90.6080907@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<47A71D90.6080907@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802041610090.5372@dildano.hawaga.org.uk>


On Mon, 4 Feb 2008, Michael Wilde wrote:

> hopefully the slow-start sensing
> algorithms will sense that resource is already under strain and stay at a low
> submission rate.

I don't think it does that at all.

> - we modify the code to enable explicit limits on the algorithm to be set by
> the user, eg:
>  throttle.host.submitlimit - max # jobs that can be queued to a host
>  throttle.host.submitrate - max #jobs/sec that can be queued to a host
>                             (float)

parameters that control thsoe exist already, I think, for the whole 
workflow. In the single site case, site specific ones aren't needed 
because of that. If they were being implemented, it would probably be 
better to make them settable in the sites catalog so that they can be 
defined differently for each site.

throttle.scote.job.factor limits the number of concurrent jobs to 2 + 
100*throttle.score.job.factor   (so to achieve a limit of 52, set 
throttle.score.job.factor to 0.5)

There's a per-site profile setting:
maxSubmitRate - limits the maximum rate of job submission, in jobs per 
second.

-- 


From hategan at mcs.anl.gov  Mon Feb  4 10:18:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 10:18:36 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
Message-ID: <1202141916.17237.4.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> That inca tests were timing out after 5 minutes and the load on the  
> machine was ~27. How are you concluding when things aren't acceptable?

It's got 2 cpus. So to me an average load of under 100 and the SSH
session being responsive looks fine.

The fact that inca tests are timing out may be because inca has too low
of a tolerance for things.

> 
> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> 
> > That's odd. Clearly if that's not acceptable from your perspective,  
> > yet
> > I thought 130 are fine, there's a disconnect between what you think is
> > acceptable and what I think is acceptable.
> >
> > What was that prompted you to conclude things are bad?
> >
> > On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >> Around 80.
> >>
> >> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>
> >>>
> >>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>> Sorry for killing the server. I'm pushing to get
> >>>> results to guide the selection of compounds for
> >>>> wet-lab testing.
> >>>>
> >>>> I had set the throttle.score.job.factor to 1 in the
> >>>> swift.properties file.
> >>>
> >>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>
> >>> Mihael
> >>>
> >>>>
> >>>> I certainly appreciate everyone's efforts and
> >>>> responsiveness.
> >>>>
> >>>> Let me know what to try next, before I kill again.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Mike
> >>>>
> >>>>
> >>>>
> >>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>
> >>>>> So I was trying some stuff on Friday night. I guess
> >>>>> I've found the
> >>>>> strategy on when to run the tests: when nobody else
> >>>>> has jobs there
> >>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>> Falkon workers
> >>>>> running, and the occasional Inca tests).
> >>>>>
> >>>>> In any event, the machine jumps to about 100%
> >>>>> utilization at around 130
> >>>>> jobs with pre-ws gram. So Mike, please set
> >>>>> throttle.score.job.factor to
> >>>>> 1 in swift.properties.
> >>>>>
> >>>>> There's still more work I need to do test-wise.
> >>>>>
> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>> work with Mike to get
> >>>>>> some swift settings that don't kill our server?
> >>>>>>
> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>
> >>>>>>> Yes, I'm submitting molecular dynamics
> >>>>> simulations
> >>>>>>> using Swift.
> >>>>>>>
> >>>>>>> Is there a default wall-time limit for jobs on
> >>>>> tg-uc?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>
> >>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>
> >>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>> average:
> >>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>> 0
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>
> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>   479
> >>>>>>>>
> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>> GRAM Authentication test successful
> >>>>>>>> real    0m26.134s
> >>>>>>>> user    0m0.090s
> >>>>>>>> sys     0m0.010s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>> UC/ANL
> >>>>>>>> TG GRAM host)
> >>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>> am
> >>>>>>>> now seeing slow
> >>>>>>>>> response times from the Gatekeeper there
> >>>>> again.
> >>>>>>>> Authenticating to
> >>>>>>>>> the gatekeeper should only take a second or
> >>>>> two,
> >>>>>>>> but it is
> >>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>
> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>> GRAM Authentication test successful
> >>>>>>>>> real    0m16.096s
> >>>>>>>>> user    0m0.060s
> >>>>>>>>> sys     0m0.020s
> >>>>>>>>>
> >>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>> high:
> >>>>>>>>>
> >>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>> average:
> >>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>> 0
> >>>>>>>> stopped,   0 zombie
> >>>>>>>>>
> >>>>>>>>> And there appear to be a large number of
> >>>>> processes
> >>>>>>>> owned by kubal:
> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>  380
> >>>>>>>>>
> >>>>>>>>> I assume that Mike is using swift to do the
> >>>>> job
> >>>>>>>> submission.  Is
> >>>>>>>>> there some throttling of the rate at which
> >>>>> jobs
> >>>>>>>> are submitted to
> >>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>> lighten this load
> >>>>>>>>> some?  (Or has that already been done since
> >>>>>>>> earlier today?)  The
> >>>>>>>>> current response times are not unacceptable,
> >>>>> but
> >>>>>>>> I'm hoping to
> >>>>>>>>> avoid having the machine grind to a halt as it
> >>>>> did
> >>>>>>>> earlier today.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> joe.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>> ===================================================
> >>>>>>>>> joseph a.
> >>>>>>>>> insley
> >>>>>>>>
> >>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>> mathematics & computer science division
> >>>>>>>> (630) 252-5649
> >>>>>>>>> argonne national laboratory
> >>>>>>>>     (630)
> >>>>>>>>> 252-5986 (fax)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>> ===================================================
> >>>>>>>> joseph a. insley
> >>>>>>>>
> >>>>>>>> insley at mcs.anl.gov
> >>>>>>>> mathematics & computer science division
> >>>>> (630)
> >>>>>>>> 252-5649
> >>>>>>>> argonne national laboratory
> >>>>>>>>   (630)
> >>>>>>>> 252-5986 (fax)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>> ____________________________________________________________________________________
> >>>>>>> Be a better friend, newshound, and
> >>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>
> >>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Swift-devel mailing list
> >>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>
> >>>>>
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Swift-devel mailing list
> >>>>> Swift-devel at ci.uchicago.edu
> >>>>>
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ____________________________________________________________________________________
> >>>> Never miss a thing.  Make Yahoo your home page.
> >>>> http://www.yahoo.com/r/hs
> >>>>
> >>>
> >>
> >
> 


From hategan at mcs.anl.gov  Mon Feb  4 10:23:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 10:23:36 -0600
Subject: [Swift-devel] Swift throttling
In-Reply-To: <Pine.LNX.4.64.0802041610090.5372@dildano.hawaga.org.uk>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<47A71D90.6080907@mcs.anl.gov>
	<Pine.LNX.4.64.0802041610090.5372@dildano.hawaga.org.uk>
Message-ID: <1202142217.17237.9.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-04 at 16:17 +0000, Ben Clifford wrote:
> 
> On Mon, 4 Feb 2008, Michael Wilde wrote:
> 
> > hopefully the slow-start sensing
> > algorithms will sense that resource is already under strain and stay at a low
> > submission rate.
> 
> I don't think it does that at all.

Actually, yes. There's the submit throttle which limits the submission
parallelism. When remote site is under load and accepts jobs slowly, the
client will invariably submit slower.

And now that I think of it, the maxSubmitRate looks a lot like it could
be integrated here.

> 
> > - we modify the code to enable explicit limits on the algorithm to be set by
> > the user, eg:
> >  throttle.host.submitlimit - max # jobs that can be queued to a host
> >  throttle.host.submitrate - max #jobs/sec that can be queued to a host
> >                             (float)
> 
> parameters that control thsoe exist already, I think, for the whole 
> workflow. In the single site case, site specific ones aren't needed 
> because of that. If they were being implemented, it would probably be 
> better to make them settable in the sites catalog so that they can be 
> defined differently for each site.
> 
> throttle.scote.job.factor limits the number of concurrent jobs to 2 + 
> 100*throttle.score.job.factor   (so to achieve a limit of 52, set 
> throttle.score.job.factor to 0.5)

Unfortunately that's an int. So it won't work. I'll make it a float.

> 
> There's a per-site profile setting:
> maxSubmitRate - limits the maximum rate of job submission, in jobs per 
> second.
> 


From leggett at mcs.anl.gov  Mon Feb  4 10:28:38 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Mon, 4 Feb 2008 10:28:38 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202141916.17237.4.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
Message-ID: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>

Then I'd say we have very different levels of acceptable. A simple job  
submission test should never take longer than 5 minutes to complete  
and a load of 27 is not acceptable when the responsiveness of the  
machine is impacted. And since we're having this conversation, there  
is a perceived problem on our end so an adjustment to our definition  
of acceptable is needed.

On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:

>
> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>> That inca tests were timing out after 5 minutes and the load on the
>> machine was ~27. How are you concluding when things aren't  
>> acceptable?
>
> It's got 2 cpus. So to me an average load of under 100 and the SSH
> session being responsive looks fine.
>
> The fact that inca tests are timing out may be because inca has too  
> low
> of a tolerance for things.
>
>>
>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>
>>> That's odd. Clearly if that's not acceptable from your perspective,
>>> yet
>>> I thought 130 are fine, there's a disconnect between what you  
>>> think is
>>> acceptable and what I think is acceptable.
>>>
>>> What was that prompted you to conclude things are bad?
>>>
>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>> Around 80.
>>>>
>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>
>>>>>
>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>> results to guide the selection of compounds for
>>>>>> wet-lab testing.
>>>>>>
>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>> swift.properties file.
>>>>>
>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>
>>>>> Mihael
>>>>>
>>>>>>
>>>>>> I certainly appreciate everyone's efforts and
>>>>>> responsiveness.
>>>>>>
>>>>>> Let me know what to try next, before I kill again.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>> I've found the
>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>> has jobs there
>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>> Falkon workers
>>>>>>> running, and the occasional Inca tests).
>>>>>>>
>>>>>>> In any event, the machine jumps to about 100%
>>>>>>> utilization at around 130
>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>> throttle.score.job.factor to
>>>>>>> 1 in swift.properties.
>>>>>>>
>>>>>>> There's still more work I need to do test-wise.
>>>>>>>
>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>> work with Mike to get
>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>
>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>
>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>> simulations
>>>>>>>>> using Swift.
>>>>>>>>>
>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>> tg-uc?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>
>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>
>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>> average:
>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>> 0
>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>
>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>  479
>>>>>>>>>>
>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>> real    0m26.134s
>>>>>>>>>> user    0m0.090s
>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>> UC/ANL
>>>>>>>>>> TG GRAM host)
>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>> am
>>>>>>>>>> now seeing slow
>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>> again.
>>>>>>>>>> Authenticating to
>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>> two,
>>>>>>>>>> but it is
>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>
>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>> high:
>>>>>>>>>>>
>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>> average:
>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>> 0
>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>
>>>>>>>>>>> And there appear to be a large number of
>>>>>>> processes
>>>>>>>>>> owned by kubal:
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>> 380
>>>>>>>>>>>
>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>> job
>>>>>>>>>> submission.  Is
>>>>>>>>>>> there some throttling of the rate at which
>>>>>>> jobs
>>>>>>>>>> are submitted to
>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>> lighten this load
>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>> earlier today?)  The
>>>>>>>>>>> current response times are not unacceptable,
>>>>>>> but
>>>>>>>>>> I'm hoping to
>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>> did
>>>>>>>>>> earlier today.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> joe.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ===================================================
>>>>>>>>>>> joseph a.
>>>>>>>>>>> insley
>>>>>>>>>>
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>> (630) 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>    (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ===================================================
>>>>>>>>>> joseph a. insley
>>>>>>>>>>
>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>> mathematics & computer science division
>>>>>>> (630)
>>>>>>>>>> 252-5649
>>>>>>>>>> argonne national laboratory
>>>>>>>>>>  (630)
>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>> ____________________________________________________________________________________
>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>
>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>
>>>>>>>
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ____________________________________________________________________________________
>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>> http://www.yahoo.com/r/hs
>>>>>>
>>>>>
>>>>
>>>
>>
>


From foster at mcs.anl.gov  Mon Feb  4 10:31:59 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 04 Feb 2008 10:31:59 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>	<1202105649.15397.46.camel@blabla.mcs.anl.gov>	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>	<1202139054.16407.5.camel@blabla.mcs.anl.gov>	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
Message-ID: <47A73DFF.3010402@mcs.anl.gov>

It would be really wonderful if someone can try GRAM4, which we believe 
addresses this problem.

Ian.

Ti Leggett wrote:
> Then I'd say we have very different levels of acceptable. A simple job 
> submission test should never take longer than 5 minutes to complete 
> and a load of 27 is not acceptable when the responsiveness of the 
> machine is impacted. And since we're having this conversation, there 
> is a perceived problem on our end so an adjustment to our definition 
> of acceptable is needed.
>
> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>
>>
>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>> That inca tests were timing out after 5 minutes and the load on the
>>> machine was ~27. How are you concluding when things aren't acceptable?
>>
>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>> session being responsive looks fine.
>>
>> The fact that inca tests are timing out may be because inca has too low
>> of a tolerance for things.
>>
>>>
>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>
>>>> That's odd. Clearly if that's not acceptable from your perspective,
>>>> yet
>>>> I thought 130 are fine, there's a disconnect between what you think is
>>>> acceptable and what I think is acceptable.
>>>>
>>>> What was that prompted you to conclude things are bad?
>>>>
>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>> Around 80.
>>>>>
>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>
>>>>>>
>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>> results to guide the selection of compounds for
>>>>>>> wet-lab testing.
>>>>>>>
>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>> swift.properties file.
>>>>>>
>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>
>>>>>> Mihael
>>>>>>
>>>>>>>
>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>> responsiveness.
>>>>>>>
>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>> I've found the
>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>> has jobs there
>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>> Falkon workers
>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>
>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>> utilization at around 130
>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>> throttle.score.job.factor to
>>>>>>>> 1 in swift.properties.
>>>>>>>>
>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>
>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>> work with Mike to get
>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>
>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>> simulations
>>>>>>>>>> using Swift.
>>>>>>>>>>
>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>> tg-uc?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>
>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>> average:
>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>> 0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>  479
>>>>>>>>>>>
>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>> UC/ANL
>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>> am
>>>>>>>>>>> now seeing slow
>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>> again.
>>>>>>>>>>> Authenticating to
>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>> two,
>>>>>>>>>>> but it is
>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>
>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>> high:
>>>>>>>>>>>>
>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>> average:
>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>> 0
>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>
>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>> processes
>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>> 380
>>>>>>>>>>>>
>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>> job
>>>>>>>>>>> submission.  Is
>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>> jobs
>>>>>>>>>>> are submitted to
>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>> lighten this load
>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>> but
>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>> did
>>>>>>>>>>> earlier today.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> joe.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>>>>>> joseph a.
>>>>>>>>>>>> insley
>>>>>>>>>>>
>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>    (630)
>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> ===================================================
>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>
>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>> mathematics & computer science division
>>>>>>>> (630)
>>>>>>>>>>> 252-5649
>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>  (630)
>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>> ____________________________________________________________________________________ 
>>>>>>>
>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>
>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>
>>>>>>>>
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ____________________________________________________________________________________ 
>>>>>>>
>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From hategan at mcs.anl.gov  Mon Feb  4 10:47:33 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 10:47:33 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
Message-ID: <1202143654.17665.12.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
> Then I'd say we have very different levels of acceptable.

Yes, that's why we're having this discussion.

>  A simple job  
> submission test should never take longer than 5 minutes to complete  
> and a load of 27 is not acceptable when the responsiveness of the  
> machine is impacted. And since we're having this conversation, there  
> is a perceived problem on our end so an adjustment to our definition  
> of acceptable is needed.

And we need to adjust our definition of not-acceptable. So we need to
meet in the middle.

So, 25 (sustained) reasonably acceptable average load? That amounts to
about 13 hungry processes per cpu. Even with a 100Hz time slice, each
process would get 8 slices per second on average.

> 
> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> 
> >
> > On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >> That inca tests were timing out after 5 minutes and the load on the
> >> machine was ~27. How are you concluding when things aren't  
> >> acceptable?
> >
> > It's got 2 cpus. So to me an average load of under 100 and the SSH
> > session being responsive looks fine.
> >
> > The fact that inca tests are timing out may be because inca has too  
> > low
> > of a tolerance for things.
> >
> >>
> >> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>
> >>> That's odd. Clearly if that's not acceptable from your perspective,
> >>> yet
> >>> I thought 130 are fine, there's a disconnect between what you  
> >>> think is
> >>> acceptable and what I think is acceptable.
> >>>
> >>> What was that prompted you to conclude things are bad?
> >>>
> >>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>> Around 80.
> >>>>
> >>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>
> >>>>>
> >>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>> results to guide the selection of compounds for
> >>>>>> wet-lab testing.
> >>>>>>
> >>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>> swift.properties file.
> >>>>>
> >>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>>
> >>>>>> I certainly appreciate everyone's efforts and
> >>>>>> responsiveness.
> >>>>>>
> >>>>>> Let me know what to try next, before I kill again.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>> I've found the
> >>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>> has jobs there
> >>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>> Falkon workers
> >>>>>>> running, and the occasional Inca tests).
> >>>>>>>
> >>>>>>> In any event, the machine jumps to about 100%
> >>>>>>> utilization at around 130
> >>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>> throttle.score.job.factor to
> >>>>>>> 1 in swift.properties.
> >>>>>>>
> >>>>>>> There's still more work I need to do test-wise.
> >>>>>>>
> >>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>> work with Mike to get
> >>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>
> >>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>
> >>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>> simulations
> >>>>>>>>> using Swift.
> >>>>>>>>>
> >>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>> tg-uc?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>
> >>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>
> >>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>> average:
> >>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>> 0
> >>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>
> >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>  479
> >>>>>>>>>>
> >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>> real    0m26.134s
> >>>>>>>>>> user    0m0.090s
> >>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>> UC/ANL
> >>>>>>>>>> TG GRAM host)
> >>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>> am
> >>>>>>>>>> now seeing slow
> >>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>> again.
> >>>>>>>>>> Authenticating to
> >>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>> two,
> >>>>>>>>>> but it is
> >>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>
> >>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>> high:
> >>>>>>>>>>>
> >>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>> average:
> >>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>> 0
> >>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>
> >>>>>>>>>>> And there appear to be a large number of
> >>>>>>> processes
> >>>>>>>>>> owned by kubal:
> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>> 380
> >>>>>>>>>>>
> >>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>> job
> >>>>>>>>>> submission.  Is
> >>>>>>>>>>> there some throttling of the rate at which
> >>>>>>> jobs
> >>>>>>>>>> are submitted to
> >>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>> lighten this load
> >>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>> earlier today?)  The
> >>>>>>>>>>> current response times are not unacceptable,
> >>>>>>> but
> >>>>>>>>>> I'm hoping to
> >>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>> did
> >>>>>>>>>> earlier today.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> joe.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>> ===================================================
> >>>>>>>>>>> joseph a.
> >>>>>>>>>>> insley
> >>>>>>>>>>
> >>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>> (630) 252-5649
> >>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>    (630)
> >>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>> ===================================================
> >>>>>>>>>> joseph a. insley
> >>>>>>>>>>
> >>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>> mathematics & computer science division
> >>>>>>> (630)
> >>>>>>>>>> 252-5649
> >>>>>>>>>> argonne national laboratory
> >>>>>>>>>>  (630)
> >>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>> ____________________________________________________________________________________
> >>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>
> >>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Swift-devel mailing list
> >>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>
> >>>>>>>
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Swift-devel mailing list
> >>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ____________________________________________________________________________________
> >>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>> http://www.yahoo.com/r/hs
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 


From hategan at mcs.anl.gov  Mon Feb  4 10:48:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 10:48:31 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <47A73DFF.3010402@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<47A73DFF.3010402@mcs.anl.gov>
Message-ID: <1202143711.17665.13.camel@blabla.mcs.anl.gov>

Yes, and I will. But unless we're completely dropping support for pre-ws
GRAM, we still need to do this.


On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote:
> It would be really wonderful if someone can try GRAM4, which we believe 
> addresses this problem.
> 
> Ian.
> 
> Ti Leggett wrote:
> > Then I'd say we have very different levels of acceptable. A simple job 
> > submission test should never take longer than 5 minutes to complete 
> > and a load of 27 is not acceptable when the responsiveness of the 
> > machine is impacted. And since we're having this conversation, there 
> > is a perceived problem on our end so an adjustment to our definition 
> > of acceptable is needed.
> >
> > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> >
> >>
> >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >>> That inca tests were timing out after 5 minutes and the load on the
> >>> machine was ~27. How are you concluding when things aren't acceptable?
> >>
> >> It's got 2 cpus. So to me an average load of under 100 and the SSH
> >> session being responsive looks fine.
> >>
> >> The fact that inca tests are timing out may be because inca has too low
> >> of a tolerance for things.
> >>
> >>>
> >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>>
> >>>> That's odd. Clearly if that's not acceptable from your perspective,
> >>>> yet
> >>>> I thought 130 are fine, there's a disconnect between what you think is
> >>>> acceptable and what I think is acceptable.
> >>>>
> >>>> What was that prompted you to conclude things are bad?
> >>>>
> >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>>> Around 80.
> >>>>>
> >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>>
> >>>>>>
> >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>>> results to guide the selection of compounds for
> >>>>>>> wet-lab testing.
> >>>>>>>
> >>>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>>> swift.properties file.
> >>>>>>
> >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>>
> >>>>>> Mihael
> >>>>>>
> >>>>>>>
> >>>>>>> I certainly appreciate everyone's efforts and
> >>>>>>> responsiveness.
> >>>>>>>
> >>>>>>> Let me know what to try next, before I kill again.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Mike
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>>
> >>>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>>> I've found the
> >>>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>>> has jobs there
> >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>>> Falkon workers
> >>>>>>>> running, and the occasional Inca tests).
> >>>>>>>>
> >>>>>>>> In any event, the machine jumps to about 100%
> >>>>>>>> utilization at around 130
> >>>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>>> throttle.score.job.factor to
> >>>>>>>> 1 in swift.properties.
> >>>>>>>>
> >>>>>>>> There's still more work I need to do test-wise.
> >>>>>>>>
> >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>>> work with Mike to get
> >>>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>>
> >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>>
> >>>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>>> simulations
> >>>>>>>>>> using Swift.
> >>>>>>>>>>
> >>>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>>> tg-uc?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>>
> >>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>>> average:
> >>>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>>> 0
> >>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>  479
> >>>>>>>>>>>
> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>> real    0m26.134s
> >>>>>>>>>>> user    0m0.090s
> >>>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>>> UC/ANL
> >>>>>>>>>>> TG GRAM host)
> >>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>>> am
> >>>>>>>>>>> now seeing slow
> >>>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>>> again.
> >>>>>>>>>>> Authenticating to
> >>>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>>> two,
> >>>>>>>>>>> but it is
> >>>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>>
> >>>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>>> high:
> >>>>>>>>>>>>
> >>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>>> average:
> >>>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>>> 0
> >>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>
> >>>>>>>>>>>> And there appear to be a large number of
> >>>>>>>> processes
> >>>>>>>>>>> owned by kubal:
> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>> 380
> >>>>>>>>>>>>
> >>>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>>> job
> >>>>>>>>>>> submission.  Is
> >>>>>>>>>>>> there some throttling of the rate at which
> >>>>>>>> jobs
> >>>>>>>>>>> are submitted to
> >>>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>>> lighten this load
> >>>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>>> earlier today?)  The
> >>>>>>>>>>>> current response times are not unacceptable,
> >>>>>>>> but
> >>>>>>>>>>> I'm hoping to
> >>>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>>> did
> >>>>>>>>>>> earlier today.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> joe.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>> ===================================================
> >>>>>>>>>>>> joseph a.
> >>>>>>>>>>>> insley
> >>>>>>>>>>>
> >>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>>> (630) 252-5649
> >>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>    (630)
> >>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>> ===================================================
> >>>>>>>>>>> joseph a. insley
> >>>>>>>>>>>
> >>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>> mathematics & computer science division
> >>>>>>>> (630)
> >>>>>>>>>>> 252-5649
> >>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>  (630)
> >>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>> ____________________________________________________________________________________ 
> >>>>>>>
> >>>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>>
> >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-devel mailing list
> >>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> Swift-devel mailing list
> >>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>
> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ____________________________________________________________________________________ 
> >>>>>>>
> >>>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 


From leggett at mcs.anl.gov  Mon Feb  4 10:55:48 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Mon, 4 Feb 2008 10:55:48 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202143654.17665.12.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<1202143654.17665.12.camel@blabla.mcs.anl.gov>
Message-ID: <B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>

load average is only an indication of what may be a problem. I've seen  
a load of 10000 on a machine and it still be very responsive because  
the processes weren't CPU hungry. So using load as a metric for  
determining acceptability is a small piece. In this case it should be  
the response of the gatekeeper. For instance, the inca jobs were  
timing out getting a response from the gatekeeper after 5 minutes.  
This is unacceptable. I would say as soon as it takes more than a  
minute for the GK to respond, back off.

On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:

>
> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
>> Then I'd say we have very different levels of acceptable.
>
> Yes, that's why we're having this discussion.
>
>> A simple job
>> submission test should never take longer than 5 minutes to complete
>> and a load of 27 is not acceptable when the responsiveness of the
>> machine is impacted. And since we're having this conversation, there
>> is a perceived problem on our end so an adjustment to our definition
>> of acceptable is needed.
>
> And we need to adjust our definition of not-acceptable. So we need to
> meet in the middle.
>
> So, 25 (sustained) reasonably acceptable average load? That amounts to
> about 13 hungry processes per cpu. Even with a 100Hz time slice, each
> process would get 8 slices per second on average.
>
>>
>> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>>
>>>
>>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>>> That inca tests were timing out after 5 minutes and the load on the
>>>> machine was ~27. How are you concluding when things aren't
>>>> acceptable?
>>>
>>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>>> session being responsive looks fine.
>>>
>>> The fact that inca tests are timing out may be because inca has too
>>> low
>>> of a tolerance for things.
>>>
>>>>
>>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>>
>>>>> That's odd. Clearly if that's not acceptable from your  
>>>>> perspective,
>>>>> yet
>>>>> I thought 130 are fine, there's a disconnect between what you
>>>>> think is
>>>>> acceptable and what I think is acceptable.
>>>>>
>>>>> What was that prompted you to conclude things are bad?
>>>>>
>>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>>> Around 80.
>>>>>>
>>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>>
>>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>>> results to guide the selection of compounds for
>>>>>>>> wet-lab testing.
>>>>>>>>
>>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>>> swift.properties file.
>>>>>>>
>>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>>>
>>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>>> responsiveness.
>>>>>>>>
>>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Mike
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>>> I've found the
>>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>>> has jobs there
>>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>>> Falkon workers
>>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>>
>>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>>> utilization at around 130
>>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>>> throttle.score.job.factor to
>>>>>>>>> 1 in swift.properties.
>>>>>>>>>
>>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>>
>>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>>> work with Mike to get
>>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>>
>>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>>> simulations
>>>>>>>>>>> using Swift.
>>>>>>>>>>>
>>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>>> tg-uc?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>>
>>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>>> average:
>>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>>> 0
>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>> 479
>>>>>>>>>>>>
>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>>> UC/ANL
>>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>>> am
>>>>>>>>>>>> now seeing slow
>>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>>> again.
>>>>>>>>>>>> Authenticating to
>>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>>> two,
>>>>>>>>>>>> but it is
>>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>>
>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>>
>>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>>> high:
>>>>>>>>>>>>>
>>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>>> average:
>>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>>> 0
>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>
>>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>>> processes
>>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>> 380
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>>> job
>>>>>>>>>>>> submission.  Is
>>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>>> jobs
>>>>>>>>>>>> are submitted to
>>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>>> lighten this load
>>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>>> but
>>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>>> did
>>>>>>>>>>>> earlier today.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> joe.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>>>>>> joseph a.
>>>>>>>>>>>>> insley
>>>>>>>>>>>>
>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>   (630)
>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ===================================================
>>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>>
>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>> (630)
>>>>>>>>>>>> 252-5649
>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>> (630)
>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>>
>>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ____________________________________________________________________________________
>>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


From hategan at mcs.anl.gov  Mon Feb  4 11:27:15 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 11:27:15 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<1202143654.17665.12.camel@blabla.mcs.anl.gov>
	<B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>
Message-ID: <1202146035.18610.0.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote:
> load average is only an indication of what may be a problem. I've seen  
> a load of 10000 on a machine and it still be very responsive because  
> the processes weren't CPU hungry. So using load as a metric for  
> determining acceptability is a small piece. In this case it should be  
> the response of the gatekeeper. For instance, the inca jobs were  
> timing out getting a response from the gatekeeper after 5 minutes.  
> This is unacceptable. I would say as soon as it takes more than a  
> minute for the GK to respond, back off.

Excellent. Now we have a useable metric and value.

> 
> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:
> 
> >
> > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
> >> Then I'd say we have very different levels of acceptable.
> >
> > Yes, that's why we're having this discussion.
> >
> >> A simple job
> >> submission test should never take longer than 5 minutes to complete
> >> and a load of 27 is not acceptable when the responsiveness of the
> >> machine is impacted. And since we're having this conversation, there
> >> is a perceived problem on our end so an adjustment to our definition
> >> of acceptable is needed.
> >
> > And we need to adjust our definition of not-acceptable. So we need to
> > meet in the middle.
> >
> > So, 25 (sustained) reasonably acceptable average load? That amounts to
> > about 13 hungry processes per cpu. Even with a 100Hz time slice, each
> > process would get 8 slices per second on average.
> >
> >>
> >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> >>
> >>>
> >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >>>> That inca tests were timing out after 5 minutes and the load on the
> >>>> machine was ~27. How are you concluding when things aren't
> >>>> acceptable?
> >>>
> >>> It's got 2 cpus. So to me an average load of under 100 and the SSH
> >>> session being responsive looks fine.
> >>>
> >>> The fact that inca tests are timing out may be because inca has too
> >>> low
> >>> of a tolerance for things.
> >>>
> >>>>
> >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>>>
> >>>>> That's odd. Clearly if that's not acceptable from your  
> >>>>> perspective,
> >>>>> yet
> >>>>> I thought 130 are fine, there's a disconnect between what you
> >>>>> think is
> >>>>> acceptable and what I think is acceptable.
> >>>>>
> >>>>> What was that prompted you to conclude things are bad?
> >>>>>
> >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>>>> Around 80.
> >>>>>>
> >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>>>> results to guide the selection of compounds for
> >>>>>>>> wet-lab testing.
> >>>>>>>>
> >>>>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>>>> swift.properties file.
> >>>>>>>
> >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>>>
> >>>>>>> Mihael
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I certainly appreciate everyone's efforts and
> >>>>>>>> responsiveness.
> >>>>>>>>
> >>>>>>>> Let me know what to try next, before I kill again.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> Mike
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>>>
> >>>>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>>>> I've found the
> >>>>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>>>> has jobs there
> >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>>>> Falkon workers
> >>>>>>>>> running, and the occasional Inca tests).
> >>>>>>>>>
> >>>>>>>>> In any event, the machine jumps to about 100%
> >>>>>>>>> utilization at around 130
> >>>>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>>>> throttle.score.job.factor to
> >>>>>>>>> 1 in swift.properties.
> >>>>>>>>>
> >>>>>>>>> There's still more work I need to do test-wise.
> >>>>>>>>>
> >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>>>> work with Mike to get
> >>>>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>>>
> >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>>>> simulations
> >>>>>>>>>>> using Swift.
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>>>> tg-uc?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>>>
> >>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>> 479
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>> real    0m26.134s
> >>>>>>>>>>>> user    0m0.090s
> >>>>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>>>> UC/ANL
> >>>>>>>>>>>> TG GRAM host)
> >>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>>>> am
> >>>>>>>>>>>> now seeing slow
> >>>>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>>>> again.
> >>>>>>>>>>>> Authenticating to
> >>>>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>>>> two,
> >>>>>>>>>>>> but it is
> >>>>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>>>> high:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And there appear to be a large number of
> >>>>>>>>> processes
> >>>>>>>>>>>> owned by kubal:
> >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>>> 380
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>>>> job
> >>>>>>>>>>>> submission.  Is
> >>>>>>>>>>>>> there some throttling of the rate at which
> >>>>>>>>> jobs
> >>>>>>>>>>>> are submitted to
> >>>>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>>>> lighten this load
> >>>>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>>>> earlier today?)  The
> >>>>>>>>>>>>> current response times are not unacceptable,
> >>>>>>>>> but
> >>>>>>>>>>>> I'm hoping to
> >>>>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>>>> did
> >>>>>>>>>>>> earlier today.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> joe.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>>> joseph a.
> >>>>>>>>>>>>> insley
> >>>>>>>>>>>>
> >>>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>>>> (630) 252-5649
> >>>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>>   (630)
> >>>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>> joseph a. insley
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>> (630)
> >>>>>>>>>>>> 252-5649
> >>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>> (630)
> >>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>>>
> >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Swift-devel mailing list
> >>>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-devel mailing list
> >>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 


From mikekubal at yahoo.com  Mon Feb  4 12:30:07 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Mon, 4 Feb 2008 10:30:07 -0800 (PST)
Subject: [Swift-devel] throttle.score.job.transfer
Message-ID: <424515.93529.qm@web52305.mail.re2.yahoo.com>

I attempted to run a job with
throttle.score.job.transfer of .5 and the job failed
with the following:

Execution failed:
        Could not convert value to number: .5
Caused by:
        For input string: ".5"

-MikeK


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From benc at hawaga.org.uk  Mon Feb  4 12:40:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 18:40:19 +0000 (GMT)
Subject: [Swift-devel] throttle.score.job.transfer
In-Reply-To: <424515.93529.qm@web52305.mail.re2.yahoo.com>
References: <424515.93529.qm@web52305.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802041839500.5372@dildano.hawaga.org.uk>


On Mon, 4 Feb 2008, Mike Kubal wrote:

> I attempted to run a job with
> throttle.score.job.transfer of .5 and the job failed
> with the following:
> 
> Execution failed:
>         Could not convert value to number: .5
> Caused by:
>         For input string: ".5"

yeah, turns out its an integer only field. my bad for telling you 
otherwise.

-- 


From benc at hawaga.org.uk  Mon Feb  4 12:44:38 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 18:44:38 +0000 (GMT)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <Pine.LNX.4.64.0802041612510.4874@dildano.hawaga.org.uk>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802041612510.4874@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0802041841590.4874@dildano.hawaga.org.uk>


you can also try out gram4 as follows:

* get swift r1609 from SVN

* set a site entry like this:

<pool handle="uc.teragrid.org" sysinfo="INTEL32::LINUX" 
gridlaunch="/home/wilde/vds/mystart">
    <gridftp  url="gsiftp://tg-gridftp.uc.teragrid.org" 
storage="/home/benc" maj
or="2" minor="2" />
    <execution provider="gt4" jobmanager="PBS" 
url="tg-grid.uc.teragrid.org" />
    <workdirectory >/home/benc</workdirectory>
    <profile namespace="globus" key="project">TG-CCR080002N</profile>
</pool>

Change the project key to a project that you are on (or, if you have a 
default project, you can remove it).

I've run this from from teraport submitting to TG-UC using the default 
load parameters and it has made it through 730 or so jobs of a 1000 node 
workflow without apparently excessive load (its still running - also I got 
some ftp failures, but job retry should handle those)

-- 


From mikekubal at yahoo.com  Mon Feb  4 12:49:23 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Mon, 4 Feb 2008 10:49:23 -0800 (PST)
Subject: [Swift-devel] throttle.score.job.transfer
In-Reply-To: <Pine.LNX.4.64.0802041839500.5372@dildano.hawaga.org.uk>
Message-ID: <690293.82602.qm@web52311.mail.re2.yahoo.com>

Unless there are any objections, I'd like to submit a
maximum of 21 jobs to the UC -teragrid with the
throttling thresholds limited to the following so a
baseline metric could be established:

throttle.submit = 2 (default 4)
throttle.host.submit = 1 (default 2)
throttle.score.job.factor = 1 (default 4)
throttle.transfers = 2 (default 4)
throttle.file.operations = 4 (default 8)

Ti, I'll let you know as soon as I launch it. 

Depending on how this goes, I plan to try Ben's local
PB approach next.

Cheers,

Mike

--- Ben Clifford <benc at hawaga.org.uk> wrote:

> 
> 
> On Mon, 4 Feb 2008, Mike Kubal wrote:
> 
> > I attempted to run a job with
> > throttle.score.job.transfer of .5 and the job
> failed
> > with the following:
> > 
> > Execution failed:
> >         Could not convert value to number: .5
> > Caused by:
> >         For input string: ".5"
> 
> yeah, turns out its an integer only field. my bad
> for telling you 
> otherwise.
> 
> -- 
> 
> 


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From mikekubal at yahoo.com  Mon Feb  4 12:55:06 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Mon, 4 Feb 2008 10:55:06 -0800 (PST)
Subject: [Swift-devel] throttle.score.job.transfer
In-Reply-To: <690293.82602.qm@web52311.mail.re2.yahoo.com>
Message-ID: <767735.55578.qm@web52312.mail.re2.yahoo.com>

or I'll try Ben's r1609 approach, unless folks would
like a baseline.


--- Mike Kubal <mikekubal at yahoo.com> wrote:

> Unless there are any objections, I'd like to submit
> a
> maximum of 21 jobs to the UC -teragrid with the
> throttling thresholds limited to the following so a
> baseline metric could be established:
> 
> throttle.submit = 2 (default 4)
> throttle.host.submit = 1 (default 2)
> throttle.score.job.factor = 1 (default 4)
> throttle.transfers = 2 (default 4)
> throttle.file.operations = 4 (default 8)
> 
> Ti, I'll let you know as soon as I launch it. 
> 
> Depending on how this goes, I plan to try Ben's
> local
> PB approach next.
> 
> Cheers,
> 
> Mike
> 
> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> > 
> > 
> > On Mon, 4 Feb 2008, Mike Kubal wrote:
> > 
> > > I attempted to run a job with
> > > throttle.score.job.transfer of .5 and the job
> > failed
> > > with the following:
> > > 
> > > Execution failed:
> > >         Could not convert value to number: .5
> > > Caused by:
> > >         For input string: ".5"
> > 
> > yeah, turns out its an integer only field. my bad
> > for telling you 
> > otherwise.
> > 
> > -- 
> > 
> > 
> 
> 
> 
>      
>
____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From benc at hawaga.org.uk  Mon Feb  4 12:59:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 18:59:40 +0000 (GMT)
Subject: [Swift-devel] throttle.score.job.transfer
In-Reply-To: <767735.55578.qm@web52312.mail.re2.yahoo.com>
References: <767735.55578.qm@web52312.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802041855400.4874@dildano.hawaga.org.uk>


On Mon, 4 Feb 2008, Mike Kubal wrote:

> or I'll try Ben's r1609 approach, unless folks would
> like a baseline.

I think trying PBS and GRAM4 are better things for you to do than continue 
spending time with GRAM2.

A comparison of how PBS and GRAM4 weigh up would be very interesting (to 
me). When you run anything there, please save your log files - I can do 
interesting things with them (for example, watching how the Swift internal 
scheduler is behaving). Also if you are getting kickstart records, save 
those too. There's a single commandline to type to do this, at the bottom 
of the user guide:

  rsync --ignore-existing *.log *-kickstart.xml 
       login.ci.uchicago.edu:/home/benc/swift-logs/ --verbose


-- 


From benc at hawaga.org.uk  Mon Feb  4 13:27:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 4 Feb 2008 19:27:31 +0000 (GMT)
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <Pine.LNX.4.64.0802041841590.4874@dildano.hawaga.org.uk>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802041612510.4874@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802041841590.4874@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0802041926100.4874@dildano.hawaga.org.uk>


On Mon, 4 Feb 2008, Ben Clifford wrote:

> I've run this from from teraport submitting to TG-UC using the default 
> load parameters and it has made it through 730 or so jobs of a 1000 node 
> workflow without apparently excessive load (its still running - also I got 
> some ftp failures, but job retry should handle those)

actually, that run got stuck - it seems to have lost one job (as in, it 
isn't in the PBS queue but Swift still thinks its in progress)

I'll look at that closer somewhat later.

-- 


From hategan at mcs.anl.gov  Mon Feb  4 14:43:21 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 14:43:21 -0600
Subject: [Swift-devel] throttle.score.job.transfer
In-Reply-To: <424515.93529.qm@web52305.mail.re2.yahoo.com>
References: <424515.93529.qm@web52305.mail.re2.yahoo.com>
Message-ID: <1202157801.20465.1.camel@blabla.mcs.anl.gov>

Yes. That will not work. You need integral numbers there. I will fix
this hopefully tonight.

On Mon, 2008-02-04 at 10:30 -0800, Mike Kubal wrote:
> I attempted to run a job with
> throttle.score.job.transfer of .5 and the job failed
> with the following:
> 
> Execution failed:
>         Could not convert value to number: .5
> Caused by:
>         For input string: ".5"
> 
> -MikeK
> 
> 
>       ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Mon Feb  4 14:45:06 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 14:45:06 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <Pine.LNX.4.64.0802041841590.4874@dildano.hawaga.org.uk>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0802041559110.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802041612510.4874@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802041841590.4874@dildano.hawaga.org.uk>
Message-ID: <1202157906.20465.4.camel@blabla.mcs.anl.gov>

For that to work, you'll need to fetch Swift from SVN.

You can find instructions on how to do that here:
http://www.ci.uchicago.edu/swift/downloads/index.php

Mihael

On Mon, 2008-02-04 at 18:44 +0000, Ben Clifford wrote:
> you can also try out gram4 as follows:
> 
> * get swift r1609 from SVN
> 
> * set a site entry like this:
> 
> <pool handle="uc.teragrid.org" sysinfo="INTEL32::LINUX" 
> gridlaunch="/home/wilde/vds/mystart">
>     <gridftp  url="gsiftp://tg-gridftp.uc.teragrid.org" 
> storage="/home/benc" maj
> or="2" minor="2" />
>     <execution provider="gt4" jobmanager="PBS" 
> url="tg-grid.uc.teragrid.org" />
>     <workdirectory >/home/benc</workdirectory>
>     <profile namespace="globus" key="project">TG-CCR080002N</profile>
> </pool>
> 
> Change the project key to a project that you are on (or, if you have a 
> default project, you can remove it).
> 
> I've run this from from teraport submitting to TG-UC using the default 
> load parameters and it has made it through 730 or so jobs of a 1000 node 
> workflow without apparently excessive load (its still running - also I got 
> some ftp failures, but job retry should handle those)
> 


From hategan at mcs.anl.gov  Mon Feb  4 16:32:32 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 16:32:32 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202143711.17665.13.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<47A73DFF.3010402@mcs.anl.gov>
	<1202143711.17665.13.camel@blabla.mcs.anl.gov>
Message-ID: <1202164352.22470.4.camel@blabla.mcs.anl.gov>

So WS-GRAM in terms of machine load seems to work better (i.e. barely
visible), which is to be expected. Swift does however run out of memory
faster. Whereas I could safely (from the client side perspective) run
256 parallel jobs with the default 64M of heap space, with WS-GRAM it
dies.

I don't have an exact dependence of load vs. number of jobs yet, but
I'll be working on that.

Mihael

On Mon, 2008-02-04 at 10:48 -0600, Mihael Hategan wrote:
> Yes, and I will. But unless we're completely dropping support for pre-ws
> GRAM, we still need to do this.
> 
> 
> On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote:
> > It would be really wonderful if someone can try GRAM4, which we believe 
> > addresses this problem.
> > 
> > Ian.
> > 
> > Ti Leggett wrote:
> > > Then I'd say we have very different levels of acceptable. A simple job 
> > > submission test should never take longer than 5 minutes to complete 
> > > and a load of 27 is not acceptable when the responsiveness of the 
> > > machine is impacted. And since we're having this conversation, there 
> > > is a perceived problem on our end so an adjustment to our definition 
> > > of acceptable is needed.
> > >
> > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> > >
> > >>
> > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> > >>> That inca tests were timing out after 5 minutes and the load on the
> > >>> machine was ~27. How are you concluding when things aren't acceptable?
> > >>
> > >> It's got 2 cpus. So to me an average load of under 100 and the SSH
> > >> session being responsive looks fine.
> > >>
> > >> The fact that inca tests are timing out may be because inca has too low
> > >> of a tolerance for things.
> > >>
> > >>>
> > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> > >>>
> > >>>> That's odd. Clearly if that's not acceptable from your perspective,
> > >>>> yet
> > >>>> I thought 130 are fine, there's a disconnect between what you think is
> > >>>> acceptable and what I think is acceptable.
> > >>>>
> > >>>> What was that prompted you to conclude things are bad?
> > >>>>
> > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> > >>>>> Around 80.
> > >>>>>
> > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> > >>>>>>> Sorry for killing the server. I'm pushing to get
> > >>>>>>> results to guide the selection of compounds for
> > >>>>>>> wet-lab testing.
> > >>>>>>>
> > >>>>>>> I had set the throttle.score.job.factor to 1 in the
> > >>>>>>> swift.properties file.
> > >>>>>>
> > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> > >>>>>>
> > >>>>>> Mihael
> > >>>>>>
> > >>>>>>>
> > >>>>>>> I certainly appreciate everyone's efforts and
> > >>>>>>> responsiveness.
> > >>>>>>>
> > >>>>>>> Let me know what to try next, before I kill again.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>>
> > >>>>>>> Mike
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > >>>>>>>
> > >>>>>>>> So I was trying some stuff on Friday night. I guess
> > >>>>>>>> I've found the
> > >>>>>>>> strategy on when to run the tests: when nobody else
> > >>>>>>>> has jobs there
> > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> > >>>>>>>> Falkon workers
> > >>>>>>>> running, and the occasional Inca tests).
> > >>>>>>>>
> > >>>>>>>> In any event, the machine jumps to about 100%
> > >>>>>>>> utilization at around 130
> > >>>>>>>> jobs with pre-ws gram. So Mike, please set
> > >>>>>>>> throttle.score.job.factor to
> > >>>>>>>> 1 in swift.properties.
> > >>>>>>>>
> > >>>>>>>> There's still more work I need to do test-wise.
> > >>>>>>>>
> > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> > >>>>>>>> work with Mike to get
> > >>>>>>>>> some swift settings that don't kill our server?
> > >>>>>>>>>
> > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Yes, I'm submitting molecular dynamics
> > >>>>>>>> simulations
> > >>>>>>>>>> using Swift.
> > >>>>>>>>>>
> > >>>>>>>>>> Is there a default wall-time limit for jobs on
> > >>>>>>>> tg-uc?
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Actually, these numbers are now escalating...
> > >>>>>>>>>>>
> > >>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> > >>>>>>>> average:
> > >>>>>>>>>>> 149.02, 123.63, 91.94
> > >>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> > >>>>>>>> 0
> > >>>>>>>>>>> stopped,   0 zombie
> > >>>>>>>>>>>
> > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>>>>>>>>>>  479
> > >>>>>>>>>>>
> > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > >>>>>>>>>>> GRAM Authentication test successful
> > >>>>>>>>>>> real    0m26.134s
> > >>>>>>>>>>> user    0m0.090s
> > >>>>>>>>>>> sys     0m0.010s
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> > >>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> > >>>>>>>> UC/ANL
> > >>>>>>>>>>> TG GRAM host)
> > >>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> > >>>>>>>> am
> > >>>>>>>>>>> now seeing slow
> > >>>>>>>>>>>> response times from the Gatekeeper there
> > >>>>>>>> again.
> > >>>>>>>>>>> Authenticating to
> > >>>>>>>>>>>> the gatekeeper should only take a second or
> > >>>>>>>> two,
> > >>>>>>>>>>> but it is
> > >>>>>>>>>>>> periodically taking up to 16 seconds:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > >>>>>>>>>>>> GRAM Authentication test successful
> > >>>>>>>>>>>> real    0m16.096s
> > >>>>>>>>>>>> user    0m0.060s
> > >>>>>>>>>>>> sys     0m0.020s
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> looking at the load on tg-grid, it is rather
> > >>>>>>>> high:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> > >>>>>>>> average:
> > >>>>>>>>>>> 89.59, 78.69, 62.92
> > >>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> > >>>>>>>> 0
> > >>>>>>>>>>> stopped,   0 zombie
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And there appear to be a large number of
> > >>>>>>>> processes
> > >>>>>>>>>>> owned by kubal:
> > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > >>>>>>>>>>>> 380
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I assume that Mike is using swift to do the
> > >>>>>>>> job
> > >>>>>>>>>>> submission.  Is
> > >>>>>>>>>>>> there some throttling of the rate at which
> > >>>>>>>> jobs
> > >>>>>>>>>>> are submitted to
> > >>>>>>>>>>>> the gatekeeper that could be done that would
> > >>>>>>>>>>> lighten this load
> > >>>>>>>>>>>> some?  (Or has that already been done since
> > >>>>>>>>>>> earlier today?)  The
> > >>>>>>>>>>>> current response times are not unacceptable,
> > >>>>>>>> but
> > >>>>>>>>>>> I'm hoping to
> > >>>>>>>>>>>> avoid having the machine grind to a halt as it
> > >>>>>>>> did
> > >>>>>>>>>>> earlier today.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> joe.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>> ===================================================
> > >>>>>>>>>>>> joseph a.
> > >>>>>>>>>>>> insley
> > >>>>>>>>>>>
> > >>>>>>>>>>>> insley at mcs.anl.gov
> > >>>>>>>>>>>> mathematics & computer science division
> > >>>>>>>>>>> (630) 252-5649
> > >>>>>>>>>>>> argonne national laboratory
> > >>>>>>>>>>>    (630)
> > >>>>>>>>>>>> 252-5986 (fax)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>> ===================================================
> > >>>>>>>>>>> joseph a. insley
> > >>>>>>>>>>>
> > >>>>>>>>>>> insley at mcs.anl.gov
> > >>>>>>>>>>> mathematics & computer science division
> > >>>>>>>> (630)
> > >>>>>>>>>>> 252-5649
> > >>>>>>>>>>> argonne national laboratory
> > >>>>>>>>>>>  (630)
> > >>>>>>>>>>> 252-5986 (fax)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>> ____________________________________________________________________________________ 
> > >>>>>>>
> > >>>>>>>>>> Be a better friend, newshound, and
> > >>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> > >>>>>>>>
> > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> Swift-devel mailing list
> > >>>>>>>>> Swift-devel at ci.uchicago.edu
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> _______________________________________________
> > >>>>>>>> Swift-devel mailing list
> > >>>>>>>> Swift-devel at ci.uchicago.edu
> > >>>>>>>>
> > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> ____________________________________________________________________________________ 
> > >>>>>>>
> > >>>>>>> Never miss a thing.  Make Yahoo your home page.
> > >>>>>>> http://www.yahoo.com/r/hs
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Mon Feb  4 17:16:05 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 04 Feb 2008 17:16:05 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202164352.22470.4.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<47A73DFF.3010402@mcs.anl.gov>
	<1202143711.17665.13.camel@blabla.mcs.anl.gov>
	<1202164352.22470.4.camel@blabla.mcs.anl.gov>
Message-ID: <1202166965.22912.0.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-04 at 16:32 -0600, Mihael Hategan wrote:
> So WS-GRAM in terms of machine load seems to work better (i.e. barely
> visible), which is to be expected. Swift does however run out of memory
> faster. Whereas I could safely (from the client side perspective) run
> 256 parallel jobs with 

... pre-WS-GRAM and...

> the default 64M of heap space, with WS-GRAM it
> dies.
> 
> I don't have an exact dependence of load vs. number of jobs yet, but
> I'll be working on that.
> 
> Mihael
> 
> On Mon, 2008-02-04 at 10:48 -0600, Mihael Hategan wrote:
> > Yes, and I will. But unless we're completely dropping support for pre-ws
> > GRAM, we still need to do this.
> > 
> > 
> > On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote:
> > > It would be really wonderful if someone can try GRAM4, which we believe 
> > > addresses this problem.
> > > 
> > > Ian.
> > > 
> > > Ti Leggett wrote:
> > > > Then I'd say we have very different levels of acceptable. A simple job 
> > > > submission test should never take longer than 5 minutes to complete 
> > > > and a load of 27 is not acceptable when the responsiveness of the 
> > > > machine is impacted. And since we're having this conversation, there 
> > > > is a perceived problem on our end so an adjustment to our definition 
> > > > of acceptable is needed.
> > > >
> > > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> > > >
> > > >>
> > > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> > > >>> That inca tests were timing out after 5 minutes and the load on the
> > > >>> machine was ~27. How are you concluding when things aren't acceptable?
> > > >>
> > > >> It's got 2 cpus. So to me an average load of under 100 and the SSH
> > > >> session being responsive looks fine.
> > > >>
> > > >> The fact that inca tests are timing out may be because inca has too low
> > > >> of a tolerance for things.
> > > >>
> > > >>>
> > > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> > > >>>
> > > >>>> That's odd. Clearly if that's not acceptable from your perspective,
> > > >>>> yet
> > > >>>> I thought 130 are fine, there's a disconnect between what you think is
> > > >>>> acceptable and what I think is acceptable.
> > > >>>>
> > > >>>> What was that prompted you to conclude things are bad?
> > > >>>>
> > > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> > > >>>>> Around 80.
> > > >>>>>
> > > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> > > >>>>>
> > > >>>>>>
> > > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> > > >>>>>>> Sorry for killing the server. I'm pushing to get
> > > >>>>>>> results to guide the selection of compounds for
> > > >>>>>>> wet-lab testing.
> > > >>>>>>>
> > > >>>>>>> I had set the throttle.score.job.factor to 1 in the
> > > >>>>>>> swift.properties file.
> > > >>>>>>
> > > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> > > >>>>>>
> > > >>>>>> Mihael
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> I certainly appreciate everyone's efforts and
> > > >>>>>>> responsiveness.
> > > >>>>>>>
> > > >>>>>>> Let me know what to try next, before I kill again.
> > > >>>>>>>
> > > >>>>>>> Cheers,
> > > >>>>>>>
> > > >>>>>>> Mike
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > >>>>>>>
> > > >>>>>>>> So I was trying some stuff on Friday night. I guess
> > > >>>>>>>> I've found the
> > > >>>>>>>> strategy on when to run the tests: when nobody else
> > > >>>>>>>> has jobs there
> > > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> > > >>>>>>>> Falkon workers
> > > >>>>>>>> running, and the occasional Inca tests).
> > > >>>>>>>>
> > > >>>>>>>> In any event, the machine jumps to about 100%
> > > >>>>>>>> utilization at around 130
> > > >>>>>>>> jobs with pre-ws gram. So Mike, please set
> > > >>>>>>>> throttle.score.job.factor to
> > > >>>>>>>> 1 in swift.properties.
> > > >>>>>>>>
> > > >>>>>>>> There's still more work I need to do test-wise.
> > > >>>>>>>>
> > > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> > > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> > > >>>>>>>> work with Mike to get
> > > >>>>>>>>> some swift settings that don't kill our server?
> > > >>>>>>>>>
> > > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Yes, I'm submitting molecular dynamics
> > > >>>>>>>> simulations
> > > >>>>>>>>>> using Swift.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Is there a default wall-time limit for jobs on
> > > >>>>>>>> tg-uc?
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Actually, these numbers are now escalating...
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> > > >>>>>>>> average:
> > > >>>>>>>>>>> 149.02, 123.63, 91.94
> > > >>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> > > >>>>>>>> 0
> > > >>>>>>>>>>> stopped,   0 zombie
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>>>>>>>>>>  479
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > > >>>>>>>>>>> GRAM Authentication test successful
> > > >>>>>>>>>>> real    0m26.134s
> > > >>>>>>>>>>> user    0m0.090s
> > > >>>>>>>>>>> sys     0m0.010s
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> > > >>>>>>>> UC/ANL
> > > >>>>>>>>>>> TG GRAM host)
> > > >>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> > > >>>>>>>> am
> > > >>>>>>>>>>> now seeing slow
> > > >>>>>>>>>>>> response times from the Gatekeeper there
> > > >>>>>>>> again.
> > > >>>>>>>>>>> Authenticating to
> > > >>>>>>>>>>>> the gatekeeper should only take a second or
> > > >>>>>>>> two,
> > > >>>>>>>>>>> but it is
> > > >>>>>>>>>>>> periodically taking up to 16 seconds:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> > > >>>>>>>>>>> tg-grid.uc.teragrid.org
> > > >>>>>>>>>>>> GRAM Authentication test successful
> > > >>>>>>>>>>>> real    0m16.096s
> > > >>>>>>>>>>>> user    0m0.060s
> > > >>>>>>>>>>>> sys     0m0.020s
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> looking at the load on tg-grid, it is rather
> > > >>>>>>>> high:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> > > >>>>>>>> average:
> > > >>>>>>>>>>> 89.59, 78.69, 62.92
> > > >>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> > > >>>>>>>> 0
> > > >>>>>>>>>>> stopped,   0 zombie
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And there appear to be a large number of
> > > >>>>>>>> processes
> > > >>>>>>>>>>> owned by kubal:
> > > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> > > >>>>>>>>>>>> 380
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I assume that Mike is using swift to do the
> > > >>>>>>>> job
> > > >>>>>>>>>>> submission.  Is
> > > >>>>>>>>>>>> there some throttling of the rate at which
> > > >>>>>>>> jobs
> > > >>>>>>>>>>> are submitted to
> > > >>>>>>>>>>>> the gatekeeper that could be done that would
> > > >>>>>>>>>>> lighten this load
> > > >>>>>>>>>>>> some?  (Or has that already been done since
> > > >>>>>>>>>>> earlier today?)  The
> > > >>>>>>>>>>>> current response times are not unacceptable,
> > > >>>>>>>> but
> > > >>>>>>>>>>> I'm hoping to
> > > >>>>>>>>>>>> avoid having the machine grind to a halt as it
> > > >>>>>>>> did
> > > >>>>>>>>>>> earlier today.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>> joe.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>> ===================================================
> > > >>>>>>>>>>>> joseph a.
> > > >>>>>>>>>>>> insley
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> insley at mcs.anl.gov
> > > >>>>>>>>>>>> mathematics & computer science division
> > > >>>>>>>>>>> (630) 252-5649
> > > >>>>>>>>>>>> argonne national laboratory
> > > >>>>>>>>>>>    (630)
> > > >>>>>>>>>>>> 252-5986 (fax)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>> ===================================================
> > > >>>>>>>>>>> joseph a. insley
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> insley at mcs.anl.gov
> > > >>>>>>>>>>> mathematics & computer science division
> > > >>>>>>>> (630)
> > > >>>>>>>>>>> 252-5649
> > > >>>>>>>>>>> argonne national laboratory
> > > >>>>>>>>>>>  (630)
> > > >>>>>>>>>>> 252-5986 (fax)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>> ____________________________________________________________________________________ 
> > > >>>>>>>
> > > >>>>>>>>>> Be a better friend, newshound, and
> > > >>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> > > >>>>>>>>
> > > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> _______________________________________________
> > > >>>>>>>>> Swift-devel mailing list
> > > >>>>>>>>> Swift-devel at ci.uchicago.edu
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> _______________________________________________
> > > >>>>>>>> Swift-devel mailing list
> > > >>>>>>>> Swift-devel at ci.uchicago.edu
> > > >>>>>>>>
> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> ____________________________________________________________________________________ 
> > > >>>>>>>
> > > >>>>>>> Never miss a thing.  Make Yahoo your home page.
> > > >>>>>>> http://www.yahoo.com/r/hs
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > >
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 


From hategan at mcs.anl.gov  Tue Feb  5 14:21:01 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 05 Feb 2008 14:21:01 -0600
Subject: [Swift-devel] RFF (request for feature)
In-Reply-To: <fec1351f0801281351g34be84d5k36c0a9d23cb61697@mail.gmail.com>
References: <fec1351f0801281351g34be84d5k36c0a9d23cb61697@mail.gmail.com>
Message-ID: <1202242861.4718.9.camel@blabla.mcs.anl.gov>

You can probably simulate lots of these with arrays. For example:

int queue[];

foreach i in [0:100] {
  queue[i] = ...;
}

foreach x in queue {
  f(x);
}

On Mon, 2008-01-28 at 15:51 -0600, Tiberiu Stef-Praun wrote:
> Hi gang,
> 
> I find myself in the need for a queuing facility in swift with the
> following operations:
> 
> createQ
> submitQ(function)
> triggerQ(function, #jobs in queue) - to signal empty queues, for instance
> deleteQ
> 
> I would think that in addition to atomic functions and composite
> functions, we will have the queue facility acting as an intermediary.
> 
> Is any of this possible/doable in a data-flow language ?
> 
> Thanks
> Tibi
> 


From hategan at mcs.anl.gov  Thu Feb  7 15:24:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 07 Feb 2008 15:24:50 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<1202143654.17665.12.camel@blabla.mcs.anl.gov>
	<B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>
Message-ID: <1202419491.13362.5.camel@blabla.mcs.anl.gov>

Ok, so I'll change the scheduler feedback loop to aim towards a 20 s max
submission time. This should apply nicely to all providers.

Any objections?

On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote:
> load average is only an indication of what may be a problem. I've seen  
> a load of 10000 on a machine and it still be very responsive because  
> the processes weren't CPU hungry. So using load as a metric for  
> determining acceptability is a small piece. In this case it should be  
> the response of the gatekeeper. For instance, the inca jobs were  
> timing out getting a response from the gatekeeper after 5 minutes.  
> This is unacceptable. I would say as soon as it takes more than a  
> minute for the GK to respond, back off.
> 
> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:
> 
> >
> > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
> >> Then I'd say we have very different levels of acceptable.
> >
> > Yes, that's why we're having this discussion.
> >
> >> A simple job
> >> submission test should never take longer than 5 minutes to complete
> >> and a load of 27 is not acceptable when the responsiveness of the
> >> machine is impacted. And since we're having this conversation, there
> >> is a perceived problem on our end so an adjustment to our definition
> >> of acceptable is needed.
> >
> > And we need to adjust our definition of not-acceptable. So we need to
> > meet in the middle.
> >
> > So, 25 (sustained) reasonably acceptable average load? That amounts to
> > about 13 hungry processes per cpu. Even with a 100Hz time slice, each
> > process would get 8 slices per second on average.
> >
> >>
> >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
> >>
> >>>
> >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
> >>>> That inca tests were timing out after 5 minutes and the load on the
> >>>> machine was ~27. How are you concluding when things aren't
> >>>> acceptable?
> >>>
> >>> It's got 2 cpus. So to me an average load of under 100 and the SSH
> >>> session being responsive looks fine.
> >>>
> >>> The fact that inca tests are timing out may be because inca has too
> >>> low
> >>> of a tolerance for things.
> >>>
> >>>>
> >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
> >>>>
> >>>>> That's odd. Clearly if that's not acceptable from your  
> >>>>> perspective,
> >>>>> yet
> >>>>> I thought 130 are fine, there's a disconnect between what you
> >>>>> think is
> >>>>> acceptable and what I think is acceptable.
> >>>>>
> >>>>> What was that prompted you to conclude things are bad?
> >>>>>
> >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
> >>>>>> Around 80.
> >>>>>>
> >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
> >>>>>>>> Sorry for killing the server. I'm pushing to get
> >>>>>>>> results to guide the selection of compounds for
> >>>>>>>> wet-lab testing.
> >>>>>>>>
> >>>>>>>> I had set the throttle.score.job.factor to 1 in the
> >>>>>>>> swift.properties file.
> >>>>>>>
> >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
> >>>>>>>
> >>>>>>> Mihael
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I certainly appreciate everyone's efforts and
> >>>>>>>> responsiveness.
> >>>>>>>>
> >>>>>>>> Let me know what to try next, before I kill again.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> Mike
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>>>>>
> >>>>>>>>> So I was trying some stuff on Friday night. I guess
> >>>>>>>>> I've found the
> >>>>>>>>> strategy on when to run the tests: when nobody else
> >>>>>>>>> has jobs there
> >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
> >>>>>>>>> Falkon workers
> >>>>>>>>> running, and the occasional Inca tests).
> >>>>>>>>>
> >>>>>>>>> In any event, the machine jumps to about 100%
> >>>>>>>>> utilization at around 130
> >>>>>>>>> jobs with pre-ws gram. So Mike, please set
> >>>>>>>>> throttle.score.job.factor to
> >>>>>>>>> 1 in swift.properties.
> >>>>>>>>>
> >>>>>>>>> There's still more work I need to do test-wise.
> >>>>>>>>>
> >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
> >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
> >>>>>>>>> work with Mike to get
> >>>>>>>>>> some swift settings that don't kill our server?
> >>>>>>>>>>
> >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Yes, I'm submitting molecular dynamics
> >>>>>>>>> simulations
> >>>>>>>>>>> using Swift.
> >>>>>>>>>>>
> >>>>>>>>>>> Is there a default wall-time limit for jobs on
> >>>>>>>>> tg-uc?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Actually, these numbers are now escalating...
> >>>>>>>>>>>>
> >>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 149.02, 123.63, 91.94
> >>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>> 479
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>> real    0m26.134s
> >>>>>>>>>>>> user    0m0.090s
> >>>>>>>>>>>> sys     0m0.010s
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
> >>>>>>>>> UC/ANL
> >>>>>>>>>>>> TG GRAM host)
> >>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
> >>>>>>>>> am
> >>>>>>>>>>>> now seeing slow
> >>>>>>>>>>>>> response times from the Gatekeeper there
> >>>>>>>>> again.
> >>>>>>>>>>>> Authenticating to
> >>>>>>>>>>>>> the gatekeeper should only take a second or
> >>>>>>>>> two,
> >>>>>>>>>>>> but it is
> >>>>>>>>>>>>> periodically taking up to 16 seconds:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
> >>>>>>>>>>>> tg-grid.uc.teragrid.org
> >>>>>>>>>>>>> GRAM Authentication test successful
> >>>>>>>>>>>>> real    0m16.096s
> >>>>>>>>>>>>> user    0m0.060s
> >>>>>>>>>>>>> sys     0m0.020s
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> looking at the load on tg-grid, it is rather
> >>>>>>>>> high:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
> >>>>>>>>> average:
> >>>>>>>>>>>> 89.59, 78.69, 62.92
> >>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
> >>>>>>>>> 0
> >>>>>>>>>>>> stopped,   0 zombie
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And there appear to be a large number of
> >>>>>>>>> processes
> >>>>>>>>>>>> owned by kubal:
> >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
> >>>>>>>>>>>>> 380
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I assume that Mike is using swift to do the
> >>>>>>>>> job
> >>>>>>>>>>>> submission.  Is
> >>>>>>>>>>>>> there some throttling of the rate at which
> >>>>>>>>> jobs
> >>>>>>>>>>>> are submitted to
> >>>>>>>>>>>>> the gatekeeper that could be done that would
> >>>>>>>>>>>> lighten this load
> >>>>>>>>>>>>> some?  (Or has that already been done since
> >>>>>>>>>>>> earlier today?)  The
> >>>>>>>>>>>>> current response times are not unacceptable,
> >>>>>>>>> but
> >>>>>>>>>>>> I'm hoping to
> >>>>>>>>>>>>> avoid having the machine grind to a halt as it
> >>>>>>>>> did
> >>>>>>>>>>>> earlier today.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> joe.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>>> joseph a.
> >>>>>>>>>>>>> insley
> >>>>>>>>>>>>
> >>>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>>>>> (630) 252-5649
> >>>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>>   (630)
> >>>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>> ===================================================
> >>>>>>>>>>>> joseph a. insley
> >>>>>>>>>>>>
> >>>>>>>>>>>> insley at mcs.anl.gov
> >>>>>>>>>>>> mathematics & computer science division
> >>>>>>>>> (630)
> >>>>>>>>>>>> 252-5649
> >>>>>>>>>>>> argonne national laboratory
> >>>>>>>>>>>> (630)
> >>>>>>>>>>>> 252-5986 (fax)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>>>>> Be a better friend, newshound, and
> >>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>>>>>>>
> >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Swift-devel mailing list
> >>>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-devel mailing list
> >>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>
> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ____________________________________________________________________________________
> >>>>>>>> Never miss a thing.  Make Yahoo your home page.
> >>>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> 


From leggett at mcs.anl.gov  Thu Feb  7 15:34:46 2008
From: leggett at mcs.anl.gov (Ti Leggett)
Date: Thu, 7 Feb 2008 15:34:46 -0600
Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG
In-Reply-To: <1202419491.13362.5.camel@blabla.mcs.anl.gov>
References: <548830.35963.qm@web52311.mail.re2.yahoo.com>
	<1202105649.15397.46.camel@blabla.mcs.anl.gov>
	<4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov>
	<1202139054.16407.5.camel@blabla.mcs.anl.gov>
	<814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov>
	<1202141916.17237.4.camel@blabla.mcs.anl.gov>
	<963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov>
	<1202143654.17665.12.camel@blabla.mcs.anl.gov>
	<B8892A73-297B-4064-B844-51C2149684A4@mcs.anl.gov>
	<1202419491.13362.5.camel@blabla.mcs.anl.gov>
Message-ID: <09616F48-D103-43FB-9E2C-9FFC470D0AC5@mcs.anl.gov>

This sounds like a good place to start.

On Feb 7, 2008, at 3:24 PM, Mihael Hategan wrote:

> Ok, so I'll change the scheduler feedback loop to aim towards a 20 s  
> max
> submission time. This should apply nicely to all providers.
>
> Any objections?
>
> On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote:
>> load average is only an indication of what may be a problem. I've  
>> seen
>> a load of 10000 on a machine and it still be very responsive because
>> the processes weren't CPU hungry. So using load as a metric for
>> determining acceptability is a small piece. In this case it should be
>> the response of the gatekeeper. For instance, the inca jobs were
>> timing out getting a response from the gatekeeper after 5 minutes.
>> This is unacceptable. I would say as soon as it takes more than a
>> minute for the GK to respond, back off.
>>
>> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote:
>>
>>>
>>> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote:
>>>> Then I'd say we have very different levels of acceptable.
>>>
>>> Yes, that's why we're having this discussion.
>>>
>>>> A simple job
>>>> submission test should never take longer than 5 minutes to complete
>>>> and a load of 27 is not acceptable when the responsiveness of the
>>>> machine is impacted. And since we're having this conversation,  
>>>> there
>>>> is a perceived problem on our end so an adjustment to our  
>>>> definition
>>>> of acceptable is needed.
>>>
>>> And we need to adjust our definition of not-acceptable. So we need  
>>> to
>>> meet in the middle.
>>>
>>> So, 25 (sustained) reasonably acceptable average load? That  
>>> amounts to
>>> about 13 hungry processes per cpu. Even with a 100Hz time slice,  
>>> each
>>> process would get 8 slices per second on average.
>>>
>>>>
>>>> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote:
>>>>
>>>>>
>>>>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote:
>>>>>> That inca tests were timing out after 5 minutes and the load on  
>>>>>> the
>>>>>> machine was ~27. How are you concluding when things aren't
>>>>>> acceptable?
>>>>>
>>>>> It's got 2 cpus. So to me an average load of under 100 and the SSH
>>>>> session being responsive looks fine.
>>>>>
>>>>> The fact that inca tests are timing out may be because inca has  
>>>>> too
>>>>> low
>>>>> of a tolerance for things.
>>>>>
>>>>>>
>>>>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>> That's odd. Clearly if that's not acceptable from your
>>>>>>> perspective,
>>>>>>> yet
>>>>>>> I thought 130 are fine, there's a disconnect between what you
>>>>>>> think is
>>>>>>> acceptable and what I think is acceptable.
>>>>>>>
>>>>>>> What was that prompted you to conclude things are bad?
>>>>>>>
>>>>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote:
>>>>>>>> Around 80.
>>>>>>>>
>>>>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote:
>>>>>>>>>> Sorry for killing the server. I'm pushing to get
>>>>>>>>>> results to guide the selection of compounds for
>>>>>>>>>> wet-lab testing.
>>>>>>>>>>
>>>>>>>>>> I had set the throttle.score.job.factor to 1 in the
>>>>>>>>>> swift.properties file.
>>>>>>>>>
>>>>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill?
>>>>>>>>>
>>>>>>>>> Mihael
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I certainly appreciate everyone's efforts and
>>>>>>>>>> responsiveness.
>>>>>>>>>>
>>>>>>>>>> Let me know what to try next, before I kill again.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Mike
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>>>>>
>>>>>>>>>>> So I was trying some stuff on Friday night. I guess
>>>>>>>>>>> I've found the
>>>>>>>>>>> strategy on when to run the tests: when nobody else
>>>>>>>>>>> has jobs there
>>>>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some
>>>>>>>>>>> Falkon workers
>>>>>>>>>>> running, and the occasional Inca tests).
>>>>>>>>>>>
>>>>>>>>>>> In any event, the machine jumps to about 100%
>>>>>>>>>>> utilization at around 130
>>>>>>>>>>> jobs with pre-ws gram. So Mike, please set
>>>>>>>>>>> throttle.score.job.factor to
>>>>>>>>>>> 1 in swift.properties.
>>>>>>>>>>>
>>>>>>>>>>> There's still more work I need to do test-wise.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote:
>>>>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone
>>>>>>>>>>> work with Mike to get
>>>>>>>>>>>> some swift settings that don't kill our server?
>>>>>>>>>>>>
>>>>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I'm submitting molecular dynamics
>>>>>>>>>>> simulations
>>>>>>>>>>>>> using Swift.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a default wall-time limit for jobs on
>>>>>>>>>>> tg-uc?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Actually, these numbers are now escalating...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> top - 17:18:54 up  2:29,  1 user,  load
>>>>>>>>>>> average:
>>>>>>>>>>>>>> 149.02, 123.63, 91.94
>>>>>>>>>>>>>> Tasks: 469 total,   4 running, 465 sleeping,
>>>>>>>>>>> 0
>>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>>> 479
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>>> real    0m26.134s
>>>>>>>>>>>>>> user    0m0.090s
>>>>>>>>>>>>>> sys     0m0.010s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the
>>>>>>>>>>> UC/ANL
>>>>>>>>>>>>>> TG GRAM host)
>>>>>>>>>>>>>>> became unresponsive and had to be rebooted.  I
>>>>>>>>>>> am
>>>>>>>>>>>>>> now seeing slow
>>>>>>>>>>>>>>> response times from the Gatekeeper there
>>>>>>>>>>> again.
>>>>>>>>>>>>>> Authenticating to
>>>>>>>>>>>>>>> the gatekeeper should only take a second or
>>>>>>>>>>> two,
>>>>>>>>>>>>>> but it is
>>>>>>>>>>>>>>> periodically taking up to 16 seconds:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r
>>>>>>>>>>>>>> tg-grid.uc.teragrid.org
>>>>>>>>>>>>>>> GRAM Authentication test successful
>>>>>>>>>>>>>>> real    0m16.096s
>>>>>>>>>>>>>>> user    0m0.060s
>>>>>>>>>>>>>>> sys     0m0.020s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> looking at the load on tg-grid, it is rather
>>>>>>>>>>> high:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> top - 16:55:26 up  2:06,  1 user,  load
>>>>>>>>>>> average:
>>>>>>>>>>>>>> 89.59, 78.69, 62.92
>>>>>>>>>>>>>>> Tasks: 398 total,  20 running, 378 sleeping,
>>>>>>>>>>> 0
>>>>>>>>>>>>>> stopped,   0 zombie
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And there appear to be a large number of
>>>>>>>>>>> processes
>>>>>>>>>>>>>> owned by kubal:
>>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l
>>>>>>>>>>>>>>> 380
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume that Mike is using swift to do the
>>>>>>>>>>> job
>>>>>>>>>>>>>> submission.  Is
>>>>>>>>>>>>>>> there some throttling of the rate at which
>>>>>>>>>>> jobs
>>>>>>>>>>>>>> are submitted to
>>>>>>>>>>>>>>> the gatekeeper that could be done that would
>>>>>>>>>>>>>> lighten this load
>>>>>>>>>>>>>>> some?  (Or has that already been done since
>>>>>>>>>>>>>> earlier today?)  The
>>>>>>>>>>>>>>> current response times are not unacceptable,
>>>>>>>>>>> but
>>>>>>>>>>>>>> I'm hoping to
>>>>>>>>>>>>>>> avoid having the machine grind to a halt as it
>>>>>>>>>>> did
>>>>>>>>>>>>>> earlier today.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> joe.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>> ===================================================
>>>>>>>>>>>>>>> joseph a.
>>>>>>>>>>>>>>> insley
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>>>>> (630) 252-5649
>>>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>>  (630)
>>>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>> ===================================================
>>>>>>>>>>>>>> joseph a. insley
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> insley at mcs.anl.gov
>>>>>>>>>>>>>> mathematics & computer science division
>>>>>>>>>>> (630)
>>>>>>>>>>>>>> 252-5649
>>>>>>>>>>>>>> argonne national laboratory
>>>>>>>>>>>>>> (630)
>>>>>>>>>>>>>> 252-5986 (fax)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>>>>>>
>>>>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________________________________
>>>>>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


From hategan at mcs.anl.gov  Thu Feb  7 20:34:13 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 07 Feb 2008 20:34:13 -0600
Subject: [Swift-devel] ws-gram tests
Message-ID: <1202438053.26812.12.camel@blabla.mcs.anl.gov>

I did a 1024 job run today with ws-gram.
I painted the results here:
http://www-unix.mcs.anl.gov/~hategan/s/g.html

Seems like client memory per job is about 370k. Which is quite a lot.
What kinda worries me is that it doesn't seem to go down after the jobs
are done, so maybe there's a memory leak, or maybe the garbage collector
doesn't do any major collections. I'll need to profile this to see
exactly what we're talking about.

The container memory is figured by looking at the process in /proc. It's
total memory including shared libraries and things. But libraries take a
fixed amount of space, so a fuzzy correlation can probably be made. It
looks quite similar to the amount of memory eaten on the client side
(per job).

CPU-load-wise, WS-GRAM behaves. There is some work during the time the
jobs are submitted, but the machine itself seems responsive. I have yet
to plot the exact submission time for each job.

So at this point I would recommend trying ws-gram as long as there
aren't too many jobs involved (i.e. under 4000 parallel jobs), and while
making sure the jvm has enough heap. More than that seems like a gamble.

Mihael


From hategan at mcs.anl.gov  Thu Feb  7 20:41:33 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 07 Feb 2008 20:41:33 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
Message-ID: <1202438493.27139.0.camel@blabla.mcs.anl.gov>


> So at this point I would recommend trying ws-gram as long as there
> aren't too many jobs involved (i.e. under 4000 parallel jobs), 

.. actually submitted jobs. This may be somewhat unlikely to occur.

> and while
> making sure the jvm has enough heap. More than that seems like a gamble.
> 
> Mihael
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Fri Feb  8 08:01:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Feb 2008 14:01:17 +0000 (GMT)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802081353470.5372@dildano.hawaga.org.uk>

some rough numbers in this space that I collected yesterday:

I ran a 10000 (10^5) parallel jobs workflow on teraport using the PBS 
provider. It launched up to 401 jobs at once, as per the default config 
file. It took about 6h but ran ok. I didn't keep any other statistics, 
though - just set it going in the morning and let it run. That's a 
throughput of about one job every couple of seconds.

My laptop can do about 5000 of the same kind of job through the fork 
provider, running at most 2 jobs at once, in about 15 minutes.

-- 


From foster at mcs.anl.gov  Fri Feb  8 09:19:21 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Fri, 08 Feb 2008 09:19:21 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
Message-ID: <47AC72F9.8010701@mcs.anl.gov>

Mihael:

That's great, thanks!

Ian.

Mihael Hategan wrote:
> I did a 1024 job run today with ws-gram.
> I painted the results here:
> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>
> Seems like client memory per job is about 370k. Which is quite a lot.
> What kinda worries me is that it doesn't seem to go down after the jobs
> are done, so maybe there's a memory leak, or maybe the garbage collector
> doesn't do any major collections. I'll need to profile this to see
> exactly what we're talking about.
>
> The container memory is figured by looking at the process in /proc. It's
> total memory including shared libraries and things. But libraries take a
> fixed amount of space, so a fuzzy correlation can probably be made. It
> looks quite similar to the amount of memory eaten on the client side
> (per job).
>
> CPU-load-wise, WS-GRAM behaves. There is some work during the time the
> jobs are submitted, but the machine itself seems responsive. I have yet
> to plot the exact submission time for each job.
>
> So at this point I would recommend trying ws-gram as long as there
> aren't too many jobs involved (i.e. under 4000 parallel jobs), and while
> making sure the jvm has enough heap. More than that seems like a gamble.
>
> Mihael
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   


From hategan at mcs.anl.gov  Fri Feb  8 09:33:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 09:33:43 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <Pine.LNX.4.64.0802081353470.5372@dildano.hawaga.org.uk>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802081353470.5372@dildano.hawaga.org.uk>
Message-ID: <1202484823.4800.2.camel@blabla.mcs.anl.gov>

I disabled the job throttle in this case. I also didn't consider
failures (restarts were at 0).

In the last run I got exactly one failure, but in some previous runs
(with 256 jobs) I got more. All that needs to be debugged.

On Fri, 2008-02-08 at 14:01 +0000, Ben Clifford wrote:
> some rough numbers in this space that I collected yesterday:
> 
> I ran a 10000 (10^5) parallel jobs workflow on teraport using the PBS 
> provider. It launched up to 401 jobs at once, as per the default config 
> file. It took about 6h but ran ok. I didn't keep any other statistics, 
> though - just set it going in the morning and let it run. That's a 
> throughput of about one job every couple of seconds.
> 
> My laptop can do about 5000 of the same kind of job through the fork 
> provider, running at most 2 jobs at once, in about 15 minutes.
> 


From smartin at mcs.anl.gov  Fri Feb  8 09:33:53 2008
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Fri, 8 Feb 2008 09:33:53 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <47AC72F9.8010701@mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
Message-ID: <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>

Mihael,

Glad to hear things are improved with GRAM4.  Lets keep going to have  
swift using GRAM4 routinely.

Below is a recent thread that looked at this exact issue with condor- 
g.  But it is entirely relevant to your use of GRAM4. the 2 issues to  
look for are

1) your use of notifications

>> I ran one more test with the improved callback code. This time, I
>> stopped storing the notification producer EPRs associated with the
>> GRAM job resources. Memory usage went down markedly.

2) you could avoid notifications and instead do client-side polling  
for job state.  This has shown to be more reliable than notifications  
under heavy loads, condor-g processing 1000s of jobs.

The core team will be looking at improving notifications once their  
other 4.2 deliverables are done.

-Stu

Begin forwarded message:

> From: feller at mcs.anl.gov
> Date: February 1, 2008 9:41:05 AM CST
> To: "Jaime Frey" <jfrey at cs.wisc.edu>
> Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin" <tmartin at physics.ucsd.edu 
> >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon" <bacon at mcs.anl.gov 
> >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner" <rwg at hep.uchicago.edu 
> >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,  
> "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny" <miron at cs.wisc.edu 
> >
> Subject: Re: Condor-G WS GRAM memory usage
>
>> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>>
>>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>>>
>>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>>>>
>>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
>>>>> raised some concerns about memory usage on the client side. I did
>>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
>>>>> to be the primary memory consumer. The GAHP server is a wrapper
>>>>> around the java client libraries for WS GRAM.
>>>>>
>>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
>>>>> time. The jobs were 2-minute sleep jobs with minimal data
>>>>> transfer. All of the jobs overlapped in submission and execution.
>>>>> Here is what I've discovered so far.
>>>>>
>>>>> Aside from the heap available to the java code, the jvm used 117
>>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
>>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>>>>>
>>>>> The maximum jvm heap usage (as reported by the garbage collector)
>>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
>>>>> heap usage was about 5 megs plus 0.6 megs per job.
>>>>>
>>>>> The only long-term memory per job that I know of in the GAHP is
>>>>> for the notification sink for job status callbacks. 600kb seems a
>>>>> little high for that. Stu, could someone on Globus help us
>>>>> determine if we're using the notification sinks inefficiently?
>>>>
>>>> Martin just looked and for the most part, there is nothing wrong
>>>> with how condor-g manages the callback sink.
>>>> However, one improvement that would reduce the memory used per job
>>>> would be to not have a notification consumer per job.  Instead use
>>>> one for all jobs.
>>>>
>>>> Also, Martin recently did some analysis on condor-g stress tests
>>>> and found that notifications are building up on the in the GRAM4
>>>> service container and that is causing delays which seem to be
>>>> causing multiple problems.  We're looking at this in a separate
>>>> effort with the GT Core team.  But, after this was clear, Martin  
>>>> re-
>>>> ran the condor-g test and relied on polling between condor-g and
>>>> the GRAM4 service instead of notifications.  Jaime, could you
>>>> repeat the no-notification test and see the difference in memory?
>>>> The changes would be to increase the polling frequency in condor-g
>>>> and comment out the subscribe for notification.  You could also
>>>> comment out the notification listener call(s) too.
>>>
>>>
>>> I did two new sets of tests today. The first used more efficient
>>> callback code in the GAHP (one notification consumer rather than one
>>> per job). The second disabled notifications and relied on polling
>>> for job status changes.
>>>
>>> The more efficient callback code did not produce a noticeable
>>> reduction in memory usage.
>>>
>>> Disabling notifications did reduce memory usage. The maximum jvm
>>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
>>> heap usage after job submission and before job completion was about
>>> 4 megs + 0.1 megs per job.
>>
>>
>> I ran one more test with the improved callback code. This time, I
>> stopped storing the notification producer EPRs associated with the
>> GRAM job resources. Memory usage went down markedly.
>>
>> I was told the client had to explicitly destroy these serve-side
>> notification producer resources when it destroys the job, otherwise
>> they hang around bogging down the server. Is this still the case? The
>> server can't destroy notification producers when their sources of
>> information are destroyed?
>>
>
> This reminds me of the odd fact that i had to suddenly grant much more
> memory to Condor-g as soon as condor-g started storing EPRs of
> subscription resources to be able to destroy them eventually.
> Those EPR's are maybe not so tiny as they look like.
>
> For 4.0: yes, currently you'll have to store and eventually destroy
> subscription resources manually to avoid heaping up persistence data
> on the server-side.
> For 4.2: no, you won't have to store them. A job resource will
> destroy all subscription resources when it's destroyed.
>
> Overall i suggest to concentrate on 4.2 gram since the "container
> hangs in job destruction" problem won't exist anymore.
>
> Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
> in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes  
> sense
> for us to do the 4.2-related changes in Gahp and hand it to you for
> fine-tuning then?
>
> Martin


On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:

> Mihael:
>
> That's great, thanks!
>
> Ian.
>
> Mihael Hategan wrote:
>> I did a 1024 job run today with ws-gram.
>> I painted the results here:
>> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>>
>> Seems like client memory per job is about 370k. Which is quite a lot.
>> What kinda worries me is that it doesn't seem to go down after the  
>> jobs
>> are done, so maybe there's a memory leak, or maybe the garbage  
>> collector
>> doesn't do any major collections. I'll need to profile this to see
>> exactly what we're talking about.
>>
>> The container memory is figured by looking at the process in /proc.  
>> It's
>> total memory including shared libraries and things. But libraries  
>> take a
>> fixed amount of space, so a fuzzy correlation can probably be made.  
>> It
>> looks quite similar to the amount of memory eaten on the client side
>> (per job).
>>
>> CPU-load-wise, WS-GRAM behaves. There is some work during the time  
>> the
>> jobs are submitted, but the machine itself seems responsive. I have  
>> yet
>> to plot the exact submission time for each job.
>>
>> So at this point I would recommend trying ws-gram as long as there
>> aren't too many jobs involved (i.e. under 4000 parallel jobs), and  
>> while
>> making sure the jvm has enough heap. More than that seems like a  
>> gamble.
>>
>> Mihael
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
>


From hategan at mcs.anl.gov  Fri Feb  8 09:46:42 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 09:46:42 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
Message-ID: <1202485602.4800.13.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 09:33 -0600, Stuart Martin wrote:
> Mihael,
> 
> Glad to hear things are improved with GRAM4.  Lets keep going to have  
> swift using GRAM4 routinely.

You're being a bit assertive there.

> 
> Below is a recent thread that looked at this exact issue with condor- 
> g.  But it is entirely relevant to your use of GRAM4. the 2 issues to  
> look for are
> 
> 1) your use of notifications
> 
> >> I ran one more test with the improved callback code. This time, I
> >> stopped storing the notification producer EPRs associated with the
> >> GRAM job resources. Memory usage went down markedly.
> 
> 2) you could avoid notifications and instead do client-side polling  
> for job state.  This has shown to be more reliable than notifications  
> under heavy loads, condor-g processing 1000s of jobs.

These are both hacks. I'm not sure I want to go there. 300K per job is a
bit too much considering that swift (which has to consider many more
things) has less than 10K overhead per job.

> 
> The core team will be looking at improving notifications once their  
> other 4.2 deliverables are done.
> 
> -Stu
> 
> Begin forwarded message:
> 
> > From: feller at mcs.anl.gov
> > Date: February 1, 2008 9:41:05 AM CST
> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin" <tmartin at physics.ucsd.edu 
> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon" <bacon at mcs.anl.gov 
> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner" <rwg at hep.uchicago.edu 
> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,  
> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny" <miron at cs.wisc.edu 
> > >
> > Subject: Re: Condor-G WS GRAM memory usage
> >
> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >>
> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >>>
> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >>>>
> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >>>>> raised some concerns about memory usage on the client side. I did
> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
> >>>>> around the java client libraries for WS GRAM.
> >>>>>
> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >>>>> transfer. All of the jobs overlapped in submission and execution.
> >>>>> Here is what I've discovered so far.
> >>>>>
> >>>>> Aside from the heap available to the java code, the jvm used 117
> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >>>>>
> >>>>> The maximum jvm heap usage (as reported by the garbage collector)
> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >>>>>
> >>>>> The only long-term memory per job that I know of in the GAHP is
> >>>>> for the notification sink for job status callbacks. 600kb seems a
> >>>>> little high for that. Stu, could someone on Globus help us
> >>>>> determine if we're using the notification sinks inefficiently?
> >>>>
> >>>> Martin just looked and for the most part, there is nothing wrong
> >>>> with how condor-g manages the callback sink.
> >>>> However, one improvement that would reduce the memory used per job
> >>>> would be to not have a notification consumer per job.  Instead use
> >>>> one for all jobs.
> >>>>
> >>>> Also, Martin recently did some analysis on condor-g stress tests
> >>>> and found that notifications are building up on the in the GRAM4
> >>>> service container and that is causing delays which seem to be
> >>>> causing multiple problems.  We're looking at this in a separate
> >>>> effort with the GT Core team.  But, after this was clear, Martin  
> >>>> re-
> >>>> ran the condor-g test and relied on polling between condor-g and
> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >>>> repeat the no-notification test and see the difference in memory?
> >>>> The changes would be to increase the polling frequency in condor-g
> >>>> and comment out the subscribe for notification.  You could also
> >>>> comment out the notification listener call(s) too.
> >>>
> >>>
> >>> I did two new sets of tests today. The first used more efficient
> >>> callback code in the GAHP (one notification consumer rather than one
> >>> per job). The second disabled notifications and relied on polling
> >>> for job status changes.
> >>>
> >>> The more efficient callback code did not produce a noticeable
> >>> reduction in memory usage.
> >>>
> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >>> heap usage after job submission and before job completion was about
> >>> 4 megs + 0.1 megs per job.
> >>
> >>
> >> I ran one more test with the improved callback code. This time, I
> >> stopped storing the notification producer EPRs associated with the
> >> GRAM job resources. Memory usage went down markedly.
> >>
> >> I was told the client had to explicitly destroy these serve-side
> >> notification producer resources when it destroys the job, otherwise
> >> they hang around bogging down the server. Is this still the case? The
> >> server can't destroy notification producers when their sources of
> >> information are destroyed?
> >>
> >
> > This reminds me of the odd fact that i had to suddenly grant much more
> > memory to Condor-g as soon as condor-g started storing EPRs of
> > subscription resources to be able to destroy them eventually.
> > Those EPR's are maybe not so tiny as they look like.
> >
> > For 4.0: yes, currently you'll have to store and eventually destroy
> > subscription resources manually to avoid heaping up persistence data
> > on the server-side.
> > For 4.2: no, you won't have to store them. A job resource will
> > destroy all subscription resources when it's destroyed.
> >
> > Overall i suggest to concentrate on 4.2 gram since the "container
> > hangs in job destruction" problem won't exist anymore.
> >
> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes  
> > sense
> > for us to do the 4.2-related changes in Gahp and hand it to you for
> > fine-tuning then?
> >
> > Martin
> 
> 
> 
> 
> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> 
> > Mihael:
> >
> > That's great, thanks!
> >
> > Ian.
> >
> > Mihael Hategan wrote:
> >> I did a 1024 job run today with ws-gram.
> >> I painted the results here:
> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >>
> >> Seems like client memory per job is about 370k. Which is quite a lot.
> >> What kinda worries me is that it doesn't seem to go down after the  
> >> jobs
> >> are done, so maybe there's a memory leak, or maybe the garbage  
> >> collector
> >> doesn't do any major collections. I'll need to profile this to see
> >> exactly what we're talking about.
> >>
> >> The container memory is figured by looking at the process in /proc.  
> >> It's
> >> total memory including shared libraries and things. But libraries  
> >> take a
> >> fixed amount of space, so a fuzzy correlation can probably be made.  
> >> It
> >> looks quite similar to the amount of memory eaten on the client side
> >> (per job).
> >>
> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time  
> >> the
> >> jobs are submitted, but the machine itself seems responsive. I have  
> >> yet
> >> to plot the exact submission time for each job.
> >>
> >> So at this point I would recommend trying ws-gram as long as there
> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and  
> >> while
> >> making sure the jvm has enough heap. More than that seems like a  
> >> gamble.
> >>
> >> Mihael
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >>
> >
> 


From feller at mcs.anl.gov  Fri Feb  8 10:09:34 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 10:09:34 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202485602.4800.13.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
Message-ID: <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>

>
> On Fri, 2008-02-08 at 09:33 -0600, Stuart Martin wrote:
>> Mihael,
>>
>> Glad to hear things are improved with GRAM4.  Lets keep going to have
>> swift using GRAM4 routinely.
>
> You're being a bit assertive there.
>
>>
>> Below is a recent thread that looked at this exact issue with condor-
>> g.  But it is entirely relevant to your use of GRAM4. the 2 issues to
>> look for are
>>
>> 1) your use of notifications
>>
>> >> I ran one more test with the improved callback code. This time, I
>> >> stopped storing the notification producer EPRs associated with the
>> >> GRAM job resources. Memory usage went down markedly.
>>
>> 2) you could avoid notifications and instead do client-side polling
>> for job state.  This has shown to be more reliable than notifications
>> under heavy loads, condor-g processing 1000s of jobs.
>
> These are both hacks. I'm not sure I want to go there. 300K per job is a
> bit too much considering that swift (which has to consider many more
> things) has less than 10K overhead per job.
>


For my better understanding:
Do you start up your own notification consumer manager that listens for
notifications of all jobs or do you let each GramJob instance listen for
notifications itself?
In case you listen for notifications yourself: do you store
GramJob objects or just EPR's of jobs and create GramJob objects if
needed?

Martin

>>
>> The core team will be looking at improving notifications once their
>> other 4.2 deliverables are done.
>>
>> -Stu
>>
>> Begin forwarded message:
>>
>> > From: feller at mcs.anl.gov
>> > Date: February 1, 2008 9:41:05 AM CST
>> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> <tmartin at physics.ucsd.edu
>> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> <bacon at mcs.anl.gov
>> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> <rwg at hep.uchicago.edu
>> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,
>> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> <miron at cs.wisc.edu
>> > >
>> > Subject: Re: Condor-G WS GRAM memory usage
>> >
>> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >>
>> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >>>
>> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >>>>
>> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
>> >>>>> raised some concerns about memory usage on the client side. I did
>> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
>> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
>> >>>>> around the java client libraries for WS GRAM.
>> >>>>>
>> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
>> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
>> >>>>> transfer. All of the jobs overlapped in submission and execution.
>> >>>>> Here is what I've discovered so far.
>> >>>>>
>> >>>>> Aside from the heap available to the java code, the jvm used 117
>> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
>> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>> >>>>>
>> >>>>> The maximum jvm heap usage (as reported by the garbage collector)
>> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
>> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >>>>>
>> >>>>> The only long-term memory per job that I know of in the GAHP is
>> >>>>> for the notification sink for job status callbacks. 600kb seems a
>> >>>>> little high for that. Stu, could someone on Globus help us
>> >>>>> determine if we're using the notification sinks inefficiently?
>> >>>>
>> >>>> Martin just looked and for the most part, there is nothing wrong
>> >>>> with how condor-g manages the callback sink.
>> >>>> However, one improvement that would reduce the memory used per job
>> >>>> would be to not have a notification consumer per job.  Instead use
>> >>>> one for all jobs.
>> >>>>
>> >>>> Also, Martin recently did some analysis on condor-g stress tests
>> >>>> and found that notifications are building up on the in the GRAM4
>> >>>> service container and that is causing delays which seem to be
>> >>>> causing multiple problems.  We're looking at this in a separate
>> >>>> effort with the GT Core team.  But, after this was clear, Martin
>> >>>> re-
>> >>>> ran the condor-g test and relied on polling between condor-g and
>> >>>> the GRAM4 service instead of notifications.  Jaime, could you
>> >>>> repeat the no-notification test and see the difference in memory?
>> >>>> The changes would be to increase the polling frequency in condor-g
>> >>>> and comment out the subscribe for notification.  You could also
>> >>>> comment out the notification listener call(s) too.
>> >>>
>> >>>
>> >>> I did two new sets of tests today. The first used more efficient
>> >>> callback code in the GAHP (one notification consumer rather than one
>> >>> per job). The second disabled notifications and relied on polling
>> >>> for job status changes.
>> >>>
>> >>> The more efficient callback code did not produce a noticeable
>> >>> reduction in memory usage.
>> >>>
>> >>> Disabling notifications did reduce memory usage. The maximum jvm
>> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
>> >>> heap usage after job submission and before job completion was about
>> >>> 4 megs + 0.1 megs per job.
>> >>
>> >>
>> >> I ran one more test with the improved callback code. This time, I
>> >> stopped storing the notification producer EPRs associated with the
>> >> GRAM job resources. Memory usage went down markedly.
>> >>
>> >> I was told the client had to explicitly destroy these serve-side
>> >> notification producer resources when it destroys the job, otherwise
>> >> they hang around bogging down the server. Is this still the case? The
>> >> server can't destroy notification producers when their sources of
>> >> information are destroyed?
>> >>
>> >
>> > This reminds me of the odd fact that i had to suddenly grant much more
>> > memory to Condor-g as soon as condor-g started storing EPRs of
>> > subscription resources to be able to destroy them eventually.
>> > Those EPR's are maybe not so tiny as they look like.
>> >
>> > For 4.0: yes, currently you'll have to store and eventually destroy
>> > subscription resources manually to avoid heaping up persistence data
>> > on the server-side.
>> > For 4.2: no, you won't have to store them. A job resource will
>> > destroy all subscription resources when it's destroyed.
>> >
>> > Overall i suggest to concentrate on 4.2 gram since the "container
>> > hangs in job destruction" problem won't exist anymore.
>> >
>> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
>> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
>> > sense
>> > for us to do the 4.2-related changes in Gahp and hand it to you for
>> > fine-tuning then?
>> >
>> > Martin
>>
>>
>>
>>
>> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>>
>> > Mihael:
>> >
>> > That's great, thanks!
>> >
>> > Ian.
>> >
>> > Mihael Hategan wrote:
>> >> I did a 1024 job run today with ws-gram.
>> >> I painted the results here:
>> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >>
>> >> Seems like client memory per job is about 370k. Which is quite a lot.
>> >> What kinda worries me is that it doesn't seem to go down after the
>> >> jobs
>> >> are done, so maybe there's a memory leak, or maybe the garbage
>> >> collector
>> >> doesn't do any major collections. I'll need to profile this to see
>> >> exactly what we're talking about.
>> >>
>> >> The container memory is figured by looking at the process in /proc.
>> >> It's
>> >> total memory including shared libraries and things. But libraries
>> >> take a
>> >> fixed amount of space, so a fuzzy correlation can probably be made.
>> >> It
>> >> looks quite similar to the amount of memory eaten on the client side
>> >> (per job).
>> >>
>> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time
>> >> the
>> >> jobs are submitted, but the machine itself seems responsive. I have
>> >> yet
>> >> to plot the exact submission time for each job.
>> >>
>> >> So at this point I would recommend trying ws-gram as long as there
>> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and
>> >> while
>> >> making sure the jvm has enough heap. More than that seems like a
>> >> gamble.
>> >>
>> >> Mihael
>> >>
>> >> _______________________________________________
>> >> Swift-devel mailing list
>> >> Swift-devel at ci.uchicago.edu
>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >>
>> >>
>> >
>>
>
>


From hategan at mcs.anl.gov  Fri Feb  8 10:18:13 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 10:18:13 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202487494.5642.7.camel@blabla.mcs.anl.gov>

> >
> > These are both hacks. I'm not sure I want to go there. 300K per job is a
> > bit too much considering that swift (which has to consider many more
> > things) has less than 10K overhead per job.
> >
> 
> 
> For my better understanding:
> Do you start up your own notification consumer manager that listens for
> notifications of all jobs or do you let each GramJob instance listen for
> notifications itself?
> In case you listen for notifications yourself: do you store
> GramJob objects or just EPR's of jobs and create GramJob objects if
> needed?

Excellent points. I let each GramJob instance listen for notifications
itself. What I observed is that it uses only one container for that.

Due to the above, a reference to the GramJob is kept anyway, regardless
of whether that reference is in client code or the local container.

I'll try to profile a run and see if I can spot where the problems are.

> 
> Martin
> 
> >>
> >> The core team will be looking at improving notifications once their
> >> other 4.2 deliverables are done.
> >>
> >> -Stu
> >>
> >> Begin forwarded message:
> >>
> >> > From: feller at mcs.anl.gov
> >> > Date: February 1, 2008 9:41:05 AM CST
> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> <tmartin at physics.ucsd.edu
> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> <bacon at mcs.anl.gov
> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> <rwg at hep.uchicago.edu
> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,
> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> <miron at cs.wisc.edu
> >> > >
> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >
> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >>
> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >>>
> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >>>>
> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >> >>>>> raised some concerns about memory usage on the client side. I did
> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared
> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
> >> >>>>> around the java client libraries for WS GRAM.
> >> >>>>>
> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >>>>> transfer. All of the jobs overlapped in submission and execution.
> >> >>>>> Here is what I've discovered so far.
> >> >>>>>
> >> >>>>> Aside from the heap available to the java code, the jvm used 117
> >> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G
> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >>>>>
> >> >>>>> The maximum jvm heap usage (as reported by the garbage collector)
> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete),
> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >>>>>
> >> >>>>> The only long-term memory per job that I know of in the GAHP is
> >> >>>>> for the notification sink for job status callbacks. 600kb seems a
> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >>>>> determine if we're using the notification sinks inefficiently?
> >> >>>>
> >> >>>> Martin just looked and for the most part, there is nothing wrong
> >> >>>> with how condor-g manages the callback sink.
> >> >>>> However, one improvement that would reduce the memory used per job
> >> >>>> would be to not have a notification consumer per job.  Instead use
> >> >>>> one for all jobs.
> >> >>>>
> >> >>>> Also, Martin recently did some analysis on condor-g stress tests
> >> >>>> and found that notifications are building up on the in the GRAM4
> >> >>>> service container and that is causing delays which seem to be
> >> >>>> causing multiple problems.  We're looking at this in a separate
> >> >>>> effort with the GT Core team.  But, after this was clear, Martin
> >> >>>> re-
> >> >>>> ran the condor-g test and relied on polling between condor-g and
> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >> >>>> repeat the no-notification test and see the difference in memory?
> >> >>>> The changes would be to increase the polling frequency in condor-g
> >> >>>> and comment out the subscribe for notification.  You could also
> >> >>>> comment out the notification listener call(s) too.
> >> >>>
> >> >>>
> >> >>> I did two new sets of tests today. The first used more efficient
> >> >>> callback code in the GAHP (one notification consumer rather than one
> >> >>> per job). The second disabled notifications and relied on polling
> >> >>> for job status changes.
> >> >>>
> >> >>> The more efficient callback code did not produce a noticeable
> >> >>> reduction in memory usage.
> >> >>>
> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >> >>> heap usage after job submission and before job completion was about
> >> >>> 4 megs + 0.1 megs per job.
> >> >>
> >> >>
> >> >> I ran one more test with the improved callback code. This time, I
> >> >> stopped storing the notification producer EPRs associated with the
> >> >> GRAM job resources. Memory usage went down markedly.
> >> >>
> >> >> I was told the client had to explicitly destroy these serve-side
> >> >> notification producer resources when it destroys the job, otherwise
> >> >> they hang around bogging down the server. Is this still the case? The
> >> >> server can't destroy notification producers when their sources of
> >> >> information are destroyed?
> >> >>
> >> >
> >> > This reminds me of the odd fact that i had to suddenly grant much more
> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> > subscription resources to be able to destroy them eventually.
> >> > Those EPR's are maybe not so tiny as they look like.
> >> >
> >> > For 4.0: yes, currently you'll have to store and eventually destroy
> >> > subscription resources manually to avoid heaping up persistence data
> >> > on the server-side.
> >> > For 4.2: no, you won't have to store them. A job resource will
> >> > destroy all subscription resources when it's destroyed.
> >> >
> >> > Overall i suggest to concentrate on 4.2 gram since the "container
> >> > hangs in job destruction" problem won't exist anymore.
> >> >
> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes
> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
> >> > sense
> >> > for us to do the 4.2-related changes in Gahp and hand it to you for
> >> > fine-tuning then?
> >> >
> >> > Martin
> >>
> >>
> >>
> >>
> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >>
> >> > Mihael:
> >> >
> >> > That's great, thanks!
> >> >
> >> > Ian.
> >> >
> >> > Mihael Hategan wrote:
> >> >> I did a 1024 job run today with ws-gram.
> >> >> I painted the results here:
> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >>
> >> >> Seems like client memory per job is about 370k. Which is quite a lot.
> >> >> What kinda worries me is that it doesn't seem to go down after the
> >> >> jobs
> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >> collector
> >> >> doesn't do any major collections. I'll need to profile this to see
> >> >> exactly what we're talking about.
> >> >>
> >> >> The container memory is figured by looking at the process in /proc.
> >> >> It's
> >> >> total memory including shared libraries and things. But libraries
> >> >> take a
> >> >> fixed amount of space, so a fuzzy correlation can probably be made.
> >> >> It
> >> >> looks quite similar to the amount of memory eaten on the client side
> >> >> (per job).
> >> >>
> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time
> >> >> the
> >> >> jobs are submitted, but the machine itself seems responsive. I have
> >> >> yet
> >> >> to plot the exact submission time for each job.
> >> >>
> >> >> So at this point I would recommend trying ws-gram as long as there
> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and
> >> >> while
> >> >> making sure the jvm has enough heap. More than that seems like a
> >> >> gamble.
> >> >>
> >> >> Mihael
> >> >>
> >> >> _______________________________________________
> >> >> Swift-devel mailing list
> >> >> Swift-devel at ci.uchicago.edu
> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >>
> >> >>
> >> >
> >>
> >
> >
> 
> 


From feller at mcs.anl.gov  Fri Feb  8 10:26:26 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 10:26:26 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202487494.5642.7.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
Message-ID: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>

>> >
>> > These are both hacks. I'm not sure I want to go there. 300K per job is
>> a
>> > bit too much considering that swift (which has to consider many more
>> > things) has less than 10K overhead per job.
>> >
>>
>>
>> For my better understanding:
>> Do you start up your own notification consumer manager that listens for
>> notifications of all jobs or do you let each GramJob instance listen for
>> notifications itself?
>> In case you listen for notifications yourself: do you store
>> GramJob objects or just EPR's of jobs and create GramJob objects if
>> needed?
>
> Excellent points. I let each GramJob instance listen for notifications
> itself. What I observed is that it uses only one container for that.
>

Shoot! i didn't know that and thought there would be a container per
GramJob in that case. That's the core mysteries with notifications.
Anyway: I did a quick check some days ago and found that GramJob is
surprisingly greedy regarding memory as you said. I'll have to further
check what it is, but will probably not do that before 4.2 is out.


> Due to the above, a reference to the GramJob is kept anyway, regardless
> of whether that reference is in client code or the local container.
>
> I'll try to profile a run and see if I can spot where the problems are.
>
>>
>> Martin
>>
>> >>
>> >> The core team will be looking at improving notifications once their
>> >> other 4.2 deliverables are done.
>> >>
>> >> -Stu
>> >>
>> >> Begin forwarded message:
>> >>
>> >> > From: feller at mcs.anl.gov
>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >> <tmartin at physics.ucsd.edu
>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >> <bacon at mcs.anl.gov
>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >> <rwg at hep.uchicago.edu
>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,
>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> <miron at cs.wisc.edu
>> >> > >
>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> >
>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> >>
>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> >>>
>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >> >>>>
>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
>> >> >>>>> raised some concerns about memory usage on the client side. I
>> did
>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> appeared
>> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
>> >> >>>>> around the java client libraries for WS GRAM.
>> >> >>>>>
>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> execution.
>> >> >>>>> Here is what I've discovered so far.
>> >> >>>>>
>> >> >>>>> Aside from the heap available to the java code, the jvm used
>> 117
>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> Condor-G
>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>> >> >>>>>
>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> collector)
>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> complete),
>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> >>>>>
>> >> >>>>> The only long-term memory per job that I know of in the GAHP is
>> >> >>>>> for the notification sink for job status callbacks. 600kb seems
>> a
>> >> >>>>> little high for that. Stu, could someone on Globus help us
>> >> >>>>> determine if we're using the notification sinks inefficiently?
>> >> >>>>
>> >> >>>> Martin just looked and for the most part, there is nothing wrong
>> >> >>>> with how condor-g manages the callback sink.
>> >> >>>> However, one improvement that would reduce the memory used per
>> job
>> >> >>>> would be to not have a notification consumer per job.  Instead
>> use
>> >> >>>> one for all jobs.
>> >> >>>>
>> >> >>>> Also, Martin recently did some analysis on condor-g stress tests
>> >> >>>> and found that notifications are building up on the in the GRAM4
>> >> >>>> service container and that is causing delays which seem to be
>> >> >>>> causing multiple problems.  We're looking at this in a separate
>> >> >>>> effort with the GT Core team.  But, after this was clear, Martin
>> >> >>>> re-
>> >> >>>> ran the condor-g test and relied on polling between condor-g and
>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
>> >> >>>> repeat the no-notification test and see the difference in
>> memory?
>> >> >>>> The changes would be to increase the polling frequency in
>> condor-g
>> >> >>>> and comment out the subscribe for notification.  You could also
>> >> >>>> comment out the notification listener call(s) too.
>> >> >>>
>> >> >>>
>> >> >>> I did two new sets of tests today. The first used more efficient
>> >> >>> callback code in the GAHP (one notification consumer rather than
>> one
>> >> >>> per job). The second disabled notifications and relied on polling
>> >> >>> for job status changes.
>> >> >>>
>> >> >>> The more efficient callback code did not produce a noticeable
>> >> >>> reduction in memory usage.
>> >> >>>
>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
>> >> >>> heap usage after job submission and before job completion was
>> about
>> >> >>> 4 megs + 0.1 megs per job.
>> >> >>
>> >> >>
>> >> >> I ran one more test with the improved callback code. This time, I
>> >> >> stopped storing the notification producer EPRs associated with the
>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> >>
>> >> >> I was told the client had to explicitly destroy these serve-side
>> >> >> notification producer resources when it destroys the job,
>> otherwise
>> >> >> they hang around bogging down the server. Is this still the case?
>> The
>> >> >> server can't destroy notification producers when their sources of
>> >> >> information are destroyed?
>> >> >>
>> >> >
>> >> > This reminds me of the odd fact that i had to suddenly grant much
>> more
>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
>> >> > subscription resources to be able to destroy them eventually.
>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> >
>> >> > For 4.0: yes, currently you'll have to store and eventually destroy
>> >> > subscription resources manually to avoid heaping up persistence
>> data
>> >> > on the server-side.
>> >> > For 4.2: no, you won't have to store them. A job resource will
>> >> > destroy all subscription resources when it's destroyed.
>> >> >
>> >> > Overall i suggest to concentrate on 4.2 gram since the "container
>> >> > hangs in job destruction" problem won't exist anymore.
>> >> >
>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
>> changes
>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
>> >> > sense
>> >> > for us to do the 4.2-related changes in Gahp and hand it to you for
>> >> > fine-tuning then?
>> >> >
>> >> > Martin
>> >>
>> >>
>> >>
>> >>
>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >>
>> >> > Mihael:
>> >> >
>> >> > That's great, thanks!
>> >> >
>> >> > Ian.
>> >> >
>> >> > Mihael Hategan wrote:
>> >> >> I did a 1024 job run today with ws-gram.
>> >> >> I painted the results here:
>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> >>
>> >> >> Seems like client memory per job is about 370k. Which is quite a
>> lot.
>> >> >> What kinda worries me is that it doesn't seem to go down after the
>> >> >> jobs
>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
>> >> >> collector
>> >> >> doesn't do any major collections. I'll need to profile this to see
>> >> >> exactly what we're talking about.
>> >> >>
>> >> >> The container memory is figured by looking at the process in
>> /proc.
>> >> >> It's
>> >> >> total memory including shared libraries and things. But libraries
>> >> >> take a
>> >> >> fixed amount of space, so a fuzzy correlation can probably be
>> made.
>> >> >> It
>> >> >> looks quite similar to the amount of memory eaten on the client
>> side
>> >> >> (per job).
>> >> >>
>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time
>> >> >> the
>> >> >> jobs are submitted, but the machine itself seems responsive. I
>> have
>> >> >> yet
>> >> >> to plot the exact submission time for each job.
>> >> >>
>> >> >> So at this point I would recommend trying ws-gram as long as there
>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and
>> >> >> while
>> >> >> making sure the jvm has enough heap. More than that seems like a
>> >> >> gamble.
>> >> >>
>> >> >> Mihael
>> >> >>
>> >> >> _______________________________________________
>> >> >> Swift-devel mailing list
>> >> >> Swift-devel at ci.uchicago.edu
>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> >>
>> >> >>
>> >> >
>> >>
>> >
>> >
>>
>>
>
>


From benc at hawaga.org.uk  Fri Feb  8 10:18:26 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Feb 2008 16:18:26 +0000 (GMT)
Subject: [Swift-devel] behaviour on out-of-memory
Message-ID: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>


On my laptop when I run swift with so many jobs that it runs out of 
memory, it gives the below errors and hangs. It doesn't seem to exit. 
That's icky for using this in any automated environment.

$ swift -tc.file ./tc.data -sites.file ./sites.xml badmonkey.swift  
-goodmonkeys=10000
Swift v0.3-dev r1609 (modified locally)

RunID: 20080208-1015-5h5huekc
Exception in thread "Worker 0" Exception in thread "Timer-0" Exception in 
thread "Worker 2" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Worker 3" java.lang.OutOfMemoryError: Java heap space
Exception in thread "Worker 1" java.lang.OutOfMemoryError: Java heap space


From hategan at mcs.anl.gov  Fri Feb  8 11:11:13 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:11:13 -0600
Subject: [Swift-devel] behaviour on out-of-memory
In-Reply-To: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>
Message-ID: <1202490673.8302.3.camel@blabla.mcs.anl.gov>

Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking
of pre-allocating some space to use in such cases for clean shutdown,
but given the concurrency, this may or may not work properly.

On Fri, 2008-02-08 at 16:18 +0000, Ben Clifford wrote:
> On my laptop when I run swift with so many jobs that it runs out of 
> memory, it gives the below errors and hangs. It doesn't seem to exit. 
> That's icky for using this in any automated environment.
> 
> $ swift -tc.file ./tc.data -sites.file ./sites.xml badmonkey.swift  
> -goodmonkeys=10000
> Swift v0.3-dev r1609 (modified locally)
> 
> RunID: 20080208-1015-5h5huekc
> Exception in thread "Worker 0" Exception in thread "Timer-0" Exception in 
> thread "Worker 2" java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Worker 3" java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Worker 1" java.lang.OutOfMemoryError: Java heap space
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Fri Feb  8 11:16:30 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:16:30 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202490990.8302.9.camel@blabla.mcs.anl.gov>


> Shoot! i didn't know that and thought there would be a container per
> GramJob in that case.

Yep. There was even a bug, not sure if it was fixed, that would mess up
the port for that container on subsequent requests (basically a second
sequential job would start the container on 8443 instead of whatever was
in the port range).

>  That's the core mysteries with notifications.
> Anyway: I did a quick check some days ago and found that GramJob is
> surprisingly greedy regarding memory as you said. I'll have to further
> check what it is, but will probably not do that before 4.2 is out.

I'll try to profile it today. You should get a license for YJP so that
you can look at the snapshots I might come up with. It's free for OSS
projects (just point them to the globus page that has your name).

> 
> 
> > Due to the above, a reference to the GramJob is kept anyway, regardless
> > of whether that reference is in client code or the local container.
> >
> > I'll try to profile a run and see if I can spot where the problems are.
> >
> >>
> >> Martin
> >>
> >> >>
> >> >> The core team will be looking at improving notifications once their
> >> >> other 4.2 deliverables are done.
> >> >>
> >> >> -Stu
> >> >>
> >> >> Begin forwarded message:
> >> >>
> >> >> > From: feller at mcs.anl.gov
> >> >> > Date: February 1, 2008 9:41:05 AM CST
> >> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> >> <tmartin at physics.ucsd.edu
> >> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> >> <bacon at mcs.anl.gov
> >> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> >> <rwg at hep.uchicago.edu
> >> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy" <roy at cs.wisc.edu>,
> >> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> >> <miron at cs.wisc.edu
> >> >> > >
> >> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >> >
> >> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >> >>
> >> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >> >>>
> >> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >> >>>>
> >> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >> >> >>>>> raised some concerns about memory usage on the client side. I
> >> did
> >> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >> appeared
> >> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper
> >> >> >>>>> around the java client libraries for WS GRAM.
> >> >> >>>>>
> >> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a
> >> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> execution.
> >> >> >>>>> Here is what I've discovered so far.
> >> >> >>>>>
> >> >> >>>>> Aside from the heap available to the java code, the jvm used
> >> 117
> >> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >> Condor-G
> >> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >> >>>>>
> >> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> collector)
> >> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >> complete),
> >> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >> >>>>>
> >> >> >>>>> The only long-term memory per job that I know of in the GAHP is
> >> >> >>>>> for the notification sink for job status callbacks. 600kb seems
> >> a
> >> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >> >>>>> determine if we're using the notification sinks inefficiently?
> >> >> >>>>
> >> >> >>>> Martin just looked and for the most part, there is nothing wrong
> >> >> >>>> with how condor-g manages the callback sink.
> >> >> >>>> However, one improvement that would reduce the memory used per
> >> job
> >> >> >>>> would be to not have a notification consumer per job.  Instead
> >> use
> >> >> >>>> one for all jobs.
> >> >> >>>>
> >> >> >>>> Also, Martin recently did some analysis on condor-g stress tests
> >> >> >>>> and found that notifications are building up on the in the GRAM4
> >> >> >>>> service container and that is causing delays which seem to be
> >> >> >>>> causing multiple problems.  We're looking at this in a separate
> >> >> >>>> effort with the GT Core team.  But, after this was clear, Martin
> >> >> >>>> re-
> >> >> >>>> ran the condor-g test and relied on polling between condor-g and
> >> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >> >> >>>> repeat the no-notification test and see the difference in
> >> memory?
> >> >> >>>> The changes would be to increase the polling frequency in
> >> condor-g
> >> >> >>>> and comment out the subscribe for notification.  You could also
> >> >> >>>> comment out the notification listener call(s) too.
> >> >> >>>
> >> >> >>>
> >> >> >>> I did two new sets of tests today. The first used more efficient
> >> >> >>> callback code in the GAHP (one notification consumer rather than
> >> one
> >> >> >>> per job). The second disabled notifications and relied on polling
> >> >> >>> for job status changes.
> >> >> >>>
> >> >> >>> The more efficient callback code did not produce a noticeable
> >> >> >>> reduction in memory usage.
> >> >> >>>
> >> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >> >> >>> heap usage after job submission and before job completion was
> >> about
> >> >> >>> 4 megs + 0.1 megs per job.
> >> >> >>
> >> >> >>
> >> >> >> I ran one more test with the improved callback code. This time, I
> >> >> >> stopped storing the notification producer EPRs associated with the
> >> >> >> GRAM job resources. Memory usage went down markedly.
> >> >> >>
> >> >> >> I was told the client had to explicitly destroy these serve-side
> >> >> >> notification producer resources when it destroys the job,
> >> otherwise
> >> >> >> they hang around bogging down the server. Is this still the case?
> >> The
> >> >> >> server can't destroy notification producers when their sources of
> >> >> >> information are destroyed?
> >> >> >>
> >> >> >
> >> >> > This reminds me of the odd fact that i had to suddenly grant much
> >> more
> >> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> >> > subscription resources to be able to destroy them eventually.
> >> >> > Those EPR's are maybe not so tiny as they look like.
> >> >> >
> >> >> > For 4.0: yes, currently you'll have to store and eventually destroy
> >> >> > subscription resources manually to avoid heaping up persistence
> >> data
> >> >> > on the server-side.
> >> >> > For 4.2: no, you won't have to store them. A job resource will
> >> >> > destroy all subscription resources when it's destroyed.
> >> >> >
> >> >> > Overall i suggest to concentrate on 4.2 gram since the "container
> >> >> > hangs in job destruction" problem won't exist anymore.
> >> >> >
> >> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >> changes
> >> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
> >> >> > sense
> >> >> > for us to do the 4.2-related changes in Gahp and hand it to you for
> >> >> > fine-tuning then?
> >> >> >
> >> >> > Martin
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> >>
> >> >> > Mihael:
> >> >> >
> >> >> > That's great, thanks!
> >> >> >
> >> >> > Ian.
> >> >> >
> >> >> > Mihael Hategan wrote:
> >> >> >> I did a 1024 job run today with ws-gram.
> >> >> >> I painted the results here:
> >> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >> >>
> >> >> >> Seems like client memory per job is about 370k. Which is quite a
> >> lot.
> >> >> >> What kinda worries me is that it doesn't seem to go down after the
> >> >> >> jobs
> >> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >> >> collector
> >> >> >> doesn't do any major collections. I'll need to profile this to see
> >> >> >> exactly what we're talking about.
> >> >> >>
> >> >> >> The container memory is figured by looking at the process in
> >> /proc.
> >> >> >> It's
> >> >> >> total memory including shared libraries and things. But libraries
> >> >> >> take a
> >> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >> made.
> >> >> >> It
> >> >> >> looks quite similar to the amount of memory eaten on the client
> >> side
> >> >> >> (per job).
> >> >> >>
> >> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time
> >> >> >> the
> >> >> >> jobs are submitted, but the machine itself seems responsive. I
> >> have
> >> >> >> yet
> >> >> >> to plot the exact submission time for each job.
> >> >> >>
> >> >> >> So at this point I would recommend trying ws-gram as long as there
> >> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and
> >> >> >> while
> >> >> >> making sure the jvm has enough heap. More than that seems like a
> >> >> >> gamble.
> >> >> >>
> >> >> >> Mihael
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Swift-devel mailing list
> >> >> >> Swift-devel at ci.uchicago.edu
> >> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >>
> >
> >
> 
> 


From benc at hawaga.org.uk  Fri Feb  8 11:19:37 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Feb 2008 17:19:37 +0000 (GMT)
Subject: [Swift-devel] behaviour on out-of-memory
In-Reply-To: <1202490673.8302.3.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>
	<1202490673.8302.3.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802081716480.4874@dildano.hawaga.org.uk>


On Fri, 8 Feb 2008, Mihael Hategan wrote:

> Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking
> of pre-allocating some space to use in such cases for clean shutdown,
> but given the concurrency, this may or may not work properly.

For my purposes, I don't really need anything cleaner than the JVM exiting 
with an error code sometime around the memory running out.

I hacked in a try/catch around karajan's EventWorker.run() which is 
catching enough for me at the moment.

-- 


From feller at mcs.anl.gov  Fri Feb  8 11:19:40 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 11:19:40 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
Message-ID: <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>

Mihael,

i think i found the memory hole in GramJob.
100 jobs in a test of mine consumed about 23MB (constantly
growing) before the fix and 8MB (very slowly growing) after
the fix. The big part of that (7MB) is used right from the
first job which may be the NotificationConsumerManager.
Will commit that change soon to 4.0 branch and you may try
it then.
Are you using 4.0.x in your tests?

Martin

>>> >
>>> > These are both hacks. I'm not sure I want to go there. 300K per job
>>> is
>>> a
>>> > bit too much considering that swift (which has to consider many more
>>> > things) has less than 10K overhead per job.
>>> >
>>>
>>>
>>> For my better understanding:
>>> Do you start up your own notification consumer manager that listens for
>>> notifications of all jobs or do you let each GramJob instance listen
>>> for
>>> notifications itself?
>>> In case you listen for notifications yourself: do you store
>>> GramJob objects or just EPR's of jobs and create GramJob objects if
>>> needed?
>>
>> Excellent points. I let each GramJob instance listen for notifications
>> itself. What I observed is that it uses only one container for that.
>>
>
> Shoot! i didn't know that and thought there would be a container per
> GramJob in that case. That's the core mysteries with notifications.
> Anyway: I did a quick check some days ago and found that GramJob is
> surprisingly greedy regarding memory as you said. I'll have to further
> check what it is, but will probably not do that before 4.2 is out.
>
>
>> Due to the above, a reference to the GramJob is kept anyway, regardless
>> of whether that reference is in client code or the local container.
>>
>> I'll try to profile a run and see if I can spot where the problems are.
>>
>>>
>>> Martin
>>>
>>> >>
>>> >> The core team will be looking at improving notifications once their
>>> >> other 4.2 deliverables are done.
>>> >>
>>> >> -Stu
>>> >>
>>> >> Begin forwarded message:
>>> >>
>>> >> > From: feller at mcs.anl.gov
>>> >> > Date: February 1, 2008 9:41:05 AM CST
>>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>>> >> <tmartin at physics.ucsd.edu
>>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>>> >> <bacon at mcs.anl.gov
>>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>>> >> <rwg at hep.uchicago.edu
>>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>>> <roy at cs.wisc.edu>,
>>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>>> >> <miron at cs.wisc.edu
>>> >> > >
>>> >> > Subject: Re: Condor-G WS GRAM memory usage
>>> >> >
>>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>>> >> >>
>>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>>> >> >>>
>>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>>> >> >>>>
>>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
>>> >> >>>>> raised some concerns about memory usage on the client side. I
>>> did
>>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>>> appeared
>>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>>> wrapper
>>> >> >>>>> around the java client libraries for WS GRAM.
>>> >> >>>>>
>>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at
>>> a
>>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
>>> >> >>>>> transfer. All of the jobs overlapped in submission and
>>> execution.
>>> >> >>>>> Here is what I've discovered so far.
>>> >> >>>>>
>>> >> >>>>> Aside from the heap available to the java code, the jvm used
>>> 117
>>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>>> Condor-G
>>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>>> >> >>>>>
>>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>>> collector)
>>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>>> complete),
>>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>>> >> >>>>>
>>> >> >>>>> The only long-term memory per job that I know of in the GAHP
>>> is
>>> >> >>>>> for the notification sink for job status callbacks. 600kb
>>> seems
>>> a
>>> >> >>>>> little high for that. Stu, could someone on Globus help us
>>> >> >>>>> determine if we're using the notification sinks inefficiently?
>>> >> >>>>
>>> >> >>>> Martin just looked and for the most part, there is nothing
>>> wrong
>>> >> >>>> with how condor-g manages the callback sink.
>>> >> >>>> However, one improvement that would reduce the memory used per
>>> job
>>> >> >>>> would be to not have a notification consumer per job.  Instead
>>> use
>>> >> >>>> one for all jobs.
>>> >> >>>>
>>> >> >>>> Also, Martin recently did some analysis on condor-g stress
>>> tests
>>> >> >>>> and found that notifications are building up on the in the
>>> GRAM4
>>> >> >>>> service container and that is causing delays which seem to be
>>> >> >>>> causing multiple problems.  We're looking at this in a separate
>>> >> >>>> effort with the GT Core team.  But, after this was clear,
>>> Martin
>>> >> >>>> re-
>>> >> >>>> ran the condor-g test and relied on polling between condor-g
>>> and
>>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
>>> >> >>>> repeat the no-notification test and see the difference in
>>> memory?
>>> >> >>>> The changes would be to increase the polling frequency in
>>> condor-g
>>> >> >>>> and comment out the subscribe for notification.  You could also
>>> >> >>>> comment out the notification listener call(s) too.
>>> >> >>>
>>> >> >>>
>>> >> >>> I did two new sets of tests today. The first used more efficient
>>> >> >>> callback code in the GAHP (one notification consumer rather than
>>> one
>>> >> >>> per job). The second disabled notifications and relied on
>>> polling
>>> >> >>> for job status changes.
>>> >> >>>
>>> >> >>> The more efficient callback code did not produce a noticeable
>>> >> >>> reduction in memory usage.
>>> >> >>>
>>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
>>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
>>> >> >>> heap usage after job submission and before job completion was
>>> about
>>> >> >>> 4 megs + 0.1 megs per job.
>>> >> >>
>>> >> >>
>>> >> >> I ran one more test with the improved callback code. This time, I
>>> >> >> stopped storing the notification producer EPRs associated with
>>> the
>>> >> >> GRAM job resources. Memory usage went down markedly.
>>> >> >>
>>> >> >> I was told the client had to explicitly destroy these serve-side
>>> >> >> notification producer resources when it destroys the job,
>>> otherwise
>>> >> >> they hang around bogging down the server. Is this still the case?
>>> The
>>> >> >> server can't destroy notification producers when their sources of
>>> >> >> information are destroyed?
>>> >> >>
>>> >> >
>>> >> > This reminds me of the odd fact that i had to suddenly grant much
>>> more
>>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
>>> >> > subscription resources to be able to destroy them eventually.
>>> >> > Those EPR's are maybe not so tiny as they look like.
>>> >> >
>>> >> > For 4.0: yes, currently you'll have to store and eventually
>>> destroy
>>> >> > subscription resources manually to avoid heaping up persistence
>>> data
>>> >> > on the server-side.
>>> >> > For 4.2: no, you won't have to store them. A job resource will
>>> >> > destroy all subscription resources when it's destroyed.
>>> >> >
>>> >> > Overall i suggest to concentrate on 4.2 gram since the "container
>>> >> > hangs in job destruction" problem won't exist anymore.
>>> >> >
>>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
>>> changes
>>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
>>> >> > sense
>>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
>>> for
>>> >> > fine-tuning then?
>>> >> >
>>> >> > Martin
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>>> >>
>>> >> > Mihael:
>>> >> >
>>> >> > That's great, thanks!
>>> >> >
>>> >> > Ian.
>>> >> >
>>> >> > Mihael Hategan wrote:
>>> >> >> I did a 1024 job run today with ws-gram.
>>> >> >> I painted the results here:
>>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>>> >> >>
>>> >> >> Seems like client memory per job is about 370k. Which is quite a
>>> lot.
>>> >> >> What kinda worries me is that it doesn't seem to go down after
>>> the
>>> >> >> jobs
>>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
>>> >> >> collector
>>> >> >> doesn't do any major collections. I'll need to profile this to
>>> see
>>> >> >> exactly what we're talking about.
>>> >> >>
>>> >> >> The container memory is figured by looking at the process in
>>> /proc.
>>> >> >> It's
>>> >> >> total memory including shared libraries and things. But libraries
>>> >> >> take a
>>> >> >> fixed amount of space, so a fuzzy correlation can probably be
>>> made.
>>> >> >> It
>>> >> >> looks quite similar to the amount of memory eaten on the client
>>> side
>>> >> >> (per job).
>>> >> >>
>>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
>>> time
>>> >> >> the
>>> >> >> jobs are submitted, but the machine itself seems responsive. I
>>> have
>>> >> >> yet
>>> >> >> to plot the exact submission time for each job.
>>> >> >>
>>> >> >> So at this point I would recommend trying ws-gram as long as
>>> there
>>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
>>> and
>>> >> >> while
>>> >> >> making sure the jvm has enough heap. More than that seems like a
>>> >> >> gamble.
>>> >> >>
>>> >> >> Mihael
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> Swift-devel mailing list
>>> >> >> Swift-devel at ci.uchicago.edu
>>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >
>>> >
>>>
>>>
>>
>>
>
>
>


From benc at hawaga.org.uk  Fri Feb  8 11:22:08 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Feb 2008 17:22:08 +0000 (GMT)
Subject: [Swift-devel] local provider maximum simultaneous jobs
Message-ID: <Pine.LNX.4.64.0802081720140.4874@dildano.hawaga.org.uk>


I'd like to make it so out-of-the-box the localhost site does not try to 
run more than a handful of jobs at once - in almost any case, that is the 
desired behaviour, I think.

There's no documented per-site profile entry for rate limiting like this. 
Is there a secret one?

--  


From hategan at mcs.anl.gov  Fri Feb  8 11:24:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:24:43 -0600
Subject: [Swift-devel] behaviour on out-of-memory
In-Reply-To: <Pine.LNX.4.64.0802081716480.4874@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>
	<1202490673.8302.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802081716480.4874@dildano.hawaga.org.uk>
Message-ID: <1202491483.9045.4.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 17:19 +0000, Ben Clifford wrote:
> 
> On Fri, 8 Feb 2008, Mihael Hategan wrote:
> 
> > Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking
> > of pre-allocating some space to use in such cases for clean shutdown,
> > but given the concurrency, this may or may not work properly.
> 
> For my purposes, I don't really need anything cleaner than the JVM exiting 
> with an error code sometime around the memory running out.

Not correct semantics when swift acts as a service (think I2U2). I
should probably find a way to immediately cancel a whole workflow when
OOMs are caught so that client software can un-reference it and
eventually get back to stability. But again, not having enough memory
may cause arbitrary breakage in arbitrary threads, so it's hard to
guarantee consistency after such a thing.

So let's keep chatting, maybe something will come up.

> 
> I hacked in a try/catch around karajan's EventWorker.run() which is 
> catching enough for me at the moment.

Normally it should generate a fault and propagate it up the call stack,
but that may itself require memory.

> 


From hategan at mcs.anl.gov  Fri Feb  8 11:27:29 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:27:29 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202491649.9045.8.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> Mihael,
> 
> i think i found the memory hole in GramJob.
> 100 jobs in a test of mine consumed about 23MB (constantly
> growing) before the fix and 8MB (very slowly growing) after
> the fix. The big part of that (7MB) is used right from the
> first job which may be the NotificationConsumerManager.
> Will commit that change soon to 4.0 branch and you may try
> it then.
> Are you using 4.0.x in your tests?

Yes. If there are no API changes, you can send me the jar file. I don't
have enough knowledge to selectively build WS-GRAM, nor enough disk
space to build the whole GT.

> 
> Martin
> 
> >>> >
> >>> > These are both hacks. I'm not sure I want to go there. 300K per job
> >>> is
> >>> a
> >>> > bit too much considering that swift (which has to consider many more
> >>> > things) has less than 10K overhead per job.
> >>> >
> >>>
> >>>
> >>> For my better understanding:
> >>> Do you start up your own notification consumer manager that listens for
> >>> notifications of all jobs or do you let each GramJob instance listen
> >>> for
> >>> notifications itself?
> >>> In case you listen for notifications yourself: do you store
> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
> >>> needed?
> >>
> >> Excellent points. I let each GramJob instance listen for notifications
> >> itself. What I observed is that it uses only one container for that.
> >>
> >
> > Shoot! i didn't know that and thought there would be a container per
> > GramJob in that case. That's the core mysteries with notifications.
> > Anyway: I did a quick check some days ago and found that GramJob is
> > surprisingly greedy regarding memory as you said. I'll have to further
> > check what it is, but will probably not do that before 4.2 is out.
> >
> >
> >> Due to the above, a reference to the GramJob is kept anyway, regardless
> >> of whether that reference is in client code or the local container.
> >>
> >> I'll try to profile a run and see if I can spot where the problems are.
> >>
> >>>
> >>> Martin
> >>>
> >>> >>
> >>> >> The core team will be looking at improving notifications once their
> >>> >> other 4.2 deliverables are done.
> >>> >>
> >>> >> -Stu
> >>> >>
> >>> >> Begin forwarded message:
> >>> >>
> >>> >> > From: feller at mcs.anl.gov
> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >>> >> <tmartin at physics.ucsd.edu
> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >>> >> <bacon at mcs.anl.gov
> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >>> >> <rwg at hep.uchicago.edu
> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> >>> <roy at cs.wisc.edu>,
> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >>> >> <miron at cs.wisc.edu
> >>> >> > >
> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> >>> >> >
> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >>> >> >>
> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >>> >> >>>
> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >>> >> >>>>
> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM
> >>> >> >>>>> raised some concerns about memory usage on the client side. I
> >>> did
> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >>> appeared
> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
> >>> wrapper
> >>> >> >>>>> around the java client libraries for WS GRAM.
> >>> >> >>>>>
> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at
> >>> a
> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> >>> execution.
> >>> >> >>>>> Here is what I've discovered so far.
> >>> >> >>>>>
> >>> >> >>>>> Aside from the heap available to the java code, the jvm used
> >>> 117
> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >>> Condor-G
> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >>> >> >>>>>
> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >>> collector)
> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >>> complete),
> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >>> >> >>>>>
> >>> >> >>>>> The only long-term memory per job that I know of in the GAHP
> >>> is
> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
> >>> seems
> >>> a
> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
> >>> >> >>>>> determine if we're using the notification sinks inefficiently?
> >>> >> >>>>
> >>> >> >>>> Martin just looked and for the most part, there is nothing
> >>> wrong
> >>> >> >>>> with how condor-g manages the callback sink.
> >>> >> >>>> However, one improvement that would reduce the memory used per
> >>> job
> >>> >> >>>> would be to not have a notification consumer per job.  Instead
> >>> use
> >>> >> >>>> one for all jobs.
> >>> >> >>>>
> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
> >>> tests
> >>> >> >>>> and found that notifications are building up on the in the
> >>> GRAM4
> >>> >> >>>> service container and that is causing delays which seem to be
> >>> >> >>>> causing multiple problems.  We're looking at this in a separate
> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
> >>> Martin
> >>> >> >>>> re-
> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
> >>> and
> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could you
> >>> >> >>>> repeat the no-notification test and see the difference in
> >>> memory?
> >>> >> >>>> The changes would be to increase the polling frequency in
> >>> condor-g
> >>> >> >>>> and comment out the subscribe for notification.  You could also
> >>> >> >>>> comment out the notification listener call(s) too.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> I did two new sets of tests today. The first used more efficient
> >>> >> >>> callback code in the GAHP (one notification consumer rather than
> >>> one
> >>> >> >>> per job). The second disabled notifications and relied on
> >>> polling
> >>> >> >>> for job status changes.
> >>> >> >>>
> >>> >> >>> The more efficient callback code did not produce a noticeable
> >>> >> >>> reduction in memory usage.
> >>> >> >>>
> >>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm
> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum
> >>> >> >>> heap usage after job submission and before job completion was
> >>> about
> >>> >> >>> 4 megs + 0.1 megs per job.
> >>> >> >>
> >>> >> >>
> >>> >> >> I ran one more test with the improved callback code. This time, I
> >>> >> >> stopped storing the notification producer EPRs associated with
> >>> the
> >>> >> >> GRAM job resources. Memory usage went down markedly.
> >>> >> >>
> >>> >> >> I was told the client had to explicitly destroy these serve-side
> >>> >> >> notification producer resources when it destroys the job,
> >>> otherwise
> >>> >> >> they hang around bogging down the server. Is this still the case?
> >>> The
> >>> >> >> server can't destroy notification producers when their sources of
> >>> >> >> information are destroyed?
> >>> >> >>
> >>> >> >
> >>> >> > This reminds me of the odd fact that i had to suddenly grant much
> >>> more
> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >>> >> > subscription resources to be able to destroy them eventually.
> >>> >> > Those EPR's are maybe not so tiny as they look like.
> >>> >> >
> >>> >> > For 4.0: yes, currently you'll have to store and eventually
> >>> destroy
> >>> >> > subscription resources manually to avoid heaping up persistence
> >>> data
> >>> >> > on the server-side.
> >>> >> > For 4.2: no, you won't have to store them. A job resource will
> >>> >> > destroy all subscription resources when it's destroyed.
> >>> >> >
> >>> >> > Overall i suggest to concentrate on 4.2 gram since the "container
> >>> >> > hangs in job destruction" problem won't exist anymore.
> >>> >> >
> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >>> changes
> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes
> >>> >> > sense
> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
> >>> for
> >>> >> > fine-tuning then?
> >>> >> >
> >>> >> > Martin
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >>> >>
> >>> >> > Mihael:
> >>> >> >
> >>> >> > That's great, thanks!
> >>> >> >
> >>> >> > Ian.
> >>> >> >
> >>> >> > Mihael Hategan wrote:
> >>> >> >> I did a 1024 job run today with ws-gram.
> >>> >> >> I painted the results here:
> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >>> >> >>
> >>> >> >> Seems like client memory per job is about 370k. Which is quite a
> >>> lot.
> >>> >> >> What kinda worries me is that it doesn't seem to go down after
> >>> the
> >>> >> >> jobs
> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >>> >> >> collector
> >>> >> >> doesn't do any major collections. I'll need to profile this to
> >>> see
> >>> >> >> exactly what we're talking about.
> >>> >> >>
> >>> >> >> The container memory is figured by looking at the process in
> >>> /proc.
> >>> >> >> It's
> >>> >> >> total memory including shared libraries and things. But libraries
> >>> >> >> take a
> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >>> made.
> >>> >> >> It
> >>> >> >> looks quite similar to the amount of memory eaten on the client
> >>> side
> >>> >> >> (per job).
> >>> >> >>
> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
> >>> time
> >>> >> >> the
> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
> >>> have
> >>> >> >> yet
> >>> >> >> to plot the exact submission time for each job.
> >>> >> >>
> >>> >> >> So at this point I would recommend trying ws-gram as long as
> >>> there
> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
> >>> and
> >>> >> >> while
> >>> >> >> making sure the jvm has enough heap. More than that seems like a
> >>> >> >> gamble.
> >>> >> >>
> >>> >> >> Mihael
> >>> >> >>
> >>> >> >> _______________________________________________
> >>> >> >> Swift-devel mailing list
> >>> >> >> Swift-devel at ci.uchicago.edu
> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >
> >>> >
> >>>
> >>>
> >>
> >>
> >
> >
> >
> 
> 


From hategan at mcs.anl.gov  Fri Feb  8 11:28:23 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:28:23 -0600
Subject: [Swift-devel] local provider maximum simultaneous jobs
In-Reply-To: <Pine.LNX.4.64.0802081720140.4874@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802081720140.4874@dildano.hawaga.org.uk>
Message-ID: <1202491703.9045.10.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 17:22 +0000, Ben Clifford wrote:
> I'd like to make it so out-of-the-box the localhost site does not try to 
> run more than a handful of jobs at once - in almost any case, that is the 
> desired behaviour, I think.
> 
> There's no documented per-site profile entry for rate limiting like this. 
> Is there a secret one?

Yep. It involves writing some Java code ;)

> 
> --  
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Fri Feb  8 11:34:20 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:34:20 -0600
Subject: [Swift-devel] local provider maximum simultaneous jobs
In-Reply-To: <1202491703.9045.10.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802081720140.4874@dildano.hawaga.org.uk>
	<1202491703.9045.10.camel@blabla.mcs.anl.gov>
Message-ID: <1202492061.9775.1.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 11:28 -0600, Mihael Hategan wrote:
> On Fri, 2008-02-08 at 17:22 +0000, Ben Clifford wrote:
> > I'd like to make it so out-of-the-box the localhost site does not try to 
> > run more than a handful of jobs at once - in almost any case, that is the 
> > desired behaviour, I think.
> > 
> > There's no documented per-site profile entry for rate limiting like this. 
> > Is there a secret one?
> 
> Yep. It involves writing some Java code ;)

I'd say file a bug report and I'll probably get to it next week, since
I'll be playing with the scheduler anyway to put the gram responsiveness
stuff in.

> 
> > 
> > --  
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Fri Feb  8 11:36:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Feb 2008 17:36:21 +0000 (GMT)
Subject: [Swift-devel] behaviour on out-of-memory
In-Reply-To: <1202491483.9045.4.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk> 
	<1202490673.8302.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802081716480.4874@dildano.hawaga.org.uk>
	<1202491483.9045.4.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802081730240.4874@dildano.hawaga.org.uk>


On Fri, 8 Feb 2008, Mihael Hategan wrote:

> Not correct semantics when swift acts as a service (think I2U2). I
> should probably find a way to immediately cancel a whole workflow when
> OOMs are caught so that client software can un-reference it and
> eventually get back to stability. But again, not having enough memory
> may cause arbitrary breakage in arbitrary threads, so it's hard to
> guarantee consistency after such a thing.

My philosophy, which is sort of backed up by the javadocs, is that OOM 
Errors are a signal that the JVM is so broken that it cannot continue - 
its the end of the universe as far as the JVM is concerned and there's 
nothing you can do. If you're so foolish as to run something (eg Swift) in 
your web server JVM that puts the JVM into that state, then sucker to you!

cf. javadoc VirtualMachineError:

> Thrown to indicate that the Java Virtual Machine is broken or has run 
> out of resources necessary for it to continue operating.

There's a bunch more memory management stuff in java 5 (eg the MXBeans) 
which are perhaps interesting - eg. when memory gets low, stop doing 
certain things / more cleanly abort select pieces of what lives in the 
JVM.

-- 


From hategan at mcs.anl.gov  Fri Feb  8 11:44:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 11:44:03 -0600
Subject: [Swift-devel] behaviour on out-of-memory
In-Reply-To: <Pine.LNX.4.64.0802081730240.4874@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802081617140.4874@dildano.hawaga.org.uk>
	<1202490673.8302.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802081716480.4874@dildano.hawaga.org.uk>
	<1202491483.9045.4.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802081730240.4874@dildano.hawaga.org.uk>
Message-ID: <1202492643.10121.4.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 17:36 +0000, Ben Clifford wrote:
> 
> On Fri, 8 Feb 2008, Mihael Hategan wrote:
> 
> > Not correct semantics when swift acts as a service (think I2U2). I
> > should probably find a way to immediately cancel a whole workflow when
> > OOMs are caught so that client software can un-reference it and
> > eventually get back to stability. But again, not having enough memory
> > may cause arbitrary breakage in arbitrary threads, so it's hard to
> > guarantee consistency after such a thing.
> 
> My philosophy, which is sort of backed up by the javadocs, is that OOM 
> Errors are a signal that the JVM is so broken that it cannot continue - 
> its the end of the universe as far as the JVM is concerned and there's 
> nothing you can do. If you're so foolish as to run something (eg Swift) in 
> your web server JVM that puts the JVM into that state, then sucker to you!

Yes and no. I there are cases when one can safely deal with it and other
cases when it's ok to let it provide partial functionality. I don't want
to definitely do/say one thing or the other at this point.

I've had the same argument with Jarek (or rather the reverse argument).
The WSRF container catches OOMs and does some cleanup and continues. I
said it shouldn't be done. When you have an OOM it's safer to have no
service than to risk nondeterministic behavior or even potential
security problems. So yes, I also happen to agree with you besides
disagreeing with you.

> 
> cf. javadoc VirtualMachineError:
> 
> > Thrown to indicate that the Java Virtual Machine is broken or has run 
> > out of resources necessary for it to continue operating.
> 
> There's a bunch more memory management stuff in java 5 (eg the MXBeans) 
> which are perhaps interesting - eg. when memory gets low, stop doing 
> certain things / more cleanly abort select pieces of what lives in the 
> JVM.
> 

Hmm. Interesting. I have to look at that. It may be time to slowly move
towards java 5.


From feller at mcs.anl.gov  Fri Feb  8 13:21:29 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 13:21:29 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202491649.9045.8.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
Message-ID: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>

Try the attached 4.0 compliant jar in your tests by dropping
it in your 4.0.x $GLOBUS_LOCATION/lib.
My tests showed about 2MB memory increase per 100 GramJob
objects which sounds to me like a reasonable number (about 20k
per GramJob object ignoring the notification consumer manager
in one job - if my calculations are right)

Martin

>
> On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> Mihael,
>>
>> i think i found the memory hole in GramJob.
>> 100 jobs in a test of mine consumed about 23MB (constantly
>> growing) before the fix and 8MB (very slowly growing) after
>> the fix. The big part of that (7MB) is used right from the
>> first job which may be the NotificationConsumerManager.
>> Will commit that change soon to 4.0 branch and you may try
>> it then.
>> Are you using 4.0.x in your tests?
>
> Yes. If there are no API changes, you can send me the jar file. I don't
> have enough knowledge to selectively build WS-GRAM, nor enough disk
> space to build the whole GT.
>
>>
>> Martin
>>
>> >>> >
>> >>> > These are both hacks. I'm not sure I want to go there. 300K per
>> job
>> >>> is
>> >>> a
>> >>> > bit too much considering that swift (which has to consider many
>> more
>> >>> > things) has less than 10K overhead per job.
>> >>> >
>> >>>
>> >>>
>> >>> For my better understanding:
>> >>> Do you start up your own notification consumer manager that listens
>> for
>> >>> notifications of all jobs or do you let each GramJob instance listen
>> >>> for
>> >>> notifications itself?
>> >>> In case you listen for notifications yourself: do you store
>> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
>> >>> needed?
>> >>
>> >> Excellent points. I let each GramJob instance listen for
>> notifications
>> >> itself. What I observed is that it uses only one container for that.
>> >>
>> >
>> > Shoot! i didn't know that and thought there would be a container per
>> > GramJob in that case. That's the core mysteries with notifications.
>> > Anyway: I did a quick check some days ago and found that GramJob is
>> > surprisingly greedy regarding memory as you said. I'll have to further
>> > check what it is, but will probably not do that before 4.2 is out.
>> >
>> >
>> >> Due to the above, a reference to the GramJob is kept anyway,
>> regardless
>> >> of whether that reference is in client code or the local container.
>> >>
>> >> I'll try to profile a run and see if I can spot where the problems
>> are.
>> >>
>> >>>
>> >>> Martin
>> >>>
>> >>> >>
>> >>> >> The core team will be looking at improving notifications once
>> their
>> >>> >> other 4.2 deliverables are done.
>> >>> >>
>> >>> >> -Stu
>> >>> >>
>> >>> >> Begin forwarded message:
>> >>> >>
>> >>> >> > From: feller at mcs.anl.gov
>> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >>> >> <tmartin at physics.ucsd.edu
>> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >>> >> <bacon at mcs.anl.gov
>> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >>> >> <rwg at hep.uchicago.edu
>> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >>> <roy at cs.wisc.edu>,
>> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >>> >> <miron at cs.wisc.edu
>> >>> >> > >
>> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >>> >> >
>> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >>> >> >>
>> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >>> >> >>>
>> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >>> >> >>>>
>> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS
>> GRAM
>> >>> >> >>>>> raised some concerns about memory usage on the client side.
>> I
>> >>> did
>> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> >>> appeared
>> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>> >>> wrapper
>> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >>> >> >>>>>
>> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30
>> at
>> >>> a
>> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
>> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> >>> execution.
>> >>> >> >>>>> Here is what I've discovered so far.
>> >>> >> >>>>>
>> >>> >> >>>>> Aside from the heap available to the java code, the jvm
>> used
>> >>> 117
>> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> >>> Condor-G
>> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
>> >>> >> >>>>>
>> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> >>> collector)
>> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
>> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> >>> complete),
>> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >>> >> >>>>>
>> >>> >> >>>>> The only long-term memory per job that I know of in the
>> GAHP
>> >>> is
>> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
>> >>> seems
>> >>> a
>> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
>> >>> >> >>>>> determine if we're using the notification sinks
>> inefficiently?
>> >>> >> >>>>
>> >>> >> >>>> Martin just looked and for the most part, there is nothing
>> >>> wrong
>> >>> >> >>>> with how condor-g manages the callback sink.
>> >>> >> >>>> However, one improvement that would reduce the memory used
>> per
>> >>> job
>> >>> >> >>>> would be to not have a notification consumer per job.
>> Instead
>> >>> use
>> >>> >> >>>> one for all jobs.
>> >>> >> >>>>
>> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
>> >>> tests
>> >>> >> >>>> and found that notifications are building up on the in the
>> >>> GRAM4
>> >>> >> >>>> service container and that is causing delays which seem to
>> be
>> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> separate
>> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
>> >>> Martin
>> >>> >> >>>> re-
>> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
>> >>> and
>> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
>> you
>> >>> >> >>>> repeat the no-notification test and see the difference in
>> >>> memory?
>> >>> >> >>>> The changes would be to increase the polling frequency in
>> >>> condor-g
>> >>> >> >>>> and comment out the subscribe for notification.  You could
>> also
>> >>> >> >>>> comment out the notification listener call(s) too.
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> I did two new sets of tests today. The first used more
>> efficient
>> >>> >> >>> callback code in the GAHP (one notification consumer rather
>> than
>> >>> one
>> >>> >> >>> per job). The second disabled notifications and relied on
>> >>> polling
>> >>> >> >>> for job status changes.
>> >>> >> >>>
>> >>> >> >>> The more efficient callback code did not produce a noticeable
>> >>> >> >>> reduction in memory usage.
>> >>> >> >>>
>> >>> >> >>> Disabling notifications did reduce memory usage. The maximum
>> jvm
>> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> minimum
>> >>> >> >>> heap usage after job submission and before job completion was
>> >>> about
>> >>> >> >>> 4 megs + 0.1 megs per job.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> I ran one more test with the improved callback code. This
>> time, I
>> >>> >> >> stopped storing the notification producer EPRs associated with
>> >>> the
>> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >>> >> >>
>> >>> >> >> I was told the client had to explicitly destroy these
>> serve-side
>> >>> >> >> notification producer resources when it destroys the job,
>> >>> otherwise
>> >>> >> >> they hang around bogging down the server. Is this still the
>> case?
>> >>> The
>> >>> >> >> server can't destroy notification producers when their sources
>> of
>> >>> >> >> information are destroyed?
>> >>> >> >>
>> >>> >> >
>> >>> >> > This reminds me of the odd fact that i had to suddenly grant
>> much
>> >>> more
>> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
>> >>> >> > subscription resources to be able to destroy them eventually.
>> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >>> >> >
>> >>> >> > For 4.0: yes, currently you'll have to store and eventually
>> >>> destroy
>> >>> >> > subscription resources manually to avoid heaping up persistence
>> >>> data
>> >>> >> > on the server-side.
>> >>> >> > For 4.2: no, you won't have to store them. A job resource will
>> >>> >> > destroy all subscription resources when it's destroyed.
>> >>> >> >
>> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> "container
>> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >>> >> >
>> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
>> >>> changes
>> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
>> makes
>> >>> >> > sense
>> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
>> >>> for
>> >>> >> > fine-tuning then?
>> >>> >> >
>> >>> >> > Martin
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >>> >>
>> >>> >> > Mihael:
>> >>> >> >
>> >>> >> > That's great, thanks!
>> >>> >> >
>> >>> >> > Ian.
>> >>> >> >
>> >>> >> > Mihael Hategan wrote:
>> >>> >> >> I did a 1024 job run today with ws-gram.
>> >>> >> >> I painted the results here:
>> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >>> >> >>
>> >>> >> >> Seems like client memory per job is about 370k. Which is quite
>> a
>> >>> lot.
>> >>> >> >> What kinda worries me is that it doesn't seem to go down after
>> >>> the
>> >>> >> >> jobs
>> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
>> >>> >> >> collector
>> >>> >> >> doesn't do any major collections. I'll need to profile this to
>> >>> see
>> >>> >> >> exactly what we're talking about.
>> >>> >> >>
>> >>> >> >> The container memory is figured by looking at the process in
>> >>> /proc.
>> >>> >> >> It's
>> >>> >> >> total memory including shared libraries and things. But
>> libraries
>> >>> >> >> take a
>> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
>> >>> made.
>> >>> >> >> It
>> >>> >> >> looks quite similar to the amount of memory eaten on the
>> client
>> >>> side
>> >>> >> >> (per job).
>> >>> >> >>
>> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
>> >>> time
>> >>> >> >> the
>> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
>> >>> have
>> >>> >> >> yet
>> >>> >> >> to plot the exact submission time for each job.
>> >>> >> >>
>> >>> >> >> So at this point I would recommend trying ws-gram as long as
>> >>> there
>> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
>> >>> and
>> >>> >> >> while
>> >>> >> >> making sure the jvm has enough heap. More than that seems like
>> a
>> >>> >> >> gamble.
>> >>> >> >>
>> >>> >> >> Mihael
>> >>> >> >>
>> >>> >> >> _______________________________________________
>> >>> >> >> Swift-devel mailing list
>> >>> >> >> Swift-devel at ci.uchicago.edu
>> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram-client.jar
Type: application/octet-stream
Size: 35825 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080208/bd2f7d68/attachment.obj>

From hategan at mcs.anl.gov  Fri Feb  8 13:29:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 13:29:03 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202498943.15258.4.camel@blabla.mcs.anl.gov>

Thanks. I'll give it a try as people head home for the weekend and the
heat in the queues is allowed to dissipate.

My profiler says that some hefty amount of heap is used by a relatively
low number of EndpointReferenceType objects. Btw, where do I get the
sources for addressing?

On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
> Try the attached 4.0 compliant jar in your tests by dropping
> it in your 4.0.x $GLOBUS_LOCATION/lib.
> My tests showed about 2MB memory increase per 100 GramJob
> objects which sounds to me like a reasonable number (about 20k
> per GramJob object ignoring the notification consumer manager
> in one job - if my calculations are right)
> 
> Martin
> 
> >
> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> >> Mihael,
> >>
> >> i think i found the memory hole in GramJob.
> >> 100 jobs in a test of mine consumed about 23MB (constantly
> >> growing) before the fix and 8MB (very slowly growing) after
> >> the fix. The big part of that (7MB) is used right from the
> >> first job which may be the NotificationConsumerManager.
> >> Will commit that change soon to 4.0 branch and you may try
> >> it then.
> >> Are you using 4.0.x in your tests?
> >
> > Yes. If there are no API changes, you can send me the jar file. I don't
> > have enough knowledge to selectively build WS-GRAM, nor enough disk
> > space to build the whole GT.
> >
> >>
> >> Martin
> >>
> >> >>> >
> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
> >> job
> >> >>> is
> >> >>> a
> >> >>> > bit too much considering that swift (which has to consider many
> >> more
> >> >>> > things) has less than 10K overhead per job.
> >> >>> >
> >> >>>
> >> >>>
> >> >>> For my better understanding:
> >> >>> Do you start up your own notification consumer manager that listens
> >> for
> >> >>> notifications of all jobs or do you let each GramJob instance listen
> >> >>> for
> >> >>> notifications itself?
> >> >>> In case you listen for notifications yourself: do you store
> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
> >> >>> needed?
> >> >>
> >> >> Excellent points. I let each GramJob instance listen for
> >> notifications
> >> >> itself. What I observed is that it uses only one container for that.
> >> >>
> >> >
> >> > Shoot! i didn't know that and thought there would be a container per
> >> > GramJob in that case. That's the core mysteries with notifications.
> >> > Anyway: I did a quick check some days ago and found that GramJob is
> >> > surprisingly greedy regarding memory as you said. I'll have to further
> >> > check what it is, but will probably not do that before 4.2 is out.
> >> >
> >> >
> >> >> Due to the above, a reference to the GramJob is kept anyway,
> >> regardless
> >> >> of whether that reference is in client code or the local container.
> >> >>
> >> >> I'll try to profile a run and see if I can spot where the problems
> >> are.
> >> >>
> >> >>>
> >> >>> Martin
> >> >>>
> >> >>> >>
> >> >>> >> The core team will be looking at improving notifications once
> >> their
> >> >>> >> other 4.2 deliverables are done.
> >> >>> >>
> >> >>> >> -Stu
> >> >>> >>
> >> >>> >> Begin forwarded message:
> >> >>> >>
> >> >>> >> > From: feller at mcs.anl.gov
> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> >>> >> <tmartin at physics.ucsd.edu
> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> >>> >> <bacon at mcs.anl.gov
> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> >>> >> <rwg at hep.uchicago.edu
> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> >> >>> <roy at cs.wisc.edu>,
> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> >>> >> <miron at cs.wisc.edu
> >> >>> >> > >
> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >>> >> >
> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >>> >> >>
> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >>> >> >>>
> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >>> >> >>>>
> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS
> >> GRAM
> >> >>> >> >>>>> raised some concerns about memory usage on the client side.
> >> I
> >> >>> did
> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >> >>> appeared
> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
> >> >>> wrapper
> >> >>> >> >>>>> around the java client libraries for WS GRAM.
> >> >>> >> >>>>>
> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30
> >> at
> >> >>> a
> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> >>> execution.
> >> >>> >> >>>>> Here is what I've discovered so far.
> >> >>> >> >>>>>
> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
> >> used
> >> >>> 117
> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >> >>> Condor-G
> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> >>> collector)
> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >> >>> complete),
> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The only long-term memory per job that I know of in the
> >> GAHP
> >> >>> is
> >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
> >> >>> seems
> >> >>> a
> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >>> >> >>>>> determine if we're using the notification sinks
> >> inefficiently?
> >> >>> >> >>>>
> >> >>> >> >>>> Martin just looked and for the most part, there is nothing
> >> >>> wrong
> >> >>> >> >>>> with how condor-g manages the callback sink.
> >> >>> >> >>>> However, one improvement that would reduce the memory used
> >> per
> >> >>> job
> >> >>> >> >>>> would be to not have a notification consumer per job.
> >> Instead
> >> >>> use
> >> >>> >> >>>> one for all jobs.
> >> >>> >> >>>>
> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
> >> >>> tests
> >> >>> >> >>>> and found that notifications are building up on the in the
> >> >>> GRAM4
> >> >>> >> >>>> service container and that is causing delays which seem to
> >> be
> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
> >> separate
> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
> >> >>> Martin
> >> >>> >> >>>> re-
> >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
> >> >>> and
> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
> >> you
> >> >>> >> >>>> repeat the no-notification test and see the difference in
> >> >>> memory?
> >> >>> >> >>>> The changes would be to increase the polling frequency in
> >> >>> condor-g
> >> >>> >> >>>> and comment out the subscribe for notification.  You could
> >> also
> >> >>> >> >>>> comment out the notification listener call(s) too.
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> I did two new sets of tests today. The first used more
> >> efficient
> >> >>> >> >>> callback code in the GAHP (one notification consumer rather
> >> than
> >> >>> one
> >> >>> >> >>> per job). The second disabled notifications and relied on
> >> >>> polling
> >> >>> >> >>> for job status changes.
> >> >>> >> >>>
> >> >>> >> >>> The more efficient callback code did not produce a noticeable
> >> >>> >> >>> reduction in memory usage.
> >> >>> >> >>>
> >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum
> >> jvm
> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
> >> minimum
> >> >>> >> >>> heap usage after job submission and before job completion was
> >> >>> about
> >> >>> >> >>> 4 megs + 0.1 megs per job.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> I ran one more test with the improved callback code. This
> >> time, I
> >> >>> >> >> stopped storing the notification producer EPRs associated with
> >> >>> the
> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
> >> >>> >> >>
> >> >>> >> >> I was told the client had to explicitly destroy these
> >> serve-side
> >> >>> >> >> notification producer resources when it destroys the job,
> >> >>> otherwise
> >> >>> >> >> they hang around bogging down the server. Is this still the
> >> case?
> >> >>> The
> >> >>> >> >> server can't destroy notification producers when their sources
> >> of
> >> >>> >> >> information are destroyed?
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
> >> much
> >> >>> more
> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> >>> >> > subscription resources to be able to destroy them eventually.
> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
> >> >>> >> >
> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
> >> >>> destroy
> >> >>> >> > subscription resources manually to avoid heaping up persistence
> >> >>> data
> >> >>> >> > on the server-side.
> >> >>> >> > For 4.2: no, you won't have to store them. A job resource will
> >> >>> >> > destroy all subscription resources when it's destroyed.
> >> >>> >> >
> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
> >> "container
> >> >>> >> > hangs in job destruction" problem won't exist anymore.
> >> >>> >> >
> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >> >>> changes
> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
> >> makes
> >> >>> >> > sense
> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
> >> >>> for
> >> >>> >> > fine-tuning then?
> >> >>> >> >
> >> >>> >> > Martin
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> >>> >>
> >> >>> >> > Mihael:
> >> >>> >> >
> >> >>> >> > That's great, thanks!
> >> >>> >> >
> >> >>> >> > Ian.
> >> >>> >> >
> >> >>> >> > Mihael Hategan wrote:
> >> >>> >> >> I did a 1024 job run today with ws-gram.
> >> >>> >> >> I painted the results here:
> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >>> >> >>
> >> >>> >> >> Seems like client memory per job is about 370k. Which is quite
> >> a
> >> >>> lot.
> >> >>> >> >> What kinda worries me is that it doesn't seem to go down after
> >> >>> the
> >> >>> >> >> jobs
> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >>> >> >> collector
> >> >>> >> >> doesn't do any major collections. I'll need to profile this to
> >> >>> see
> >> >>> >> >> exactly what we're talking about.
> >> >>> >> >>
> >> >>> >> >> The container memory is figured by looking at the process in
> >> >>> /proc.
> >> >>> >> >> It's
> >> >>> >> >> total memory including shared libraries and things. But
> >> libraries
> >> >>> >> >> take a
> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >> >>> made.
> >> >>> >> >> It
> >> >>> >> >> looks quite similar to the amount of memory eaten on the
> >> client
> >> >>> side
> >> >>> >> >> (per job).
> >> >>> >> >>
> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
> >> >>> time
> >> >>> >> >> the
> >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
> >> >>> have
> >> >>> >> >> yet
> >> >>> >> >> to plot the exact submission time for each job.
> >> >>> >> >>
> >> >>> >> >> So at this point I would recommend trying ws-gram as long as
> >> >>> there
> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
> >> >>> and
> >> >>> >> >> while
> >> >>> >> >> making sure the jvm has enough heap. More than that seems like
> >> a
> >> >>> >> >> gamble.
> >> >>> >> >>
> >> >>> >> >> Mihael
> >> >>> >> >>
> >> >>> >> >> _______________________________________________
> >> >>> >> >> Swift-devel mailing list
> >> >>> >> >> Swift-devel at ci.uchicago.edu
> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >>
> >
> >


From feller at mcs.anl.gov  Fri Feb  8 13:46:00 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 13:46:00 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202498943.15258.4.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202498943.15258.4.camel@blabla.mcs.anl.gov>
Message-ID: <11618.208.54.7.179.1202499960.squirrel@www-unix.mcs.anl.gov>

> Thanks. I'll give it a try as people head home for the weekend and the
> heat in the queues is allowed to dissipate.
>
> My profiler says that some hefty amount of heap is used by a relatively
> low number of EndpointReferenceType objects. Btw, where do I get the
> sources for addressing?

It's included as a jar in wsrf, but you can also see the sources by
extracting java/lib-src/ws-addressing/ws-addressing.tar.gz of the
wsrf package.

so:
cvs co -r globus_4_0_6 wsrf
cd wsrf/java/lib-src/ws-addressing/
...

And yes, it seems to be the objects of type EndpointReferenceType.
Those seem to be beasts. Rachana once mentioned that they should be
trimmed when you get them from the stubs because they contain "SOAP crap".

GramJob stored the job-EPR and subscription-EPR as they came from
the output from the call to the factory stub.

In the new jar trimmed eprs (provided by ObjectSerializer.clone(eprObject))
are stored in GramJob objects instead of the raw ones.

Martin

> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> Try the attached 4.0 compliant jar in your tests by dropping
>> it in your 4.0.x $GLOBUS_LOCATION/lib.
>> My tests showed about 2MB memory increase per 100 GramJob
>> objects which sounds to me like a reasonable number (about 20k
>> per GramJob object ignoring the notification consumer manager
>> in one job - if my calculations are right)
>>
>> Martin
>>
>> >
>> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> Mihael,
>> >>
>> >> i think i found the memory hole in GramJob.
>> >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> growing) before the fix and 8MB (very slowly growing) after
>> >> the fix. The big part of that (7MB) is used right from the
>> >> first job which may be the NotificationConsumerManager.
>> >> Will commit that change soon to 4.0 branch and you may try
>> >> it then.
>> >> Are you using 4.0.x in your tests?
>> >
>> > Yes. If there are no API changes, you can send me the jar file. I
>> don't
>> > have enough knowledge to selectively build WS-GRAM, nor enough disk
>> > space to build the whole GT.
>> >
>> >>
>> >> Martin
>> >>
>> >> >>> >
>> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
>> >> job
>> >> >>> is
>> >> >>> a
>> >> >>> > bit too much considering that swift (which has to consider many
>> >> more
>> >> >>> > things) has less than 10K overhead per job.
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> For my better understanding:
>> >> >>> Do you start up your own notification consumer manager that
>> listens
>> >> for
>> >> >>> notifications of all jobs or do you let each GramJob instance
>> listen
>> >> >>> for
>> >> >>> notifications itself?
>> >> >>> In case you listen for notifications yourself: do you store
>> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects
>> if
>> >> >>> needed?
>> >> >>
>> >> >> Excellent points. I let each GramJob instance listen for
>> >> notifications
>> >> >> itself. What I observed is that it uses only one container for
>> that.
>> >> >>
>> >> >
>> >> > Shoot! i didn't know that and thought there would be a container
>> per
>> >> > GramJob in that case. That's the core mysteries with notifications.
>> >> > Anyway: I did a quick check some days ago and found that GramJob is
>> >> > surprisingly greedy regarding memory as you said. I'll have to
>> further
>> >> > check what it is, but will probably not do that before 4.2 is out.
>> >> >
>> >> >
>> >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> regardless
>> >> >> of whether that reference is in client code or the local
>> container.
>> >> >>
>> >> >> I'll try to profile a run and see if I can spot where the problems
>> >> are.
>> >> >>
>> >> >>>
>> >> >>> Martin
>> >> >>>
>> >> >>> >>
>> >> >>> >> The core team will be looking at improving notifications once
>> >> their
>> >> >>> >> other 4.2 deliverables are done.
>> >> >>> >>
>> >> >>> >> -Stu
>> >> >>> >>
>> >> >>> >> Begin forwarded message:
>> >> >>> >>
>> >> >>> >> > From: feller at mcs.anl.gov
>> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >> >>> >> <tmartin at physics.ucsd.edu
>> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >> >>> >> <bacon at mcs.anl.gov
>> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >> >>> >> <rwg at hep.uchicago.edu
>> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> >>> <roy at cs.wisc.edu>,
>> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> >>> >> <miron at cs.wisc.edu
>> >> >>> >> > >
>> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> >>> >> >
>> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> >>> >> >>>
>> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >> >>> >> >>>>
>> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
>> WS
>> >> GRAM
>> >> >>> >> >>>>> raised some concerns about memory usage on the client
>> side.
>> >> I
>> >> >>> did
>> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> >> >>> appeared
>> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>> >> >>> wrapper
>> >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to
>> 30
>> >> at
>> >> >>> a
>> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
>> data
>> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> >> >>> execution.
>> >> >>> >> >>>>> Here is what I've discovered so far.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
>> >> used
>> >> >>> 117
>> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> >> >>> Condor-G
>> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
>> pair.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> >> >>> collector)
>> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
>> was
>> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> >> >>> complete),
>> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The only long-term memory per job that I know of in the
>> >> GAHP
>> >> >>> is
>> >> >>> >> >>>>> for the notification sink for job status callbacks.
>> 600kb
>> >> >>> seems
>> >> >>> a
>> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help
>> us
>> >> >>> >> >>>>> determine if we're using the notification sinks
>> >> inefficiently?
>> >> >>> >> >>>>
>> >> >>> >> >>>> Martin just looked and for the most part, there is
>> nothing
>> >> >>> wrong
>> >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> >>> >> >>>> However, one improvement that would reduce the memory
>> used
>> >> per
>> >> >>> job
>> >> >>> >> >>>> would be to not have a notification consumer per job.
>> >> Instead
>> >> >>> use
>> >> >>> >> >>>> one for all jobs.
>> >> >>> >> >>>>
>> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> stress
>> >> >>> tests
>> >> >>> >> >>>> and found that notifications are building up on the in
>> the
>> >> >>> GRAM4
>> >> >>> >> >>>> service container and that is causing delays which seem
>> to
>> >> be
>> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> >> separate
>> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
>> >> >>> Martin
>> >> >>> >> >>>> re-
>> >> >>> >> >>>> ran the condor-g test and relied on polling between
>> condor-g
>> >> >>> and
>> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
>> >> you
>> >> >>> >> >>>> repeat the no-notification test and see the difference in
>> >> >>> memory?
>> >> >>> >> >>>> The changes would be to increase the polling frequency in
>> >> >>> condor-g
>> >> >>> >> >>>> and comment out the subscribe for notification.  You
>> could
>> >> also
>> >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> I did two new sets of tests today. The first used more
>> >> efficient
>> >> >>> >> >>> callback code in the GAHP (one notification consumer
>> rather
>> >> than
>> >> >>> one
>> >> >>> >> >>> per job). The second disabled notifications and relied on
>> >> >>> polling
>> >> >>> >> >>> for job status changes.
>> >> >>> >> >>>
>> >> >>> >> >>> The more efficient callback code did not produce a
>> noticeable
>> >> >>> >> >>> reduction in memory usage.
>> >> >>> >> >>>
>> >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> maximum
>> >> jvm
>> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> >> minimum
>> >> >>> >> >>> heap usage after job submission and before job completion
>> was
>> >> >>> about
>> >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> I ran one more test with the improved callback code. This
>> >> time, I
>> >> >>> >> >> stopped storing the notification producer EPRs associated
>> with
>> >> >>> the
>> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> >>> >> >>
>> >> >>> >> >> I was told the client had to explicitly destroy these
>> >> serve-side
>> >> >>> >> >> notification producer resources when it destroys the job,
>> >> >>> otherwise
>> >> >>> >> >> they hang around bogging down the server. Is this still the
>> >> case?
>> >> >>> The
>> >> >>> >> >> server can't destroy notification producers when their
>> sources
>> >> of
>> >> >>> >> >> information are destroyed?
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
>> >> much
>> >> >>> more
>> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs
>> of
>> >> >>> >> > subscription resources to be able to destroy them
>> eventually.
>> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> >>> >> >
>> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
>> >> >>> destroy
>> >> >>> >> > subscription resources manually to avoid heaping up
>> persistence
>> >> >>> data
>> >> >>> >> > on the server-side.
>> >> >>> >> > For 4.2: no, you won't have to store them. A job resource
>> will
>> >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> >>> >> >
>> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> "container
>> >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> >>> >> >
>> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
>> 4.2
>> >> >>> changes
>> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
>> >> makes
>> >> >>> >> > sense
>> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to
>> you
>> >> >>> for
>> >> >>> >> > fine-tuning then?
>> >> >>> >> >
>> >> >>> >> > Martin
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> >>> >>
>> >> >>> >> > Mihael:
>> >> >>> >> >
>> >> >>> >> > That's great, thanks!
>> >> >>> >> >
>> >> >>> >> > Ian.
>> >> >>> >> >
>> >> >>> >> > Mihael Hategan wrote:
>> >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> >>> >> >> I painted the results here:
>> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> >>> >> >>
>> >> >>> >> >> Seems like client memory per job is about 370k. Which is
>> quite
>> >> a
>> >> >>> lot.
>> >> >>> >> >> What kinda worries me is that it doesn't seem to go down
>> after
>> >> >>> the
>> >> >>> >> >> jobs
>> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> garbage
>> >> >>> >> >> collector
>> >> >>> >> >> doesn't do any major collections. I'll need to profile this
>> to
>> >> >>> see
>> >> >>> >> >> exactly what we're talking about.
>> >> >>> >> >>
>> >> >>> >> >> The container memory is figured by looking at the process
>> in
>> >> >>> /proc.
>> >> >>> >> >> It's
>> >> >>> >> >> total memory including shared libraries and things. But
>> >> libraries
>> >> >>> >> >> take a
>> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably
>> be
>> >> >>> made.
>> >> >>> >> >> It
>> >> >>> >> >> looks quite similar to the amount of memory eaten on the
>> >> client
>> >> >>> side
>> >> >>> >> >> (per job).
>> >> >>> >> >>
>> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
>> the
>> >> >>> time
>> >> >>> >> >> the
>> >> >>> >> >> jobs are submitted, but the machine itself seems
>> responsive. I
>> >> >>> have
>> >> >>> >> >> yet
>> >> >>> >> >> to plot the exact submission time for each job.
>> >> >>> >> >>
>> >> >>> >> >> So at this point I would recommend trying ws-gram as long
>> as
>> >> >>> there
>> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
>> jobs),
>> >> >>> and
>> >> >>> >> >> while
>> >> >>> >> >> making sure the jvm has enough heap. More than that seems
>> like
>> >> a
>> >> >>> >> >> gamble.
>> >> >>> >> >>
>> >> >>> >> >> Mihael
>> >> >>> >> >>
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> Swift-devel mailing list
>> >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>
>


From hategan at mcs.anl.gov  Fri Feb  8 13:57:53 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 13:57:53 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202500673.17544.1.camel@blabla.mcs.anl.gov>

Won't fly:

java.lang.NoClassDefFoundError: org/globus/exec/utils/audit/AuditUtil
        at
org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:952)
        at org.globus.exec.client.GramJob.submit(GramJob.java:447)

On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
> Try the attached 4.0 compliant jar in your tests by dropping
> it in your 4.0.x $GLOBUS_LOCATION/lib.
> My tests showed about 2MB memory increase per 100 GramJob
> objects which sounds to me like a reasonable number (about 20k
> per GramJob object ignoring the notification consumer manager
> in one job - if my calculations are right)
> 
> Martin
> 
> >
> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> >> Mihael,
> >>
> >> i think i found the memory hole in GramJob.
> >> 100 jobs in a test of mine consumed about 23MB (constantly
> >> growing) before the fix and 8MB (very slowly growing) after
> >> the fix. The big part of that (7MB) is used right from the
> >> first job which may be the NotificationConsumerManager.
> >> Will commit that change soon to 4.0 branch and you may try
> >> it then.
> >> Are you using 4.0.x in your tests?
> >
> > Yes. If there are no API changes, you can send me the jar file. I don't
> > have enough knowledge to selectively build WS-GRAM, nor enough disk
> > space to build the whole GT.
> >
> >>
> >> Martin
> >>
> >> >>> >
> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
> >> job
> >> >>> is
> >> >>> a
> >> >>> > bit too much considering that swift (which has to consider many
> >> more
> >> >>> > things) has less than 10K overhead per job.
> >> >>> >
> >> >>>
> >> >>>
> >> >>> For my better understanding:
> >> >>> Do you start up your own notification consumer manager that listens
> >> for
> >> >>> notifications of all jobs or do you let each GramJob instance listen
> >> >>> for
> >> >>> notifications itself?
> >> >>> In case you listen for notifications yourself: do you store
> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
> >> >>> needed?
> >> >>
> >> >> Excellent points. I let each GramJob instance listen for
> >> notifications
> >> >> itself. What I observed is that it uses only one container for that.
> >> >>
> >> >
> >> > Shoot! i didn't know that and thought there would be a container per
> >> > GramJob in that case. That's the core mysteries with notifications.
> >> > Anyway: I did a quick check some days ago and found that GramJob is
> >> > surprisingly greedy regarding memory as you said. I'll have to further
> >> > check what it is, but will probably not do that before 4.2 is out.
> >> >
> >> >
> >> >> Due to the above, a reference to the GramJob is kept anyway,
> >> regardless
> >> >> of whether that reference is in client code or the local container.
> >> >>
> >> >> I'll try to profile a run and see if I can spot where the problems
> >> are.
> >> >>
> >> >>>
> >> >>> Martin
> >> >>>
> >> >>> >>
> >> >>> >> The core team will be looking at improving notifications once
> >> their
> >> >>> >> other 4.2 deliverables are done.
> >> >>> >>
> >> >>> >> -Stu
> >> >>> >>
> >> >>> >> Begin forwarded message:
> >> >>> >>
> >> >>> >> > From: feller at mcs.anl.gov
> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> >>> >> <tmartin at physics.ucsd.edu
> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> >>> >> <bacon at mcs.anl.gov
> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> >>> >> <rwg at hep.uchicago.edu
> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> >> >>> <roy at cs.wisc.edu>,
> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> >>> >> <miron at cs.wisc.edu
> >> >>> >> > >
> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >>> >> >
> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >>> >> >>
> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >>> >> >>>
> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >>> >> >>>>
> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS
> >> GRAM
> >> >>> >> >>>>> raised some concerns about memory usage on the client side.
> >> I
> >> >>> did
> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >> >>> appeared
> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
> >> >>> wrapper
> >> >>> >> >>>>> around the java client libraries for WS GRAM.
> >> >>> >> >>>>>
> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30
> >> at
> >> >>> a
> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> >>> execution.
> >> >>> >> >>>>> Here is what I've discovered so far.
> >> >>> >> >>>>>
> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
> >> used
> >> >>> 117
> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >> >>> Condor-G
> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> >>> collector)
> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >> >>> complete),
> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The only long-term memory per job that I know of in the
> >> GAHP
> >> >>> is
> >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
> >> >>> seems
> >> >>> a
> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >>> >> >>>>> determine if we're using the notification sinks
> >> inefficiently?
> >> >>> >> >>>>
> >> >>> >> >>>> Martin just looked and for the most part, there is nothing
> >> >>> wrong
> >> >>> >> >>>> with how condor-g manages the callback sink.
> >> >>> >> >>>> However, one improvement that would reduce the memory used
> >> per
> >> >>> job
> >> >>> >> >>>> would be to not have a notification consumer per job.
> >> Instead
> >> >>> use
> >> >>> >> >>>> one for all jobs.
> >> >>> >> >>>>
> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
> >> >>> tests
> >> >>> >> >>>> and found that notifications are building up on the in the
> >> >>> GRAM4
> >> >>> >> >>>> service container and that is causing delays which seem to
> >> be
> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
> >> separate
> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
> >> >>> Martin
> >> >>> >> >>>> re-
> >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
> >> >>> and
> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
> >> you
> >> >>> >> >>>> repeat the no-notification test and see the difference in
> >> >>> memory?
> >> >>> >> >>>> The changes would be to increase the polling frequency in
> >> >>> condor-g
> >> >>> >> >>>> and comment out the subscribe for notification.  You could
> >> also
> >> >>> >> >>>> comment out the notification listener call(s) too.
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> I did two new sets of tests today. The first used more
> >> efficient
> >> >>> >> >>> callback code in the GAHP (one notification consumer rather
> >> than
> >> >>> one
> >> >>> >> >>> per job). The second disabled notifications and relied on
> >> >>> polling
> >> >>> >> >>> for job status changes.
> >> >>> >> >>>
> >> >>> >> >>> The more efficient callback code did not produce a noticeable
> >> >>> >> >>> reduction in memory usage.
> >> >>> >> >>>
> >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum
> >> jvm
> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
> >> minimum
> >> >>> >> >>> heap usage after job submission and before job completion was
> >> >>> about
> >> >>> >> >>> 4 megs + 0.1 megs per job.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> I ran one more test with the improved callback code. This
> >> time, I
> >> >>> >> >> stopped storing the notification producer EPRs associated with
> >> >>> the
> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
> >> >>> >> >>
> >> >>> >> >> I was told the client had to explicitly destroy these
> >> serve-side
> >> >>> >> >> notification producer resources when it destroys the job,
> >> >>> otherwise
> >> >>> >> >> they hang around bogging down the server. Is this still the
> >> case?
> >> >>> The
> >> >>> >> >> server can't destroy notification producers when their sources
> >> of
> >> >>> >> >> information are destroyed?
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
> >> much
> >> >>> more
> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> >>> >> > subscription resources to be able to destroy them eventually.
> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
> >> >>> >> >
> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
> >> >>> destroy
> >> >>> >> > subscription resources manually to avoid heaping up persistence
> >> >>> data
> >> >>> >> > on the server-side.
> >> >>> >> > For 4.2: no, you won't have to store them. A job resource will
> >> >>> >> > destroy all subscription resources when it's destroyed.
> >> >>> >> >
> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
> >> "container
> >> >>> >> > hangs in job destruction" problem won't exist anymore.
> >> >>> >> >
> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >> >>> changes
> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
> >> makes
> >> >>> >> > sense
> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
> >> >>> for
> >> >>> >> > fine-tuning then?
> >> >>> >> >
> >> >>> >> > Martin
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> >>> >>
> >> >>> >> > Mihael:
> >> >>> >> >
> >> >>> >> > That's great, thanks!
> >> >>> >> >
> >> >>> >> > Ian.
> >> >>> >> >
> >> >>> >> > Mihael Hategan wrote:
> >> >>> >> >> I did a 1024 job run today with ws-gram.
> >> >>> >> >> I painted the results here:
> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >>> >> >>
> >> >>> >> >> Seems like client memory per job is about 370k. Which is quite
> >> a
> >> >>> lot.
> >> >>> >> >> What kinda worries me is that it doesn't seem to go down after
> >> >>> the
> >> >>> >> >> jobs
> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >>> >> >> collector
> >> >>> >> >> doesn't do any major collections. I'll need to profile this to
> >> >>> see
> >> >>> >> >> exactly what we're talking about.
> >> >>> >> >>
> >> >>> >> >> The container memory is figured by looking at the process in
> >> >>> /proc.
> >> >>> >> >> It's
> >> >>> >> >> total memory including shared libraries and things. But
> >> libraries
> >> >>> >> >> take a
> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >> >>> made.
> >> >>> >> >> It
> >> >>> >> >> looks quite similar to the amount of memory eaten on the
> >> client
> >> >>> side
> >> >>> >> >> (per job).
> >> >>> >> >>
> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
> >> >>> time
> >> >>> >> >> the
> >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
> >> >>> have
> >> >>> >> >> yet
> >> >>> >> >> to plot the exact submission time for each job.
> >> >>> >> >>
> >> >>> >> >> So at this point I would recommend trying ws-gram as long as
> >> >>> there
> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
> >> >>> and
> >> >>> >> >> while
> >> >>> >> >> making sure the jvm has enough heap. More than that seems like
> >> a
> >> >>> >> >> gamble.
> >> >>> >> >>
> >> >>> >> >> Mihael
> >> >>> >> >>
> >> >>> >> >> _______________________________________________
> >> >>> >> >> Swift-devel mailing list
> >> >>> >> >> Swift-devel at ci.uchicago.edu
> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >>
> >
> >


From feller at mcs.anl.gov  Fri Feb  8 14:15:09 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 14:15:09 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202500673.17544.1.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202500673.17544.1.camel@blabla.mcs.anl.gov>
Message-ID: <21032.208.54.7.179.1202501709.squirrel@www-unix.mcs.anl.gov>

ok, replace all gram jars with the attached ones.

> Won't fly:
>
> java.lang.NoClassDefFoundError: org/globus/exec/utils/audit/AuditUtil
>         at
> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:952)
>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
>
> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> Try the attached 4.0 compliant jar in your tests by dropping
>> it in your 4.0.x $GLOBUS_LOCATION/lib.
>> My tests showed about 2MB memory increase per 100 GramJob
>> objects which sounds to me like a reasonable number (about 20k
>> per GramJob object ignoring the notification consumer manager
>> in one job - if my calculations are right)
>>
>> Martin
>>
>> >
>> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> Mihael,
>> >>
>> >> i think i found the memory hole in GramJob.
>> >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> growing) before the fix and 8MB (very slowly growing) after
>> >> the fix. The big part of that (7MB) is used right from the
>> >> first job which may be the NotificationConsumerManager.
>> >> Will commit that change soon to 4.0 branch and you may try
>> >> it then.
>> >> Are you using 4.0.x in your tests?
>> >
>> > Yes. If there are no API changes, you can send me the jar file. I
>> don't
>> > have enough knowledge to selectively build WS-GRAM, nor enough disk
>> > space to build the whole GT.
>> >
>> >>
>> >> Martin
>> >>
>> >> >>> >
>> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
>> >> job
>> >> >>> is
>> >> >>> a
>> >> >>> > bit too much considering that swift (which has to consider many
>> >> more
>> >> >>> > things) has less than 10K overhead per job.
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>> For my better understanding:
>> >> >>> Do you start up your own notification consumer manager that
>> listens
>> >> for
>> >> >>> notifications of all jobs or do you let each GramJob instance
>> listen
>> >> >>> for
>> >> >>> notifications itself?
>> >> >>> In case you listen for notifications yourself: do you store
>> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects
>> if
>> >> >>> needed?
>> >> >>
>> >> >> Excellent points. I let each GramJob instance listen for
>> >> notifications
>> >> >> itself. What I observed is that it uses only one container for
>> that.
>> >> >>
>> >> >
>> >> > Shoot! i didn't know that and thought there would be a container
>> per
>> >> > GramJob in that case. That's the core mysteries with notifications.
>> >> > Anyway: I did a quick check some days ago and found that GramJob is
>> >> > surprisingly greedy regarding memory as you said. I'll have to
>> further
>> >> > check what it is, but will probably not do that before 4.2 is out.
>> >> >
>> >> >
>> >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> regardless
>> >> >> of whether that reference is in client code or the local
>> container.
>> >> >>
>> >> >> I'll try to profile a run and see if I can spot where the problems
>> >> are.
>> >> >>
>> >> >>>
>> >> >>> Martin
>> >> >>>
>> >> >>> >>
>> >> >>> >> The core team will be looking at improving notifications once
>> >> their
>> >> >>> >> other 4.2 deliverables are done.
>> >> >>> >>
>> >> >>> >> -Stu
>> >> >>> >>
>> >> >>> >> Begin forwarded message:
>> >> >>> >>
>> >> >>> >> > From: feller at mcs.anl.gov
>> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
>> >> >>> >> <tmartin at physics.ucsd.edu
>> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> >> >>> >> <bacon at mcs.anl.gov
>> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
>> >> >>> >> <rwg at hep.uchicago.edu
>> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> >>> <roy at cs.wisc.edu>,
>> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> >>> >> <miron at cs.wisc.edu
>> >> >>> >> > >
>> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> >>> >> >
>> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> >>> >> >>>
>> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> >> >>> >> >>>>
>> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
>> WS
>> >> GRAM
>> >> >>> >> >>>>> raised some concerns about memory usage on the client
>> side.
>> >> I
>> >> >>> did
>> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
>> >> >>> appeared
>> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
>> >> >>> wrapper
>> >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to
>> 30
>> >> at
>> >> >>> a
>> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
>> data
>> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> >> >>> execution.
>> >> >>> >> >>>>> Here is what I've discovered so far.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
>> >> used
>> >> >>> 117
>> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
>> >> >>> Condor-G
>> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
>> pair.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> >> >>> collector)
>> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
>> was
>> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
>> >> >>> complete),
>> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> >>> >> >>>>>
>> >> >>> >> >>>>> The only long-term memory per job that I know of in the
>> >> GAHP
>> >> >>> is
>> >> >>> >> >>>>> for the notification sink for job status callbacks.
>> 600kb
>> >> >>> seems
>> >> >>> a
>> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help
>> us
>> >> >>> >> >>>>> determine if we're using the notification sinks
>> >> inefficiently?
>> >> >>> >> >>>>
>> >> >>> >> >>>> Martin just looked and for the most part, there is
>> nothing
>> >> >>> wrong
>> >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> >>> >> >>>> However, one improvement that would reduce the memory
>> used
>> >> per
>> >> >>> job
>> >> >>> >> >>>> would be to not have a notification consumer per job.
>> >> Instead
>> >> >>> use
>> >> >>> >> >>>> one for all jobs.
>> >> >>> >> >>>>
>> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> stress
>> >> >>> tests
>> >> >>> >> >>>> and found that notifications are building up on the in
>> the
>> >> >>> GRAM4
>> >> >>> >> >>>> service container and that is causing delays which seem
>> to
>> >> be
>> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> >> separate
>> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
>> >> >>> Martin
>> >> >>> >> >>>> re-
>> >> >>> >> >>>> ran the condor-g test and relied on polling between
>> condor-g
>> >> >>> and
>> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
>> >> you
>> >> >>> >> >>>> repeat the no-notification test and see the difference in
>> >> >>> memory?
>> >> >>> >> >>>> The changes would be to increase the polling frequency in
>> >> >>> condor-g
>> >> >>> >> >>>> and comment out the subscribe for notification.  You
>> could
>> >> also
>> >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> I did two new sets of tests today. The first used more
>> >> efficient
>> >> >>> >> >>> callback code in the GAHP (one notification consumer
>> rather
>> >> than
>> >> >>> one
>> >> >>> >> >>> per job). The second disabled notifications and relied on
>> >> >>> polling
>> >> >>> >> >>> for job status changes.
>> >> >>> >> >>>
>> >> >>> >> >>> The more efficient callback code did not produce a
>> noticeable
>> >> >>> >> >>> reduction in memory usage.
>> >> >>> >> >>>
>> >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> maximum
>> >> jvm
>> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> >> minimum
>> >> >>> >> >>> heap usage after job submission and before job completion
>> was
>> >> >>> about
>> >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> I ran one more test with the improved callback code. This
>> >> time, I
>> >> >>> >> >> stopped storing the notification producer EPRs associated
>> with
>> >> >>> the
>> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> >>> >> >>
>> >> >>> >> >> I was told the client had to explicitly destroy these
>> >> serve-side
>> >> >>> >> >> notification producer resources when it destroys the job,
>> >> >>> otherwise
>> >> >>> >> >> they hang around bogging down the server. Is this still the
>> >> case?
>> >> >>> The
>> >> >>> >> >> server can't destroy notification producers when their
>> sources
>> >> of
>> >> >>> >> >> information are destroyed?
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
>> >> much
>> >> >>> more
>> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs
>> of
>> >> >>> >> > subscription resources to be able to destroy them
>> eventually.
>> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> >>> >> >
>> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
>> >> >>> destroy
>> >> >>> >> > subscription resources manually to avoid heaping up
>> persistence
>> >> >>> data
>> >> >>> >> > on the server-side.
>> >> >>> >> > For 4.2: no, you won't have to store them. A job resource
>> will
>> >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> >>> >> >
>> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> "container
>> >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> >>> >> >
>> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
>> 4.2
>> >> >>> changes
>> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
>> >> makes
>> >> >>> >> > sense
>> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to
>> you
>> >> >>> for
>> >> >>> >> > fine-tuning then?
>> >> >>> >> >
>> >> >>> >> > Martin
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> >>> >>
>> >> >>> >> > Mihael:
>> >> >>> >> >
>> >> >>> >> > That's great, thanks!
>> >> >>> >> >
>> >> >>> >> > Ian.
>> >> >>> >> >
>> >> >>> >> > Mihael Hategan wrote:
>> >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> >>> >> >> I painted the results here:
>> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> >>> >> >>
>> >> >>> >> >> Seems like client memory per job is about 370k. Which is
>> quite
>> >> a
>> >> >>> lot.
>> >> >>> >> >> What kinda worries me is that it doesn't seem to go down
>> after
>> >> >>> the
>> >> >>> >> >> jobs
>> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> garbage
>> >> >>> >> >> collector
>> >> >>> >> >> doesn't do any major collections. I'll need to profile this
>> to
>> >> >>> see
>> >> >>> >> >> exactly what we're talking about.
>> >> >>> >> >>
>> >> >>> >> >> The container memory is figured by looking at the process
>> in
>> >> >>> /proc.
>> >> >>> >> >> It's
>> >> >>> >> >> total memory including shared libraries and things. But
>> >> libraries
>> >> >>> >> >> take a
>> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably
>> be
>> >> >>> made.
>> >> >>> >> >> It
>> >> >>> >> >> looks quite similar to the amount of memory eaten on the
>> >> client
>> >> >>> side
>> >> >>> >> >> (per job).
>> >> >>> >> >>
>> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
>> the
>> >> >>> time
>> >> >>> >> >> the
>> >> >>> >> >> jobs are submitted, but the machine itself seems
>> responsive. I
>> >> >>> have
>> >> >>> >> >> yet
>> >> >>> >> >> to plot the exact submission time for each job.
>> >> >>> >> >>
>> >> >>> >> >> So at this point I would recommend trying ws-gram as long
>> as
>> >> >>> there
>> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
>> jobs),
>> >> >>> and
>> >> >>> >> >> while
>> >> >>> >> >> making sure the jvm has enough heap. More than that seems
>> like
>> >> a
>> >> >>> >> >> gamble.
>> >> >>> >> >>
>> >> >>> >> >> Mihael
>> >> >>> >> >>
>> >> >>> >> >> _______________________________________________
>> >> >>> >> >> Swift-devel mailing list
>> >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gramjars.tar.gz
Type: application/x-gzip
Size: 531778 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080208/308a0954/attachment.bin>

From hategan at mcs.anl.gov  Fri Feb  8 15:02:30 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 15:02:30 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202504550.21618.0.camel@blabla.mcs.anl.gov>

On a first look it indeed looks like the gc is more successful at
cleaning stuff up.

On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
> Try the attached 4.0 compliant jar in your tests by dropping
> it in your 4.0.x $GLOBUS_LOCATION/lib.
> My tests showed about 2MB memory increase per 100 GramJob
> objects which sounds to me like a reasonable number (about 20k
> per GramJob object ignoring the notification consumer manager
> in one job - if my calculations are right)
> 
> Martin
> 
> >
> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> >> Mihael,
> >>
> >> i think i found the memory hole in GramJob.
> >> 100 jobs in a test of mine consumed about 23MB (constantly
> >> growing) before the fix and 8MB (very slowly growing) after
> >> the fix. The big part of that (7MB) is used right from the
> >> first job which may be the NotificationConsumerManager.
> >> Will commit that change soon to 4.0 branch and you may try
> >> it then.
> >> Are you using 4.0.x in your tests?
> >
> > Yes. If there are no API changes, you can send me the jar file. I don't
> > have enough knowledge to selectively build WS-GRAM, nor enough disk
> > space to build the whole GT.
> >
> >>
> >> Martin
> >>
> >> >>> >
> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
> >> job
> >> >>> is
> >> >>> a
> >> >>> > bit too much considering that swift (which has to consider many
> >> more
> >> >>> > things) has less than 10K overhead per job.
> >> >>> >
> >> >>>
> >> >>>
> >> >>> For my better understanding:
> >> >>> Do you start up your own notification consumer manager that listens
> >> for
> >> >>> notifications of all jobs or do you let each GramJob instance listen
> >> >>> for
> >> >>> notifications itself?
> >> >>> In case you listen for notifications yourself: do you store
> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
> >> >>> needed?
> >> >>
> >> >> Excellent points. I let each GramJob instance listen for
> >> notifications
> >> >> itself. What I observed is that it uses only one container for that.
> >> >>
> >> >
> >> > Shoot! i didn't know that and thought there would be a container per
> >> > GramJob in that case. That's the core mysteries with notifications.
> >> > Anyway: I did a quick check some days ago and found that GramJob is
> >> > surprisingly greedy regarding memory as you said. I'll have to further
> >> > check what it is, but will probably not do that before 4.2 is out.
> >> >
> >> >
> >> >> Due to the above, a reference to the GramJob is kept anyway,
> >> regardless
> >> >> of whether that reference is in client code or the local container.
> >> >>
> >> >> I'll try to profile a run and see if I can spot where the problems
> >> are.
> >> >>
> >> >>>
> >> >>> Martin
> >> >>>
> >> >>> >>
> >> >>> >> The core team will be looking at improving notifications once
> >> their
> >> >>> >> other 4.2 deliverables are done.
> >> >>> >>
> >> >>> >> -Stu
> >> >>> >>
> >> >>> >> Begin forwarded message:
> >> >>> >>
> >> >>> >> > From: feller at mcs.anl.gov
> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> >> >>> >> <tmartin at physics.ucsd.edu
> >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> >>> >> <bacon at mcs.anl.gov
> >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> >> >>> >> <rwg at hep.uchicago.edu
> >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> >> >>> <roy at cs.wisc.edu>,
> >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> >>> >> <miron at cs.wisc.edu
> >> >>> >> > >
> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> >>> >> >
> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> >>> >> >>
> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> >>> >> >>>
> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> >>> >> >>>>
> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS
> >> GRAM
> >> >>> >> >>>>> raised some concerns about memory usage on the client side.
> >> I
> >> >>> did
> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> >> >>> appeared
> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
> >> >>> wrapper
> >> >>> >> >>>>> around the java client libraries for WS GRAM.
> >> >>> >> >>>>>
> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30
> >> at
> >> >>> a
> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> >>> execution.
> >> >>> >> >>>>> Here is what I've discovered so far.
> >> >>> >> >>>>>
> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
> >> used
> >> >>> 117
> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> >> >>> Condor-G
> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> >>> collector)
> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> >> >>> complete),
> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> >>> >> >>>>>
> >> >>> >> >>>>> The only long-term memory per job that I know of in the
> >> GAHP
> >> >>> is
> >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
> >> >>> seems
> >> >>> a
> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
> >> >>> >> >>>>> determine if we're using the notification sinks
> >> inefficiently?
> >> >>> >> >>>>
> >> >>> >> >>>> Martin just looked and for the most part, there is nothing
> >> >>> wrong
> >> >>> >> >>>> with how condor-g manages the callback sink.
> >> >>> >> >>>> However, one improvement that would reduce the memory used
> >> per
> >> >>> job
> >> >>> >> >>>> would be to not have a notification consumer per job.
> >> Instead
> >> >>> use
> >> >>> >> >>>> one for all jobs.
> >> >>> >> >>>>
> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
> >> >>> tests
> >> >>> >> >>>> and found that notifications are building up on the in the
> >> >>> GRAM4
> >> >>> >> >>>> service container and that is causing delays which seem to
> >> be
> >> >>> >> >>>> causing multiple problems.  We're looking at this in a
> >> separate
> >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
> >> >>> Martin
> >> >>> >> >>>> re-
> >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
> >> >>> and
> >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
> >> you
> >> >>> >> >>>> repeat the no-notification test and see the difference in
> >> >>> memory?
> >> >>> >> >>>> The changes would be to increase the polling frequency in
> >> >>> condor-g
> >> >>> >> >>>> and comment out the subscribe for notification.  You could
> >> also
> >> >>> >> >>>> comment out the notification listener call(s) too.
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> I did two new sets of tests today. The first used more
> >> efficient
> >> >>> >> >>> callback code in the GAHP (one notification consumer rather
> >> than
> >> >>> one
> >> >>> >> >>> per job). The second disabled notifications and relied on
> >> >>> polling
> >> >>> >> >>> for job status changes.
> >> >>> >> >>>
> >> >>> >> >>> The more efficient callback code did not produce a noticeable
> >> >>> >> >>> reduction in memory usage.
> >> >>> >> >>>
> >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum
> >> jvm
> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
> >> minimum
> >> >>> >> >>> heap usage after job submission and before job completion was
> >> >>> about
> >> >>> >> >>> 4 megs + 0.1 megs per job.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> I ran one more test with the improved callback code. This
> >> time, I
> >> >>> >> >> stopped storing the notification producer EPRs associated with
> >> >>> the
> >> >>> >> >> GRAM job resources. Memory usage went down markedly.
> >> >>> >> >>
> >> >>> >> >> I was told the client had to explicitly destroy these
> >> serve-side
> >> >>> >> >> notification producer resources when it destroys the job,
> >> >>> otherwise
> >> >>> >> >> they hang around bogging down the server. Is this still the
> >> case?
> >> >>> The
> >> >>> >> >> server can't destroy notification producers when their sources
> >> of
> >> >>> >> >> information are destroyed?
> >> >>> >> >>
> >> >>> >> >
> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
> >> much
> >> >>> more
> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> >> >>> >> > subscription resources to be able to destroy them eventually.
> >> >>> >> > Those EPR's are maybe not so tiny as they look like.
> >> >>> >> >
> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
> >> >>> destroy
> >> >>> >> > subscription resources manually to avoid heaping up persistence
> >> >>> data
> >> >>> >> > on the server-side.
> >> >>> >> > For 4.2: no, you won't have to store them. A job resource will
> >> >>> >> > destroy all subscription resources when it's destroyed.
> >> >>> >> >
> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
> >> "container
> >> >>> >> > hangs in job destruction" problem won't exist anymore.
> >> >>> >> >
> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> >> >>> changes
> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
> >> makes
> >> >>> >> > sense
> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
> >> >>> for
> >> >>> >> > fine-tuning then?
> >> >>> >> >
> >> >>> >> > Martin
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> >>> >>
> >> >>> >> > Mihael:
> >> >>> >> >
> >> >>> >> > That's great, thanks!
> >> >>> >> >
> >> >>> >> > Ian.
> >> >>> >> >
> >> >>> >> > Mihael Hategan wrote:
> >> >>> >> >> I did a 1024 job run today with ws-gram.
> >> >>> >> >> I painted the results here:
> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> >>> >> >>
> >> >>> >> >> Seems like client memory per job is about 370k. Which is quite
> >> a
> >> >>> lot.
> >> >>> >> >> What kinda worries me is that it doesn't seem to go down after
> >> >>> the
> >> >>> >> >> jobs
> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> >> >>> >> >> collector
> >> >>> >> >> doesn't do any major collections. I'll need to profile this to
> >> >>> see
> >> >>> >> >> exactly what we're talking about.
> >> >>> >> >>
> >> >>> >> >> The container memory is figured by looking at the process in
> >> >>> /proc.
> >> >>> >> >> It's
> >> >>> >> >> total memory including shared libraries and things. But
> >> libraries
> >> >>> >> >> take a
> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
> >> >>> made.
> >> >>> >> >> It
> >> >>> >> >> looks quite similar to the amount of memory eaten on the
> >> client
> >> >>> side
> >> >>> >> >> (per job).
> >> >>> >> >>
> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
> >> >>> time
> >> >>> >> >> the
> >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
> >> >>> have
> >> >>> >> >> yet
> >> >>> >> >> to plot the exact submission time for each job.
> >> >>> >> >>
> >> >>> >> >> So at this point I would recommend trying ws-gram as long as
> >> >>> there
> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
> >> >>> and
> >> >>> >> >> while
> >> >>> >> >> making sure the jvm has enough heap. More than that seems like
> >> a
> >> >>> >> >> gamble.
> >> >>> >> >>
> >> >>> >> >> Mihael
> >> >>> >> >>
> >> >>> >> >> _______________________________________________
> >> >>> >> >> Swift-devel mailing list
> >> >>> >> >> Swift-devel at ci.uchicago.edu
> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >>
> >>
> >
> >


From hategan at mcs.anl.gov  Fri Feb  8 16:12:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 16:12:43 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202504550.21618.0.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202504550.21618.0.camel@blabla.mcs.anl.gov>
Message-ID: <1202508763.25421.0.camel@blabla.mcs.anl.gov>

Yep. Looks much better. How stable is this otherwise?

On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote:
> On a first look it indeed looks like the gc is more successful at
> cleaning stuff up.
> 
> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
> > Try the attached 4.0 compliant jar in your tests by dropping
> > it in your 4.0.x $GLOBUS_LOCATION/lib.
> > My tests showed about 2MB memory increase per 100 GramJob
> > objects which sounds to me like a reasonable number (about 20k
> > per GramJob object ignoring the notification consumer manager
> > in one job - if my calculations are right)
> > 
> > Martin
> > 
> > >
> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> > >> Mihael,
> > >>
> > >> i think i found the memory hole in GramJob.
> > >> 100 jobs in a test of mine consumed about 23MB (constantly
> > >> growing) before the fix and 8MB (very slowly growing) after
> > >> the fix. The big part of that (7MB) is used right from the
> > >> first job which may be the NotificationConsumerManager.
> > >> Will commit that change soon to 4.0 branch and you may try
> > >> it then.
> > >> Are you using 4.0.x in your tests?
> > >
> > > Yes. If there are no API changes, you can send me the jar file. I don't
> > > have enough knowledge to selectively build WS-GRAM, nor enough disk
> > > space to build the whole GT.
> > >
> > >>
> > >> Martin
> > >>
> > >> >>> >
> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per
> > >> job
> > >> >>> is
> > >> >>> a
> > >> >>> > bit too much considering that swift (which has to consider many
> > >> more
> > >> >>> > things) has less than 10K overhead per job.
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>> For my better understanding:
> > >> >>> Do you start up your own notification consumer manager that listens
> > >> for
> > >> >>> notifications of all jobs or do you let each GramJob instance listen
> > >> >>> for
> > >> >>> notifications itself?
> > >> >>> In case you listen for notifications yourself: do you store
> > >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if
> > >> >>> needed?
> > >> >>
> > >> >> Excellent points. I let each GramJob instance listen for
> > >> notifications
> > >> >> itself. What I observed is that it uses only one container for that.
> > >> >>
> > >> >
> > >> > Shoot! i didn't know that and thought there would be a container per
> > >> > GramJob in that case. That's the core mysteries with notifications.
> > >> > Anyway: I did a quick check some days ago and found that GramJob is
> > >> > surprisingly greedy regarding memory as you said. I'll have to further
> > >> > check what it is, but will probably not do that before 4.2 is out.
> > >> >
> > >> >
> > >> >> Due to the above, a reference to the GramJob is kept anyway,
> > >> regardless
> > >> >> of whether that reference is in client code or the local container.
> > >> >>
> > >> >> I'll try to profile a run and see if I can spot where the problems
> > >> are.
> > >> >>
> > >> >>>
> > >> >>> Martin
> > >> >>>
> > >> >>> >>
> > >> >>> >> The core team will be looking at improving notifications once
> > >> their
> > >> >>> >> other 4.2 deliverables are done.
> > >> >>> >>
> > >> >>> >> -Stu
> > >> >>> >>
> > >> >>> >> Begin forwarded message:
> > >> >>> >>
> > >> >>> >> > From: feller at mcs.anl.gov
> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> > >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> > >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence Martin"
> > >> >>> >> <tmartin at physics.ucsd.edu
> > >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> > >> >>> >> <bacon at mcs.anl.gov
> > >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob Gardner"
> > >> >>> >> <rwg at hep.uchicago.edu
> > >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> > >> >>> <roy at cs.wisc.edu>,
> > >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> > >> >>> >> <miron at cs.wisc.edu
> > >> >>> >> > >
> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> > >> >>> >> >
> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> > >> >>> >> >>
> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> > >> >>> >> >>>
> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> > >> >>> >> >>>>
> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS
> > >> GRAM
> > >> >>> >> >>>>> raised some concerns about memory usage on the client side.
> > >> I
> > >> >>> did
> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which
> > >> >>> appeared
> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a
> > >> >>> wrapper
> > >> >>> >> >>>>> around the java client libraries for WS GRAM.
> > >> >>> >> >>>>>
> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30
> > >> at
> > >> >>> a
> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data
> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> > >> >>> execution.
> > >> >>> >> >>>>> Here is what I've discovered so far.
> > >> >>> >> >>>>>
> > >> >>> >> >>>>> Aside from the heap available to the java code, the jvm
> > >> used
> > >> >>> 117
> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory.
> > >> >>> Condor-G
> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair.
> > >> >>> >> >>>>>
> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> > >> >>> collector)
> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was
> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to
> > >> >>> complete),
> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> > >> >>> >> >>>>>
> > >> >>> >> >>>>> The only long-term memory per job that I know of in the
> > >> GAHP
> > >> >>> is
> > >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb
> > >> >>> seems
> > >> >>> a
> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us
> > >> >>> >> >>>>> determine if we're using the notification sinks
> > >> inefficiently?
> > >> >>> >> >>>>
> > >> >>> >> >>>> Martin just looked and for the most part, there is nothing
> > >> >>> wrong
> > >> >>> >> >>>> with how condor-g manages the callback sink.
> > >> >>> >> >>>> However, one improvement that would reduce the memory used
> > >> per
> > >> >>> job
> > >> >>> >> >>>> would be to not have a notification consumer per job.
> > >> Instead
> > >> >>> use
> > >> >>> >> >>>> one for all jobs.
> > >> >>> >> >>>>
> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress
> > >> >>> tests
> > >> >>> >> >>>> and found that notifications are building up on the in the
> > >> >>> GRAM4
> > >> >>> >> >>>> service container and that is causing delays which seem to
> > >> be
> > >> >>> >> >>>> causing multiple problems.  We're looking at this in a
> > >> separate
> > >> >>> >> >>>> effort with the GT Core team.  But, after this was clear,
> > >> >>> Martin
> > >> >>> >> >>>> re-
> > >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g
> > >> >>> and
> > >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime, could
> > >> you
> > >> >>> >> >>>> repeat the no-notification test and see the difference in
> > >> >>> memory?
> > >> >>> >> >>>> The changes would be to increase the polling frequency in
> > >> >>> condor-g
> > >> >>> >> >>>> and comment out the subscribe for notification.  You could
> > >> also
> > >> >>> >> >>>> comment out the notification listener call(s) too.
> > >> >>> >> >>>
> > >> >>> >> >>>
> > >> >>> >> >>> I did two new sets of tests today. The first used more
> > >> efficient
> > >> >>> >> >>> callback code in the GAHP (one notification consumer rather
> > >> than
> > >> >>> one
> > >> >>> >> >>> per job). The second disabled notifications and relied on
> > >> >>> polling
> > >> >>> >> >>> for job status changes.
> > >> >>> >> >>>
> > >> >>> >> >>> The more efficient callback code did not produce a noticeable
> > >> >>> >> >>> reduction in memory usage.
> > >> >>> >> >>>
> > >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum
> > >> jvm
> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
> > >> minimum
> > >> >>> >> >>> heap usage after job submission and before job completion was
> > >> >>> about
> > >> >>> >> >>> 4 megs + 0.1 megs per job.
> > >> >>> >> >>
> > >> >>> >> >>
> > >> >>> >> >> I ran one more test with the improved callback code. This
> > >> time, I
> > >> >>> >> >> stopped storing the notification producer EPRs associated with
> > >> >>> the
> > >> >>> >> >> GRAM job resources. Memory usage went down markedly.
> > >> >>> >> >>
> > >> >>> >> >> I was told the client had to explicitly destroy these
> > >> serve-side
> > >> >>> >> >> notification producer resources when it destroys the job,
> > >> >>> otherwise
> > >> >>> >> >> they hang around bogging down the server. Is this still the
> > >> case?
> > >> >>> The
> > >> >>> >> >> server can't destroy notification producers when their sources
> > >> of
> > >> >>> >> >> information are destroyed?
> > >> >>> >> >>
> > >> >>> >> >
> > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant
> > >> much
> > >> >>> more
> > >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of
> > >> >>> >> > subscription resources to be able to destroy them eventually.
> > >> >>> >> > Those EPR's are maybe not so tiny as they look like.
> > >> >>> >> >
> > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually
> > >> >>> destroy
> > >> >>> >> > subscription resources manually to avoid heaping up persistence
> > >> >>> data
> > >> >>> >> > on the server-side.
> > >> >>> >> > For 4.2: no, you won't have to store them. A job resource will
> > >> >>> >> > destroy all subscription resources when it's destroyed.
> > >> >>> >> >
> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
> > >> "container
> > >> >>> >> > hangs in job destruction" problem won't exist anymore.
> > >> >>> >> >
> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2
> > >> >>> changes
> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it
> > >> makes
> > >> >>> >> > sense
> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you
> > >> >>> for
> > >> >>> >> > fine-tuning then?
> > >> >>> >> >
> > >> >>> >> > Martin
> > >> >>> >>
> > >> >>> >>
> > >> >>> >>
> > >> >>> >>
> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> > >> >>> >>
> > >> >>> >> > Mihael:
> > >> >>> >> >
> > >> >>> >> > That's great, thanks!
> > >> >>> >> >
> > >> >>> >> > Ian.
> > >> >>> >> >
> > >> >>> >> > Mihael Hategan wrote:
> > >> >>> >> >> I did a 1024 job run today with ws-gram.
> > >> >>> >> >> I painted the results here:
> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> > >> >>> >> >>
> > >> >>> >> >> Seems like client memory per job is about 370k. Which is quite
> > >> a
> > >> >>> lot.
> > >> >>> >> >> What kinda worries me is that it doesn't seem to go down after
> > >> >>> the
> > >> >>> >> >> jobs
> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage
> > >> >>> >> >> collector
> > >> >>> >> >> doesn't do any major collections. I'll need to profile this to
> > >> >>> see
> > >> >>> >> >> exactly what we're talking about.
> > >> >>> >> >>
> > >> >>> >> >> The container memory is figured by looking at the process in
> > >> >>> /proc.
> > >> >>> >> >> It's
> > >> >>> >> >> total memory including shared libraries and things. But
> > >> libraries
> > >> >>> >> >> take a
> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be
> > >> >>> made.
> > >> >>> >> >> It
> > >> >>> >> >> looks quite similar to the amount of memory eaten on the
> > >> client
> > >> >>> side
> > >> >>> >> >> (per job).
> > >> >>> >> >>
> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the
> > >> >>> time
> > >> >>> >> >> the
> > >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I
> > >> >>> have
> > >> >>> >> >> yet
> > >> >>> >> >> to plot the exact submission time for each job.
> > >> >>> >> >>
> > >> >>> >> >> So at this point I would recommend trying ws-gram as long as
> > >> >>> there
> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs),
> > >> >>> and
> > >> >>> >> >> while
> > >> >>> >> >> making sure the jvm has enough heap. More than that seems like
> > >> a
> > >> >>> >> >> gamble.
> > >> >>> >> >>
> > >> >>> >> >> Mihael
> > >> >>> >> >>
> > >> >>> >> >> _______________________________________________
> > >> >>> >> >> Swift-devel mailing list
> > >> >>> >> >> Swift-devel at ci.uchicago.edu
> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >> >>> >> >>
> > >> >>> >> >>
> > >> >>> >> >
> > >> >>> >>
> > >> >>> >
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >
> > >
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From feller at mcs.anl.gov  Fri Feb  8 16:32:06 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 16:32:06 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202508763.25421.0.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202504550.21618.0.camel@blabla.mcs.anl.gov>
	<1202508763.25421.0.camel@blabla.mcs.anl.gov>
Message-ID: <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov>

I can't see any stability issues here. The only thing i changed
is using

EndpointReferenceType jobEPR = (EndpointReferenceType)
    ObjectSerializer.clone(response.getManagedJobEndpoint());

instead of

EndpointReferenceType jobEPR = response.getManagedJobEndpoint();

at 2 or 3 locations in the code.

Rachana uses cloning in core too. So it's supposed to be
a stable thing.

A question though: Do you see a speedup in submission?

Martin


> Yep. Looks much better. How stable is this otherwise?
>
> On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote:
>> On a first look it indeed looks like the gc is more successful at
>> cleaning stuff up.
>>
>> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> > Try the attached 4.0 compliant jar in your tests by dropping
>> > it in your 4.0.x $GLOBUS_LOCATION/lib.
>> > My tests showed about 2MB memory increase per 100 GramJob
>> > objects which sounds to me like a reasonable number (about 20k
>> > per GramJob object ignoring the notification consumer manager
>> > in one job - if my calculations are right)
>> >
>> > Martin
>> >
>> > >
>> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> > >> Mihael,
>> > >>
>> > >> i think i found the memory hole in GramJob.
>> > >> 100 jobs in a test of mine consumed about 23MB (constantly
>> > >> growing) before the fix and 8MB (very slowly growing) after
>> > >> the fix. The big part of that (7MB) is used right from the
>> > >> first job which may be the NotificationConsumerManager.
>> > >> Will commit that change soon to 4.0 branch and you may try
>> > >> it then.
>> > >> Are you using 4.0.x in your tests?
>> > >
>> > > Yes. If there are no API changes, you can send me the jar file. I
>> don't
>> > > have enough knowledge to selectively build WS-GRAM, nor enough disk
>> > > space to build the whole GT.
>> > >
>> > >>
>> > >> Martin
>> > >>
>> > >> >>> >
>> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K
>> per
>> > >> job
>> > >> >>> is
>> > >> >>> a
>> > >> >>> > bit too much considering that swift (which has to consider
>> many
>> > >> more
>> > >> >>> > things) has less than 10K overhead per job.
>> > >> >>> >
>> > >> >>>
>> > >> >>>
>> > >> >>> For my better understanding:
>> > >> >>> Do you start up your own notification consumer manager that
>> listens
>> > >> for
>> > >> >>> notifications of all jobs or do you let each GramJob instance
>> listen
>> > >> >>> for
>> > >> >>> notifications itself?
>> > >> >>> In case you listen for notifications yourself: do you store
>> > >> >>> GramJob objects or just EPR's of jobs and create GramJob
>> objects if
>> > >> >>> needed?
>> > >> >>
>> > >> >> Excellent points. I let each GramJob instance listen for
>> > >> notifications
>> > >> >> itself. What I observed is that it uses only one container for
>> that.
>> > >> >>
>> > >> >
>> > >> > Shoot! i didn't know that and thought there would be a container
>> per
>> > >> > GramJob in that case. That's the core mysteries with
>> notifications.
>> > >> > Anyway: I did a quick check some days ago and found that GramJob
>> is
>> > >> > surprisingly greedy regarding memory as you said. I'll have to
>> further
>> > >> > check what it is, but will probably not do that before 4.2 is
>> out.
>> > >> >
>> > >> >
>> > >> >> Due to the above, a reference to the GramJob is kept anyway,
>> > >> regardless
>> > >> >> of whether that reference is in client code or the local
>> container.
>> > >> >>
>> > >> >> I'll try to profile a run and see if I can spot where the
>> problems
>> > >> are.
>> > >> >>
>> > >> >>>
>> > >> >>> Martin
>> > >> >>>
>> > >> >>> >>
>> > >> >>> >> The core team will be looking at improving notifications
>> once
>> > >> their
>> > >> >>> >> other 4.2 deliverables are done.
>> > >> >>> >>
>> > >> >>> >> -Stu
>> > >> >>> >>
>> > >> >>> >> Begin forwarded message:
>> > >> >>> >>
>> > >> >>> >> > From: feller at mcs.anl.gov
>> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> > >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> > >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence
>> Martin"
>> > >> >>> >> <tmartin at physics.ucsd.edu
>> > >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
>> > >> >>> >> <bacon at mcs.anl.gov
>> > >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob
>> Gardner"
>> > >> >>> >> <rwg at hep.uchicago.edu
>> > >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> > >> >>> <roy at cs.wisc.edu>,
>> > >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> > >> >>> >> <miron at cs.wisc.edu
>> > >> >>> >> > >
>> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> > >> >>> >> >
>> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> > >> >>> >> >>
>> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> > >> >>> >> >>>
>> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
>> > >> >>> >> >>>>
>> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
>> WS
>> > >> GRAM
>> > >> >>> >> >>>>> raised some concerns about memory usage on the client
>> side.
>> > >> I
>> > >> >>> did
>> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server,
>> which
>> > >> >>> appeared
>> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is
>> a
>> > >> >>> wrapper
>> > >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> > >> >>> >> >>>>>
>> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up
>> to 30
>> > >> at
>> > >> >>> a
>> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
>> data
>> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
>> > >> >>> execution.
>> > >> >>> >> >>>>> Here is what I've discovered so far.
>> > >> >>> >> >>>>>
>> > >> >>> >> >>>>> Aside from the heap available to the java code, the
>> jvm
>> > >> used
>> > >> >>> 117
>> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared
>> memory.
>> > >> >>> Condor-G
>> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
>> pair.
>> > >> >>> >> >>>>>
>> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
>> > >> >>> collector)
>> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
>> was
>> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them
>> to
>> > >> >>> complete),
>> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> > >> >>> >> >>>>>
>> > >> >>> >> >>>>> The only long-term memory per job that I know of in
>> the
>> > >> GAHP
>> > >> >>> is
>> > >> >>> >> >>>>> for the notification sink for job status callbacks.
>> 600kb
>> > >> >>> seems
>> > >> >>> a
>> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus
>> help us
>> > >> >>> >> >>>>> determine if we're using the notification sinks
>> > >> inefficiently?
>> > >> >>> >> >>>>
>> > >> >>> >> >>>> Martin just looked and for the most part, there is
>> nothing
>> > >> >>> wrong
>> > >> >>> >> >>>> with how condor-g manages the callback sink.
>> > >> >>> >> >>>> However, one improvement that would reduce the memory
>> used
>> > >> per
>> > >> >>> job
>> > >> >>> >> >>>> would be to not have a notification consumer per job.
>> > >> Instead
>> > >> >>> use
>> > >> >>> >> >>>> one for all jobs.
>> > >> >>> >> >>>>
>> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> stress
>> > >> >>> tests
>> > >> >>> >> >>>> and found that notifications are building up on the in
>> the
>> > >> >>> GRAM4
>> > >> >>> >> >>>> service container and that is causing delays which seem
>> to
>> > >> be
>> > >> >>> >> >>>> causing multiple problems.  We're looking at this in a
>> > >> separate
>> > >> >>> >> >>>> effort with the GT Core team.  But, after this was
>> clear,
>> > >> >>> Martin
>> > >> >>> >> >>>> re-
>> > >> >>> >> >>>> ran the condor-g test and relied on polling between
>> condor-g
>> > >> >>> and
>> > >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime,
>> could
>> > >> you
>> > >> >>> >> >>>> repeat the no-notification test and see the difference
>> in
>> > >> >>> memory?
>> > >> >>> >> >>>> The changes would be to increase the polling frequency
>> in
>> > >> >>> condor-g
>> > >> >>> >> >>>> and comment out the subscribe for notification.  You
>> could
>> > >> also
>> > >> >>> >> >>>> comment out the notification listener call(s) too.
>> > >> >>> >> >>>
>> > >> >>> >> >>>
>> > >> >>> >> >>> I did two new sets of tests today. The first used more
>> > >> efficient
>> > >> >>> >> >>> callback code in the GAHP (one notification consumer
>> rather
>> > >> than
>> > >> >>> one
>> > >> >>> >> >>> per job). The second disabled notifications and relied
>> on
>> > >> >>> polling
>> > >> >>> >> >>> for job status changes.
>> > >> >>> >> >>>
>> > >> >>> >> >>> The more efficient callback code did not produce a
>> noticeable
>> > >> >>> >> >>> reduction in memory usage.
>> > >> >>> >> >>>
>> > >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> maximum
>> > >> jvm
>> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
>> > >> minimum
>> > >> >>> >> >>> heap usage after job submission and before job
>> completion was
>> > >> >>> about
>> > >> >>> >> >>> 4 megs + 0.1 megs per job.
>> > >> >>> >> >>
>> > >> >>> >> >>
>> > >> >>> >> >> I ran one more test with the improved callback code. This
>> > >> time, I
>> > >> >>> >> >> stopped storing the notification producer EPRs associated
>> with
>> > >> >>> the
>> > >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> > >> >>> >> >>
>> > >> >>> >> >> I was told the client had to explicitly destroy these
>> > >> serve-side
>> > >> >>> >> >> notification producer resources when it destroys the job,
>> > >> >>> otherwise
>> > >> >>> >> >> they hang around bogging down the server. Is this still
>> the
>> > >> case?
>> > >> >>> The
>> > >> >>> >> >> server can't destroy notification producers when their
>> sources
>> > >> of
>> > >> >>> >> >> information are destroyed?
>> > >> >>> >> >>
>> > >> >>> >> >
>> > >> >>> >> > This reminds me of the odd fact that i had to suddenly
>> grant
>> > >> much
>> > >> >>> more
>> > >> >>> >> > memory to Condor-g as soon as condor-g started storing
>> EPRs of
>> > >> >>> >> > subscription resources to be able to destroy them
>> eventually.
>> > >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> > >> >>> >> >
>> > >> >>> >> > For 4.0: yes, currently you'll have to store and
>> eventually
>> > >> >>> destroy
>> > >> >>> >> > subscription resources manually to avoid heaping up
>> persistence
>> > >> >>> data
>> > >> >>> >> > on the server-side.
>> > >> >>> >> > For 4.2: no, you won't have to store them. A job resource
>> will
>> > >> >>> >> > destroy all subscription resources when it's destroyed.
>> > >> >>> >> >
>> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> > >> "container
>> > >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> > >> >>> >> >
>> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
>> 4.2
>> > >> >>> changes
>> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if
>> it
>> > >> makes
>> > >> >>> >> > sense
>> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it
>> to you
>> > >> >>> for
>> > >> >>> >> > fine-tuning then?
>> > >> >>> >> >
>> > >> >>> >> > Martin
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> > >> >>> >>
>> > >> >>> >> > Mihael:
>> > >> >>> >> >
>> > >> >>> >> > That's great, thanks!
>> > >> >>> >> >
>> > >> >>> >> > Ian.
>> > >> >>> >> >
>> > >> >>> >> > Mihael Hategan wrote:
>> > >> >>> >> >> I did a 1024 job run today with ws-gram.
>> > >> >>> >> >> I painted the results here:
>> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> > >> >>> >> >>
>> > >> >>> >> >> Seems like client memory per job is about 370k. Which is
>> quite
>> > >> a
>> > >> >>> lot.
>> > >> >>> >> >> What kinda worries me is that it doesn't seem to go down
>> after
>> > >> >>> the
>> > >> >>> >> >> jobs
>> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> garbage
>> > >> >>> >> >> collector
>> > >> >>> >> >> doesn't do any major collections. I'll need to profile
>> this to
>> > >> >>> see
>> > >> >>> >> >> exactly what we're talking about.
>> > >> >>> >> >>
>> > >> >>> >> >> The container memory is figured by looking at the process
>> in
>> > >> >>> /proc.
>> > >> >>> >> >> It's
>> > >> >>> >> >> total memory including shared libraries and things. But
>> > >> libraries
>> > >> >>> >> >> take a
>> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can
>> probably be
>> > >> >>> made.
>> > >> >>> >> >> It
>> > >> >>> >> >> looks quite similar to the amount of memory eaten on the
>> > >> client
>> > >> >>> side
>> > >> >>> >> >> (per job).
>> > >> >>> >> >>
>> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
>> the
>> > >> >>> time
>> > >> >>> >> >> the
>> > >> >>> >> >> jobs are submitted, but the machine itself seems
>> responsive. I
>> > >> >>> have
>> > >> >>> >> >> yet
>> > >> >>> >> >> to plot the exact submission time for each job.
>> > >> >>> >> >>
>> > >> >>> >> >> So at this point I would recommend trying ws-gram as long
>> as
>> > >> >>> there
>> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
>> jobs),
>> > >> >>> and
>> > >> >>> >> >> while
>> > >> >>> >> >> making sure the jvm has enough heap. More than that seems
>> like
>> > >> a
>> > >> >>> >> >> gamble.
>> > >> >>> >> >>
>> > >> >>> >> >> Mihael
>> > >> >>> >> >>
>> > >> >>> >> >> _______________________________________________
>> > >> >>> >> >> Swift-devel mailing list
>> > >> >>> >> >> Swift-devel at ci.uchicago.edu
>> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> > >> >>> >> >>
>> > >> >>> >> >>
>> > >> >>> >> >
>> > >> >>> >>
>> > >> >>> >
>> > >> >>> >
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >> >>
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >>
>> > >
>> > >
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>
>


From hategan at mcs.anl.gov  Fri Feb  8 16:37:18 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 08 Feb 2008 16:37:18 -0600
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202504550.21618.0.camel@blabla.mcs.anl.gov>
	<1202508763.25421.0.camel@blabla.mcs.anl.gov>
	<58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202510238.26717.2.camel@blabla.mcs.anl.gov>


On Fri, 2008-02-08 at 16:32 -0600, feller at mcs.anl.gov wrote:
> I can't see any stability issues here. The only thing i changed
> is using
> 
> EndpointReferenceType jobEPR = (EndpointReferenceType)
>     ObjectSerializer.clone(response.getManagedJobEndpoint());
> 
> instead of
> 
> EndpointReferenceType jobEPR = response.getManagedJobEndpoint();
> 
> at 2 or 3 locations in the code.
> 
> Rachana uses cloning in core too. So it's supposed to be
> a stable thing.
> 
> A question though: Do you see a speedup in submission?

I wasn't looking for that. Anything I should be aware of?

> 
> Martin
> 
> 
> > Yep. Looks much better. How stable is this otherwise?
> >
> > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote:
> >> On a first look it indeed looks like the gc is more successful at
> >> cleaning stuff up.
> >>
> >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
> >> > Try the attached 4.0 compliant jar in your tests by dropping
> >> > it in your 4.0.x $GLOBUS_LOCATION/lib.
> >> > My tests showed about 2MB memory increase per 100 GramJob
> >> > objects which sounds to me like a reasonable number (about 20k
> >> > per GramJob object ignoring the notification consumer manager
> >> > in one job - if my calculations are right)
> >> >
> >> > Martin
> >> >
> >> > >
> >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
> >> > >> Mihael,
> >> > >>
> >> > >> i think i found the memory hole in GramJob.
> >> > >> 100 jobs in a test of mine consumed about 23MB (constantly
> >> > >> growing) before the fix and 8MB (very slowly growing) after
> >> > >> the fix. The big part of that (7MB) is used right from the
> >> > >> first job which may be the NotificationConsumerManager.
> >> > >> Will commit that change soon to 4.0 branch and you may try
> >> > >> it then.
> >> > >> Are you using 4.0.x in your tests?
> >> > >
> >> > > Yes. If there are no API changes, you can send me the jar file. I
> >> don't
> >> > > have enough knowledge to selectively build WS-GRAM, nor enough disk
> >> > > space to build the whole GT.
> >> > >
> >> > >>
> >> > >> Martin
> >> > >>
> >> > >> >>> >
> >> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K
> >> per
> >> > >> job
> >> > >> >>> is
> >> > >> >>> a
> >> > >> >>> > bit too much considering that swift (which has to consider
> >> many
> >> > >> more
> >> > >> >>> > things) has less than 10K overhead per job.
> >> > >> >>> >
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> For my better understanding:
> >> > >> >>> Do you start up your own notification consumer manager that
> >> listens
> >> > >> for
> >> > >> >>> notifications of all jobs or do you let each GramJob instance
> >> listen
> >> > >> >>> for
> >> > >> >>> notifications itself?
> >> > >> >>> In case you listen for notifications yourself: do you store
> >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob
> >> objects if
> >> > >> >>> needed?
> >> > >> >>
> >> > >> >> Excellent points. I let each GramJob instance listen for
> >> > >> notifications
> >> > >> >> itself. What I observed is that it uses only one container for
> >> that.
> >> > >> >>
> >> > >> >
> >> > >> > Shoot! i didn't know that and thought there would be a container
> >> per
> >> > >> > GramJob in that case. That's the core mysteries with
> >> notifications.
> >> > >> > Anyway: I did a quick check some days ago and found that GramJob
> >> is
> >> > >> > surprisingly greedy regarding memory as you said. I'll have to
> >> further
> >> > >> > check what it is, but will probably not do that before 4.2 is
> >> out.
> >> > >> >
> >> > >> >
> >> > >> >> Due to the above, a reference to the GramJob is kept anyway,
> >> > >> regardless
> >> > >> >> of whether that reference is in client code or the local
> >> container.
> >> > >> >>
> >> > >> >> I'll try to profile a run and see if I can spot where the
> >> problems
> >> > >> are.
> >> > >> >>
> >> > >> >>>
> >> > >> >>> Martin
> >> > >> >>>
> >> > >> >>> >>
> >> > >> >>> >> The core team will be looking at improving notifications
> >> once
> >> > >> their
> >> > >> >>> >> other 4.2 deliverables are done.
> >> > >> >>> >>
> >> > >> >>> >> -Stu
> >> > >> >>> >>
> >> > >> >>> >> Begin forwarded message:
> >> > >> >>> >>
> >> > >> >>> >> > From: feller at mcs.anl.gov
> >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
> >> > >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
> >> > >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence
> >> Martin"
> >> > >> >>> >> <tmartin at physics.ucsd.edu
> >> > >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles bacon"
> >> > >> >>> >> <bacon at mcs.anl.gov
> >> > >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob
> >> Gardner"
> >> > >> >>> >> <rwg at hep.uchicago.edu
> >> > >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
> >> > >> >>> <roy at cs.wisc.edu>,
> >> > >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
> >> > >> >>> >> <miron at cs.wisc.edu
> >> > >> >>> >> > >
> >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
> >> > >> >>> >> >
> >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
> >> > >> >>> >> >>
> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
> >> > >> >>> >> >>>
> >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote:
> >> > >> >>> >> >>>>
> >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with
> >> WS
> >> > >> GRAM
> >> > >> >>> >> >>>>> raised some concerns about memory usage on the client
> >> side.
> >> > >> I
> >> > >> >>> did
> >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server,
> >> which
> >> > >> >>> appeared
> >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is
> >> a
> >> > >> >>> wrapper
> >> > >> >>> >> >>>>> around the java client libraries for WS GRAM.
> >> > >> >>> >> >>>>>
> >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up
> >> to 30
> >> > >> at
> >> > >> >>> a
> >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal
> >> data
> >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and
> >> > >> >>> execution.
> >> > >> >>> >> >>>>> Here is what I've discovered so far.
> >> > >> >>> >> >>>>>
> >> > >> >>> >> >>>>> Aside from the heap available to the java code, the
> >> jvm
> >> > >> used
> >> > >> >>> 117
> >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared
> >> memory.
> >> > >> >>> Condor-G
> >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN)
> >> pair.
> >> > >> >>> >> >>>>>
> >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage
> >> > >> >>> collector)
> >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP
> >> was
> >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them
> >> to
> >> > >> >>> complete),
> >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
> >> > >> >>> >> >>>>>
> >> > >> >>> >> >>>>> The only long-term memory per job that I know of in
> >> the
> >> > >> GAHP
> >> > >> >>> is
> >> > >> >>> >> >>>>> for the notification sink for job status callbacks.
> >> 600kb
> >> > >> >>> seems
> >> > >> >>> a
> >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus
> >> help us
> >> > >> >>> >> >>>>> determine if we're using the notification sinks
> >> > >> inefficiently?
> >> > >> >>> >> >>>>
> >> > >> >>> >> >>>> Martin just looked and for the most part, there is
> >> nothing
> >> > >> >>> wrong
> >> > >> >>> >> >>>> with how condor-g manages the callback sink.
> >> > >> >>> >> >>>> However, one improvement that would reduce the memory
> >> used
> >> > >> per
> >> > >> >>> job
> >> > >> >>> >> >>>> would be to not have a notification consumer per job.
> >> > >> Instead
> >> > >> >>> use
> >> > >> >>> >> >>>> one for all jobs.
> >> > >> >>> >> >>>>
> >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
> >> stress
> >> > >> >>> tests
> >> > >> >>> >> >>>> and found that notifications are building up on the in
> >> the
> >> > >> >>> GRAM4
> >> > >> >>> >> >>>> service container and that is causing delays which seem
> >> to
> >> > >> be
> >> > >> >>> >> >>>> causing multiple problems.  We're looking at this in a
> >> > >> separate
> >> > >> >>> >> >>>> effort with the GT Core team.  But, after this was
> >> clear,
> >> > >> >>> Martin
> >> > >> >>> >> >>>> re-
> >> > >> >>> >> >>>> ran the condor-g test and relied on polling between
> >> condor-g
> >> > >> >>> and
> >> > >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime,
> >> could
> >> > >> you
> >> > >> >>> >> >>>> repeat the no-notification test and see the difference
> >> in
> >> > >> >>> memory?
> >> > >> >>> >> >>>> The changes would be to increase the polling frequency
> >> in
> >> > >> >>> condor-g
> >> > >> >>> >> >>>> and comment out the subscribe for notification.  You
> >> could
> >> > >> also
> >> > >> >>> >> >>>> comment out the notification listener call(s) too.
> >> > >> >>> >> >>>
> >> > >> >>> >> >>>
> >> > >> >>> >> >>> I did two new sets of tests today. The first used more
> >> > >> efficient
> >> > >> >>> >> >>> callback code in the GAHP (one notification consumer
> >> rather
> >> > >> than
> >> > >> >>> one
> >> > >> >>> >> >>> per job). The second disabled notifications and relied
> >> on
> >> > >> >>> polling
> >> > >> >>> >> >>> for job status changes.
> >> > >> >>> >> >>>
> >> > >> >>> >> >>> The more efficient callback code did not produce a
> >> noticeable
> >> > >> >>> >> >>> reduction in memory usage.
> >> > >> >>> >> >>>
> >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The
> >> maximum
> >> > >> jvm
> >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The
> >> > >> minimum
> >> > >> >>> >> >>> heap usage after job submission and before job
> >> completion was
> >> > >> >>> about
> >> > >> >>> >> >>> 4 megs + 0.1 megs per job.
> >> > >> >>> >> >>
> >> > >> >>> >> >>
> >> > >> >>> >> >> I ran one more test with the improved callback code. This
> >> > >> time, I
> >> > >> >>> >> >> stopped storing the notification producer EPRs associated
> >> with
> >> > >> >>> the
> >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly.
> >> > >> >>> >> >>
> >> > >> >>> >> >> I was told the client had to explicitly destroy these
> >> > >> serve-side
> >> > >> >>> >> >> notification producer resources when it destroys the job,
> >> > >> >>> otherwise
> >> > >> >>> >> >> they hang around bogging down the server. Is this still
> >> the
> >> > >> case?
> >> > >> >>> The
> >> > >> >>> >> >> server can't destroy notification producers when their
> >> sources
> >> > >> of
> >> > >> >>> >> >> information are destroyed?
> >> > >> >>> >> >>
> >> > >> >>> >> >
> >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly
> >> grant
> >> > >> much
> >> > >> >>> more
> >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing
> >> EPRs of
> >> > >> >>> >> > subscription resources to be able to destroy them
> >> eventually.
> >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like.
> >> > >> >>> >> >
> >> > >> >>> >> > For 4.0: yes, currently you'll have to store and
> >> eventually
> >> > >> >>> destroy
> >> > >> >>> >> > subscription resources manually to avoid heaping up
> >> persistence
> >> > >> >>> data
> >> > >> >>> >> > on the server-side.
> >> > >> >>> >> > For 4.2: no, you won't have to store them. A job resource
> >> will
> >> > >> >>> >> > destroy all subscription resources when it's destroyed.
> >> > >> >>> >> >
> >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
> >> > >> "container
> >> > >> >>> >> > hangs in job destruction" problem won't exist anymore.
> >> > >> >>> >> >
> >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable
> >> 4.2
> >> > >> >>> changes
> >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if
> >> it
> >> > >> makes
> >> > >> >>> >> > sense
> >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it
> >> to you
> >> > >> >>> for
> >> > >> >>> >> > fine-tuning then?
> >> > >> >>> >> >
> >> > >> >>> >> > Martin
> >> > >> >>> >>
> >> > >> >>> >>
> >> > >> >>> >>
> >> > >> >>> >>
> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
> >> > >> >>> >>
> >> > >> >>> >> > Mihael:
> >> > >> >>> >> >
> >> > >> >>> >> > That's great, thanks!
> >> > >> >>> >> >
> >> > >> >>> >> > Ian.
> >> > >> >>> >> >
> >> > >> >>> >> > Mihael Hategan wrote:
> >> > >> >>> >> >> I did a 1024 job run today with ws-gram.
> >> > >> >>> >> >> I painted the results here:
> >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
> >> > >> >>> >> >>
> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is
> >> quite
> >> > >> a
> >> > >> >>> lot.
> >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go down
> >> after
> >> > >> >>> the
> >> > >> >>> >> >> jobs
> >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
> >> garbage
> >> > >> >>> >> >> collector
> >> > >> >>> >> >> doesn't do any major collections. I'll need to profile
> >> this to
> >> > >> >>> see
> >> > >> >>> >> >> exactly what we're talking about.
> >> > >> >>> >> >>
> >> > >> >>> >> >> The container memory is figured by looking at the process
> >> in
> >> > >> >>> /proc.
> >> > >> >>> >> >> It's
> >> > >> >>> >> >> total memory including shared libraries and things. But
> >> > >> libraries
> >> > >> >>> >> >> take a
> >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can
> >> probably be
> >> > >> >>> made.
> >> > >> >>> >> >> It
> >> > >> >>> >> >> looks quite similar to the amount of memory eaten on the
> >> > >> client
> >> > >> >>> side
> >> > >> >>> >> >> (per job).
> >> > >> >>> >> >>
> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during
> >> the
> >> > >> >>> time
> >> > >> >>> >> >> the
> >> > >> >>> >> >> jobs are submitted, but the machine itself seems
> >> responsive. I
> >> > >> >>> have
> >> > >> >>> >> >> yet
> >> > >> >>> >> >> to plot the exact submission time for each job.
> >> > >> >>> >> >>
> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long
> >> as
> >> > >> >>> there
> >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel
> >> jobs),
> >> > >> >>> and
> >> > >> >>> >> >> while
> >> > >> >>> >> >> making sure the jvm has enough heap. More than that seems
> >> like
> >> > >> a
> >> > >> >>> >> >> gamble.
> >> > >> >>> >> >>
> >> > >> >>> >> >> Mihael
> >> > >> >>> >> >>
> >> > >> >>> >> >> _______________________________________________
> >> > >> >>> >> >> Swift-devel mailing list
> >> > >> >>> >> >> Swift-devel at ci.uchicago.edu
> >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >> > >> >>> >> >>
> >> > >> >>> >> >>
> >> > >> >>> >> >
> >> > >> >>> >>
> >> > >> >>> >
> >> > >> >>> >
> >> > >> >>>
> >> > >> >>>
> >> > >> >>
> >> > >> >>
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >>
> >> > >>
> >> > >
> >> > >
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >
> >
> 
> 


From feller at mcs.anl.gov  Fri Feb  8 16:46:06 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Fri, 8 Feb 2008 16:46:06 -0600 (CST)
Subject: [Swift-devel] ws-gram tests
In-Reply-To: <1202510238.26717.2.camel@blabla.mcs.anl.gov>
References: <1202438053.26812.12.camel@blabla.mcs.anl.gov>
	<47AC72F9.8010701@mcs.anl.gov>
	<3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov>
	<1202485602.4800.13.camel@blabla.mcs.anl.gov>
	<33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov>
	<1202487494.5642.7.camel@blabla.mcs.anl.gov>
	<34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov>
	<43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov>
	<1202491649.9045.8.camel@blabla.mcs.anl.gov>
	<5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov>
	<1202504550.21618.0.camel@blabla.mcs.anl.gov>
	<1202508763.25421.0.camel@blabla.mcs.anl.gov>
	<58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov>
	<1202510238.26717.2.camel@blabla.mcs.anl.gov>
Message-ID: <61240.208.54.7.179.1202510766.squirrel@www-unix.mcs.anl.gov>

>
> On Fri, 2008-02-08 at 16:32 -0600, feller at mcs.anl.gov wrote:
>> I can't see any stability issues here. The only thing i changed
>> is using
>>
>> EndpointReferenceType jobEPR = (EndpointReferenceType)
>>     ObjectSerializer.clone(response.getManagedJobEndpoint());
>>
>> instead of
>>
>> EndpointReferenceType jobEPR = response.getManagedJobEndpoint();
>>
>> at 2 or 3 locations in the code.
>>
>> Rachana uses cloning in core too. So it's supposed to be
>> a stable thing.
>>
>> A question though: Do you see a speedup in submission?
>
> I wasn't looking for that. Anything I should be aware of?
>

Well, i can see a quite big speedup and can't really explain it.
The only thing i did was that cloning. But i'm working on trunk and
i changed some things in job creation that allow faster job creation.
In 4.0 you might only see it in jobs without delegation.
It would be interesting for me if you see a higher submission rate
in jobs that don't have any links to delegated credentials in
the job description (so no jobCredentialEndpoint, no
stagingCredentialEndpoint, no transferCredentialEndpoints).

Martin

>>
>> Martin
>>
>>
>> > Yep. Looks much better. How stable is this otherwise?
>> >
>> > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote:
>> >> On a first look it indeed looks like the gc is more successful at
>> >> cleaning stuff up.
>> >>
>> >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote:
>> >> > Try the attached 4.0 compliant jar in your tests by dropping
>> >> > it in your 4.0.x $GLOBUS_LOCATION/lib.
>> >> > My tests showed about 2MB memory increase per 100 GramJob
>> >> > objects which sounds to me like a reasonable number (about 20k
>> >> > per GramJob object ignoring the notification consumer manager
>> >> > in one job - if my calculations are right)
>> >> >
>> >> > Martin
>> >> >
>> >> > >
>> >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote:
>> >> > >> Mihael,
>> >> > >>
>> >> > >> i think i found the memory hole in GramJob.
>> >> > >> 100 jobs in a test of mine consumed about 23MB (constantly
>> >> > >> growing) before the fix and 8MB (very slowly growing) after
>> >> > >> the fix. The big part of that (7MB) is used right from the
>> >> > >> first job which may be the NotificationConsumerManager.
>> >> > >> Will commit that change soon to 4.0 branch and you may try
>> >> > >> it then.
>> >> > >> Are you using 4.0.x in your tests?
>> >> > >
>> >> > > Yes. If there are no API changes, you can send me the jar file. I
>> >> don't
>> >> > > have enough knowledge to selectively build WS-GRAM, nor enough
>> disk
>> >> > > space to build the whole GT.
>> >> > >
>> >> > >>
>> >> > >> Martin
>> >> > >>
>> >> > >> >>> >
>> >> > >> >>> > These are both hacks. I'm not sure I want to go there.
>> 300K
>> >> per
>> >> > >> job
>> >> > >> >>> is
>> >> > >> >>> a
>> >> > >> >>> > bit too much considering that swift (which has to consider
>> >> many
>> >> > >> more
>> >> > >> >>> > things) has less than 10K overhead per job.
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >> >>>
>> >> > >> >>> For my better understanding:
>> >> > >> >>> Do you start up your own notification consumer manager that
>> >> listens
>> >> > >> for
>> >> > >> >>> notifications of all jobs or do you let each GramJob
>> instance
>> >> listen
>> >> > >> >>> for
>> >> > >> >>> notifications itself?
>> >> > >> >>> In case you listen for notifications yourself: do you store
>> >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob
>> >> objects if
>> >> > >> >>> needed?
>> >> > >> >>
>> >> > >> >> Excellent points. I let each GramJob instance listen for
>> >> > >> notifications
>> >> > >> >> itself. What I observed is that it uses only one container
>> for
>> >> that.
>> >> > >> >>
>> >> > >> >
>> >> > >> > Shoot! i didn't know that and thought there would be a
>> container
>> >> per
>> >> > >> > GramJob in that case. That's the core mysteries with
>> >> notifications.
>> >> > >> > Anyway: I did a quick check some days ago and found that
>> GramJob
>> >> is
>> >> > >> > surprisingly greedy regarding memory as you said. I'll have to
>> >> further
>> >> > >> > check what it is, but will probably not do that before 4.2 is
>> >> out.
>> >> > >> >
>> >> > >> >
>> >> > >> >> Due to the above, a reference to the GramJob is kept anyway,
>> >> > >> regardless
>> >> > >> >> of whether that reference is in client code or the local
>> >> container.
>> >> > >> >>
>> >> > >> >> I'll try to profile a run and see if I can spot where the
>> >> problems
>> >> > >> are.
>> >> > >> >>
>> >> > >> >>>
>> >> > >> >>> Martin
>> >> > >> >>>
>> >> > >> >>> >>
>> >> > >> >>> >> The core team will be looking at improving notifications
>> >> once
>> >> > >> their
>> >> > >> >>> >> other 4.2 deliverables are done.
>> >> > >> >>> >>
>> >> > >> >>> >> -Stu
>> >> > >> >>> >>
>> >> > >> >>> >> Begin forwarded message:
>> >> > >> >>> >>
>> >> > >> >>> >> > From: feller at mcs.anl.gov
>> >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST
>> >> > >> >>> >> > To: "Jaime Frey" <jfrey at cs.wisc.edu>
>> >> > >> >>> >> > Cc: "Stuart Martin" <smartin at mcs.anl.gov>, "Terrence
>> >> Martin"
>> >> > >> >>> >> <tmartin at physics.ucsd.edu
>> >> > >> >>> >> > >, "Martin Feller" <feller at mcs.anl.gov>, "charles
>> bacon"
>> >> > >> >>> >> <bacon at mcs.anl.gov
>> >> > >> >>> >> > >, "Suchandra Thapa" <sthapa at ci.uchicago.edu>, "Rob
>> >> Gardner"
>> >> > >> >>> >> <rwg at hep.uchicago.edu
>> >> > >> >>> >> > >, "Jeff Porter" <rjporter at lbl.gov>, "Alain Roy"
>> >> > >> >>> <roy at cs.wisc.edu>,
>> >> > >> >>> >> > "Todd Tannenbaum" <tannenba at cs.wisc.edu>, "Miron Livny"
>> >> > >> >>> >> <miron at cs.wisc.edu
>> >> > >> >>> >> > >
>> >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage
>> >> > >> >>> >> >
>> >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote:
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote:
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey
>> wrote:
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G
>> with
>> >> WS
>> >> > >> GRAM
>> >> > >> >>> >> >>>>> raised some concerns about memory usage on the
>> client
>> >> side.
>> >> > >> I
>> >> > >> >>> did
>> >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server,
>> >> which
>> >> > >> >>> appeared
>> >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server
>> is
>> >> a
>> >> > >> >>> wrapper
>> >> > >> >>> >> >>>>> around the java client libraries for WS GRAM.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs
>> up
>> >> to 30
>> >> > >> at
>> >> > >> >>> a
>> >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with
>> minimal
>> >> data
>> >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission
>> and
>> >> > >> >>> execution.
>> >> > >> >>> >> >>>>> Here is what I've discovered so far.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> Aside from the heap available to the java code, the
>> >> jvm
>> >> > >> used
>> >> > >> >>> 117
>> >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared
>> >> memory.
>> >> > >> >>> Condor-G
>> >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509
>> DN)
>> >> pair.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the
>> garbage
>> >> > >> >>> collector)
>> >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the
>> GAHP
>> >> was
>> >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for
>> them
>> >> to
>> >> > >> >>> complete),
>> >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job.
>> >> > >> >>> >> >>>>>
>> >> > >> >>> >> >>>>> The only long-term memory per job that I know of in
>> >> the
>> >> > >> GAHP
>> >> > >> >>> is
>> >> > >> >>> >> >>>>> for the notification sink for job status callbacks.
>> >> 600kb
>> >> > >> >>> seems
>> >> > >> >>> a
>> >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus
>> >> help us
>> >> > >> >>> >> >>>>> determine if we're using the notification sinks
>> >> > >> inefficiently?
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>> Martin just looked and for the most part, there is
>> >> nothing
>> >> > >> >>> wrong
>> >> > >> >>> >> >>>> with how condor-g manages the callback sink.
>> >> > >> >>> >> >>>> However, one improvement that would reduce the
>> memory
>> >> used
>> >> > >> per
>> >> > >> >>> job
>> >> > >> >>> >> >>>> would be to not have a notification consumer per
>> job.
>> >> > >> Instead
>> >> > >> >>> use
>> >> > >> >>> >> >>>> one for all jobs.
>> >> > >> >>> >> >>>>
>> >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g
>> >> stress
>> >> > >> >>> tests
>> >> > >> >>> >> >>>> and found that notifications are building up on the
>> in
>> >> the
>> >> > >> >>> GRAM4
>> >> > >> >>> >> >>>> service container and that is causing delays which
>> seem
>> >> to
>> >> > >> be
>> >> > >> >>> >> >>>> causing multiple problems.  We're looking at this in
>> a
>> >> > >> separate
>> >> > >> >>> >> >>>> effort with the GT Core team.  But, after this was
>> >> clear,
>> >> > >> >>> Martin
>> >> > >> >>> >> >>>> re-
>> >> > >> >>> >> >>>> ran the condor-g test and relied on polling between
>> >> condor-g
>> >> > >> >>> and
>> >> > >> >>> >> >>>> the GRAM4 service instead of notifications.  Jaime,
>> >> could
>> >> > >> you
>> >> > >> >>> >> >>>> repeat the no-notification test and see the
>> difference
>> >> in
>> >> > >> >>> memory?
>> >> > >> >>> >> >>>> The changes would be to increase the polling
>> frequency
>> >> in
>> >> > >> >>> condor-g
>> >> > >> >>> >> >>>> and comment out the subscribe for notification.  You
>> >> could
>> >> > >> also
>> >> > >> >>> >> >>>> comment out the notification listener call(s) too.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> I did two new sets of tests today. The first used
>> more
>> >> > >> efficient
>> >> > >> >>> >> >>> callback code in the GAHP (one notification consumer
>> >> rather
>> >> > >> than
>> >> > >> >>> one
>> >> > >> >>> >> >>> per job). The second disabled notifications and
>> relied
>> >> on
>> >> > >> >>> polling
>> >> > >> >>> >> >>> for job status changes.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> The more efficient callback code did not produce a
>> >> noticeable
>> >> > >> >>> >> >>> reduction in memory usage.
>> >> > >> >>> >> >>>
>> >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The
>> >> maximum
>> >> > >> jvm
>> >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job.
>> The
>> >> > >> minimum
>> >> > >> >>> >> >>> heap usage after job submission and before job
>> >> completion was
>> >> > >> >>> about
>> >> > >> >>> >> >>> 4 megs + 0.1 megs per job.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> I ran one more test with the improved callback code.
>> This
>> >> > >> time, I
>> >> > >> >>> >> >> stopped storing the notification producer EPRs
>> associated
>> >> with
>> >> > >> >>> the
>> >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> I was told the client had to explicitly destroy these
>> >> > >> serve-side
>> >> > >> >>> >> >> notification producer resources when it destroys the
>> job,
>> >> > >> >>> otherwise
>> >> > >> >>> >> >> they hang around bogging down the server. Is this
>> still
>> >> the
>> >> > >> case?
>> >> > >> >>> The
>> >> > >> >>> >> >> server can't destroy notification producers when their
>> >> sources
>> >> > >> of
>> >> > >> >>> >> >> information are destroyed?
>> >> > >> >>> >> >>
>> >> > >> >>> >> >
>> >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly
>> >> grant
>> >> > >> much
>> >> > >> >>> more
>> >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing
>> >> EPRs of
>> >> > >> >>> >> > subscription resources to be able to destroy them
>> >> eventually.
>> >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like.
>> >> > >> >>> >> >
>> >> > >> >>> >> > For 4.0: yes, currently you'll have to store and
>> >> eventually
>> >> > >> >>> destroy
>> >> > >> >>> >> > subscription resources manually to avoid heaping up
>> >> persistence
>> >> > >> >>> data
>> >> > >> >>> >> > on the server-side.
>> >> > >> >>> >> > For 4.2: no, you won't have to store them. A job
>> resource
>> >> will
>> >> > >> >>> >> > destroy all subscription resources when it's destroyed.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the
>> >> > >> "container
>> >> > >> >>> >> > hangs in job destruction" problem won't exist anymore.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100%
>> reliable
>> >> 4.2
>> >> > >> >>> changes
>> >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder
>> if
>> >> it
>> >> > >> makes
>> >> > >> >>> >> > sense
>> >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand
>> it
>> >> to you
>> >> > >> >>> for
>> >> > >> >>> >> > fine-tuning then?
>> >> > >> >>> >> >
>> >> > >> >>> >> > Martin
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >>
>> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote:
>> >> > >> >>> >>
>> >> > >> >>> >> > Mihael:
>> >> > >> >>> >> >
>> >> > >> >>> >> > That's great, thanks!
>> >> > >> >>> >> >
>> >> > >> >>> >> > Ian.
>> >> > >> >>> >> >
>> >> > >> >>> >> > Mihael Hategan wrote:
>> >> > >> >>> >> >> I did a 1024 job run today with ws-gram.
>> >> > >> >>> >> >> I painted the results here:
>> >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which
>> is
>> >> quite
>> >> > >> a
>> >> > >> >>> lot.
>> >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go
>> down
>> >> after
>> >> > >> >>> the
>> >> > >> >>> >> >> jobs
>> >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the
>> >> garbage
>> >> > >> >>> >> >> collector
>> >> > >> >>> >> >> doesn't do any major collections. I'll need to profile
>> >> this to
>> >> > >> >>> see
>> >> > >> >>> >> >> exactly what we're talking about.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> The container memory is figured by looking at the
>> process
>> >> in
>> >> > >> >>> /proc.
>> >> > >> >>> >> >> It's
>> >> > >> >>> >> >> total memory including shared libraries and things.
>> But
>> >> > >> libraries
>> >> > >> >>> >> >> take a
>> >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can
>> >> probably be
>> >> > >> >>> made.
>> >> > >> >>> >> >> It
>> >> > >> >>> >> >> looks quite similar to the amount of memory eaten on
>> the
>> >> > >> client
>> >> > >> >>> side
>> >> > >> >>> >> >> (per job).
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work
>> during
>> >> the
>> >> > >> >>> time
>> >> > >> >>> >> >> the
>> >> > >> >>> >> >> jobs are submitted, but the machine itself seems
>> >> responsive. I
>> >> > >> >>> have
>> >> > >> >>> >> >> yet
>> >> > >> >>> >> >> to plot the exact submission time for each job.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as
>> long
>> >> as
>> >> > >> >>> there
>> >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000
>> parallel
>> >> jobs),
>> >> > >> >>> and
>> >> > >> >>> >> >> while
>> >> > >> >>> >> >> making sure the jvm has enough heap. More than that
>> seems
>> >> like
>> >> > >> a
>> >> > >> >>> >> >> gamble.
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> Mihael
>> >> > >> >>> >> >>
>> >> > >> >>> >> >> _______________________________________________
>> >> > >> >>> >> >> Swift-devel mailing list
>> >> > >> >>> >> >> Swift-devel at ci.uchicago.edu
>> >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >> > >> >>> >> >>
>> >> > >> >>> >> >>
>> >> > >> >>> >> >
>> >> > >> >>> >>
>> >> > >> >>> >
>> >> > >> >>> >
>> >> > >> >>>
>> >> > >> >>>
>> >> > >> >>
>> >> > >> >>
>> >> > >> >
>> >> > >> >
>> >> > >> >
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >>
>> >> _______________________________________________
>> >> Swift-devel mailing list
>> >> Swift-devel at ci.uchicago.edu
>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> >>
>> >
>> >
>>
>>
>
>


From benc at hawaga.org.uk  Sun Feb 10 05:50:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 10 Feb 2008 11:50:05 +0000 (GMT)
Subject: [Swift-devel] program order
Message-ID: <Pine.LNX.4.64.0802101137160.4874@dildano.hawaga.org.uk>


This works in the present code - type and mapping declaration after 
assignment (see tests/language-behaviour/040-program-order.swift)

outfile = greeting("hi");
messagefile outfile <"040-program-order.out">;

When implementing some more compile time checking, I rediscovered this. 
I'm not sure whether I prefer this to be permitted or to be prohibited.

-- 


From hategan at mcs.anl.gov  Sun Feb 10 11:45:10 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 10 Feb 2008 11:45:10 -0600
Subject: [Swift-devel] program order
In-Reply-To: <Pine.LNX.4.64.0802101137160.4874@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802101137160.4874@dildano.hawaga.org.uk>
Message-ID: <1202665510.5770.15.camel@blabla.mcs.anl.gov>

On Sun, 2008-02-10 at 11:50 +0000, Ben Clifford wrote:
> This works in the present code - type and mapping declaration after 
> assignment (see tests/language-behaviour/040-program-order.swift)
> 
> outfile = greeting("hi");
> messagefile outfile <"040-program-order.out">;
> 
> When implementing some more compile time checking, I rediscovered this. 
> I'm not sure whether I prefer this to be permitted or to be prohibited.

Does it really? Or does it cause a race condition which happens to work
most of the times?

> 


From benc at hawaga.org.uk  Sun Feb 10 12:12:11 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 10 Feb 2008 18:12:11 +0000 (GMT)
Subject: [Swift-devel] program order
In-Reply-To: <1202665510.5770.15.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802101137160.4874@dildano.hawaga.org.uk>
	<1202665510.5770.15.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802101804200.4874@dildano.hawaga.org.uk>


On Sun, 10 Feb 2008, Mihael Hategan wrote:

> Does it really? Or does it cause a race condition which happens to work
> most of the times?

It produces almost the same KML either way, though at least with the 
partial closing stuff that I put in a month or two ago enough to be 
significant.

It is not a race in karajan execution because variable declarations get 
compiled to a separate block that is always placed before the parallel 
execution of assignments/procedures; so the ordering is irrelevant from 
that perspective (That is related roblem with using not-yet-evaluated 
variables in mapper parameters).

I suspect some funny stuff will happen with partial closing when arrays 
are used in this order at the moment though.

-- 


From benc at hawaga.org.uk  Mon Feb 11 08:28:57 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 11 Feb 2008 14:28:57 +0000 (GMT)
Subject: [Swift-devel] cog r1871
Message-ID: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>


On my laptop, I'm getting the below error message with latest cog and my 
development swift.

cog r1864 doesn't give this problem. cog r1871 does (there's nothing in 
between those two commits in the piece of the cog svn that swift uses)


echo failed
Execution failed:
        Exception in echo:
Arguments: [hello]
Host: tp-fork-gram2
Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi
stderr.txt:.

stdout.txt:.

----

Caused by:
        Exception in getFile
Caused by:
        Server refused performing the request. Custom message:  (error 
code 1) [Nested exception message:  Custom message: Unexpected reply: 451 
refusing to store with active mode
org.globus.ftp.exception.DataChannelException: setPassive() must match 
store() and setActive() - retrieve()  (error code 2)
org.globus.ftp.exception.DataChannelException: setPassive() must match 
store() and setActive() - retrieve()  (error code 2)
        at 
org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317)
        at org.globus.ftp.FTPClient.get(FTPClient.java:1236)
        at 
org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359)
        at 
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275)
        at 
org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60)
        at 
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490)
        at java.lang.Thread.run(Thread.java:613)


-- 


From hategan at mcs.anl.gov  Mon Feb 11 10:01:21 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Feb 2008 10:01:21 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
Message-ID: <1202745681.15887.10.camel@blabla.mcs.anl.gov>

On Mon, 2008-02-11 at 14:28 +0000, Ben Clifford wrote:
> On my laptop, I'm getting the below error message with latest cog and my 
> development swift.
> 
> cog r1864 doesn't give this problem. cog r1871 does (there's nothing in 
> between those two commits in the piece of the cog svn that swift uses)

I know what causes the problem. It's r1871, as you say.

> 
> 
> echo failed
> Execution failed:
>         Exception in echo:
> Arguments: [hello]
> Host: tp-fork-gram2
> Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi
> stderr.txt:.
> 
> stdout.txt:.
> 
> ----
> 
> Caused by:
>         Exception in getFile
> Caused by:
>         Server refused performing the request. Custom message:  (error 
> code 1) [Nested exception message:  Custom message: Unexpected reply: 451 
> refusing to store with active mode
> org.globus.ftp.exception.DataChannelException: setPassive() must match 
> store() and setActive() - retrieve()  (error code 2)
> org.globus.ftp.exception.DataChannelException: setPassive() must match 
> store() and setActive() - retrieve()  (error code 2)
>         at 
> org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317)
>         at org.globus.ftp.FTPClient.get(FTPClient.java:1236)
>         at 
> org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359)
>         at 
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275)
>         at 
> org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60)
>         at 
> org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490)
>         at java.lang.Thread.run(Thread.java:613)
> 
> 


From hategan at mcs.anl.gov  Mon Feb 11 10:59:52 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Feb 2008 10:59:52 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202745681.15887.10.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
Message-ID: <1202749192.18234.0.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-11 at 10:01 -0600, Mihael Hategan wrote:
> On Mon, 2008-02-11 at 14:28 +0000, Ben Clifford wrote:
> > On my laptop, I'm getting the below error message with latest cog and my 
> > development swift.
> > 
> > cog r1864 doesn't give this problem. cog r1871 does (there's nothing in 
> > between those two commits in the piece of the cog svn that swift uses)
> 
> I know what causes the problem.

Actually I don't. I only have a suspicion. Can you send me the logs?

>  It's r1871, as you say.
> 
> > 
> > 
> > echo failed
> > Execution failed:
> >         Exception in echo:
> > Arguments: [hello]
> > Host: tp-fork-gram2
> > Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi
> > stderr.txt:.
> > 
> > stdout.txt:.
> > 
> > ----
> > 
> > Caused by:
> >         Exception in getFile
> > Caused by:
> >         Server refused performing the request. Custom message:  (error 
> > code 1) [Nested exception message:  Custom message: Unexpected reply: 451 
> > refusing to store with active mode
> > org.globus.ftp.exception.DataChannelException: setPassive() must match 
> > store() and setActive() - retrieve()  (error code 2)
> > org.globus.ftp.exception.DataChannelException: setPassive() must match 
> > store() and setActive() - retrieve()  (error code 2)
> >         at 
> > org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317)
> >         at org.globus.ftp.FTPClient.get(FTPClient.java:1236)
> >         at 
> > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359)
> >         at 
> > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275)
> >         at 
> > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60)
> >         at 
> > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490)
> >         at java.lang.Thread.run(Thread.java:613)
> > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Feb 11 11:25:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 11 Feb 2008 17:25:05 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202749192.18234.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk> 
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>


On Mon, 11 Feb 2008, Mihael Hategan wrote:

> Actually I don't. I only have a suspicion. Can you send me the logs?

the log for running tests/language-behaviour/061-cattwo to tg-uc from my 
laptop is here:

http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080211-1720-a2hqh596.log

-- 


From hategan at mcs.anl.gov  Mon Feb 11 13:34:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Feb 2008 13:34:03 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
Message-ID: <1202758443.28686.0.camel@blabla.mcs.anl.gov>


On Mon, 2008-02-11 at 17:25 +0000, Ben Clifford wrote:
> 
> On Mon, 11 Feb 2008, Mihael Hategan wrote:
> 
> > Actually I don't. I only have a suspicion. Can you send me the logs?
> 
> the log for running tests/language-behaviour/061-cattwo to tg-uc from my 
> laptop is here:
> 
> http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080211-1720-a2hqh596.log
> 

r1875 should fix this.


From benc at hawaga.org.uk  Mon Feb 11 16:19:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 11 Feb 2008 22:19:15 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202758443.28686.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk> 
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>


On Mon, 11 Feb 2008, Mihael Hategan wrote:

> r1875 should fix this.

yes, it seems to.

-- 


From hategan at mcs.anl.gov  Mon Feb 11 16:37:12 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Feb 2008 16:37:12 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
Message-ID: <1202769433.31985.1.camel@blabla.mcs.anl.gov>

Also, r1876 updates the gram4 client to a patched version of 4.0.6 which
seems to eat less memory than 4.0.6 and earlier.

On Mon, 2008-02-11 at 22:19 +0000, Ben Clifford wrote:
> 
> On Mon, 11 Feb 2008, Mihael Hategan wrote:
> 
> > r1875 should fix this.
> 
> yes, it seems to.
> 


From benc at hawaga.org.uk  Mon Feb 11 16:50:56 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 11 Feb 2008 22:50:56 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202769433.31985.1.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk> 
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk> 
	<1202758443.28686.0.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>


I'm seeing repeatable cleanup errors like the below. The workflows run to 
completion, though.

RunID: 20080211-2248-rsqe1da0
cat started
cat completed
The following warnings have occurred:
1. Cleanup on tguc failed
Caused by:
        Cannot submit job: null
Caused by:
        java.lang.NullPointerException
        at 
org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211)
        at 
org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970)
        at org.globus.exec.client.GramJob.submit(GramJob.java:447)
        at 
org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189)
        at 
org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54)
        at 
org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
        at 
edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)


-- 


From hategan at mcs.anl.gov  Mon Feb 11 17:46:42 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Feb 2008 17:46:42 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
Message-ID: <1202773602.779.0.camel@blabla.mcs.anl.gov>

Martin?

On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote:
> I'm seeing repeatable cleanup errors like the below. The workflows run to 
> completion, though.
> 
> RunID: 20080211-2248-rsqe1da0
> cat started
> cat completed
> The following warnings have occurred:
> 1. Cleanup on tguc failed
> Caused by:
>         Cannot submit job: null
> Caused by:
>         java.lang.NullPointerException
>         at 
> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211)
>         at 
> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970)
>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
>         at 
> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189)
>         at 
> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54)
>         at 
> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
>         at 
> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
> 
> 


From feller at mcs.anl.gov  Mon Feb 11 23:28:05 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Mon, 11 Feb 2008 23:28:05 -0600 (CST)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202773602.779.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
Message-ID: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>

My fault, not the ObjectSerializers one.
You submitted in batch-mode?
The attached jar should fix that.
Hope the java version is fine.
Martin

> Martin?
>
> On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote:
>> I'm seeing repeatable cleanup errors like the below. The workflows run
>> to
>> completion, though.
>>
>> RunID: 20080211-2248-rsqe1da0
>> cat started
>> cat completed
>> The following warnings have occurred:
>> 1. Cleanup on tguc failed
>> Caused by:
>>         Cannot submit job: null
>> Caused by:
>>         java.lang.NullPointerException
>>         at
>> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211)
>>         at
>> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970)
>>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
>>         at
>> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189)
>>         at
>> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54)
>>         at
>> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
>>         at
>> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gram-client.jar
Type: application/octet-stream
Size: 35855 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080211/219b413f/attachment.obj>

From benc at hawaga.org.uk  Tue Feb 12 06:47:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Feb 2008 12:47:23 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>   
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>   
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>   
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>   
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>   
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>   
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>

On Mon, 11 Feb 2008, feller at mcs.anl.gov wrote:

> My fault, not the ObjectSerializers one.
> You submitted in batch-mode?

The final cleanup job that failed is in batch mode, yes. Its the only one 
submitted that way.

> The attached jar should fix that.

With your new jar, I no longer get that error. I did once get the below 
stack trace, though execution appeared to continue. It hasn't happened a 
second time or third time on running the same tests.

touch started
Unable to destroy remote service for task urn:0-1-1202817892228
java.lang.NullPointerException
        at 
org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12)
        at 
org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:32)
        at org.globus.exec.client.GramJob.destroy(GramJob.java:1303)
        at 
org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.cleanup(JobSubmissionTaskHandler.java:431)
        at 
org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.stateChanged(JobSubmissionTaskHandler.java:397)
        at org.globus.exec.client.GramJob.setState(GramJob.java:321)
        at org.globus.exec.client.GramJob.deliver(GramJob.java:1677)
        at 
org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:126)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at 
org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384)
        at 
org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281)
        at 
org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319)
        at 
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
        at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
        at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
        at 
org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450)
        at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285)
        at 
org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
        at 
org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
        at 
org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147)
        at 
org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
touch completed

-- 


From benc at hawaga.org.uk  Tue Feb 12 06:49:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Feb 2008 12:49:21 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1202769433.31985.1.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk> 
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk> 
	<1202758443.28686.0.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802121247430.32747@dildano.hawaga.org.uk>


On Mon, 11 Feb 2008, Mihael Hategan wrote:

> Also, r1876 updates the gram4 client to a patched version of 4.0.6 which 
> seems to eat less memory than 4.0.6 and earlier.

For source code reproducibility when some sucker wants to go look at the 
source code, can you label the gram jars with a timestamp (until such time 
as GT moves to a version control system with commit IDs)?

-- 


From feller at mcs.anl.gov  Tue Feb 12 09:33:14 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Tue, 12 Feb 2008 09:33:14 -0600 (CST)
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
	<Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
Message-ID: <49223.130.202.97.10.1202830394.squirrel@www-unix.mcs.anl.gov>

> On Mon, 11 Feb 2008, feller at mcs.anl.gov wrote:
>
>> My fault, not the ObjectSerializers one.
>> You submitted in batch-mode?
>
> The final cleanup job that failed is in batch mode, yes. Its the only one
> submitted that way.
>
>> The attached jar should fix that.
>
> With your new jar, I no longer get that error. I did once get the below
> stack trace, though execution appeared to continue. It hasn't happened a
> second time or third time on running the same tests.
>
> touch started
> Unable to destroy remote service for task urn:0-1-1202817892228
> java.lang.NullPointerException
>         at
> org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12)
>         at
> org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:32)
>         at org.globus.exec.client.GramJob.destroy(GramJob.java:1303)
>         at
> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.cleanup(JobSubmissionTaskHandler.java:431)
>         at
> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.stateChanged(JobSubmissionTaskHandler.java:397)
>         at org.globus.exec.client.GramJob.setState(GramJob.java:321)
>         at org.globus.exec.client.GramJob.deliver(GramJob.java:1677)
>         at
> org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:126)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at
> org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384)
>         at
> org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281)
>         at
> org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319)
>         at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>         at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
>         at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
>         at
> org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450)
>         at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285)
>         at
> org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
>         at
> org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
>         at
> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147)
>         at
> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
> touch completed
>
> --


This is odd. Can you have an eye on that in further tests?
May it happen that you use GramJob.setEndpoint(EndpointReferenceType)
before destruction at some point and pass null as argument? That's the
only situation where i can see that this can happen right now without
an exception being thrown before.

Martin


From mikekubal at yahoo.com  Tue Feb 12 12:08:51 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 10:08:51 -0800 (PST)
Subject: [Swift-devel] latest attempt with GRAM4
In-Reply-To: <Pine.LNX.4.64.0802121247430.32747@dildano.hawaga.org.uk>
Message-ID: <734334.98494.qm@web52305.mail.re2.yahoo.com>

Hello All,

I am running with the cog and swift from svn as of
Monday afternoon, 2/11. The swift script ran
successfully when using pre-ws, but failed with
ws-gram. I am also running with kickstart on, but will
now test with kickstart off to see if this is the
problem. This is the error I get back. (I rsynced the
log files to Ben's dir at UC, job gtxa3945 is the one
that failed). 

Failed to transfer kickstart records from
run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/UC-64Exception
in getFile
        task:transfer @ vdl-int.k, line: 322
        sys:try @ vdl-int.k, line: 322
        vdl:transferkickstartrec @ vdl-int.k, line:
409
        sys:set @ vdl-int.k, line: 409
        sys:sequential @ vdl-int.k, line: 409
        sys:try @ vdl-int.k, line: 408
        sys:else @ vdl-int.k, line: 407
        sys:if @ vdl-int.k, line: 405
        sys:set @ vdl-int.k, line: 404
        sys:catch @ vdl-int.k, line: 396
        sys:try @ vdl-int.k, line: 354
        task:allocatehost @ vdl-int.k, line: 334
        vdl:execute2 @ execute-default.k, line: 23
        sys:restartonerror @ execute-default.k, line:
21
        sys:sequential @ execute-default.k, line: 19
        sys:try @ execute-default.k, line: 18
        sys:if @ execute-default.k, line: 17
        sys:then @ execute-default.k, line: 16
        sys:if @ execute-default.k, line: 15
        vdl:execute @
run_MD_pipeline_loop_for_impdh.kml, line: 67
        prepare_ligand @
run_MD_pipeline_loop_for_impdh.kml, line: 585
        sys:sequential @
run_MD_pipeline_loop_for_impdh.kml, line: 584
        sys:parallel @
run_MD_pipeline_loop_for_impdh.kml, line: 583
        sys:parallelfor @
run_MD_pipeline_loop_for_impdh.kml, line: 450
        sys:sequential @
run_MD_pipeline_loop_for_impdh.kml, line: 449
        vdl:mainp @
run_MD_pipeline_loop_for_impdh.kml, line: 448
        mainp @ vdl.k, line: 150
        vdl:mains @
run_MD_pipeline_loop_for_impdh.kml, line: 447
        vdl:mains @
run_MD_pipeline_loop_for_impdh.kml, line: 447
        rlog:restartlog @
run_MD_pipeline_loop_for_impdh.kml, line: 446
        kernel:project @
run_MD_pipeline_loop_for_impdh.kml, line: 2
       
run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945
Caused by:
org.globus.cog.abstraction.impl.file.FileResourceException:
Exception in getFile
Caused by: org.globus.ftp.exception.ServerException:
Server refused performing the request. Custom message:
 (error code 1) [Nested exception message:  Custom
message: Unexpected reply: 500-Command failed. :
globus_l_gfs_file_open failed.
500-globus_xio: Unable to open file
/home/kubal/Swift_Runs/run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/amberize_ligand-lqshnboi-kickstart.xml
500-globus_xio: System error in open: No such file or
directory
500-globus_xio: A system call failed: No such file or
directory
500-
500 End.] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:
 Custom message: Unexpected reply: 500-Command failed.
: globus_l_gfs_file_open failed.
500-globus_xio: Unable to open file
/home/kubal/Swift_Runs/run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/amberize_ligand-lqshnboi-kickstart.xml
500-globus_xio: System error in open: No such file or
directory
500-globus_xio: A system call failed: No such file or
directory
500-
500 End.]

--- Ben Clifford <benc at hawaga.org.uk> wrote:

> 
> On Mon, 11 Feb 2008, Mihael Hategan wrote:
> 
> > Also, r1876 updates the gram4 client to a patched
> version of 4.0.6 which 
> > seems to eat less memory than 4.0.6 and earlier.
> 
> For source code reproducibility when some sucker
> wants to go look at the 
> source code, can you label the gram jars with a
> timestamp (until such time 
> as GT moves to a version control system with commit
> IDs)?
> 
> -- 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From benc at hawaga.org.uk  Tue Feb 12 12:33:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Feb 2008 18:33:34 +0000 (GMT)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <734334.98494.qm@web52305.mail.re2.yahoo.com>
References: <734334.98494.qm@web52305.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802121833030.32747@dildano.hawaga.org.uk>

yeah, run that same without kickstart. the error reported is that 
kickstart didn't work right - but there's perhaps some underlying error.
-- 


From mikekubal at yahoo.com  Tue Feb 12 13:36:20 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 11:36:20 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <Pine.LNX.4.64.0802121833030.32747@dildano.hawaga.org.uk>
Message-ID: <283100.87314.qm@web52307.mail.re2.yahoo.com>

Yes, I believe you are right. The kickstart message
may be only a warning. After digging a little deeper
it appears the job is failing due to a project/account
id problem. I get the following error:

Caused by:
        The executable could not be started., qsub:
Invalid Account MSG=invalid account

I am specifying the same TG-account in my site-file
for the gram4 run that fails, as in the site-file for
the pre-ws job that suceeds. This is the same project,
TG-MCA01S018, that is set in my .tg_default_project
file in ~kubal/ on the UC teragrid.


--- Ben Clifford <benc at hawaga.org.uk> wrote:

> yeah, run that same without kickstart. the error
> reported is that 
> kickstart didn't work right - but there's perhaps
> some underlying error.
> -- 
> 
> 
> 


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From hategan at mcs.anl.gov  Tue Feb 12 13:45:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 13:45:02 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <283100.87314.qm@web52307.mail.re2.yahoo.com>
References: <283100.87314.qm@web52307.mail.re2.yahoo.com>
Message-ID: <1202845502.13985.1.camel@blabla.mcs.anl.gov>

While this doesn't solve the underlying problem, it may help you get
this to work: log into tg-login1.uc..., set this project as default,
then remove the project spec from the sites file and try again.

Mihael

On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
> Yes, I believe you are right. The kickstart message
> may be only a warning. After digging a little deeper
> it appears the job is failing due to a project/account
> id problem. I get the following error:
> 
> Caused by:
>         The executable could not be started., qsub:
> Invalid Account MSG=invalid account
> 
> I am specifying the same TG-account in my site-file
> for the gram4 run that fails, as in the site-file for
> the pre-ws job that suceeds. This is the same project,
> TG-MCA01S018, that is set in my .tg_default_project
> file in ~kubal/ on the UC teragrid.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> 
> > yeah, run that same without kickstart. the error
> > reported is that 
> > kickstart didn't work right - but there's perhaps
> > some underlying error.
> > -- 
> > 
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> 


From mikekubal at yahoo.com  Tue Feb 12 14:09:17 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 12:09:17 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202845502.13985.1.camel@blabla.mcs.anl.gov>
Message-ID: <874540.48019.qm@web52309.mail.re2.yahoo.com>

I'll give it a try.

When using GRAM4, is qsub the method used to
ultimately put the job in the queue? 

MikeK
--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> While this doesn't solve the underlying problem, it
> may help you get
> this to work: log into tg-login1.uc..., set this
> project as default,
> then remove the project spec from the sites file and
> try again.
> 
> Mihael
> 
> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
> > Yes, I believe you are right. The kickstart
> message
> > may be only a warning. After digging a little
> deeper
> > it appears the job is failing due to a
> project/account
> > id problem. I get the following error:
> > 
> > Caused by:
> >         The executable could not be started.,
> qsub:
> > Invalid Account MSG=invalid account
> > 
> > I am specifying the same TG-account in my
> site-file
> > for the gram4 run that fails, as in the site-file
> for
> > the pre-ws job that suceeds. This is the same
> project,
> > TG-MCA01S018, that is set in my
> .tg_default_project
> > file in ~kubal/ on the UC teragrid.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > 
> > > yeah, run that same without kickstart. the error
> > > reported is that 
> > > kickstart didn't work right - but there's
> perhaps
> > > some underlying error.
> > > -- 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> >      
>
____________________________________________________________________________________
> > Never miss a thing.  Make Yahoo your home page. 
> > http://www.yahoo.com/r/hs
> > 
> 
> 


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


From hategan at mcs.anl.gov  Tue Feb 12 14:15:06 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 14:15:06 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <874540.48019.qm@web52309.mail.re2.yahoo.com>
References: <874540.48019.qm@web52309.mail.re2.yahoo.com>
Message-ID: <1202847306.14542.1.camel@blabla.mcs.anl.gov>


On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote:
> I'll give it a try.
> 
> When using GRAM4, is qsub the method used to
> ultimately put the job in the queue?

Looks like it. I also believe it's the case with pre-ws gram. Stu may be
able to clarify.

> 
> MikeK
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > While this doesn't solve the underlying problem, it
> > may help you get
> > this to work: log into tg-login1.uc..., set this
> > project as default,
> > then remove the project spec from the sites file and
> > try again.
> > 
> > Mihael
> > 
> > On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
> > > Yes, I believe you are right. The kickstart
> > message
> > > may be only a warning. After digging a little
> > deeper
> > > it appears the job is failing due to a
> > project/account
> > > id problem. I get the following error:
> > > 
> > > Caused by:
> > >         The executable could not be started.,
> > qsub:
> > > Invalid Account MSG=invalid account
> > > 
> > > I am specifying the same TG-account in my
> > site-file
> > > for the gram4 run that fails, as in the site-file
> > for
> > > the pre-ws job that suceeds. This is the same
> > project,
> > > TG-MCA01S018, that is set in my
> > .tg_default_project
> > > file in ~kubal/ on the UC teragrid.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > > 
> > > > yeah, run that same without kickstart. the error
> > > > reported is that 
> > > > kickstart didn't work right - but there's
> > perhaps
> > > > some underlying error.
> > > > -- 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >      
> >
> ____________________________________________________________________________________
> > > Never miss a thing.  Make Yahoo your home page. 
> > > http://www.yahoo.com/r/hs
> > > 
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
> 


From smartin at mcs.anl.gov  Tue Feb 12 14:20:39 2008
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 12 Feb 2008 14:20:39 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202847306.14542.1.camel@blabla.mcs.anl.gov>
References: <874540.48019.qm@web52309.mail.re2.yahoo.com>
	<1202847306.14542.1.camel@blabla.mcs.anl.gov>
Message-ID: <F3847C08-A48C-4231-B9E6-0E428130E648@mcs.anl.gov>

that's right, qsub is used for PBS (and some others too)
bsub is LSF
condor_q for condor
...

-Stu

On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote:

>
> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote:
>> I'll give it a try.
>>
>> When using GRAM4, is qsub the method used to
>> ultimately put the job in the queue?
>
> Looks like it. I also believe it's the case with pre-ws gram. Stu  
> may be
> able to clarify.
>
>>
>> MikeK
>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>
>>> While this doesn't solve the underlying problem, it
>>> may help you get
>>> this to work: log into tg-login1.uc..., set this
>>> project as default,
>>> then remove the project spec from the sites file and
>>> try again.
>>>
>>> Mihael
>>>
>>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
>>>> Yes, I believe you are right. The kickstart
>>> message
>>>> may be only a warning. After digging a little
>>> deeper
>>>> it appears the job is failing due to a
>>> project/account
>>>> id problem. I get the following error:
>>>>
>>>> Caused by:
>>>>        The executable could not be started.,
>>> qsub:
>>>> Invalid Account MSG=invalid account
>>>>
>>>> I am specifying the same TG-account in my
>>> site-file
>>>> for the gram4 run that fails, as in the site-file
>>> for
>>>> the pre-ws job that suceeds. This is the same
>>> project,
>>>> TG-MCA01S018, that is set in my
>>> .tg_default_project
>>>> file in ~kubal/ on the UC teragrid.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
>>>>
>>>>> yeah, run that same without kickstart. the error
>>>>> reported is that
>>>>> kickstart didn't work right - but there's
>>> perhaps
>>>>> some underlying error.
>>>>> -- 
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>> ____________________________________________________________________________________
>>>> Never miss a thing.  Make Yahoo your home page.
>>>> http://www.yahoo.com/r/hs
>>>>
>>>
>>>
>>
>>
>>
>>       
>> ____________________________________________________________________________________
>> Be a better friend, newshound, and
>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>
>


From hategan at mcs.anl.gov  Tue Feb 12 14:23:22 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 14:23:22 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <F3847C08-A48C-4231-B9E6-0E428130E648@mcs.anl.gov>
References: <874540.48019.qm@web52309.mail.re2.yahoo.com>
	<1202847306.14542.1.camel@blabla.mcs.anl.gov>
	<F3847C08-A48C-4231-B9E6-0E428130E648@mcs.anl.gov>
Message-ID: <1202847802.15085.0.camel@blabla.mcs.anl.gov>

Is this the same for pre-WS GRAM?

On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin wrote:
> that's right, qsub is used for PBS (and some others too)
> bsub is LSF
> condor_q for condor
> ...
> 
> -Stu
> 
> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote:
> 
> >
> > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote:
> >> I'll give it a try.
> >>
> >> When using GRAM4, is qsub the method used to
> >> ultimately put the job in the queue?
> >
> > Looks like it. I also believe it's the case with pre-ws gram. Stu  
> > may be
> > able to clarify.
> >
> >>
> >> MikeK
> >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>
> >>> While this doesn't solve the underlying problem, it
> >>> may help you get
> >>> this to work: log into tg-login1.uc..., set this
> >>> project as default,
> >>> then remove the project spec from the sites file and
> >>> try again.
> >>>
> >>> Mihael
> >>>
> >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
> >>>> Yes, I believe you are right. The kickstart
> >>> message
> >>>> may be only a warning. After digging a little
> >>> deeper
> >>>> it appears the job is failing due to a
> >>> project/account
> >>>> id problem. I get the following error:
> >>>>
> >>>> Caused by:
> >>>>        The executable could not be started.,
> >>> qsub:
> >>>> Invalid Account MSG=invalid account
> >>>>
> >>>> I am specifying the same TG-account in my
> >>> site-file
> >>>> for the gram4 run that fails, as in the site-file
> >>> for
> >>>> the pre-ws job that suceeds. This is the same
> >>> project,
> >>>> TG-MCA01S018, that is set in my
> >>> .tg_default_project
> >>>> file in ~kubal/ on the UC teragrid.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> >>>>
> >>>>> yeah, run that same without kickstart. the error
> >>>>> reported is that
> >>>>> kickstart didn't work right - but there's
> >>> perhaps
> >>>>> some underlying error.
> >>>>> -- 
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >> ____________________________________________________________________________________
> >>>> Never miss a thing.  Make Yahoo your home page.
> >>>> http://www.yahoo.com/r/hs
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >>       
> >> ____________________________________________________________________________________
> >> Be a better friend, newshound, and
> >> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>
> >
> 


From smartin at mcs.anl.gov  Tue Feb 12 14:26:44 2008
From: smartin at mcs.anl.gov (Stuart Martin)
Date: Tue, 12 Feb 2008 14:26:44 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202847802.15085.0.camel@blabla.mcs.anl.gov>
References: <874540.48019.qm@web52309.mail.re2.yahoo.com>
	<1202847306.14542.1.camel@blabla.mcs.anl.gov>
	<F3847C08-A48C-4231-B9E6-0E428130E648@mcs.anl.gov>
	<1202847802.15085.0.camel@blabla.mcs.anl.gov>
Message-ID: <41A561F6-5D46-4B2C-96B5-E693290C41C6@mcs.anl.gov>

Yes.  Both versions use the *same* perl scripts to submit jobs.

On Feb 12, 2008, at Feb 12, 2:23 PM, Mihael Hategan wrote:

> Is this the same for pre-WS GRAM?
>
> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin wrote:
>> that's right, qsub is used for PBS (and some others too)
>> bsub is LSF
>> condor_q for condor
>> ...
>>
>> -Stu
>>
>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote:
>>
>>>
>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote:
>>>> I'll give it a try.
>>>>
>>>> When using GRAM4, is qsub the method used to
>>>> ultimately put the job in the queue?
>>>
>>> Looks like it. I also believe it's the case with pre-ws gram. Stu
>>> may be
>>> able to clarify.
>>>
>>>>
>>>> MikeK
>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>
>>>>> While this doesn't solve the underlying problem, it
>>>>> may help you get
>>>>> this to work: log into tg-login1.uc..., set this
>>>>> project as default,
>>>>> then remove the project spec from the sites file and
>>>>> try again.
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote:
>>>>>> Yes, I believe you are right. The kickstart
>>>>> message
>>>>>> may be only a warning. After digging a little
>>>>> deeper
>>>>>> it appears the job is failing due to a
>>>>> project/account
>>>>>> id problem. I get the following error:
>>>>>>
>>>>>> Caused by:
>>>>>>       The executable could not be started.,
>>>>> qsub:
>>>>>> Invalid Account MSG=invalid account
>>>>>>
>>>>>> I am specifying the same TG-account in my
>>>>> site-file
>>>>>> for the gram4 run that fails, as in the site-file
>>>>> for
>>>>>> the pre-ws job that suceeds. This is the same
>>>>> project,
>>>>>> TG-MCA01S018, that is set in my
>>>>> .tg_default_project
>>>>>> file in ~kubal/ on the UC teragrid.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
>>>>>>
>>>>>>> yeah, run that same without kickstart. the error
>>>>>>> reported is that
>>>>>>> kickstart didn't work right - but there's
>>>>> perhaps
>>>>>>> some underlying error.
>>>>>>> -- 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> ____________________________________________________________________________________
>>>>>> Never miss a thing.  Make Yahoo your home page.
>>>>>> http://www.yahoo.com/r/hs
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ____________________________________________________________________________________
>>>> Be a better friend, newshound, and
>>>> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>
>>>
>>
>


From mikekubal at yahoo.com  Tue Feb 12 14:34:28 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 12:34:28 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202847802.15085.0.camel@blabla.mcs.anl.gov>
Message-ID: <660399.26765.qm@web52308.mail.re2.yahoo.com>

I tried running with the account id removed from the
sites.file as in the following line:

<profile namespace="globus" key=""></profile>

but received the same error.


--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Is this the same for pre-WS GRAM?
> 
> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
> wrote:
> > that's right, qsub is used for PBS (and some
> others too)
> > bsub is LSF
> > condor_q for condor
> > ...
> > 
> > -Stu
> > 
> > On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> Hategan wrote:
> > 
> > >
> > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
> wrote:
> > >> I'll give it a try.
> > >>
> > >> When using GRAM4, is qsub the method used to
> > >> ultimately put the job in the queue?
> > >
> > > Looks like it. I also believe it's the case with
> pre-ws gram. Stu  
> > > may be
> > > able to clarify.
> > >
> > >>
> > >> MikeK
> > >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > >>
> > >>> While this doesn't solve the underlying
> problem, it
> > >>> may help you get
> > >>> this to work: log into tg-login1.uc..., set
> this
> > >>> project as default,
> > >>> then remove the project spec from the sites
> file and
> > >>> try again.
> > >>>
> > >>> Mihael
> > >>>
> > >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal
> wrote:
> > >>>> Yes, I believe you are right. The kickstart
> > >>> message
> > >>>> may be only a warning. After digging a little
> > >>> deeper
> > >>>> it appears the job is failing due to a
> > >>> project/account
> > >>>> id problem. I get the following error:
> > >>>>
> > >>>> Caused by:
> > >>>>        The executable could not be started.,
> > >>> qsub:
> > >>>> Invalid Account MSG=invalid account
> > >>>>
> > >>>> I am specifying the same TG-account in my
> > >>> site-file
> > >>>> for the gram4 run that fails, as in the
> site-file
> > >>> for
> > >>>> the pre-ws job that suceeds. This is the same
> > >>> project,
> > >>>> TG-MCA01S018, that is set in my
> > >>> .tg_default_project
> > >>>> file in ~kubal/ on the UC teragrid.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > >>>>
> > >>>>> yeah, run that same without kickstart. the
> error
> > >>>>> reported is that
> > >>>>> kickstart didn't work right - but there's
> > >>> perhaps
> > >>>>> some underlying error.
> > >>>>> -- 
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>
>
____________________________________________________________________________________
> > >>>> Never miss a thing.  Make Yahoo your home
> page.
> > >>>> http://www.yahoo.com/r/hs
> > >>>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >>       
> > >>
>
____________________________________________________________________________________
> > >> Be a better friend, newshound, and
> > >> know-it-all with Yahoo! Mobile.  Try it now. 
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > >>
> > >
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From hategan at mcs.anl.gov  Tue Feb 12 14:37:18 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 14:37:18 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <660399.26765.qm@web52308.mail.re2.yahoo.com>
References: <660399.26765.qm@web52308.mail.re2.yahoo.com>
Message-ID: <1202848638.15905.0.camel@blabla.mcs.anl.gov>

You should probably remove the line completely.

Did you chose a default project on the login node with tgprojects?

On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal wrote:
> I tried running with the account id removed from the
> sites.file as in the following line:
> 
> <profile namespace="globus" key=""></profile>
> 
> but received the same error.
> 
> 
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Is this the same for pre-WS GRAM?
> > 
> > On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
> > wrote:
> > > that's right, qsub is used for PBS (and some
> > others too)
> > > bsub is LSF
> > > condor_q for condor
> > > ...
> > > 
> > > -Stu
> > > 
> > > On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > Hategan wrote:
> > > 
> > > >
> > > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
> > wrote:
> > > >> I'll give it a try.
> > > >>
> > > >> When using GRAM4, is qsub the method used to
> > > >> ultimately put the job in the queue?
> > > >
> > > > Looks like it. I also believe it's the case with
> > pre-ws gram. Stu  
> > > > may be
> > > > able to clarify.
> > > >
> > > >>
> > > >> MikeK
> > > >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > >>
> > > >>> While this doesn't solve the underlying
> > problem, it
> > > >>> may help you get
> > > >>> this to work: log into tg-login1.uc..., set
> > this
> > > >>> project as default,
> > > >>> then remove the project spec from the sites
> > file and
> > > >>> try again.
> > > >>>
> > > >>> Mihael
> > > >>>
> > > >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal
> > wrote:
> > > >>>> Yes, I believe you are right. The kickstart
> > > >>> message
> > > >>>> may be only a warning. After digging a little
> > > >>> deeper
> > > >>>> it appears the job is failing due to a
> > > >>> project/account
> > > >>>> id problem. I get the following error:
> > > >>>>
> > > >>>> Caused by:
> > > >>>>        The executable could not be started.,
> > > >>> qsub:
> > > >>>> Invalid Account MSG=invalid account
> > > >>>>
> > > >>>> I am specifying the same TG-account in my
> > > >>> site-file
> > > >>>> for the gram4 run that fails, as in the
> > site-file
> > > >>> for
> > > >>>> the pre-ws job that suceeds. This is the same
> > > >>> project,
> > > >>>> TG-MCA01S018, that is set in my
> > > >>> .tg_default_project
> > > >>>> file in ~kubal/ on the UC teragrid.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
> > > >>>>
> > > >>>>> yeah, run that same without kickstart. the
> > error
> > > >>>>> reported is that
> > > >>>>> kickstart didn't work right - but there's
> > > >>> perhaps
> > > >>>>> some underlying error.
> > > >>>>> -- 
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> >
> ____________________________________________________________________________________
> > > >>>> Never miss a thing.  Make Yahoo your home
> > page.
> > > >>>> http://www.yahoo.com/r/hs
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >>       
> > > >>
> >
> ____________________________________________________________________________________
> > > >> Be a better friend, newshound, and
> > > >> know-it-all with Yahoo! Mobile.  Try it now. 
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > > >>
> > > >
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 


From insley at mcs.anl.gov  Tue Feb 12 14:45:29 2008
From: insley at mcs.anl.gov (joseph insley)
Date: Tue, 12 Feb 2008 14:45:29 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202848638.15905.0.camel@blabla.mcs.anl.gov>
References: <660399.26765.qm@web52308.mail.re2.yahoo.com>
	<1202848638.15905.0.camel@blabla.mcs.anl.gov>
Message-ID: <54A8DBC2-386E-4446-B29C-64952AD7B782@mcs.anl.gov>

Mike K,

looks like you have the wrong value in your .tg_default_project file:

insley at tg-viz-login1:~> more ~kubal/.tg_default_project
TG-MCA01S018

you should be using: TG-MCB010025N

insley at tg-viz-login1:~> tgusage -i -u kubal

[snip]

Account: TG-MCA01S018
Title: Computational Studies of Complex Processes in Biological  
Macromolecular Systems
Resource: teragrid

****
Local project name on dtf.anl.teragrid is TG-MCB010025N
****

Allocation Period: 2007-08-03 to 2008-03-31

Name (Last First) or Account       Total      Remaining        Usage
----------------------------     ----------  ------------   ----------
    Kubal  Michael                101880 SU     99358 SU       296 SU
----------------------------------------------------------------------
    TG-MCA01S018                  101880 SU     99358 SU      2522 SU


On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote:

> You should probably remove the line completely.
>
> Did you chose a default project on the login node with tgprojects?
>
> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal wrote:
>> I tried running with the account id removed from the
>> sites.file as in the following line:
>>
>> <profile namespace="globus" key=""></profile>
>>
>> but received the same error.
>>
>>
>>
>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>
>>> Is this the same for pre-WS GRAM?
>>>
>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
>>> wrote:
>>>> that's right, qsub is used for PBS (and some
>>> others too)
>>>> bsub is LSF
>>>> condor_q for condor
>>>> ...
>>>>
>>>> -Stu
>>>>
>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
>>> Hategan wrote:
>>>>
>>>>>
>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
>>> wrote:
>>>>>> I'll give it a try.
>>>>>>
>>>>>> When using GRAM4, is qsub the method used to
>>>>>> ultimately put the job in the queue?
>>>>>
>>>>> Looks like it. I also believe it's the case with
>>> pre-ws gram. Stu
>>>>> may be
>>>>> able to clarify.
>>>>>
>>>>>>
>>>>>> MikeK
>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> While this doesn't solve the underlying
>>> problem, it
>>>>>>> may help you get
>>>>>>> this to work: log into tg-login1.uc..., set
>>> this
>>>>>>> project as default,
>>>>>>> then remove the project spec from the sites
>>> file and
>>>>>>> try again.
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal
>>> wrote:
>>>>>>>> Yes, I believe you are right. The kickstart
>>>>>>> message
>>>>>>>> may be only a warning. After digging a little
>>>>>>> deeper
>>>>>>>> it appears the job is failing due to a
>>>>>>> project/account
>>>>>>>> id problem. I get the following error:
>>>>>>>>
>>>>>>>> Caused by:
>>>>>>>>        The executable could not be started.,
>>>>>>> qsub:
>>>>>>>> Invalid Account MSG=invalid account
>>>>>>>>
>>>>>>>> I am specifying the same TG-account in my
>>>>>>> site-file
>>>>>>>> for the gram4 run that fails, as in the
>>> site-file
>>>>>>> for
>>>>>>>> the pre-ws job that suceeds. This is the same
>>>>>>> project,
>>>>>>>> TG-MCA01S018, that is set in my
>>>>>>> .tg_default_project
>>>>>>>> file in ~kubal/ on the UC teragrid.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk> wrote:
>>>>>>>>
>>>>>>>>> yeah, run that same without kickstart. the
>>> error
>>>>>>>>> reported is that
>>>>>>>>> kickstart didn't work right - but there's
>>>>>>> perhaps
>>>>>>>>> some underlying error.
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>> _____________________________________________________________________ 
>> _______________
>>>>>>>> Never miss a thing.  Make Yahoo your home
>>> page.
>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>> _____________________________________________________________________ 
>> _______________
>>>>>> Be a better friend, newshound, and
>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>>
>>
>>
>>        
>> _____________________________________________________________________ 
>> _______________
>> Looking for last minute shopping deals?
>> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/ 
>> newsearch/category.php?category=shopping
>>
>

===================================================
joseph a. insley                                                       
insley at mcs.anl.gov
mathematics & computer science division       (630) 252-5649
argonne national laboratory                               (630)  
252-5986 (fax)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080212/ed1f58c9/attachment.html>

From mikekubal at yahoo.com  Tue Feb 12 16:20:27 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 14:20:27 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <54A8DBC2-386E-4446-B29C-64952AD7B782@mcs.anl.gov>
Message-ID: <341523.12842.qm@web52302.mail.re2.yahoo.com>

Thanks Joe. This solved the account id problem.

--- joseph insley <insley at mcs.anl.gov> wrote:

> Mike K,
> 
> looks like you have the wrong value in your
> .tg_default_project file:
> 
> insley at tg-viz-login1:~> more
> ~kubal/.tg_default_project
> TG-MCA01S018
> 
> you should be using: TG-MCB010025N
> 
> insley at tg-viz-login1:~> tgusage -i -u kubal
> 
> [snip]
> 
> Account: TG-MCA01S018
> Title: Computational Studies of Complex Processes in
> Biological  
> Macromolecular Systems
> Resource: teragrid
> 
> ****
> Local project name on dtf.anl.teragrid is
> TG-MCB010025N
> ****
> 
> Allocation Period: 2007-08-03 to 2008-03-31
> 
> Name (Last First) or Account       Total     
> Remaining        Usage
> ----------------------------     ---------- 
> ------------   ----------
>     Kubal  Michael                101880 SU    
> 99358 SU       296 SU
>
----------------------------------------------------------------------
>     TG-MCA01S018                  101880 SU    
> 99358 SU      2522 SU
> 
> 
> 
> On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote:
> 
> > You should probably remove the line completely.
> >
> > Did you chose a default project on the login node
> with tgprojects?
> >
> > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> wrote:
> >> I tried running with the account id removed from
> the
> >> sites.file as in the following line:
> >>
> >> <profile namespace="globus" key=""></profile>
> >>
> >> but received the same error.
> >>
> >>
> >>
> >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>
> >>> Is this the same for pre-WS GRAM?
> >>>
> >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
> >>> wrote:
> >>>> that's right, qsub is used for PBS (and some
> >>> others too)
> >>>> bsub is LSF
> >>>> condor_q for condor
> >>>> ...
> >>>>
> >>>> -Stu
> >>>>
> >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> >>> Hategan wrote:
> >>>>
> >>>>>
> >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
> >>> wrote:
> >>>>>> I'll give it a try.
> >>>>>>
> >>>>>> When using GRAM4, is qsub the method used to
> >>>>>> ultimately put the job in the queue?
> >>>>>
> >>>>> Looks like it. I also believe it's the case
> with
> >>> pre-ws gram. Stu
> >>>>> may be
> >>>>> able to clarify.
> >>>>>
> >>>>>>
> >>>>>> MikeK
> >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >>>>>>
> >>>>>>> While this doesn't solve the underlying
> >>> problem, it
> >>>>>>> may help you get
> >>>>>>> this to work: log into tg-login1.uc..., set
> >>> this
> >>>>>>> project as default,
> >>>>>>> then remove the project spec from the sites
> >>> file and
> >>>>>>> try again.
> >>>>>>>
> >>>>>>> Mihael
> >>>>>>>
> >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> Kubal
> >>> wrote:
> >>>>>>>> Yes, I believe you are right. The kickstart
> >>>>>>> message
> >>>>>>>> may be only a warning. After digging a
> little
> >>>>>>> deeper
> >>>>>>>> it appears the job is failing due to a
> >>>>>>> project/account
> >>>>>>>> id problem. I get the following error:
> >>>>>>>>
> >>>>>>>> Caused by:
> >>>>>>>>        The executable could not be
> started.,
> >>>>>>> qsub:
> >>>>>>>> Invalid Account MSG=invalid account
> >>>>>>>>
> >>>>>>>> I am specifying the same TG-account in my
> >>>>>>> site-file
> >>>>>>>> for the gram4 run that fails, as in the
> >>> site-file
> >>>>>>> for
> >>>>>>>> the pre-ws job that suceeds. This is the
> same
> >>>>>>> project,
> >>>>>>>> TG-MCA01S018, that is set in my
> >>>>>>> .tg_default_project
> >>>>>>>> file in ~kubal/ on the UC teragrid.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> wrote:
> >>>>>>>>
> >>>>>>>>> yeah, run that same without kickstart. the
> >>> error
> >>>>>>>>> reported is that
> >>>>>>>>> kickstart didn't work right - but there's
> >>>>>>> perhaps
> >>>>>>>>> some underlying error.
> >>>>>>>>> -- 
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>
> >>
>
_____________________________________________________________________
> 
> >> _______________
> >>>>>>>> Never miss a thing.  Make Yahoo your home
> >>> page.
> >>>>>>>> http://www.yahoo.com/r/hs
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>
> >>
>
_____________________________________________________________________
> 
> >> _______________
> >>>>>> Be a better friend, newshound, and
> >>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> >>>
> >>
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>>
> >>
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> 
=== message truncated ===


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From hategan at mcs.anl.gov  Tue Feb 12 16:23:34 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 16:23:34 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <341523.12842.qm@web52302.mail.re2.yahoo.com>
References: <341523.12842.qm@web52302.mail.re2.yahoo.com>
Message-ID: <1202855014.23472.0.camel@blabla.mcs.anl.gov>

Would it be worth trying to find out why it worked with pre-WS GRAM?

Mihael

On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
> Thanks Joe. This solved the account id problem.
> 
> --- joseph insley <insley at mcs.anl.gov> wrote:
> 
> > Mike K,
> > 
> > looks like you have the wrong value in your
> > .tg_default_project file:
> > 
> > insley at tg-viz-login1:~> more
> > ~kubal/.tg_default_project
> > TG-MCA01S018
> > 
> > you should be using: TG-MCB010025N
> > 
> > insley at tg-viz-login1:~> tgusage -i -u kubal
> > 
> > [snip]
> > 
> > Account: TG-MCA01S018
> > Title: Computational Studies of Complex Processes in
> > Biological  
> > Macromolecular Systems
> > Resource: teragrid
> > 
> > ****
> > Local project name on dtf.anl.teragrid is
> > TG-MCB010025N
> > ****
> > 
> > Allocation Period: 2007-08-03 to 2008-03-31
> > 
> > Name (Last First) or Account       Total     
> > Remaining        Usage
> > ----------------------------     ---------- 
> > ------------   ----------
> >     Kubal  Michael                101880 SU    
> > 99358 SU       296 SU
> >
> ----------------------------------------------------------------------
> >     TG-MCA01S018                  101880 SU    
> > 99358 SU      2522 SU
> > 
> > 
> > 
> > On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote:
> > 
> > > You should probably remove the line completely.
> > >
> > > Did you chose a default project on the login node
> > with tgprojects?
> > >
> > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> > wrote:
> > >> I tried running with the account id removed from
> > the
> > >> sites.file as in the following line:
> > >>
> > >> <profile namespace="globus" key=""></profile>
> > >>
> > >> but received the same error.
> > >>
> > >>
> > >>
> > >> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > >>
> > >>> Is this the same for pre-WS GRAM?
> > >>>
> > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
> > >>> wrote:
> > >>>> that's right, qsub is used for PBS (and some
> > >>> others too)
> > >>>> bsub is LSF
> > >>>> condor_q for condor
> > >>>> ...
> > >>>>
> > >>>> -Stu
> > >>>>
> > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > >>> Hategan wrote:
> > >>>>
> > >>>>>
> > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
> > >>> wrote:
> > >>>>>> I'll give it a try.
> > >>>>>>
> > >>>>>> When using GRAM4, is qsub the method used to
> > >>>>>> ultimately put the job in the queue?
> > >>>>>
> > >>>>> Looks like it. I also believe it's the case
> > with
> > >>> pre-ws gram. Stu
> > >>>>> may be
> > >>>>> able to clarify.
> > >>>>>
> > >>>>>>
> > >>>>>> MikeK
> > >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > >>>>>>
> > >>>>>>> While this doesn't solve the underlying
> > >>> problem, it
> > >>>>>>> may help you get
> > >>>>>>> this to work: log into tg-login1.uc..., set
> > >>> this
> > >>>>>>> project as default,
> > >>>>>>> then remove the project spec from the sites
> > >>> file and
> > >>>>>>> try again.
> > >>>>>>>
> > >>>>>>> Mihael
> > >>>>>>>
> > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> > Kubal
> > >>> wrote:
> > >>>>>>>> Yes, I believe you are right. The kickstart
> > >>>>>>> message
> > >>>>>>>> may be only a warning. After digging a
> > little
> > >>>>>>> deeper
> > >>>>>>>> it appears the job is failing due to a
> > >>>>>>> project/account
> > >>>>>>>> id problem. I get the following error:
> > >>>>>>>>
> > >>>>>>>> Caused by:
> > >>>>>>>>        The executable could not be
> > started.,
> > >>>>>>> qsub:
> > >>>>>>>> Invalid Account MSG=invalid account
> > >>>>>>>>
> > >>>>>>>> I am specifying the same TG-account in my
> > >>>>>>> site-file
> > >>>>>>>> for the gram4 run that fails, as in the
> > >>> site-file
> > >>>>>>> for
> > >>>>>>>> the pre-ws job that suceeds. This is the
> > same
> > >>>>>>> project,
> > >>>>>>>> TG-MCA01S018, that is set in my
> > >>>>>>> .tg_default_project
> > >>>>>>>> file in ~kubal/ on the UC teragrid.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> > wrote:
> > >>>>>>>>
> > >>>>>>>>> yeah, run that same without kickstart. the
> > >>> error
> > >>>>>>>>> reported is that
> > >>>>>>>>> kickstart didn't work right - but there's
> > >>>>>>> perhaps
> > >>>>>>>>> some underlying error.
> > >>>>>>>>> -- 
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> _____________________________________________________________________
> > 
> > >> _______________
> > >>>>>>>> Never miss a thing.  Make Yahoo your home
> > >>> page.
> > >>>>>>>> http://www.yahoo.com/r/hs
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>
> > >>
> >
> _____________________________________________________________________
> > 
> > >> _______________
> > >>>>>> Be a better friend, newshound, and
> > >>>>>> know-it-all with Yahoo! Mobile.  Try it now.
> > >>>
> > >>
> >
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>> _______________________________________________
> > >>> Swift-devel mailing list
> > >>> Swift-devel at ci.uchicago.edu
> > >>>
> > >>
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>>
> > 
> === message truncated ===
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 


From mikekubal at yahoo.com  Tue Feb 12 16:43:38 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 14:43:38 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov>
Message-ID: <876972.28192.qm@web52307.mail.re2.yahoo.com>

Just to be sure I tested with pre-WS and it worked
also. 
 
--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Would it be worth trying to find out why it worked
> with pre-WS GRAM?
> 
> Mihael
> 
> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
> > Thanks Joe. This solved the account id problem.
> > 
> > --- joseph insley <insley at mcs.anl.gov> wrote:
> > 
> > > Mike K,
> > > 
> > > looks like you have the wrong value in your
> > > .tg_default_project file:
> > > 
> > > insley at tg-viz-login1:~> more
> > > ~kubal/.tg_default_project
> > > TG-MCA01S018
> > > 
> > > you should be using: TG-MCB010025N
> > > 
> > > insley at tg-viz-login1:~> tgusage -i -u kubal
> > > 
> > > [snip]
> > > 
> > > Account: TG-MCA01S018
> > > Title: Computational Studies of Complex
> Processes in
> > > Biological  
> > > Macromolecular Systems
> > > Resource: teragrid
> > > 
> > > ****
> > > Local project name on dtf.anl.teragrid is
> > > TG-MCB010025N
> > > ****
> > > 
> > > Allocation Period: 2007-08-03 to 2008-03-31
> > > 
> > > Name (Last First) or Account       Total     
> > > Remaining        Usage
> > > ----------------------------     ---------- 
> > > ------------   ----------
> > >     Kubal  Michael                101880 SU    
> > > 99358 SU       296 SU
> > >
> >
>
----------------------------------------------------------------------
> > >     TG-MCA01S018                  101880 SU    
> > > 99358 SU      2522 SU
> > > 
> > > 
> > > 
> > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> wrote:
> > > 
> > > > You should probably remove the line
> completely.
> > > >
> > > > Did you chose a default project on the login
> node
> > > with tgprojects?
> > > >
> > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> > > wrote:
> > > >> I tried running with the account id removed
> from
> > > the
> > > >> sites.file as in the following line:
> > > >>
> > > >> <profile namespace="globus" key=""></profile>
> > > >>
> > > >> but received the same error.
> > > >>
> > > >>
> > > >>
> > > >> --- Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> > > >>
> > > >>> Is this the same for pre-WS GRAM?
> > > >>>
> > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> Martin
> > > >>> wrote:
> > > >>>> that's right, qsub is used for PBS (and
> some
> > > >>> others too)
> > > >>>> bsub is LSF
> > > >>>> condor_q for condor
> > > >>>> ...
> > > >>>>
> > > >>>> -Stu
> > > >>>>
> > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > > >>> Hategan wrote:
> > > >>>>
> > > >>>>>
> > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> Kubal
> > > >>> wrote:
> > > >>>>>> I'll give it a try.
> > > >>>>>>
> > > >>>>>> When using GRAM4, is qsub the method used
> to
> > > >>>>>> ultimately put the job in the queue?
> > > >>>>>
> > > >>>>> Looks like it. I also believe it's the
> case
> > > with
> > > >>> pre-ws gram. Stu
> > > >>>>> may be
> > > >>>>> able to clarify.
> > > >>>>>
> > > >>>>>>
> > > >>>>>> MikeK
> > > >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >>>>>>
> > > >>>>>>> While this doesn't solve the underlying
> > > >>> problem, it
> > > >>>>>>> may help you get
> > > >>>>>>> this to work: log into tg-login1.uc...,
> set
> > > >>> this
> > > >>>>>>> project as default,
> > > >>>>>>> then remove the project spec from the
> sites
> > > >>> file and
> > > >>>>>>> try again.
> > > >>>>>>>
> > > >>>>>>> Mihael
> > > >>>>>>>
> > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> > > Kubal
> > > >>> wrote:
> > > >>>>>>>> Yes, I believe you are right. The
> kickstart
> > > >>>>>>> message
> > > >>>>>>>> may be only a warning. After digging a
> > > little
> > > >>>>>>> deeper
> > > >>>>>>>> it appears the job is failing due to a
> > > >>>>>>> project/account
> > > >>>>>>>> id problem. I get the following error:
> > > >>>>>>>>
> > > >>>>>>>> Caused by:
> > > >>>>>>>>        The executable could not be
> > > started.,
> > > >>>>>>> qsub:
> > > >>>>>>>> Invalid Account MSG=invalid account
> > > >>>>>>>>
> > > >>>>>>>> I am specifying the same TG-account in
> my
> > > >>>>>>> site-file
> > > >>>>>>>> for the gram4 run that fails, as in the
> > > >>> site-file
> > > >>>>>>> for
> > > >>>>>>>> the pre-ws job that suceeds. This is
> the
> > > same
> > > >>>>>>> project,
> > > >>>>>>>> TG-MCA01S018, that is set in my
> > > >>>>>>> .tg_default_project
> > > >>>>>>>> file in ~kubal/ on the UC teragrid.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> > > wrote:
> > > >>>>>>>>
> > > >>>>>>>>> yeah, run that same without kickstart.
> the
> > > >>> error
> > > >>>>>>>>> reported is that
> > > >>>>>>>>> kickstart didn't work right - but
> there's
> > > >>>>>>> perhaps
> > > >>>>>>>>> some underlying error.
> > > >>>>>>>>> -- 
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>
> > > >>
> > >
> >
>
_____________________________________________________________________
> > > 
> > > >> _______________
> 
=== message truncated ===


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


From insley at mcs.anl.gov  Tue Feb 12 16:48:05 2008
From: insley at mcs.anl.gov (joseph insley)
Date: Tue, 12 Feb 2008 16:48:05 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov>
References: <341523.12842.qm@web52302.mail.re2.yahoo.com>
	<1202855014.23472.0.camel@blabla.mcs.anl.gov>
Message-ID: <E1FD9549-B188-444F-8EF5-E8AE23D7CFD9@mcs.anl.gov>

If a project id is specified explicitly in the job description, that  
takes precedence over the default project.  Could it be that the  
correct one was previously specified in the job request?

joe.

On Feb 12, 2008, at 4:23 PM, Mihael Hategan wrote:

> Would it be worth trying to find out why it worked with pre-WS GRAM?
>
> Mihael
>
> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
>> Thanks Joe. This solved the account id problem.
>>
>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>
>>> Mike K,
>>>
>>> looks like you have the wrong value in your
>>> .tg_default_project file:
>>>
>>> insley at tg-viz-login1:~> more
>>> ~kubal/.tg_default_project
>>> TG-MCA01S018
>>>
>>> you should be using: TG-MCB010025N
>>>
>>> insley at tg-viz-login1:~> tgusage -i -u kubal
>>>
>>> [snip]
>>>
>>> Account: TG-MCA01S018
>>> Title: Computational Studies of Complex Processes in
>>> Biological
>>> Macromolecular Systems
>>> Resource: teragrid
>>>
>>> ****
>>> Local project name on dtf.anl.teragrid is
>>> TG-MCB010025N
>>> ****
>>>
>>> Allocation Period: 2007-08-03 to 2008-03-31
>>>
>>> Name (Last First) or Account       Total
>>> Remaining        Usage
>>> ----------------------------     ----------
>>> ------------   ----------
>>>     Kubal  Michael                101880 SU
>>> 99358 SU       296 SU
>>>
>> --------------------------------------------------------------------- 
>> -
>>>     TG-MCA01S018                  101880 SU
>>> 99358 SU      2522 SU
>>>
>>>
>>>
>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote:
>>>
>>>> You should probably remove the line completely.
>>>>
>>>> Did you chose a default project on the login node
>>> with tgprojects?
>>>>
>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
>>> wrote:
>>>>> I tried running with the account id removed from
>>> the
>>>>> sites.file as in the following line:
>>>>>
>>>>> <profile namespace="globus" key=""></profile>
>>>>>
>>>>> but received the same error.
>>>>>
>>>>>
>>>>>
>>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>>
>>>>>> Is this the same for pre-WS GRAM?
>>>>>>
>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin
>>>>>> wrote:
>>>>>>> that's right, qsub is used for PBS (and some
>>>>>> others too)
>>>>>>> bsub is LSF
>>>>>>> condor_q for condor
>>>>>>> ...
>>>>>>>
>>>>>>> -Stu
>>>>>>>
>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
>>>>>> Hategan wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal
>>>>>> wrote:
>>>>>>>>> I'll give it a try.
>>>>>>>>>
>>>>>>>>> When using GRAM4, is qsub the method used to
>>>>>>>>> ultimately put the job in the queue?
>>>>>>>>
>>>>>>>> Looks like it. I also believe it's the case
>>> with
>>>>>> pre-ws gram. Stu
>>>>>>>> may be
>>>>>>>> able to clarify.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> MikeK
>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
>>> wrote:
>>>>>>>>>
>>>>>>>>>> While this doesn't solve the underlying
>>>>>> problem, it
>>>>>>>>>> may help you get
>>>>>>>>>> this to work: log into tg-login1.uc..., set
>>>>>> this
>>>>>>>>>> project as default,
>>>>>>>>>> then remove the project spec from the sites
>>>>>> file and
>>>>>>>>>> try again.
>>>>>>>>>>
>>>>>>>>>> Mihael
>>>>>>>>>>
>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
>>> Kubal
>>>>>> wrote:
>>>>>>>>>>> Yes, I believe you are right. The kickstart
>>>>>>>>>> message
>>>>>>>>>>> may be only a warning. After digging a
>>> little
>>>>>>>>>> deeper
>>>>>>>>>>> it appears the job is failing due to a
>>>>>>>>>> project/account
>>>>>>>>>>> id problem. I get the following error:
>>>>>>>>>>>
>>>>>>>>>>> Caused by:
>>>>>>>>>>>        The executable could not be
>>> started.,
>>>>>>>>>> qsub:
>>>>>>>>>>> Invalid Account MSG=invalid account
>>>>>>>>>>>
>>>>>>>>>>> I am specifying the same TG-account in my
>>>>>>>>>> site-file
>>>>>>>>>>> for the gram4 run that fails, as in the
>>>>>> site-file
>>>>>>>>>> for
>>>>>>>>>>> the pre-ws job that suceeds. This is the
>>> same
>>>>>>>>>> project,
>>>>>>>>>>> TG-MCA01S018, that is set in my
>>>>>>>>>> .tg_default_project
>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> yeah, run that same without kickstart. the
>>>>>> error
>>>>>>>>>>>> reported is that
>>>>>>>>>>>> kickstart didn't work right - but there's
>>>>>>>>>> perhaps
>>>>>>>>>>>> some underlying error.
>>>>>>>>>>>> -- 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>
>> _____________________________________________________________________
>>>
>>>>> _______________
>>>>>>>>>>> Never miss a thing.  Make Yahoo your home
>>>>>> page.
>>>>>>>>>>> http://www.yahoo.com/r/hs
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>
>> _____________________________________________________________________
>>>
>>>>> _______________
>>>>>>>>> Be a better friend, newshound, and
>>>>>>>>> know-it-all with Yahoo! Mobile.  Try it now.
>>>>>>
>>>>>
>>>
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>
>>>>>
>>>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>
>> === message truncated ===
>>
>>
>>
>>        
>> _____________________________________________________________________ 
>> _______________
>> Looking for last minute shopping deals?
>> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/ 
>> newsearch/category.php?category=shopping
>>
>


From wilde at mcs.anl.gov  Tue Feb 12 16:51:31 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 12 Feb 2008 16:51:31 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <876972.28192.qm@web52307.mail.re2.yahoo.com>
References: <876972.28192.qm@web52307.mail.re2.yahoo.com>
Message-ID: <47B222F3.3010709@mcs.anl.gov>

Mike, did you do a recent test with pre-WS-GRAM with the 
.tg_default_project file set *incorrectly*?

I think the puzzle was why this would cause WS-GRAM to fail but not 
pre-WS-GRAM, as it would seem they would both get the TG account to use 
in the same manner.

- mikew

On 2/12/08 4:43 PM, Mike Kubal wrote:
> Just to be sure I tested with pre-WS and it worked
> also. 
>  
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
>> Would it be worth trying to find out why it worked
>> with pre-WS GRAM?
>>
>> Mihael
>>
>> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
>>> Thanks Joe. This solved the account id problem.
>>>
>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>
>>>> Mike K,
>>>>
>>>> looks like you have the wrong value in your
>>>> .tg_default_project file:
>>>>
>>>> insley at tg-viz-login1:~> more
>>>> ~kubal/.tg_default_project
>>>> TG-MCA01S018
>>>>
>>>> you should be using: TG-MCB010025N
>>>>
>>>> insley at tg-viz-login1:~> tgusage -i -u kubal
>>>>
>>>> [snip]
>>>>
>>>> Account: TG-MCA01S018
>>>> Title: Computational Studies of Complex
>> Processes in
>>>> Biological  
>>>> Macromolecular Systems
>>>> Resource: teragrid
>>>>
>>>> ****
>>>> Local project name on dtf.anl.teragrid is
>>>> TG-MCB010025N
>>>> ****
>>>>
>>>> Allocation Period: 2007-08-03 to 2008-03-31
>>>>
>>>> Name (Last First) or Account       Total     
>>>> Remaining        Usage
>>>> ----------------------------     ---------- 
>>>> ------------   ----------
>>>>     Kubal  Michael                101880 SU    
>>>> 99358 SU       296 SU
>>>>
> ----------------------------------------------------------------------
>>>>     TG-MCA01S018                  101880 SU    
>>>> 99358 SU      2522 SU
>>>>
>>>>
>>>>
>>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan
>> wrote:
>>>>> You should probably remove the line
>> completely.
>>>>> Did you chose a default project on the login
>> node
>>>> with tgprojects?
>>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
>>>> wrote:
>>>>>> I tried running with the account id removed
>> from
>>>> the
>>>>>> sites.file as in the following line:
>>>>>>
>>>>>> <profile namespace="globus" key=""></profile>
>>>>>>
>>>>>> but received the same error.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
>> wrote:
>>>>>>> Is this the same for pre-WS GRAM?
>>>>>>>
>>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
>> Martin
>>>>>>> wrote:
>>>>>>>> that's right, qsub is used for PBS (and
>> some
>>>>>>> others too)
>>>>>>>> bsub is LSF
>>>>>>>> condor_q for condor
>>>>>>>> ...
>>>>>>>>
>>>>>>>> -Stu
>>>>>>>>
>>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
>>>>>>> Hategan wrote:
>>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
>> Kubal
>>>>>>> wrote:
>>>>>>>>>> I'll give it a try.
>>>>>>>>>>
>>>>>>>>>> When using GRAM4, is qsub the method used
>> to
>>>>>>>>>> ultimately put the job in the queue?
>>>>>>>>> Looks like it. I also believe it's the
>> case
>>>> with
>>>>>>> pre-ws gram. Stu
>>>>>>>>> may be
>>>>>>>>> able to clarify.
>>>>>>>>>
>>>>>>>>>> MikeK
>>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
>>>> wrote:
>>>>>>>>>>> While this doesn't solve the underlying
>>>>>>> problem, it
>>>>>>>>>>> may help you get
>>>>>>>>>>> this to work: log into tg-login1.uc...,
>> set
>>>>>>> this
>>>>>>>>>>> project as default,
>>>>>>>>>>> then remove the project spec from the
>> sites
>>>>>>> file and
>>>>>>>>>>> try again.
>>>>>>>>>>>
>>>>>>>>>>> Mihael
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
>>>> Kubal
>>>>>>> wrote:
>>>>>>>>>>>> Yes, I believe you are right. The
>> kickstart
>>>>>>>>>>> message
>>>>>>>>>>>> may be only a warning. After digging a
>>>> little
>>>>>>>>>>> deeper
>>>>>>>>>>>> it appears the job is failing due to a
>>>>>>>>>>> project/account
>>>>>>>>>>>> id problem. I get the following error:
>>>>>>>>>>>>
>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>        The executable could not be
>>>> started.,
>>>>>>>>>>> qsub:
>>>>>>>>>>>> Invalid Account MSG=invalid account
>>>>>>>>>>>>
>>>>>>>>>>>> I am specifying the same TG-account in
>> my
>>>>>>>>>>> site-file
>>>>>>>>>>>> for the gram4 run that fails, as in the
>>>>>>> site-file
>>>>>>>>>>> for
>>>>>>>>>>>> the pre-ws job that suceeds. This is
>> the
>>>> same
>>>>>>>>>>> project,
>>>>>>>>>>>> TG-MCA01S018, that is set in my
>>>>>>>>>>> .tg_default_project
>>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
>>>> wrote:
>>>>>>>>>>>>> yeah, run that same without kickstart.
>> the
>>>>>>> error
>>>>>>>>>>>>> reported is that
>>>>>>>>>>>>> kickstart didn't work right - but
>> there's
>>>>>>>>>>> perhaps
>>>>>>>>>>>>> some underlying error.
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
> _____________________________________________________________________
>>>>>> _______________
> === message truncated ===
> 
> 
> 
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From mikekubal at yahoo.com  Tue Feb 12 17:00:28 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 15:00:28 -0800 (PST)
Subject: [Swift-devel] next hurdle
In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov>
Message-ID: <236032.16528.qm@web52307.mail.re2.yahoo.com>

One of the applications (antechamber) being launched
by swift on the uc-teragrid is failing with an exit
code of 1 and a message of 'cannot execute binary'
file. It sounds like it might be attempting to run on
one of the 32-bit nodes, though in my tc-file, it
specifies to run only on the  64-bit nodes.

The only difference between a successful run and the
error above are the lines below in from the
sites-file:

(with this line I get the error above)
<execution provider="gt4" jobmanager="PBS"
url="tg-grid1.uc.teragrid.org" />


(with this line instead the job succeeds)
<jobmanager universe="vanilla"
url="tg-grid1.uc.teragrid.org" major="4" minor="0"
patch="0"/>

I rsynced the log and kickstart files to
/home/benc/swift-logs at UC.

Cheers, 

Mike

--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Would it be worth trying to find out why it worked
> with pre-WS GRAM?
> 
> Mihael
> 
> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
> > Thanks Joe. This solved the account id problem.
> > 
> > --- joseph insley <insley at mcs.anl.gov> wrote:
> > 
> > > Mike K,
> > > 
> > > looks like you have the wrong value in your
> > > .tg_default_project file:
> > > 
> > > insley at tg-viz-login1:~> more
> > > ~kubal/.tg_default_project
> > > TG-MCA01S018
> > > 
> > > you should be using: TG-MCB010025N
> > > 
> > > insley at tg-viz-login1:~> tgusage -i -u kubal
> > > 
> > > [snip]
> > > 
> > > Account: TG-MCA01S018
> > > Title: Computational Studies of Complex
> Processes in
> > > Biological  
> > > Macromolecular Systems
> > > Resource: teragrid
> > > 
> > > ****
> > > Local project name on dtf.anl.teragrid is
> > > TG-MCB010025N
> > > ****
> > > 
> > > Allocation Period: 2007-08-03 to 2008-03-31
> > > 
> > > Name (Last First) or Account       Total     
> > > Remaining        Usage
> > > ----------------------------     ---------- 
> > > ------------   ----------
> > >     Kubal  Michael                101880 SU    
> > > 99358 SU       296 SU
> > >
> >
>
----------------------------------------------------------------------
> > >     TG-MCA01S018                  101880 SU    
> > > 99358 SU      2522 SU
> > > 
> > > 
> > > 
> > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> wrote:
> > > 
> > > > You should probably remove the line
> completely.
> > > >
> > > > Did you chose a default project on the login
> node
> > > with tgprojects?
> > > >
> > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> > > wrote:
> > > >> I tried running with the account id removed
> from
> > > the
> > > >> sites.file as in the following line:
> > > >>
> > > >> <profile namespace="globus" key=""></profile>
> > > >>
> > > >> but received the same error.
> > > >>
> > > >>
> > > >>
> > > >> --- Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> > > >>
> > > >>> Is this the same for pre-WS GRAM?
> > > >>>
> > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> Martin
> > > >>> wrote:
> > > >>>> that's right, qsub is used for PBS (and
> some
> > > >>> others too)
> > > >>>> bsub is LSF
> > > >>>> condor_q for condor
> > > >>>> ...
> > > >>>>
> > > >>>> -Stu
> > > >>>>
> > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > > >>> Hategan wrote:
> > > >>>>
> > > >>>>>
> > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> Kubal
> > > >>> wrote:
> > > >>>>>> I'll give it a try.
> > > >>>>>>
> > > >>>>>> When using GRAM4, is qsub the method used
> to
> > > >>>>>> ultimately put the job in the queue?
> > > >>>>>
> > > >>>>> Looks like it. I also believe it's the
> case
> > > with
> > > >>> pre-ws gram. Stu
> > > >>>>> may be
> > > >>>>> able to clarify.
> > > >>>>>
> > > >>>>>>
> > > >>>>>> MikeK
> > > >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >>>>>>
> > > >>>>>>> While this doesn't solve the underlying
> > > >>> problem, it
> > > >>>>>>> may help you get
> > > >>>>>>> this to work: log into tg-login1.uc...,
> set
> > > >>> this
> > > >>>>>>> project as default,
> > > >>>>>>> then remove the project spec from the
> sites
> > > >>> file and
> > > >>>>>>> try again.
> > > >>>>>>>
> > > >>>>>>> Mihael
> > > >>>>>>>
> > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> > > Kubal
> > > >>> wrote:
> > > >>>>>>>> Yes, I believe you are right. The
> kickstart
> > > >>>>>>> message
> > > >>>>>>>> may be only a warning. After digging a
> > > little
> > > >>>>>>> deeper
> > > >>>>>>>> it appears the job is failing due to a
> > > >>>>>>> project/account
> > > >>>>>>>> id problem. I get the following error:
> > > >>>>>>>>
> > > >>>>>>>> Caused by:
> > > >>>>>>>>        The executable could not be
> > > started.,
> > > >>>>>>> qsub:
> > > >>>>>>>> Invalid Account MSG=invalid account
> > > >>>>>>>>
> > > >>>>>>>> I am specifying the same TG-account in
> my
> > > >>>>>>> site-file
> > > >>>>>>>> for the gram4 run that fails, as in the
> > > >>> site-file
> > > >>>>>>> for
> > > >>>>>>>> the pre-ws job that suceeds. This is
> the
> > > same
> > > >>>>>>> project,
> > > >>>>>>>> TG-MCA01S018, that is set in my
> > > >>>>>>> .tg_default_project
> > > >>>>>>>> file in ~kubal/ on the UC teragrid.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> > > wrote:
> > > >>>>>>>>
> > > >>>>>>>>> yeah, run that same without kickstart.
> the
> > > >>> error
> > > >>>>>>>>> reported is that
> > > >>>>>>>>> kickstart didn't work right - but
> there's
> > > >>>>>>> perhaps
> > > >>>>>>>>> some underlying error.
> > > >>>>>>>>> -- 
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>
> > > >>
> > >
> >
>
_____________________________________________________________________
> > > 
> > > >> _______________
> 
=== message truncated ===


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From hategan at mcs.anl.gov  Tue Feb 12 17:06:44 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 17:06:44 -0600
Subject: [Swift-devel] Re: next hurdle
In-Reply-To: <236032.16528.qm@web52307.mail.re2.yahoo.com>
References: <236032.16528.qm@web52307.mail.re2.yahoo.com>
Message-ID: <1202857604.26548.1.camel@blabla.mcs.anl.gov>


On Tue, 2008-02-12 at 15:00 -0800, Mike Kubal wrote:
> One of the applications (antechamber) being launched
> by swift on the uc-teragrid is failing with an exit
> code of 1 and a message of 'cannot execute binary'
> file. It sounds like it might be attempting to run on
> one of the 32-bit nodes, though in my tc-file, it
> specifies to run only on the  64-bit nodes.
> 
> The only difference between a successful run and the
> error above are the lines below in from the
> sites-file:
> 
> (with this line I get the error above)
> <execution provider="gt4" jobmanager="PBS"
> url="tg-grid1.uc.teragrid.org" />
> 
> 
> (with this line instead the job succeeds)
> <jobmanager universe="vanilla"
> url="tg-grid1.uc.teragrid.org" major="4" minor="0"
> patch="0"/>

That's running with fork on the head node.

> 
> I rsynced the log and kickstart files to
> /home/benc/swift-logs at UC.
> 
> Cheers, 
> 
> Mike
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Would it be worth trying to find out why it worked
> > with pre-WS GRAM?
> > 
> > Mihael
> > 
> > On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote:
> > > Thanks Joe. This solved the account id problem.
> > > 
> > > --- joseph insley <insley at mcs.anl.gov> wrote:
> > > 
> > > > Mike K,
> > > > 
> > > > looks like you have the wrong value in your
> > > > .tg_default_project file:
> > > > 
> > > > insley at tg-viz-login1:~> more
> > > > ~kubal/.tg_default_project
> > > > TG-MCA01S018
> > > > 
> > > > you should be using: TG-MCB010025N
> > > > 
> > > > insley at tg-viz-login1:~> tgusage -i -u kubal
> > > > 
> > > > [snip]
> > > > 
> > > > Account: TG-MCA01S018
> > > > Title: Computational Studies of Complex
> > Processes in
> > > > Biological  
> > > > Macromolecular Systems
> > > > Resource: teragrid
> > > > 
> > > > ****
> > > > Local project name on dtf.anl.teragrid is
> > > > TG-MCB010025N
> > > > ****
> > > > 
> > > > Allocation Period: 2007-08-03 to 2008-03-31
> > > > 
> > > > Name (Last First) or Account       Total     
> > > > Remaining        Usage
> > > > ----------------------------     ---------- 
> > > > ------------   ----------
> > > >     Kubal  Michael                101880 SU    
> > > > 99358 SU       296 SU
> > > >
> > >
> >
> ----------------------------------------------------------------------
> > > >     TG-MCA01S018                  101880 SU    
> > > > 99358 SU      2522 SU
> > > > 
> > > > 
> > > > 
> > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> > wrote:
> > > > 
> > > > > You should probably remove the line
> > completely.
> > > > >
> > > > > Did you chose a default project on the login
> > node
> > > > with tgprojects?
> > > > >
> > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> > > > wrote:
> > > > >> I tried running with the account id removed
> > from
> > > > the
> > > > >> sites.file as in the following line:
> > > > >>
> > > > >> <profile namespace="globus" key=""></profile>
> > > > >>
> > > > >> but received the same error.
> > > > >>
> > > > >>
> > > > >>
> > > > >> --- Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > > > >>
> > > > >>> Is this the same for pre-WS GRAM?
> > > > >>>
> > > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> > Martin
> > > > >>> wrote:
> > > > >>>> that's right, qsub is used for PBS (and
> > some
> > > > >>> others too)
> > > > >>>> bsub is LSF
> > > > >>>> condor_q for condor
> > > > >>>> ...
> > > > >>>>
> > > > >>>> -Stu
> > > > >>>>
> > > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > > > >>> Hategan wrote:
> > > > >>>>
> > > > >>>>>
> > > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> > Kubal
> > > > >>> wrote:
> > > > >>>>>> I'll give it a try.
> > > > >>>>>>
> > > > >>>>>> When using GRAM4, is qsub the method used
> > to
> > > > >>>>>> ultimately put the job in the queue?
> > > > >>>>>
> > > > >>>>> Looks like it. I also believe it's the
> > case
> > > > with
> > > > >>> pre-ws gram. Stu
> > > > >>>>> may be
> > > > >>>>> able to clarify.
> > > > >>>>>
> > > > >>>>>>
> > > > >>>>>> MikeK
> > > > >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > > > wrote:
> > > > >>>>>>
> > > > >>>>>>> While this doesn't solve the underlying
> > > > >>> problem, it
> > > > >>>>>>> may help you get
> > > > >>>>>>> this to work: log into tg-login1.uc...,
> > set
> > > > >>> this
> > > > >>>>>>> project as default,
> > > > >>>>>>> then remove the project spec from the
> > sites
> > > > >>> file and
> > > > >>>>>>> try again.
> > > > >>>>>>>
> > > > >>>>>>> Mihael
> > > > >>>>>>>
> > > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> > > > Kubal
> > > > >>> wrote:
> > > > >>>>>>>> Yes, I believe you are right. The
> > kickstart
> > > > >>>>>>> message
> > > > >>>>>>>> may be only a warning. After digging a
> > > > little
> > > > >>>>>>> deeper
> > > > >>>>>>>> it appears the job is failing due to a
> > > > >>>>>>> project/account
> > > > >>>>>>>> id problem. I get the following error:
> > > > >>>>>>>>
> > > > >>>>>>>> Caused by:
> > > > >>>>>>>>        The executable could not be
> > > > started.,
> > > > >>>>>>> qsub:
> > > > >>>>>>>> Invalid Account MSG=invalid account
> > > > >>>>>>>>
> > > > >>>>>>>> I am specifying the same TG-account in
> > my
> > > > >>>>>>> site-file
> > > > >>>>>>>> for the gram4 run that fails, as in the
> > > > >>> site-file
> > > > >>>>>>> for
> > > > >>>>>>>> the pre-ws job that suceeds. This is
> > the
> > > > same
> > > > >>>>>>> project,
> > > > >>>>>>>> TG-MCA01S018, that is set in my
> > > > >>>>>>> .tg_default_project
> > > > >>>>>>>> file in ~kubal/ on the UC teragrid.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> > > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> yeah, run that same without kickstart.
> > the
> > > > >>> error
> > > > >>>>>>>>> reported is that
> > > > >>>>>>>>> kickstart didn't work right - but
> > there's
> > > > >>>>>>> perhaps
> > > > >>>>>>>>> some underlying error.
> > > > >>>>>>>>> -- 
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> _____________________________________________________________________
> > > > 
> > > > >> _______________
> > 
> === message truncated ===
> 
> 
> 
>       ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> 


From mikekubal at yahoo.com  Tue Feb 12 17:14:21 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Tue, 12 Feb 2008 15:14:21 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <47B222F3.3010709@mcs.anl.gov>
Message-ID: <283035.8845.qm@web52304.mail.re2.yahoo.com>

With pre-WS-GRAM, it doesn't seem to matter which
account/project id I use where. I can have the
TG-MCB010025N specified in the sites-files and
TG-MCA01S018 specified in ~kubal/.tg-default_project
on the uc teragrid, or vice versa and it still works,
or having them match in both places.

With WS-GRAM, I have to use TG-MCB010025N, the local
uc-teragrid project id, in both places. Using
TG-MCA01S018, the teragrid wide charge number/account
number, causes the qsub failure error.


--- Michael Wilde <wilde at mcs.anl.gov> wrote:

> Mike, did you do a recent test with pre-WS-GRAM with
> the 
> .tg_default_project file set *incorrectly*?
> 
> I think the puzzle was why this would cause WS-GRAM
> to fail but not 
> pre-WS-GRAM, as it would seem they would both get
> the TG account to use 
> in the same manner.
> 
> - mikew
> 
> On 2/12/08 4:43 PM, Mike Kubal wrote:
> > Just to be sure I tested with pre-WS and it worked
> > also. 
> >  
> > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > 
> >> Would it be worth trying to find out why it
> worked
> >> with pre-WS GRAM?
> >>
> >> Mihael
> >>
> >> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal
> wrote:
> >>> Thanks Joe. This solved the account id problem.
> >>>
> >>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>
> >>>> Mike K,
> >>>>
> >>>> looks like you have the wrong value in your
> >>>> .tg_default_project file:
> >>>>
> >>>> insley at tg-viz-login1:~> more
> >>>> ~kubal/.tg_default_project
> >>>> TG-MCA01S018
> >>>>
> >>>> you should be using: TG-MCB010025N
> >>>>
> >>>> insley at tg-viz-login1:~> tgusage -i -u kubal
> >>>>
> >>>> [snip]
> >>>>
> >>>> Account: TG-MCA01S018
> >>>> Title: Computational Studies of Complex
> >> Processes in
> >>>> Biological  
> >>>> Macromolecular Systems
> >>>> Resource: teragrid
> >>>>
> >>>> ****
> >>>> Local project name on dtf.anl.teragrid is
> >>>> TG-MCB010025N
> >>>> ****
> >>>>
> >>>> Allocation Period: 2007-08-03 to 2008-03-31
> >>>>
> >>>> Name (Last First) or Account       Total     
> >>>> Remaining        Usage
> >>>> ----------------------------     ---------- 
> >>>> ------------   ----------
> >>>>     Kubal  Michael                101880 SU    
> >>>> 99358 SU       296 SU
> >>>>
> >
>
----------------------------------------------------------------------
> >>>>     TG-MCA01S018                  101880 SU    
> >>>> 99358 SU      2522 SU
> >>>>
> >>>>
> >>>>
> >>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> >> wrote:
> >>>>> You should probably remove the line
> >> completely.
> >>>>> Did you chose a default project on the login
> >> node
> >>>> with tgprojects?
> >>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> >>>> wrote:
> >>>>>> I tried running with the account id removed
> >> from
> >>>> the
> >>>>>> sites.file as in the following line:
> >>>>>>
> >>>>>> <profile namespace="globus" key=""></profile>
> >>>>>>
> >>>>>> but received the same error.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> >> wrote:
> >>>>>>> Is this the same for pre-WS GRAM?
> >>>>>>>
> >>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> >> Martin
> >>>>>>> wrote:
> >>>>>>>> that's right, qsub is used for PBS (and
> >> some
> >>>>>>> others too)
> >>>>>>>> bsub is LSF
> >>>>>>>> condor_q for condor
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> -Stu
> >>>>>>>>
> >>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> >>>>>>> Hategan wrote:
> >>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> >> Kubal
> >>>>>>> wrote:
> >>>>>>>>>> I'll give it a try.
> >>>>>>>>>>
> >>>>>>>>>> When using GRAM4, is qsub the method used
> >> to
> >>>>>>>>>> ultimately put the job in the queue?
> >>>>>>>>> Looks like it. I also believe it's the
> >> case
> >>>> with
> >>>>>>> pre-ws gram. Stu
> >>>>>>>>> may be
> >>>>>>>>> able to clarify.
> >>>>>>>>>
> >>>>>>>>>> MikeK
> >>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> >>>> wrote:
> >>>>>>>>>>> While this doesn't solve the underlying
> >>>>>>> problem, it
> >>>>>>>>>>> may help you get
> >>>>>>>>>>> this to work: log into tg-login1.uc...,
> >> set
> >>>>>>> this
> >>>>>>>>>>> project as default,
> >>>>>>>>>>> then remove the project spec from the
> >> sites
> >>>>>>> file and
> >>>>>>>>>>> try again.
> >>>>>>>>>>>
> >>>>>>>>>>> Mihael
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> >>>> Kubal
> >>>>>>> wrote:
> >>>>>>>>>>>> Yes, I believe you are right. The
> >> kickstart
> >>>>>>>>>>> message
> >>>>>>>>>>>> may be only a warning. After digging a
> >>>> little
> >>>>>>>>>>> deeper
> >>>>>>>>>>>> it appears the job is failing due to a
> >>>>>>>>>>> project/account
> >>>>>>>>>>>> id problem. I get the following error:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Caused by:
> >>>>>>>>>>>>        The executable could not be
> >>>> started.,
> >>>>>>>>>>> qsub:
> >>>>>>>>>>>> Invalid Account MSG=invalid account
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am specifying the same TG-account in
> >> my
> >>>>>>>>>>> site-file
> >>>>>>>>>>>> for the gram4 run that fails, as in the
> >>>>>>> site-file
> >>>>>>>>>>> for
> >>>>>>>>>>>> the pre-ws job that suceeds. This is
> >> the
> >>>> same
> >>>>>>>>>>> project,
> >>>>>>>>>>>> TG-MCA01S018, that is set in my
> >>>>>>>>>>> .tg_default_project
> >>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> >>>> wrote:
> >>>>>>>>>>>>> yeah, run that same without kickstart.
> >> the
> >>>>>>> error
> >>>>>>>>>>>>> reported is that
> >>>>>>>>>>>>> kickstart didn't work right - but
> >> there's
> >>>>>>>>>>> perhaps
> >>>>>>>>>>>>> some underlying error.
> >>>>>>>>>>>>> -- 
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> 
=== message truncated ===


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From hategan at mcs.anl.gov  Tue Feb 12 17:17:12 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 17:17:12 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <283035.8845.qm@web52304.mail.re2.yahoo.com>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
Message-ID: <1202858233.27191.0.camel@blabla.mcs.anl.gov>

Are you sure you were using PBS with pre-ws GRAM and not fork?

On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote:
> With pre-WS-GRAM, it doesn't seem to matter which
> account/project id I use where. I can have the
> TG-MCB010025N specified in the sites-files and
> TG-MCA01S018 specified in ~kubal/.tg-default_project
> on the uc teragrid, or vice versa and it still works,
> or having them match in both places.
> 
> With WS-GRAM, I have to use TG-MCB010025N, the local
> uc-teragrid project id, in both places. Using
> TG-MCA01S018, the teragrid wide charge number/account
> number, causes the qsub failure error.
> 
> 
> 
> 
> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> 
> > Mike, did you do a recent test with pre-WS-GRAM with
> > the 
> > .tg_default_project file set *incorrectly*?
> > 
> > I think the puzzle was why this would cause WS-GRAM
> > to fail but not 
> > pre-WS-GRAM, as it would seem they would both get
> > the TG account to use 
> > in the same manner.
> > 
> > - mikew
> > 
> > On 2/12/08 4:43 PM, Mike Kubal wrote:
> > > Just to be sure I tested with pre-WS and it worked
> > > also. 
> > >  
> > > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > 
> > >> Would it be worth trying to find out why it
> > worked
> > >> with pre-WS GRAM?
> > >>
> > >> Mihael
> > >>
> > >> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal
> > wrote:
> > >>> Thanks Joe. This solved the account id problem.
> > >>>
> > >>> --- joseph insley <insley at mcs.anl.gov> wrote:
> > >>>
> > >>>> Mike K,
> > >>>>
> > >>>> looks like you have the wrong value in your
> > >>>> .tg_default_project file:
> > >>>>
> > >>>> insley at tg-viz-login1:~> more
> > >>>> ~kubal/.tg_default_project
> > >>>> TG-MCA01S018
> > >>>>
> > >>>> you should be using: TG-MCB010025N
> > >>>>
> > >>>> insley at tg-viz-login1:~> tgusage -i -u kubal
> > >>>>
> > >>>> [snip]
> > >>>>
> > >>>> Account: TG-MCA01S018
> > >>>> Title: Computational Studies of Complex
> > >> Processes in
> > >>>> Biological  
> > >>>> Macromolecular Systems
> > >>>> Resource: teragrid
> > >>>>
> > >>>> ****
> > >>>> Local project name on dtf.anl.teragrid is
> > >>>> TG-MCB010025N
> > >>>> ****
> > >>>>
> > >>>> Allocation Period: 2007-08-03 to 2008-03-31
> > >>>>
> > >>>> Name (Last First) or Account       Total     
> > >>>> Remaining        Usage
> > >>>> ----------------------------     ---------- 
> > >>>> ------------   ----------
> > >>>>     Kubal  Michael                101880 SU    
> > >>>> 99358 SU       296 SU
> > >>>>
> > >
> >
> ----------------------------------------------------------------------
> > >>>>     TG-MCA01S018                  101880 SU    
> > >>>> 99358 SU      2522 SU
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> > >> wrote:
> > >>>>> You should probably remove the line
> > >> completely.
> > >>>>> Did you chose a default project on the login
> > >> node
> > >>>> with tgprojects?
> > >>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> > >>>> wrote:
> > >>>>>> I tried running with the account id removed
> > >> from
> > >>>> the
> > >>>>>> sites.file as in the following line:
> > >>>>>>
> > >>>>>> <profile namespace="globus" key=""></profile>
> > >>>>>>
> > >>>>>> but received the same error.
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > >> wrote:
> > >>>>>>> Is this the same for pre-WS GRAM?
> > >>>>>>>
> > >>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> > >> Martin
> > >>>>>>> wrote:
> > >>>>>>>> that's right, qsub is used for PBS (and
> > >> some
> > >>>>>>> others too)
> > >>>>>>>> bsub is LSF
> > >>>>>>>> condor_q for condor
> > >>>>>>>> ...
> > >>>>>>>>
> > >>>>>>>> -Stu
> > >>>>>>>>
> > >>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> > >>>>>>> Hategan wrote:
> > >>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> > >> Kubal
> > >>>>>>> wrote:
> > >>>>>>>>>> I'll give it a try.
> > >>>>>>>>>>
> > >>>>>>>>>> When using GRAM4, is qsub the method used
> > >> to
> > >>>>>>>>>> ultimately put the job in the queue?
> > >>>>>>>>> Looks like it. I also believe it's the
> > >> case
> > >>>> with
> > >>>>>>> pre-ws gram. Stu
> > >>>>>>>>> may be
> > >>>>>>>>> able to clarify.
> > >>>>>>>>>
> > >>>>>>>>>> MikeK
> > >>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> > >>>> wrote:
> > >>>>>>>>>>> While this doesn't solve the underlying
> > >>>>>>> problem, it
> > >>>>>>>>>>> may help you get
> > >>>>>>>>>>> this to work: log into tg-login1.uc...,
> > >> set
> > >>>>>>> this
> > >>>>>>>>>>> project as default,
> > >>>>>>>>>>> then remove the project spec from the
> > >> sites
> > >>>>>>> file and
> > >>>>>>>>>>> try again.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Mihael
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> > >>>> Kubal
> > >>>>>>> wrote:
> > >>>>>>>>>>>> Yes, I believe you are right. The
> > >> kickstart
> > >>>>>>>>>>> message
> > >>>>>>>>>>>> may be only a warning. After digging a
> > >>>> little
> > >>>>>>>>>>> deeper
> > >>>>>>>>>>>> it appears the job is failing due to a
> > >>>>>>>>>>> project/account
> > >>>>>>>>>>>> id problem. I get the following error:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Caused by:
> > >>>>>>>>>>>>        The executable could not be
> > >>>> started.,
> > >>>>>>>>>>> qsub:
> > >>>>>>>>>>>> Invalid Account MSG=invalid account
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am specifying the same TG-account in
> > >> my
> > >>>>>>>>>>> site-file
> > >>>>>>>>>>>> for the gram4 run that fails, as in the
> > >>>>>>> site-file
> > >>>>>>>>>>> for
> > >>>>>>>>>>>> the pre-ws job that suceeds. This is
> > >> the
> > >>>> same
> > >>>>>>>>>>> project,
> > >>>>>>>>>>>> TG-MCA01S018, that is set in my
> > >>>>>>>>>>> .tg_default_project
> > >>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> > >>>> wrote:
> > >>>>>>>>>>>>> yeah, run that same without kickstart.
> > >> the
> > >>>>>>> error
> > >>>>>>>>>>>>> reported is that
> > >>>>>>>>>>>>> kickstart didn't work right - but
> > >> there's
> > >>>>>>>>>>> perhaps
> > >>>>>>>>>>>>> some underlying error.
> > >>>>>>>>>>>>> -- 
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > 
> === message truncated ===
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 


From wilde at mcs.anl.gov  Tue Feb 12 17:41:18 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 12 Feb 2008 17:41:18 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202858233.27191.0.camel@blabla.mcs.anl.gov>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
	<1202858233.27191.0.camel@blabla.mcs.anl.gov>
Message-ID: <47B22E9E.2060000@mcs.anl.gov>

That would certainly explain this clobbered the head node.
Im sorry that we all missed this last week.

If true, we would have seen the applications running on the headnode.
I wonder if anyone noticed?

Mike, heres a sample entry Ive used in the past for UC-TG:

<pool handle="UC" gridlaunch="/home/wilde/swift/tools/mystart" 
sysinfo="INTEL32::LINUX">
     <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" 
storage="/home/wilde/swiftdata/UC/storage" major="2" minor="2" />
     <jobmanager universe="vanilla" 
url="tg-grid.uc.teragrid.org/jobmanager-pbs" major="2" minor="2"/>
     <workdirectory>/home/wilde/swiftdata/UC/work</workdirectory>
</pool>


The missing part is the "/jobmanager-pbs" in the url= tag of the 
<jobmanager> element.

- mikew


On 2/12/08 5:17 PM, Mihael Hategan wrote:
> Are you sure you were using PBS with pre-ws GRAM and not fork?
> 
> On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote:
>> With pre-WS-GRAM, it doesn't seem to matter which
>> account/project id I use where. I can have the
>> TG-MCB010025N specified in the sites-files and
>> TG-MCA01S018 specified in ~kubal/.tg-default_project
>> on the uc teragrid, or vice versa and it still works,
>> or having them match in both places.
>>
>> With WS-GRAM, I have to use TG-MCB010025N, the local
>> uc-teragrid project id, in both places. Using
>> TG-MCA01S018, the teragrid wide charge number/account
>> number, causes the qsub failure error.
>>
>>
>>
>>
>> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
>>
>>> Mike, did you do a recent test with pre-WS-GRAM with
>>> the 
>>> .tg_default_project file set *incorrectly*?
>>>
>>> I think the puzzle was why this would cause WS-GRAM
>>> to fail but not 
>>> pre-WS-GRAM, as it would seem they would both get
>>> the TG account to use 
>>> in the same manner.
>>>
>>> - mikew
>>>
>>> On 2/12/08 4:43 PM, Mike Kubal wrote:
>>>> Just to be sure I tested with pre-WS and it worked
>>>> also. 
>>>>  
>>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
>>>>
>>>>> Would it be worth trying to find out why it
>>> worked
>>>>> with pre-WS GRAM?
>>>>>
>>>>> Mihael
>>>>>
>>>>> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal
>>> wrote:
>>>>>> Thanks Joe. This solved the account id problem.
>>>>>>
>>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> Mike K,
>>>>>>>
>>>>>>> looks like you have the wrong value in your
>>>>>>> .tg_default_project file:
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> more
>>>>>>> ~kubal/.tg_default_project
>>>>>>> TG-MCA01S018
>>>>>>>
>>>>>>> you should be using: TG-MCB010025N
>>>>>>>
>>>>>>> insley at tg-viz-login1:~> tgusage -i -u kubal
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>> Account: TG-MCA01S018
>>>>>>> Title: Computational Studies of Complex
>>>>> Processes in
>>>>>>> Biological  
>>>>>>> Macromolecular Systems
>>>>>>> Resource: teragrid
>>>>>>>
>>>>>>> ****
>>>>>>> Local project name on dtf.anl.teragrid is
>>>>>>> TG-MCB010025N
>>>>>>> ****
>>>>>>>
>>>>>>> Allocation Period: 2007-08-03 to 2008-03-31
>>>>>>>
>>>>>>> Name (Last First) or Account       Total     
>>>>>>> Remaining        Usage
>>>>>>> ----------------------------     ---------- 
>>>>>>> ------------   ----------
>>>>>>>     Kubal  Michael                101880 SU    
>>>>>>> 99358 SU       296 SU
>>>>>>>
>> ----------------------------------------------------------------------
>>>>>>>     TG-MCA01S018                  101880 SU    
>>>>>>> 99358 SU      2522 SU
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan
>>>>> wrote:
>>>>>>>> You should probably remove the line
>>>>> completely.
>>>>>>>> Did you chose a default project on the login
>>>>> node
>>>>>>> with tgprojects?
>>>>>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
>>>>>>> wrote:
>>>>>>>>> I tried running with the account id removed
>>>>> from
>>>>>>> the
>>>>>>>>> sites.file as in the following line:
>>>>>>>>>
>>>>>>>>> <profile namespace="globus" key=""></profile>
>>>>>>>>>
>>>>>>>>> but received the same error.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
>>>>> wrote:
>>>>>>>>>> Is this the same for pre-WS GRAM?
>>>>>>>>>>
>>>>>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
>>>>> Martin
>>>>>>>>>> wrote:
>>>>>>>>>>> that's right, qsub is used for PBS (and
>>>>> some
>>>>>>>>>> others too)
>>>>>>>>>>> bsub is LSF
>>>>>>>>>>> condor_q for condor
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> -Stu
>>>>>>>>>>>
>>>>>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
>>>>>>>>>> Hategan wrote:
>>>>>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
>>>>> Kubal
>>>>>>>>>> wrote:
>>>>>>>>>>>>> I'll give it a try.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When using GRAM4, is qsub the method used
>>>>> to
>>>>>>>>>>>>> ultimately put the job in the queue?
>>>>>>>>>>>> Looks like it. I also believe it's the
>>>>> case
>>>>>>> with
>>>>>>>>>> pre-ws gram. Stu
>>>>>>>>>>>> may be
>>>>>>>>>>>> able to clarify.
>>>>>>>>>>>>
>>>>>>>>>>>>> MikeK
>>>>>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>>>>>>>> While this doesn't solve the underlying
>>>>>>>>>> problem, it
>>>>>>>>>>>>>> may help you get
>>>>>>>>>>>>>> this to work: log into tg-login1.uc...,
>>>>> set
>>>>>>>>>> this
>>>>>>>>>>>>>> project as default,
>>>>>>>>>>>>>> then remove the project spec from the
>>>>> sites
>>>>>>>>>> file and
>>>>>>>>>>>>>> try again.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mihael
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
>>>>>>> Kubal
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> Yes, I believe you are right. The
>>>>> kickstart
>>>>>>>>>>>>>> message
>>>>>>>>>>>>>>> may be only a warning. After digging a
>>>>>>> little
>>>>>>>>>>>>>> deeper
>>>>>>>>>>>>>>> it appears the job is failing due to a
>>>>>>>>>>>>>> project/account
>>>>>>>>>>>>>>> id problem. I get the following error:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>>        The executable could not be
>>>>>>> started.,
>>>>>>>>>>>>>> qsub:
>>>>>>>>>>>>>>> Invalid Account MSG=invalid account
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am specifying the same TG-account in
>>>>> my
>>>>>>>>>>>>>> site-file
>>>>>>>>>>>>>>> for the gram4 run that fails, as in the
>>>>>>>>>> site-file
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the pre-ws job that suceeds. This is
>>>>> the
>>>>>>> same
>>>>>>>>>>>>>> project,
>>>>>>>>>>>>>>> TG-MCA01S018, that is set in my
>>>>>>>>>>>>>> .tg_default_project
>>>>>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
>>>>>>> wrote:
>>>>>>>>>>>>>>>> yeah, run that same without kickstart.
>>>>> the
>>>>>>>>>> error
>>>>>>>>>>>>>>>> reported is that
>>>>>>>>>>>>>>>> kickstart didn't work right - but
>>>>> there's
>>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>>>>> some underlying error.
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>> === message truncated ===
>>
>>
>>
>>       ____________________________________________________________________________________
>> Looking for last minute shopping deals?  
>> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>>
> 
> 


From benc at hawaga.org.uk  Tue Feb 12 17:51:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Feb 2008 23:51:05 +0000 (GMT)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <283100.87314.qm@web52307.mail.re2.yahoo.com>
References: <283100.87314.qm@web52307.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802122350050.32747@dildano.hawaga.org.uk>


On Tue, 12 Feb 2008, Mike Kubal wrote:

> Yes, I believe you are right. The kickstart message
> may be only a warning. After digging a little deeper
> it appears the job is failing due to a project/account
> id problem. I get the following error:
> 
> Caused by:
>         The executable could not be started., qsub:
> Invalid Account MSG=invalid account

> I am specifying the same TG-account in my site-file
> for the gram4 run that fails, as in the site-file for
> the pre-ws job that suceeds. This is the same project,
> TG-MCA01S018, that is set in my .tg_default_project
> file in ~kubal/ on the UC teragrid.

ok. there's something wrong there.

run the command: tgprojects
and paste its output.

also paste your sites file again.
-- 


From hategan at mcs.anl.gov  Tue Feb 12 17:53:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 17:53:50 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <47B22E9E.2060000@mcs.anl.gov>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
	<1202858233.27191.0.camel@blabla.mcs.anl.gov>
	<47B22E9E.2060000@mcs.anl.gov>
Message-ID: <1202860430.28705.7.camel@blabla.mcs.anl.gov>


On Tue, 2008-02-12 at 17:41 -0600, Michael Wilde wrote:
> That would certainly explain this clobbered the head node.
> Im sorry that we all missed this last week.


This is what Joe killed at the time:

> kubal    28202 19438  1 16:50 ?        00:00:00 /usr/bin/perl /soft/ 
> prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / 
> tmp/gram_SPsdme -c poll

Looks like PBS.

> 
> If true, we would have seen the applications running on the headnode.
> I wonder if anyone noticed?
> 
> Mike, heres a sample entry Ive used in the past for UC-TG:
> 
> <pool handle="UC" gridlaunch="/home/wilde/swift/tools/mystart" 
> sysinfo="INTEL32::LINUX">
>      <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" 
> storage="/home/wilde/swiftdata/UC/storage" major="2" minor="2" />
>      <jobmanager universe="vanilla" 
> url="tg-grid.uc.teragrid.org/jobmanager-pbs" major="2" minor="2"/>
>      <workdirectory>/home/wilde/swiftdata/UC/work</workdirectory>
> </pool>
> 
> 
> The missing part is the "/jobmanager-pbs" in the url= tag of the 
> <jobmanager> element.

I think we may want to discourage that since it's not portable. I'd say
instead of <jobmanager>, one should use <execution provider="gt2"
jobManager="pbs" url="tg-grid.uc.teragrid.org"/>

Mihael

> 
> - mikew
> 
> 
> On 2/12/08 5:17 PM, Mihael Hategan wrote:
> > Are you sure you were using PBS with pre-ws GRAM and not fork?
> > 
> > On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote:
> >> With pre-WS-GRAM, it doesn't seem to matter which
> >> account/project id I use where. I can have the
> >> TG-MCB010025N specified in the sites-files and
> >> TG-MCA01S018 specified in ~kubal/.tg-default_project
> >> on the uc teragrid, or vice versa and it still works,
> >> or having them match in both places.
> >>
> >> With WS-GRAM, I have to use TG-MCB010025N, the local
> >> uc-teragrid project id, in both places. Using
> >> TG-MCA01S018, the teragrid wide charge number/account
> >> number, causes the qsub failure error.
> >>
> >>
> >>
> >>
> >> --- Michael Wilde <wilde at mcs.anl.gov> wrote:
> >>
> >>> Mike, did you do a recent test with pre-WS-GRAM with
> >>> the 
> >>> .tg_default_project file set *incorrectly*?
> >>>
> >>> I think the puzzle was why this would cause WS-GRAM
> >>> to fail but not 
> >>> pre-WS-GRAM, as it would seem they would both get
> >>> the TG account to use 
> >>> in the same manner.
> >>>
> >>> - mikew
> >>>
> >>> On 2/12/08 4:43 PM, Mike Kubal wrote:
> >>>> Just to be sure I tested with pre-WS and it worked
> >>>> also. 
> >>>>  
> >>>> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >>>>
> >>>>> Would it be worth trying to find out why it
> >>> worked
> >>>>> with pre-WS GRAM?
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal
> >>> wrote:
> >>>>>> Thanks Joe. This solved the account id problem.
> >>>>>>
> >>>>>> --- joseph insley <insley at mcs.anl.gov> wrote:
> >>>>>>
> >>>>>>> Mike K,
> >>>>>>>
> >>>>>>> looks like you have the wrong value in your
> >>>>>>> .tg_default_project file:
> >>>>>>>
> >>>>>>> insley at tg-viz-login1:~> more
> >>>>>>> ~kubal/.tg_default_project
> >>>>>>> TG-MCA01S018
> >>>>>>>
> >>>>>>> you should be using: TG-MCB010025N
> >>>>>>>
> >>>>>>> insley at tg-viz-login1:~> tgusage -i -u kubal
> >>>>>>>
> >>>>>>> [snip]
> >>>>>>>
> >>>>>>> Account: TG-MCA01S018
> >>>>>>> Title: Computational Studies of Complex
> >>>>> Processes in
> >>>>>>> Biological  
> >>>>>>> Macromolecular Systems
> >>>>>>> Resource: teragrid
> >>>>>>>
> >>>>>>> ****
> >>>>>>> Local project name on dtf.anl.teragrid is
> >>>>>>> TG-MCB010025N
> >>>>>>> ****
> >>>>>>>
> >>>>>>> Allocation Period: 2007-08-03 to 2008-03-31
> >>>>>>>
> >>>>>>> Name (Last First) or Account       Total     
> >>>>>>> Remaining        Usage
> >>>>>>> ----------------------------     ---------- 
> >>>>>>> ------------   ----------
> >>>>>>>     Kubal  Michael                101880 SU    
> >>>>>>> 99358 SU       296 SU
> >>>>>>>
> >> ----------------------------------------------------------------------
> >>>>>>>     TG-MCA01S018                  101880 SU    
> >>>>>>> 99358 SU      2522 SU
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan
> >>>>> wrote:
> >>>>>>>> You should probably remove the line
> >>>>> completely.
> >>>>>>>> Did you chose a default project on the login
> >>>>> node
> >>>>>>> with tgprojects?
> >>>>>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal
> >>>>>>> wrote:
> >>>>>>>>> I tried running with the account id removed
> >>>>> from
> >>>>>>> the
> >>>>>>>>> sites.file as in the following line:
> >>>>>>>>>
> >>>>>>>>> <profile namespace="globus" key=""></profile>
> >>>>>>>>>
> >>>>>>>>> but received the same error.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> >>>>> wrote:
> >>>>>>>>>> Is this the same for pre-WS GRAM?
> >>>>>>>>>>
> >>>>>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart
> >>>>> Martin
> >>>>>>>>>> wrote:
> >>>>>>>>>>> that's right, qsub is used for PBS (and
> >>>>> some
> >>>>>>>>>> others too)
> >>>>>>>>>>> bsub is LSF
> >>>>>>>>>>> condor_q for condor
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> -Stu
> >>>>>>>>>>>
> >>>>>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael
> >>>>>>>>>> Hategan wrote:
> >>>>>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike
> >>>>> Kubal
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>> I'll give it a try.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> When using GRAM4, is qsub the method used
> >>>>> to
> >>>>>>>>>>>>> ultimately put the job in the queue?
> >>>>>>>>>>>> Looks like it. I also believe it's the
> >>>>> case
> >>>>>>> with
> >>>>>>>>>> pre-ws gram. Stu
> >>>>>>>>>>>> may be
> >>>>>>>>>>>> able to clarify.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> MikeK
> >>>>>>>>>>>>> --- Mihael Hategan <hategan at mcs.anl.gov>
> >>>>>>> wrote:
> >>>>>>>>>>>>>> While this doesn't solve the underlying
> >>>>>>>>>> problem, it
> >>>>>>>>>>>>>> may help you get
> >>>>>>>>>>>>>> this to work: log into tg-login1.uc...,
> >>>>> set
> >>>>>>>>>> this
> >>>>>>>>>>>>>> project as default,
> >>>>>>>>>>>>>> then remove the project spec from the
> >>>>> sites
> >>>>>>>>>> file and
> >>>>>>>>>>>>>> try again.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Mihael
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike
> >>>>>>> Kubal
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> Yes, I believe you are right. The
> >>>>> kickstart
> >>>>>>>>>>>>>> message
> >>>>>>>>>>>>>>> may be only a warning. After digging a
> >>>>>>> little
> >>>>>>>>>>>>>> deeper
> >>>>>>>>>>>>>>> it appears the job is failing due to a
> >>>>>>>>>>>>>> project/account
> >>>>>>>>>>>>>>> id problem. I get the following error:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Caused by:
> >>>>>>>>>>>>>>>        The executable could not be
> >>>>>>> started.,
> >>>>>>>>>>>>>> qsub:
> >>>>>>>>>>>>>>> Invalid Account MSG=invalid account
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am specifying the same TG-account in
> >>>>> my
> >>>>>>>>>>>>>> site-file
> >>>>>>>>>>>>>>> for the gram4 run that fails, as in the
> >>>>>>>>>> site-file
> >>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>> the pre-ws job that suceeds. This is
> >>>>> the
> >>>>>>> same
> >>>>>>>>>>>>>> project,
> >>>>>>>>>>>>>>> TG-MCA01S018, that is set in my
> >>>>>>>>>>>>>> .tg_default_project
> >>>>>>>>>>>>>>> file in ~kubal/ on the UC teragrid.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --- Ben Clifford <benc at hawaga.org.uk>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>> yeah, run that same without kickstart.
> >>>>> the
> >>>>>>>>>> error
> >>>>>>>>>>>>>>>> reported is that
> >>>>>>>>>>>>>>>> kickstart didn't work right - but
> >>>>> there's
> >>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>>> some underlying error.
> >>>>>>>>>>>>>>>> -- 
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >> === message truncated ===
> >>
> >>
> >>
> >>       ____________________________________________________________________________________
> >> Looking for last minute shopping deals?  
> >> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> >>
> > 
> > 
> 


From benc at hawaga.org.uk  Tue Feb 12 17:57:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 12 Feb 2008 23:57:34 +0000 (GMT)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202860430.28705.7.camel@blabla.mcs.anl.gov>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
	<1202858233.27191.0.camel@blabla.mcs.anl.gov>
	<47B22E9E.2060000@mcs.anl.gov>
	<1202860430.28705.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0802122356500.5372@dildano.hawaga.org.uk>


On Tue, 12 Feb 2008, Mihael Hategan wrote:

> I think we may want to discourage that since it's not portable. I'd say
> instead of <jobmanager>, one should use <execution provider="gt2"
> jobManager="pbs" url="tg-grid.uc.teragrid.org"/>

which is more portable...?
-- 


From hategan at mcs.anl.gov  Tue Feb 12 18:04:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 18:04:36 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <Pine.LNX.4.64.0802122356500.5372@dildano.hawaga.org.uk>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
	<1202858233.27191.0.camel@blabla.mcs.anl.gov>
	<47B22E9E.2060000@mcs.anl.gov>
	<1202860430.28705.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802122356500.5372@dildano.hawaga.org.uk>
Message-ID: <1202861076.29685.4.camel@blabla.mcs.anl.gov>


On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote:
> 
> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> 
> > I think we may want to discourage that since it's not portable. I'd say
> > instead of <jobmanager>, one should use <execution provider="gt2"
> > jobManager="pbs" url="tg-grid.uc.teragrid.org"/>
> 
> which is more portable...?

Hmm?


From wilde at mcs.anl.gov  Tue Feb 12 18:26:41 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 12 Feb 2008 18:26:41 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202861076.29685.4.camel@blabla.mcs.anl.gov>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>	<1202858233.27191.0.camel@blabla.mcs.anl.gov>	<47B22E9E.2060000@mcs.anl.gov>	<1202860430.28705.7.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0802122356500.5372@dildano.hawaga.org.uk>
	<1202861076.29685.4.camel@blabla.mcs.anl.gov>
Message-ID: <47B23941.7050201@mcs.anl.gov>

I think that makes sense - you mean that jobManager="pbs" works for both 
WS-GRAM and pre-WS-GRAM, right?

On 2/12/08 6:04 PM, Mihael Hategan wrote:
> On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote:
>> On Tue, 12 Feb 2008, Mihael Hategan wrote:
>>
>>> I think we may want to discourage that since it's not portable. I'd say
>>> instead of <jobmanager>, one should use <execution provider="gt2"
>>> jobManager="pbs" url="tg-grid.uc.teragrid.org"/>
>> which is more portable...?
> 
> Hmm?
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From hategan at mcs.anl.gov  Tue Feb 12 18:31:55 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 12 Feb 2008 18:31:55 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <47B23941.7050201@mcs.anl.gov>
References: <283035.8845.qm@web52304.mail.re2.yahoo.com>
	<1202858233.27191.0.camel@blabla.mcs.anl.gov>
	<47B22E9E.2060000@mcs.anl.gov>
	<1202860430.28705.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802122356500.5372@dildano.hawaga.org.uk>
	<1202861076.29685.4.camel@blabla.mcs.anl.gov>
	<47B23941.7050201@mcs.anl.gov>
Message-ID: <1202862715.31941.5.camel@blabla.mcs.anl.gov>

On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde wrote:
> I think that makes sense - you mean that jobManager="pbs" works for both 
> WS-GRAM and pre-WS-GRAM, right?

Yes. Not only that, with <jobmanager> and WS-GRAM there is no (known to
me) way to specify a job manager. Somewhat ironic.


> 
> On 2/12/08 6:04 PM, Mihael Hategan wrote:
> > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote:
> >> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> >>
> >>> I think we may want to discourage that since it's not portable. I'd say
> >>> instead of <jobmanager>, one should use <execution provider="gt2"
> >>> jobManager="pbs" url="tg-grid.uc.teragrid.org"/>
> >> which is more portable...?
> > 
> > Hmm?

I'm asking Ben "Hmm?" because I thought he was aware of the above fact
and so unsure what exactly he wanted to know.

> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 


From benc at hawaga.org.uk  Wed Feb 13 06:53:24 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 13 Feb 2008 12:53:24 +0000 (GMT)
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>   
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>   
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>   
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>   
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>   
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>   
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
	<Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0802131252160.32747@dildano.hawaga.org.uk>


On Tue, 12 Feb 2008, Ben Clifford wrote:

> > The attached jar should fix that.
> 
> With your new jar, I no longer get that error. I did once get the below 
> stack trace, though execution appeared to continue. It hasn't happened a 
> second time or third time on running the same tests.

This change should probably find its way into the swift distribution via 
commits to the various dependencies that I don't commit to (GRAM4 and 
cog).

-- 


From feller at mcs.anl.gov  Wed Feb 13 08:49:03 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Wed, 13 Feb 2008 08:49:03 -0600 (CST)
Subject: [Swift-devel] cog r1871
In-Reply-To: <Pine.LNX.4.64.0802131252160.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
	<Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802131252160.32747@dildano.hawaga.org.uk>
Message-ID: <22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov>

>
> On Tue, 12 Feb 2008, Ben Clifford wrote:
>
>> > The attached jar should fix that.
>>
>> With your new jar, I no longer get that error. I did once get the below
>> stack trace, though execution appeared to continue. It hasn't happened a
>> second time or third time on running the same tests.
>
> This change should probably find its way into the swift distribution via
> commits to the various dependencies that I don't commit to (GRAM4 and
> cog).
>

The change has not been committed yet to any branch in ws-gram.
As far as i know cog has its own gram jars. Is that right?
jars from what GT version are in the latest cog version?


From hategan at mcs.anl.gov  Wed Feb 13 10:05:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 13 Feb 2008 10:05:02 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
	<Pine.LNX.4.64.0802121209520.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802131252160.32747@dildano.hawaga.org.uk>
	<22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1202918702.16251.1.camel@blabla.mcs.anl.gov>


On Wed, 2008-02-13 at 08:49 -0600, feller at mcs.anl.gov wrote:
> >
> > On Tue, 12 Feb 2008, Ben Clifford wrote:
> >
> >> > The attached jar should fix that.
> >>
> >> With your new jar, I no longer get that error. I did once get the below
> >> stack trace, though execution appeared to continue. It hasn't happened a
> >> second time or third time on running the same tests.
> >
> > This change should probably find its way into the swift distribution via
> > commits to the various dependencies that I don't commit to (GRAM4 and
> > cog).
> >
> 
> The change has not been committed yet to any branch in ws-gram.
> As far as i know cog has its own gram jars. Is that right?
> jars from what GT version are in the latest cog version?

Right now it's the first thing you sent. I think before that it was
4.0.2.

Mihaek

> 
> 


From mikekubal at yahoo.com  Wed Feb 13 14:03:44 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Wed, 13 Feb 2008 12:03:44 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202862715.31941.5.camel@blabla.mcs.anl.gov>
Message-ID: <338963.32679.qm@web52306.mail.re2.yahoo.com>

It worked swimmingly with Mihael's suggestion to
change gt4 to gt2 in the following line in my sites
file:

<execution provider="gt2" jobmanager="PBS"
url="tg-grid1.uc.teragrid.org" />

The only warning I get is a failure to transfer
kickstart records if I include the gridlaunch argument
as in the line below:
<pool handle="UC-64" sysinfo="INTEL64::LINUX"
gridlaunch="/home/wilde/vds/mystart">

Cheers,

Mike


--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde
> wrote:
> > I think that makes sense - you mean that
> jobManager="pbs" works for both 
> > WS-GRAM and pre-WS-GRAM, right?
> 
> Yes. Not only that, with <jobmanager> and WS-GRAM
> there is no (known to
> me) way to specify a job manager. Somewhat ironic.
> 
> 
> > 
> > On 2/12/08 6:04 PM, Mihael Hategan wrote:
> > > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford
> wrote:
> > >> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> > >>
> > >>> I think we may want to discourage that since
> it's not portable. I'd say
> > >>> instead of <jobmanager>, one should use
> <execution provider="gt2"
> > >>> jobManager="pbs"
> url="tg-grid.uc.teragrid.org"/>
> > >> which is more portable...?
> > > 
> > > Hmm?
> 
> I'm asking Ben "Hmm?" because I thought he was aware
> of the above fact
> and so unsure what exactly he wanted to know.
> 
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping


From hategan at mcs.anl.gov  Wed Feb 13 14:15:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 13 Feb 2008 14:15:36 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <338963.32679.qm@web52306.mail.re2.yahoo.com>
References: <338963.32679.qm@web52306.mail.re2.yahoo.com>
Message-ID: <1202933737.21302.4.camel@blabla.mcs.anl.gov>

On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote:
> It worked swimmingly with Mihael's suggestion to
> change gt4 to gt2

Ouch. That was a bit of a mistake there. I was pointing out that
<execution> should be used instead of <jobmanager>. GT2 was accidental.
You should probably change that to GT4 unless you're using a checkout
more current than yesterday which has some throttling patches to try to
prevent killing the head node.

>  in the following line in my sites
> file:
> 
> <execution provider="gt2" jobmanager="PBS"
> url="tg-grid1.uc.teragrid.org" />
> 
> The only warning I get is a failure to transfer
> kickstart records if I include the gridlaunch argument
> as in the line below:
> <pool handle="UC-64" sysinfo="INTEL64::LINUX"
> gridlaunch="/home/wilde/vds/mystart">
> 
> Cheers,
> 
> Mike
> 
> 
> 
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde
> > wrote:
> > > I think that makes sense - you mean that
> > jobManager="pbs" works for both 
> > > WS-GRAM and pre-WS-GRAM, right?
> > 
> > Yes. Not only that, with <jobmanager> and WS-GRAM
> > there is no (known to
> > me) way to specify a job manager. Somewhat ironic.
> > 
> > 
> > > 
> > > On 2/12/08 6:04 PM, Mihael Hategan wrote:
> > > > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford
> > wrote:
> > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> > > >>
> > > >>> I think we may want to discourage that since
> > it's not portable. I'd say
> > > >>> instead of <jobmanager>, one should use
> > <execution provider="gt2"
> > > >>> jobManager="pbs"
> > url="tg-grid.uc.teragrid.org"/>
> > > >> which is more portable...?
> > > > 
> > > > Hmm?
> > 
> > I'm asking Ben "Hmm?" because I thought he was aware
> > of the above fact
> > and so unsure what exactly he wanted to know.
> > 
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > > 
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 


From mikekubal at yahoo.com  Wed Feb 13 14:55:34 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Wed, 13 Feb 2008 12:55:34 -0800 (PST)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <1202933737.21302.4.camel@blabla.mcs.anl.gov>
Message-ID: <570155.71653.qm@web52302.mail.re2.yahoo.com>

What does the gt2/gt4 signify?

Using gt4 in the line below causes app on the
uc-teragrid to fail with message "cannot execute
binary file":

<execution provider="gt2" jobmanager="PBS"
url="tg-grid1.uc.teragrid.org" />

Cheers,

Mike

--- Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote:
> > It worked swimmingly with Mihael's suggestion to
> > change gt4 to gt2
> 
> Ouch. That was a bit of a mistake there. I was
> pointing out that
> <execution> should be used instead of <jobmanager>.
> GT2 was accidental.
> You should probably change that to GT4 unless you're
> using a checkout
> more current than yesterday which has some
> throttling patches to try to
> prevent killing the head node.
> 
> >  in the following line in my sites
> > file:
> > 
> > <execution provider="gt2" jobmanager="PBS"
> > url="tg-grid1.uc.teragrid.org" />
> > 
> > The only warning I get is a failure to transfer
> > kickstart records if I include the gridlaunch
> argument
> > as in the line below:
> > <pool handle="UC-64" sysinfo="INTEL64::LINUX"
> > gridlaunch="/home/wilde/vds/mystart">
> > 
> > Cheers,
> > 
> > Mike
> > 
> > 
> > 
> > 
> > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > 
> > > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde
> > > wrote:
> > > > I think that makes sense - you mean that
> > > jobManager="pbs" works for both 
> > > > WS-GRAM and pre-WS-GRAM, right?
> > > 
> > > Yes. Not only that, with <jobmanager> and
> WS-GRAM
> > > there is no (known to
> > > me) way to specify a job manager. Somewhat
> ironic.
> > > 
> > > 
> > > > 
> > > > On 2/12/08 6:04 PM, Mihael Hategan wrote:
> > > > > On Tue, 2008-02-12 at 23:57 +0000, Ben
> Clifford
> > > wrote:
> > > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> > > > >>
> > > > >>> I think we may want to discourage that
> since
> > > it's not portable. I'd say
> > > > >>> instead of <jobmanager>, one should use
> > > <execution provider="gt2"
> > > > >>> jobManager="pbs"
> > > url="tg-grid.uc.teragrid.org"/>
> > > > >> which is more portable...?
> > > > > 
> > > > > Hmm?
> > > 
> > > I'm asking Ben "Hmm?" because I thought he was
> aware
> > > of the above fact
> > > and so unsure what exactly he wanted to know.
> > > 
> > > > > 
> > > > >
> _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > >
> > >
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > 
> > > > > 
> > > > 
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > >
> >
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > > 
> > 
> > 
> > 
> >      
>
____________________________________________________________________________________
> > Looking for last minute shopping deals?  
> > Find them fast with Yahoo! Search. 
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
>
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


From hategan at mcs.anl.gov  Wed Feb 13 15:03:27 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 13 Feb 2008 15:03:27 -0600
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <570155.71653.qm@web52302.mail.re2.yahoo.com>
References: <570155.71653.qm@web52302.mail.re2.yahoo.com>
Message-ID: <1202936607.22610.0.camel@blabla.mcs.anl.gov>


On Wed, 2008-02-13 at 12:55 -0800, Mike Kubal wrote:
> What does the gt2/gt4 signify?

gt2 - pre-ws gram
gt4 - ws gram

> 
> Using gt4 in the line below causes app on the
> uc-teragrid to fail with message "cannot execute
> binary file":

Nevermind then. Though we should probably debug that.

> 
> <execution provider="gt2" jobmanager="PBS"
> url="tg-grid1.uc.teragrid.org" />
> 
> Cheers,
> 
> Mike
> 
> --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote:
> > > It worked swimmingly with Mihael's suggestion to
> > > change gt4 to gt2
> > 
> > Ouch. That was a bit of a mistake there. I was
> > pointing out that
> > <execution> should be used instead of <jobmanager>.
> > GT2 was accidental.
> > You should probably change that to GT4 unless you're
> > using a checkout
> > more current than yesterday which has some
> > throttling patches to try to
> > prevent killing the head node.
> > 
> > >  in the following line in my sites
> > > file:
> > > 
> > > <execution provider="gt2" jobmanager="PBS"
> > > url="tg-grid1.uc.teragrid.org" />
> > > 
> > > The only warning I get is a failure to transfer
> > > kickstart records if I include the gridlaunch
> > argument
> > > as in the line below:
> > > <pool handle="UC-64" sysinfo="INTEL64::LINUX"
> > > gridlaunch="/home/wilde/vds/mystart">
> > > 
> > > Cheers,
> > > 
> > > Mike
> > > 
> > > 
> > > 
> > > 
> > > --- Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > > 
> > > > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde
> > > > wrote:
> > > > > I think that makes sense - you mean that
> > > > jobManager="pbs" works for both 
> > > > > WS-GRAM and pre-WS-GRAM, right?
> > > > 
> > > > Yes. Not only that, with <jobmanager> and
> > WS-GRAM
> > > > there is no (known to
> > > > me) way to specify a job manager. Somewhat
> > ironic.
> > > > 
> > > > 
> > > > > 
> > > > > On 2/12/08 6:04 PM, Mihael Hategan wrote:
> > > > > > On Tue, 2008-02-12 at 23:57 +0000, Ben
> > Clifford
> > > > wrote:
> > > > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote:
> > > > > >>
> > > > > >>> I think we may want to discourage that
> > since
> > > > it's not portable. I'd say
> > > > > >>> instead of <jobmanager>, one should use
> > > > <execution provider="gt2"
> > > > > >>> jobManager="pbs"
> > > > url="tg-grid.uc.teragrid.org"/>
> > > > > >> which is more portable...?
> > > > > > 
> > > > > > Hmm?
> > > > 
> > > > I'm asking Ben "Hmm?" because I thought he was
> > aware
> > > > of the above fact
> > > > and so unsure what exactly he wanted to know.
> > > > 
> > > > > > 
> > > > > >
> > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > >
> > > >
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > >
> > >
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > >      
> >
> ____________________________________________________________________________________
> > > Looking for last minute shopping deals?  
> > > Find them fast with Yahoo! Search. 
> >
> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> > > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> 
> 
> 
>       ____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> 


From benc at hawaga.org.uk  Thu Feb 14 03:25:35 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 14 Feb 2008 09:25:35 +0000 (GMT)
Subject: [Swift-devel] Re: latest attempt with GRAM4
In-Reply-To: <570155.71653.qm@web52302.mail.re2.yahoo.com>
References: <570155.71653.qm@web52302.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0802140914140.5372@dildano.hawaga.org.uk>


On Wed, 13 Feb 2008, Mike Kubal wrote:

> What does the gt2/gt4 signify?

There are two totally different job submission systems, both called GRAM.

GRAM2 is more deployed but much older. GRAM4 is newer, less used, but has 
the promise of being (much) more scalable.

> Using gt4 in the line below causes app on the
> uc-teragrid to fail with message "cannot execute
> binary file":
> 
> <execution provider="gt2" jobmanager="PBS"
> url="tg-grid1.uc.teragrid.org" />

For debugging these problems, see if you can run the example workflow, 
examples/vdsk/first.swift - that should help isolate execution problems in 
general with something application specific.

-- 


From benc at hawaga.org.uk  Fri Feb 15 16:41:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Feb 2008 22:41:53 +0000 (GMT)
Subject: [Swift-devel] placement of large amounts of client side kickstart
	records
Message-ID: <Pine.LNX.4.64.0802152238275.32747@dildano.hawaga.org.uk>


At present, kickstart records go to $PWD.

That's lame - 10000 jobs give 10000 files that are i) in $PWD and ii) all 
in the same directory.

I'd like to do something about that.

i) is most important - perhaps put them first in a subdirectory named by 
workflow run ID, eg fmri-20080215-1828-2d433ro1.d/

ii) matters a bit less; however, kickstart records could be staged back 
into a hierarchy split up by job ID, in the same way that they are split 
up on the execute side.

-- 


From wilde at mcs.anl.gov  Fri Feb 15 16:53:48 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 15 Feb 2008 16:53:48 -0600
Subject: [Swift-devel] placement of large amounts of client side kickstart
	records
In-Reply-To: <Pine.LNX.4.64.0802152238275.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802152238275.32747@dildano.hawaga.org.uk>
Message-ID: <47B617FC.7090108@mcs.anl.gov>

This sounds very good to me. I had to hack around this in the work I did 
on angle in November.

On 2/15/08 4:41 PM, Ben Clifford wrote:
> At present, kickstart records go to $PWD.
> 
> That's lame - 10000 jobs give 10000 files that are i) in $PWD and ii) all 
> in the same directory.
> 
> I'd like to do something about that.
> 
> i) is most important - perhaps put them first in a subdirectory named by 
> workflow run ID, eg fmri-20080215-1828-2d433ro1.d/
> 
> ii) matters a bit less; however, kickstart records could be staged back 
> into a hierarchy split up by job ID, in the same way that they are split 
> up on the execute side.
> 


From wilde at mcs.anl.gov  Tue Feb 19 15:52:48 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 19 Feb 2008 15:52:48 -0600
Subject: [Swift-devel] Re: Swift running errors
In-Reply-To: <20080219150017.AWQ22172@m4500-03.uchicago.edu>
References: <20080219150017.AWQ22172@m4500-03.uchicago.edu>
Message-ID: <47BB4FB0.3030202@mcs.anl.gov>

Xi,

Regarding the kickstart problem - this is just a warning, possibly due 
to an incorrect spec in your sites.xml file on where kickstart is 
installed.  We can look into this.

Regarding "too many open files" - its possible that swift is trying to 
run too much in parallel and thus opening too many files at once. 
Mihael or Ben, could this be due to lack of or incorrect setting of the 
throttling parameters? I cant tell if this is hitting a per-host or 
per-process limit, but I suspect its the latter.

Xi, until you hear from others, look at the throttling parameters and 
set them to a modest value to start with. I need to go back to my notes 
for this - and we should document this more clearly in the user guide.

- mike


On 2/19/08 3:00 PM, lixi at uchicago.edu wrote:
> Hi,
> 
> I have two problems. 
> 
> 1. Today, when I try to run swift workflow on muliple OSG 
> sites, I always encounter the following errors which cause 
> the running failed:
> [lixi at login remote]$ swift -
> tc.file /home/lixi/swift/test/tc.data -
> sites.file /home/lixi/swift/test/OSGEDU_Sites.xml 
> workflowtest.swift 
> Swift v0.3-dev r1674 (modified locally)
> 
> RunID: 20080219-1447-1hztqje9
> node started
> Failed to transfer kickstart records from workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/CIT_CMS_T2Exception in 
> getFile
>         task:transfer @ vdl-int.k, line: 322
>         sys:try @ vdl-int.k, line: 322
>         vdl:transferkickstartrec @ vdl-int.k, line: 409
>         sys:set @ vdl-int.k, line: 409
>         sys:sequential @ vdl-int.k, line: 409
>         sys:try @ vdl-int.k, line: 408
>         sys:else @ vdl-int.k, line: 407
>         sys:if @ vdl-int.k, line: 405
>         sys:set @ vdl-int.k, line: 404
>         sys:catch @ vdl-int.k, line: 396
>         sys:try @ vdl-int.k, line: 354
>         task:allocatehost @ vdl-int.k, line: 334
>         vdl:execute2 @ execute-default.k, line: 23
>         sys:restartonerror @ execute-default.k, line: 21
>         sys:sequential @ execute-default.k, line: 19
>         sys:try @ execute-default.k, line: 18
>         sys:if @ execute-default.k, line: 17
>         sys:then @ execute-default.k, line: 16
>         sys:if @ execute-default.k, line: 15
>         vdl:execute @ workflowtest.kml, line: 31
>         worknode @ workflowtest.kml, line: 79
>         sys:sequential @ workflowtest.kml, line: 78
>         sys:parallel @ workflowtest.kml, line: 77
>         vdl:mainp @ workflowtest.kml, line: 76
>         mainp @ vdl.k, line: 150
>         vdl:mains @ workflowtest.kml, line: 75
>         vdl:mains @ workflowtest.kml, line: 75
>         rlog:restartlog @ workflowtest.kml, line: 74
>         kernel:project @ workflowtest.kml, line: 2
>         workflowtest-20080219-1447-1hztqje9
> Caused by: 
> org.globus.cog.abstraction.impl.file.FileResourceException: 
> Exception in getFile
> Caused by: org.globus.ftp.exception.ServerException: Server 
> refused performing the request. Custom message:  (error code 
> 1) [Nested exception message:  Custom message: Unexpected 
> reply: 500-Command failed. : 
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file /raid2/osg-data/lixi/workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
> kickstart.xml
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or 
> directory
> 500 End.] [Nested exception is 
> org.globus.ftp.exception.UnexpectedReplyCodeException:  
> Custom message: Unexpected reply: 500-Command failed. : 
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file /raid2/osg-data/lixi/workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
> kickstart.xml
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or 
> directory
> 500 End.]
> 
> 2. When runing a workflow which involves 1000nodes, I 
> encounter the following errors very frequently, but not all 
> the time:
> ...
> node completed
> node completed
> node completed
> node completed
> node completed
> node failed
> Execution failed:
>         Exception in node:
> Arguments: [_concurrent/intermediatefile-b5b5dc39-df70-4137-
> 8149-c20f5d1af839-, out.0132.txt]
> Host: localhost
> Directory: workflowtest-20080219-1443-2qx4ctkc/jobs/6/node-
> 64kddnoi
> stderr.txt: 
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>         java.io.IOException: Too many open files
> 
> Could you tell me why and teach me how to resolve such 
> problems? 
> 
> Thanks,
> 
> Xi
> 
> 


From hategan at mcs.anl.gov  Thu Feb 21 12:53:25 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 21 Feb 2008 12:53:25 -0600
Subject: [Swift-devel] cog r1871
In-Reply-To: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
Message-ID: <1203620005.2850.9.camel@blabla.mcs.anl.gov>

An email message in a different thread brings the question: is this
compiled for 1.4 or 1.5?

Mihael

On Mon, 2008-02-11 at 23:28 -0600, feller at mcs.anl.gov wrote:
> My fault, not the ObjectSerializers one.
> You submitted in batch-mode?
> The attached jar should fix that.
> Hope the java version is fine.
> Martin
> 
> > Martin?
> >
> > On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote:
> >> I'm seeing repeatable cleanup errors like the below. The workflows run
> >> to
> >> completion, though.
> >>
> >> RunID: 20080211-2248-rsqe1da0
> >> cat started
> >> cat completed
> >> The following warnings have occurred:
> >> 1. Cleanup on tguc failed
> >> Caused by:
> >>         Cannot submit job: null
> >> Caused by:
> >>         java.lang.NullPointerException
> >>         at
> >> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211)
> >>         at
> >> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970)
> >>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
> >>         at
> >> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189)
> >>         at
> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54)
> >>         at
> >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
> >>         at
> >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
> >>
> >>
> >
> >


From feller at mcs.anl.gov  Thu Feb 21 13:32:29 2008
From: feller at mcs.anl.gov (feller at mcs.anl.gov)
Date: Thu, 21 Feb 2008 13:32:29 -0600 (CST)
Subject: [Swift-devel] cog r1871
In-Reply-To: <1203620005.2850.9.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0802111427040.28568@dildano.hawaga.org.uk>
	<1202745681.15887.10.camel@blabla.mcs.anl.gov>
	<1202749192.18234.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802111723250.32747@dildano.hawaga.org.uk>
	<1202758443.28686.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112219050.32747@dildano.hawaga.org.uk>
	<1202769433.31985.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0802112250210.32747@dildano.hawaga.org.uk>
	<1202773602.779.0.camel@blabla.mcs.anl.gov>
	<49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov>
	<1203620005.2850.9.camel@blabla.mcs.anl.gov>
Message-ID: <21069.208.54.7.178.1203622349.squirrel@www-unix.mcs.anl.gov>

99.63% sure that it was built with java 1.4
if it was 1.5 a client running under 1.4 should see errors.

Martin

> An email message in a different thread brings the question: is this
> compiled for 1.4 or 1.5?
>
> Mihael
>
> On Mon, 2008-02-11 at 23:28 -0600, feller at mcs.anl.gov wrote:
>> My fault, not the ObjectSerializers one.
>> You submitted in batch-mode?
>> The attached jar should fix that.
>> Hope the java version is fine.
>> Martin
>>
>> > Martin?
>> >
>> > On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote:
>> >> I'm seeing repeatable cleanup errors like the below. The workflows
>> run
>> >> to
>> >> completion, though.
>> >>
>> >> RunID: 20080211-2248-rsqe1da0
>> >> cat started
>> >> cat completed
>> >> The following warnings have occurred:
>> >> 1. Cleanup on tguc failed
>> >> Caused by:
>> >>         Cannot submit job: null
>> >> Caused by:
>> >>         java.lang.NullPointerException
>> >>         at
>> >> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211)
>> >>         at
>> >> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970)
>> >>         at org.globus.exec.client.GramJob.submit(GramJob.java:447)
>> >>         at
>> >> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189)
>> >>         at
>> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54)
>> >>         at
>> >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
>> >>         at
>> >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>> >>
>> >>
>> >
>> >
>
>


From benc at hawaga.org.uk  Sun Feb 24 17:12:16 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 24 Feb 2008 23:12:16 +0000 (GMT)
Subject: [Swift-devel] some racelike condition on file stagein
Message-ID: <Pine.LNX.4.64.0802242308570.32747@dildano.hawaga.org.uk>


I see errors like the below sometimes on my laptop and on the NMI test 
systems, happening in file stagein. It happens sporadically and seemingly 
not very often if at all on i386 linux, so feels like some kind of race 
condition. Every time I've seen it, its been creating a file under a 
non-trivial directory structure (eg under _concurrent) so maybe there's 
something funny going on there.

Its come to my attention because I put in a test that turns off execution 
retries and runs all the local behaviour tests, on the basis that local 
execution should be very unlikely to need retries.


	org.globus.cog.abstraction.impl.file.FileResourceException: Failed 
to create directory: 
_concurrent/aligned-a0f7b757-142a-4c66-8288-bd06ec2d591c--array//elt-4.-field


http://nmi-s005.cs.wisc.edu:80/nmi/run/benc/2008/02/benc_nmi-s005.cs.wisc.edu_1203870306_20341/userdir/nmi:x86_fc_3/remote_task.err


I'll put some more details in the bugzilla.

-- 


From lixi at uchicago.edu  Tue Feb 19 15:00:17 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Tue, 19 Feb 2008 15:00:17 -0600 (CST)
Subject: [Swift-devel] Swift running errors
Message-ID: <20080219150017.AWQ22172@m4500-03.uchicago.edu>

Hi,

I have two problems. 

1. Today, when I try to run swift workflow on muliple OSG 
sites, I always encounter the following errors which cause 
the running failed:
[lixi at login remote]$ swift -
tc.file /home/lixi/swift/test/tc.data -
sites.file /home/lixi/swift/test/OSGEDU_Sites.xml 
workflowtest.swift 
Swift v0.3-dev r1674 (modified locally)

RunID: 20080219-1447-1hztqje9
node started
Failed to transfer kickstart records from workflowtest-
20080219-1447-1hztqje9/kickstart/8/CIT_CMS_T2Exception in 
getFile
        task:transfer @ vdl-int.k, line: 322
        sys:try @ vdl-int.k, line: 322
        vdl:transferkickstartrec @ vdl-int.k, line: 409
        sys:set @ vdl-int.k, line: 409
        sys:sequential @ vdl-int.k, line: 409
        sys:try @ vdl-int.k, line: 408
        sys:else @ vdl-int.k, line: 407
        sys:if @ vdl-int.k, line: 405
        sys:set @ vdl-int.k, line: 404
        sys:catch @ vdl-int.k, line: 396
        sys:try @ vdl-int.k, line: 354
        task:allocatehost @ vdl-int.k, line: 334
        vdl:execute2 @ execute-default.k, line: 23
        sys:restartonerror @ execute-default.k, line: 21
        sys:sequential @ execute-default.k, line: 19
        sys:try @ execute-default.k, line: 18
        sys:if @ execute-default.k, line: 17
        sys:then @ execute-default.k, line: 16
        sys:if @ execute-default.k, line: 15
        vdl:execute @ workflowtest.kml, line: 31
        worknode @ workflowtest.kml, line: 79
        sys:sequential @ workflowtest.kml, line: 78
        sys:parallel @ workflowtest.kml, line: 77
        vdl:mainp @ workflowtest.kml, line: 76
        mainp @ vdl.k, line: 150
        vdl:mains @ workflowtest.kml, line: 75
        vdl:mains @ workflowtest.kml, line: 75
        rlog:restartlog @ workflowtest.kml, line: 74
        kernel:project @ workflowtest.kml, line: 2
        workflowtest-20080219-1447-1hztqje9
Caused by: 
org.globus.cog.abstraction.impl.file.FileResourceException: 
Exception in getFile
Caused by: org.globus.ftp.exception.ServerException: Server 
refused performing the request. Custom message:  (error code 
1) [Nested exception message:  Custom message: Unexpected 
reply: 500-Command failed. : 
globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file /raid2/osg-data/lixi/workflowtest-
20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
kickstart.xml
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or 
directory
500 End.] [Nested exception is 
org.globus.ftp.exception.UnexpectedReplyCodeException:  
Custom message: Unexpected reply: 500-Command failed. : 
globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
500-globus_l_gfs_file_open failed.
500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
500-globus_xio_register_open failed.
500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
500-Unable to open file /raid2/osg-data/lixi/workflowtest-
20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
kickstart.xml
500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
500-System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or 
directory
500 End.]

2. When runing a workflow which involves 1000nodes, I 
encounter the following errors very frequently, but not all 
the time:
...
node completed
node completed
node completed
node completed
node completed
node failed
Execution failed:
        Exception in node:
Arguments: [_concurrent/intermediatefile-b5b5dc39-df70-4137-
8149-c20f5d1af839-, out.0132.txt]
Host: localhost
Directory: workflowtest-20080219-1443-2qx4ctkc/jobs/6/node-
64kddnoi
stderr.txt: 

stdout.txt: 

----

Caused by:
        java.io.IOException: Too many open files

Could you tell me why and teach me how to resolve such 
problems? 

Thanks,

Xi


From zhoujianghua1017 at 163.com  Tue Feb 26 07:48:26 2008
From: zhoujianghua1017 at 163.com (jezhee)
Date: Tue, 26 Feb 2008 21:48:26 +0800
Subject: [Swift-devel] Some questions about Swift
Message-ID: <200802262145466621306@163.com>

swift-devel?
  Hi.
   This is Zhou Jianghua from China and come into some problems when using Swift. Waiting for your guide and thanks a lot.
   I have installed the Swift environment in my computer and run some simple examples in local machine.  All things were normal except that the exection was very slow. A simple program just displaying text on the screen took 5 to 10 seconds. Could you tell me why?
   Besides, I followed the instructions in the documentation, Swift lab at University of Chicago Computation Institute,part I: Grid workflow(url:http://www.ci.uchicago.edu/osgedu/schools/swiftlab/). BUt, I didn't find the folder sw in my machine, and the file sites-chicago.xml neither. So, I can't let my program run at a remote host. How to solve this?
  
?Regards.
							2008-02-26
//////////////////////////////////////////
// Zhou Jianghua zhoujianghua1017 at 163.com
// EI Dep, Huazhong Uni of Sci & Tech
// Internet Technology and Engineering Center
// http://www.itec.org.cn
// 
// Tel?(86)27-87792139
// Fax?(86)27-87540745
// Zipcode?430074
// Address?Luoyu Road 1037, Wuhan, Hubei, China
/////////////////////////////////////////

From benc at hawaga.org.uk  Tue Feb 26 10:42:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 26 Feb 2008 16:42:04 +0000 (GMT)
Subject: [Swift-devel] Some questions about Swift
In-Reply-To: <200802262145466621306@163.com>
References: <200802262145466621306@163.com>
Message-ID: <Pine.LNX.4.64.0802261636080.5372@dildano.hawaga.org.uk>


On Tue, 26 Feb 2008, jezhee wrote:

>    I have installed the Swift environment in my computer and run some 
> simple examples in local machine.  All things were normal except that 
> the exection was very slow. A simple program just displaying text on the 
> screen took 5 to 10 seconds. Could you tell me why?

There is a lot of startup involved with running a swift program - that is 
probably most of the time you see. This time consists of loading the JVM, 
loading various libraries and compiling your program.

However, if you run two programs, you should find that it takes about the 
same amount of time, not twice as long.

>    Besides, I followed the instructions in the documentation, Swift lab 
> at University of Chicago Computation Institute,part I: Grid 
> workflow(url:http://www.ci.uchicago.edu/osgedu/schools/swiftlab/). BUt, 
> I didn't find the folder sw in my machine, and the file 
> sites-chicago.xml neither. So, I can't let my program run at a remote 
> host. How to solve this?

Those instructions won't work if you are working on your own machine.

Have you ever used Globus to run a job on the grid before? If so, then I 
can show you how to use Swift to submit jobs from there using your 
existing setup.

If you have not, then you should get set up to submit jobs to some 
execution system first - for example, apply for an account on the CI 
gridlab at http://www.ci.uchicago.edu/osgedu/schools/gridlab/

-- 


From benc at hawaga.org.uk  Wed Feb 27 02:32:43 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Feb 2008 08:32:43 +0000 (GMT)
Subject: [Swift-devel] Re: [Newslab] Re: Getting all RSS data into database
In-Reply-To: <4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com>
References: <47C2E38D.6060700@mcs.anl.gov>
	<4290b6c60802250920p108c77f6j17879871c532490@mail.gmail.com>
	<47C2FA1B.40509@mcs.anl.gov>
	<4290b6c60802251006w7694ectbe773d6211ba8cbb@mail.gmail.com>
	<Pine.LNX.4.64.0802261455380.5372@dildano.hawaga.org.uk> 
	<4290b6c60802260831q79b1acbeoaf8a454c09e6a9a0@mail.gmail.com> 
	<Pine.LNX.4.64.0802261657570.32747@dildano.hawaga.org.uk>
	<4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0802270825060.32747@dildano.hawaga.org.uk>


Note: I added swift-devel to this piece of the thread because it is 
relevant there; and perhaps now not so relevant to the newslab list.

On Tue, 26 Feb 2008, Quan Tran Pham wrote:

> > What I think you are trying to do is merge a bunch of files into a single
> big file? (which is not in itself a merge sort)

> I have merge2 that merge two sorted files (contain sorted key + value) into
> one big sorted file.

A different way of thinking about this, which is perhaps more of interest 
to the swift development group rather than newslab directly:

Define a binary operator like >+ meaning somthing like 
ordered-concatenate, which will combine two files in the appropriate 
ordered fashion.

  file >+ file  -->  file

This operator is commutative.

Then have foldC able to fold knowing that the supplied operator is 
commutative (so it can split up in a binary fashion, or however other way 
it cares to).

Now say:

  file[] inputs 
  file output
  output = foldC (>+) inputs

Perhaps foldC should be provided by Swift, with >+ provided as a 
procedure.

-- 


From benc at hawaga.org.uk  Wed Feb 27 03:20:11 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Feb 2008 09:20:11 +0000 (GMT)
Subject: [Swift-devel] compile time error handling
Message-ID: <Pine.LNX.4.64.0802270915030.32747@dildano.hawaga.org.uk>


Over the past day or so, I've committed a bunch of compile time error 
handling changes. r1691 is the last one of those for now.

Some more compile error messages will now have source line numbers in 
them.

There is more compile-time static analysis of the program, which should 
result in errors occurring at compile time rather than part-way through 
workflow execution.

Programs which previously ran OK should still compile OK. As always, 
indicate here if not.

-- 


From benc at hawaga.org.uk  Wed Feb 27 03:45:25 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Feb 2008 09:45:25 +0000 (GMT)
Subject: [Swift-devel] get an @arg as an int.
Message-ID: <Pine.LNX.4.64.0802270942080.32747@dildano.hawaga.org.uk>


I want to make my load tests take the number of procedures to run as a 
commandline @arg.

So I want to iterate over [1:@arg(foo)] or something like that.

But @arg(...) has type string.

I have a straightforward implementation of @toint(string) that fixes this.

I am however slightly concerned about a lack of coherency in what can be 
cast to what / what can be read from a file / the forms those things take 
(eg. @extractint, readdata, @toint)

-- 


From benc at hawaga.org.uk  Wed Feb 27 08:07:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Feb 2008 14:07:19 +0000 (GMT)
Subject: [Swift-devel] Re: some racelike condition on file stagein
In-Reply-To: <Pine.LNX.4.64.0802242308570.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802242308570.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0802271404220.4607@dildano.hawaga.org.uk>


File.mkdirs() is not thread-safe, according to typing "threadsafe java 
mkdirs" into google.

I applied the below patch to my cog checkout and the error goes away for 
me. However, I'm not a cog developer, so someone else needs to fix this in 
the CoG SVN.

Index: cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java
===================================================================
--- cog.orig/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java	2007-08-27 09:30:23.000000000 +0100
+++ cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java	2008-02-27 13:51:42.000000000 +0000
@@ -146,15 +146,19 @@
         }
     }
 
+static Object mkdirlock = new Object();
+
     public void createDirectories(String directory)
             throws FileResourceException {
         if (directory == null || directory.equals("")) {
             return;
         }
         File f = resolve(directory);
+        synchronized(mkdirlock) {
         if (!f.mkdirs() && !f.exists()) {
             throw new FileResourceException("Failed to create directory: " + directory);
         }
+        }
     }
 
     public void deleteDirectory(String dir, boolean force) throws FileResourceException {


From hategan at mcs.anl.gov  Wed Feb 27 09:33:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 27 Feb 2008 09:33:36 -0600
Subject: [Swift-devel] Re: some racelike condition on file stagein
In-Reply-To: <Pine.LNX.4.64.0802271404220.4607@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802242308570.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0802271404220.4607@dildano.hawaga.org.uk>
Message-ID: <1204126416.17698.7.camel@blabla.mcs.anl.gov>

Right. The same problem likely applies to gridftp.

On Wed, 2008-02-27 at 14:07 +0000, Ben Clifford wrote:
> File.mkdirs() is not thread-safe, according to typing "threadsafe java 
> mkdirs" into google.
> 
> I applied the below patch to my cog checkout and the error goes away for 
> me. However, I'm not a cog developer, so someone else needs to fix this in 
> the CoG SVN.
> 
> Index: cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java
> ===================================================================
> --- cog.orig/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java	2007-08-27 09:30:23.000000000 +0100
> +++ cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java	2008-02-27 13:51:42.000000000 +0000
> @@ -146,15 +146,19 @@
>          }
>      }
>  
> +static Object mkdirlock = new Object();
> +
>      public void createDirectories(String directory)
>              throws FileResourceException {
>          if (directory == null || directory.equals("")) {
>              return;
>          }
>          File f = resolve(directory);
> +        synchronized(mkdirlock) {
>          if (!f.mkdirs() && !f.exists()) {
>              throw new FileResourceException("Failed to create directory: " + directory);
>          }
> +        }
>      }
>  
>      public void deleteDirectory(String dir, boolean force) throws FileResourceException {
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From quanpt at gmail.com  Wed Feb 27 10:00:12 2008
From: quanpt at gmail.com (Quan Tran Pham)
Date: Wed, 27 Feb 2008 10:00:12 -0600
Subject: [Swift-devel] Re: [Newslab] Re: Getting all RSS data into database
In-Reply-To: <Pine.LNX.4.64.0802270825060.32747@dildano.hawaga.org.uk>
References: <47C2E38D.6060700@mcs.anl.gov>
	<4290b6c60802250920p108c77f6j17879871c532490@mail.gmail.com>
	<47C2FA1B.40509@mcs.anl.gov>
	<4290b6c60802251006w7694ectbe773d6211ba8cbb@mail.gmail.com>
	<Pine.LNX.4.64.0802261455380.5372@dildano.hawaga.org.uk>
	<4290b6c60802260831q79b1acbeoaf8a454c09e6a9a0@mail.gmail.com>
	<Pine.LNX.4.64.0802261657570.32747@dildano.hawaga.org.uk>
	<4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com>
	<Pine.LNX.4.64.0802270825060.32747@dildano.hawaga.org.uk>
Message-ID: <4290b6c60802270800j6577a5e2ue684164a2fe4bac4@mail.gmail.com>

I would support the idea.

That foldC-like function has been used in some other languages:
+ reduce in python (they have order from left to right, by the way)
+ reduce phase in MapReduce programming model (
http://labs.google.com/papers/mapreduce.html)

Quan

On Wed, Feb 27, 2008 at 2:32 AM, Ben Clifford <benc at hawaga.org.uk> wrote:

>
> Note: I added swift-devel to this piece of the thread because it is
> relevant there; and perhaps now not so relevant to the newslab list.
>
> On Tue, 26 Feb 2008, Quan Tran Pham wrote:
>
> > > What I think you are trying to do is merge a bunch of files into a
> single
> > big file? (which is not in itself a merge sort)
>
> > I have merge2 that merge two sorted files (contain sorted key + value)
> into
> > one big sorted file.
>
> A different way of thinking about this, which is perhaps more of interest
> to the swift development group rather than newslab directly:
>
> Define a binary operator like >+ meaning somthing like
> ordered-concatenate, which will combine two files in the appropriate
> ordered fashion.
>
>  file >+ file  -->  file
>
> This operator is commutative.
>
> Then have foldC able to fold knowing that the supplied operator is
> commutative (so it can split up in a binary fashion, or however other way
> it cares to).
>
> Now say:
>
>  file[] inputs
>  file output
>  output = foldC (>+) inputs
>
> Perhaps foldC should be provided by Swift, with >+ provided as a
> procedure.
>
> --
>


-- 
Quan Tran Pham
PhD Student
Department of Computer Science
University of Chicago
1100 E 58th Street, Chicago, IL 60637
Office: Ryerson 178
Phone: (773)702-4227
Fax: (773)702-8487
quanpt at cs.uchicago.edu
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080227/1f95eb62/attachment.html>

From benc at hawaga.org.uk  Thu Feb 28 16:22:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Feb 2008 22:22:34 +0000 (GMT)
Subject: [Swift-devel] runtime console stats
Message-ID: <Pine.LNX.4.64.0802282217560.32747@dildano.hawaga.org.uk>


In the style of the RFT client, I implemented a runtime progress ticker 
that every few seconds outputs a line of how many jobs are in each 
internal state. See below for an example output.

I think exposing the various internal states on the console is a useful 
thing to do.

The states in the below example are a bit lame - should probably have 
something like: Waiting for a site to be allocated; staging in; submitted 
for execution; staging out; all finished..


$ swift 130-fmri.swift 
Swift v0.3-dev r1689 (modified locally)

RunID: 20080228-1619-xkb5elaf
Progress: 
touch started
touch started
touch started
touch started
Progress:  EXECUTE:3 STAGEOUT:1 START:4
touch completed
touch completed
touch completed
touch completed
touch started
Progress:  EXECUTE2DONE:1 END:4 START:3
touch completed
touch started
touch started
touch started
touch completed
touch completed
touch started
touch started
touch completed
touch started
Progress:  EXECUTE:2 EXECUTE2:1 END:8
touch completed
touch completed
touch completed
Final status:  END:11


From foster at mcs.anl.gov  Thu Feb 28 18:06:58 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Thu, 28 Feb 2008 18:06:58 -0600
Subject: [Swift-devel] runtime console stats
In-Reply-To: <Pine.LNX.4.64.0802282217560.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802282217560.32747@dildano.hawaga.org.uk>
Message-ID: <47C74CA2.4090606@mcs.anl.gov>

cool!

Ben Clifford wrote:
> In the style of the RFT client, I implemented a runtime progress ticker 
> that every few seconds outputs a line of how many jobs are in each 
> internal state. See below for an example output.
>
> I think exposing the various internal states on the console is a useful 
> thing to do.
>
> The states in the below example are a bit lame - should probably have 
> something like: Waiting for a site to be allocated; staging in; submitted 
> for execution; staging out; all finished..
>
>
> $ swift 130-fmri.swift 
> Swift v0.3-dev r1689 (modified locally)
>
> RunID: 20080228-1619-xkb5elaf
> Progress: 
> touch started
> touch started
> touch started
> touch started
> Progress:  EXECUTE:3 STAGEOUT:1 START:4
> touch completed
> touch completed
> touch completed
> touch completed
> touch started
> Progress:  EXECUTE2DONE:1 END:4 START:3
> touch completed
> touch started
> touch started
> touch started
> touch completed
> touch completed
> touch started
> touch started
> touch completed
> touch started
> Progress:  EXECUTE:2 EXECUTE2:1 END:8
> touch completed
> touch completed
> touch completed
> Final status:  END:11
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   


From benc at hawaga.org.uk  Fri Feb 29 06:51:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 29 Feb 2008 12:51:53 +0000 (GMT)
Subject: [Swift-devel] execute side md5sum
Message-ID: <Pine.LNX.4.64.0802291245440.32747@dildano.hawaga.org.uk>

Its pretty straightforward to modify the wrapper to take a hash (eg 
md5sum) of input files before and output files after execution (I made a 
prototype yesterday afternoon) and log those hashes.

This gives a convenient summary of the content of the inputs and outputs 
that is automated and hard to break through lack of attention; and so is 
probably useful for questions like "was this run with the same version or 
a different version of a particular input file".

(this is not some universal versioning solution that will solve the 
unsolvable; instead it provides an answer to 'was this file the same or 
different than some other file?' over which can be laid other exciting 
versioning systems)

I think having something like this is probably useful optional 
functionality (enabled in the same was as kickstart, perhaps).

-- 


From itf at mcs.anl.gov  Fri Feb 29 07:03:23 2008
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Fri, 29 Feb 2008 13:03:23 +0000
Subject: [Swift-devel] execute side md5sum
In-Reply-To: <Pine.LNX.4.64.0802291245440.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0802291245440.32747@dildano.hawaga.org.uk>
Message-ID: <1468697544-1204290242-cardhu_decombobulator_blackberry.rim.net-447726758-@bxe122.bisx.prod.on.blackberry>

Definitely. How about the exeuctable as well?

Ian

Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Ben Clifford <benc at hawaga.org.uk>

Date: Fri, 29 Feb 2008 12:51:53 
To:swift-devel at ci.uchicago.edu
Subject: [Swift-devel] execute side md5sum


Its pretty straightforward to modify the wrapper to take a hash (eg 
md5sum) of input files before and output files after execution (I made a 
prototype yesterday afternoon) and log those hashes.

This gives a convenient summary of the content of the inputs and outputs 
that is automated and hard to break through lack of attention; and so is 
probably useful for questions like "was this run with the same version or 
a different version of a particular input file".

(this is not some universal versioning solution that will solve the 
unsolvable; instead it provides an answer to 'was this file the same or 
different than some other file?' over which can be laid other exciting 
versioning systems)

I think having something like this is probably useful optional 
functionality (enabled in the same was as kickstart, perhaps).

-- 
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel