From benc at hawaga.org.uk Fri Feb 1 16:03:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Feb 2008 22:03:07 +0000 (GMT) Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> Message-ID: related to this, Swift can use PBS directly if its run on the headnode. in some cases, this is going to be preferable to using either version of GRAM. I think this would have avoided the particular problem encountered here. I haven't tried this on TG-UC, but it seems to work ok for me on teraport. -- From benc at hawaga.org.uk Fri Feb 1 17:29:51 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 1 Feb 2008 23:29:51 +0000 (GMT) Subject: [Swift-devel] nightly rebuild of documentation Message-ID: I finally got round to setting up a cron job to update the webspace from SVN every 24h (i.e. it runs update.sh) So now, unless its urgent, you can commit doc changes to SVN and not have to log in to update the actual deployment of those docs. -- From hategan at mcs.anl.gov Fri Feb 1 19:07:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 01 Feb 2008 19:07:02 -0600 Subject: [Swift-devel] Support request: Swift jobs flooding uc-teragrid? In-Reply-To: <1201750553.11697.8.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <5A5A2046-8B6E-43AA-83CE-7DDE560E2F8E@mcs.anl.gov> <479F5047.2040805@mcs.anl.gov> <479F7B67.7070400@mcs.anl.gov> <479F809B.5050306@cs.uchicago.edu> <3B95054D-AECE-438E-8F94-2A7547304372@mcs.anl.gov> <523CF375-AFD1-46E1-A6B3-40EF0A4258D5@mcs.anl.gov> <1201664122.32688.36.camel@blabla.mcs.anl.gov> <47A08A36.1000502@mcs.anl.gov> <0E3D6F9E-5D4C-42D3-A693-62B043406BBF@mcs.anl.gov> <1201718259.5465.1.camel@blabla.mcs.anl.gov> <69921E77-384E-4D36-922C-BED4D52F2177@mcs.anl.gov> <1201742109.9441.3.camel@blabla.mcs.anl.gov> <47A13F4C.60507@mcs.anl.gov> <1201750553.11697.8.camel@blabla.mcs.anl.gov> Message-ID: <1201914422.5589.1.camel@blabla.mcs.anl.gov> Nice. I can see other people's jobs: hategan at tg-grid1:~> cat /soft/prews-gram-4.0.1-r3/tmp/gram_job_state/job.tg-grid1.uc.teragrid.org.9324.1184364728 https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/ 12 128 0 1460443.tg-master.uc.teragrid.org &(rsl_substitution=(GRIDMANAGER_GASS_URL https://sidgrid.ci.uchicago.edu:60651))(executable='/home/skenny/vds_32/bin/kickstart')(directory='/home/skenny/sidgrid_out/skenny/skenny/wf_test/run0001')(arguments=-n upload::uploader -N sidgrid::UploadClient -R ANLUCTERAGRID32 /home/skenny/sidgrid/soft/upload/uploader skenny wf_test graspB.lh.forperm.txt_260.output graspB.lh.forperm.txt_261.output graspB.lh.forperm.txt_262.output graspB.lh.forperm.txt_263.output graspB.lh.forperm.txt_264.output graspB.lh.forperm.txt_265.output)(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CACHED_STDERR) $(GRIDMANAGER_GASS_URL)#'/ci/sidgrid.ci.uchicago.edu/htdocs/sidgrid/sidgrid_test_server/sidgrid/transformations/sidgridUsers/skenny/wf_test/run0001/uploader_ID000007.err'))(environment=(app '/app/osg_app')(data '/home/skenny/data')(tmp '/tmp')(wntmp '/tmp'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io_url=$(GRIDMANAGER_GASS_URL))(jobtype=single)(maxwalltime=2400) https://tg-grid1.uc.teragrid.org:50170/9324/1184364728/ ... On Wed, 2008-01-30 at 21:35 -0600, Mihael Hategan wrote: > Sure, I'd do that anyway to test the testing script(s)/process. I mean > if I do mess it, I want to make sure I only need to do it once. > > But I'm thinking it's better to agree on some time than for Joe or Ti or > JP to randomly wonder what's going on. > > On the other hand, seeing many processes in my name will probably > eliminate the confusion :) > > On Wed, 2008-01-30 at 21:23 -0600, Michael Wilde wrote: > > I suggested we start the tests at a moderate intensity, and record the > > impact on CPU, mem, qlength, etc. > > > > Then ramp up untl those indicators start to suggest that the gk is under > > strain. > > > > Its not 100% foolproof, but better than blind stress testing. > > > > - mike > > > > > > On 1/30/08 7:15 PM, Mihael Hategan wrote: > > > Me doing such tests will probably mess the gatekeeper node again. How do > > > we proceed? > > > > > > On Wed, 2008-01-30 at 13:19 -0600, Stuart Martin wrote: > > >> I'm saying run swift tests using GRAM4 and see what you get. Run a > > >> similar job scenario like 2000 jobs to the same GRAM4 service. I will > > >> be interested to see how swift does for performance, scalability, > > >> errors... > > >> It's possible that condor-g is not optimal, so seeing how another > > >> GRAM4 client dong similar job submission scenarios fares would make > > >> for an interesting comparison. > > >> > > >> -Stu > > >> > > >> On Jan 30, 2008, at Jan 30, 12:37 PM, Mihael Hategan wrote: > > >> > > >>> I'm confused. Why would you want to test GRAM scalability while > > >>> introducing additional biasing elements, such as Condor-G? > > >>> > > >>> On Wed, 2008-01-30 at 11:21 -0600, Stuart Martin wrote: > > >>>> All, > > >>>> > > >>>> I wanted to chime in with a number of things being discussed here. > > >>>> > > >>>> There is a GRAM RFT Core reliability group focused on ensuring the > > >>>> GRAM service stays up and functional in spit of an onslaught from a > > >>>> client. http://confluence.globus.org/display/CDIGS/GRAM-RFT-Core+Reliability+Tiger+Team > > >>>> > > >>>> The ultimate goal here is that a client may get a timeout and that > > >>>> would be the signal to backoff some. > > >>>> > > >>>> ----- > > >>>> > > >>>> OSG - VO testing: We worked with Terrence (CMS) recently and here are > > >>>> his test results. > > >>>> http://hepuser.ucsd.edu/twiki/bin/view/UCSDTier2/WSGramTests > > >>>> > > >>>> GRAM2 handled this 2000 jobs x 2 condor-g clients to the same GRAM > > >>>> service better than GRAM4. But again, this is with the condor-g > > >>>> tricks. Without the tricks, GRAM2 will handle the load better. > > >>>> > > >>>> OSG VTB testing: These were using globusrun-ws and also condor-g. > > >>>> https://twiki.grid.iu.edu/twiki/bin/view/Integration/ > > >>>> WSGramValidation > > >>>> > > >>>> clients in these tests got a variety of errors depending on the jobs > > >>>> run: timeouts, GridFTP authentication errors, client-side OOM, ... > > >>>> GRAM4 functions pretty well, but it was not able to handle Terrence's > > >>>> scenario. But it handled 1000 jobs x 1 condor-g client just fine. > > >>>> > > >>>> ----- > > >>>> > > >>>> It would be very interesting to see how swift does with GRAM4. This > > >>>> would make for a nice comparison to condor-g. > > >>>> > > >>>> As far as having functioning GRAM4 services on TG, things have > > >>>> improved. LEAD is using GRAM4 exclusively and we've been working > > >>>> with > > >>>> them to make sure the GRAM4 services are up and functioning. INCA > > >>>> has > > >>>> been updated to more effectively test and monitor GRAM4 and GridFTP > > >>>> services that LEAD is targeting. This could be extended for any > > >>>> hosts > > >>>> that swift would like to test against. Here are some interesting > > >>>> charts from INCA - http://cuzco.sdsc.edu:8085/cgi-bin/lead.cgi > > >>>> > > >>>> -Stu > > >>>> > > >>>> On Jan 30, 2008, at Jan 30, 10:00 AM, Ti Leggett wrote: > > >>>> > > >>>>> On Jan 30, 2008, at 01/30/08 09:48 AM, Ben Clifford wrote: > > >>>>> > > >>>>> [snip] > > >>>>> > > >>>>>> No. The default behaviour when working with a user who is "just > > >>>>>> trying to > > >>>>>> get their stuff to run" is "screw this, use GRAM2 because it > > >>>>>> works". > > >>>>>> > > >>>>>> Its a self-reinforcing feedback loop, that will be broken at the > > >>>>>> point > > >>>>>> that it becomes easier for people to stick with GRAM4 than default > > >>>>>> back to > > >>>>>> GRAM2. I guess we need to keep trying every now and then and hope > > >>>>>> that one > > >>>>>> time it sticks ;-) > > >>>>>> > > >>>>>> -- > > >>>>> Well this works to a point, but if falling back to a technology that > > >>>>> is known to not be scalable for your sizes results in killing a > > >>>>> machine, I, as a site admin, will eventually either a) deny you > > >>>>> service b) shut down the poorly performing service or c) all of the > > >>>>> above. So it's in your best interest to find and use those > > >>>>> technologies that are best suited to the task at hand so the users > > >>>>> of your software don't get nailed by (a). > > >>>>> > > >>>>> In this case it seems to me that using WS-GRAM, extending WS-GRAM > > >>>>> and/or MDS to report site statistics, and/or modifying WS-GRAM to > > >>>>> throttle itself (think of how apache reports "Server busy. Try again > > >>>>> later") is the best path forward. For the short term, it seems that > > >>>>> the Swift developers should manually find those limits for sites > > >>>>> that the users use regularly for them to use, *and* educate their > > >>>>> users on how to identify that they could be adversely affecting a > > >>>>> resource and throttle themselves till the ideal, automated method is > > >>>>> a usable reality. > > >>>>> > > >>>> _______________________________________________ > > >>>> Swift-devel mailing list > > >>>> Swift-devel at ci.uchicago.edu > > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>>> > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From leggett at mcs.anl.gov Sun Feb 3 15:34:45 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Sun, 3 Feb 2008 15:34:45 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <921658.18899.qm@web52308.mail.re2.yahoo.com> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> Message-ID: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> Mike, You're killing tg-grid1 again. Can someone work with Mike to get some swift settings that don't kill our server? On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > Yes, I'm submitting molecular dynamics simulations > using Swift. > > Is there a default wall-time limit for jobs on tg-uc? > > > > --- joseph insley wrote: > >> Actually, these numbers are now escalating... >> >> top - 17:18:54 up 2:29, 1 user, load average: >> 149.02, 123.63, 91.94 >> Tasks: 469 total, 4 running, 465 sleeping, 0 >> stopped, 0 zombie >> >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >> 479 >> >> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >> GRAM Authentication test successful >> real 0m26.134s >> user 0m0.090s >> sys 0m0.010s >> >> >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >> >>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >> TG GRAM host) >>> became unresponsive and had to be rebooted. I am >> now seeing slow >>> response times from the Gatekeeper there again. >> Authenticating to >>> the gatekeeper should only take a second or two, >> but it is >>> periodically taking up to 16 seconds: >>> >>> insley at tg-viz-login1:~> time globusrun -a -r >> tg-grid.uc.teragrid.org >>> GRAM Authentication test successful >>> real 0m16.096s >>> user 0m0.060s >>> sys 0m0.020s >>> >>> looking at the load on tg-grid, it is rather high: >>> >>> top - 16:55:26 up 2:06, 1 user, load average: >> 89.59, 78.69, 62.92 >>> Tasks: 398 total, 20 running, 378 sleeping, 0 >> stopped, 0 zombie >>> >>> And there appear to be a large number of processes >> owned by kubal: >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>> 380 >>> >>> I assume that Mike is using swift to do the job >> submission. Is >>> there some throttling of the rate at which jobs >> are submitted to >>> the gatekeeper that could be done that would >> lighten this load >>> some? (Or has that already been done since >> earlier today?) The >>> current response times are not unacceptable, but >> I'm hoping to >>> avoid having the machine grind to a halt as it did >> earlier today. >>> >>> Thanks, >>> joe. >>> >>> >>> >> =================================================== >>> joseph a. >>> insley >> >>> insley at mcs.anl.gov >>> mathematics & computer science division >> (630) 252-5649 >>> argonne national laboratory >> (630) >>> 252-5986 (fax) >>> >>> >> >> =================================================== >> joseph a. insley >> >> insley at mcs.anl.gov >> mathematics & computer science division (630) >> 252-5649 >> argonne national laboratory >> (630) >> 252-5986 (fax) >> >> >> > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From leggett at mcs.anl.gov Sun Feb 3 15:36:57 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Sun, 3 Feb 2008 15:36:57 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> Message-ID: <2D238AED-3D5C-479D-B017-AE8105F5ABA5@mcs.anl.gov> I should say I killed all your processes running on tg-grid1 so your jobs most likely are going to fail. On Feb 3, 2008, at 3:34 PM, Ti Leggett wrote: > Mike, You're killing tg-grid1 again. Can someone work with Mike to > get some swift settings that don't kill our server? > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >> Yes, I'm submitting molecular dynamics simulations >> using Swift. >> >> Is there a default wall-time limit for jobs on tg-uc? >> >> >> >> --- joseph insley wrote: >> >>> Actually, these numbers are now escalating... >>> >>> top - 17:18:54 up 2:29, 1 user, load average: >>> 149.02, 123.63, 91.94 >>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>> stopped, 0 zombie >>> >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>> 479 >>> >>> insley at tg-viz-login1:~> time globusrun -a -r >>> tg-grid.uc.teragrid.org >>> GRAM Authentication test successful >>> real 0m26.134s >>> user 0m0.090s >>> sys 0m0.010s >>> >>> >>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>> >>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>> TG GRAM host) >>>> became unresponsive and had to be rebooted. I am >>> now seeing slow >>>> response times from the Gatekeeper there again. >>> Authenticating to >>>> the gatekeeper should only take a second or two, >>> but it is >>>> periodically taking up to 16 seconds: >>>> >>>> insley at tg-viz-login1:~> time globusrun -a -r >>> tg-grid.uc.teragrid.org >>>> GRAM Authentication test successful >>>> real 0m16.096s >>>> user 0m0.060s >>>> sys 0m0.020s >>>> >>>> looking at the load on tg-grid, it is rather high: >>>> >>>> top - 16:55:26 up 2:06, 1 user, load average: >>> 89.59, 78.69, 62.92 >>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>> stopped, 0 zombie >>>> >>>> And there appear to be a large number of processes >>> owned by kubal: >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>> 380 >>>> >>>> I assume that Mike is using swift to do the job >>> submission. Is >>>> there some throttling of the rate at which jobs >>> are submitted to >>>> the gatekeeper that could be done that would >>> lighten this load >>>> some? (Or has that already been done since >>> earlier today?) The >>>> current response times are not unacceptable, but >>> I'm hoping to >>>> avoid having the machine grind to a halt as it did >>> earlier today. >>>> >>>> Thanks, >>>> joe. >>>> >>>> >>>> >>> =================================================== >>>> joseph a. >>>> insley >>> >>>> insley at mcs.anl.gov >>>> mathematics & computer science division >>> (630) 252-5649 >>>> argonne national laboratory >>> (630) >>>> 252-5986 (fax) >>>> >>>> >>> >>> =================================================== >>> joseph a. insley >>> >>> insley at mcs.anl.gov >>> mathematics & computer science division (630) >>> 252-5649 >>> argonne national laboratory >>> (630) >>> 252-5986 (fax) >>> >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> > From hategan at mcs.anl.gov Sun Feb 3 21:09:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 03 Feb 2008 21:09:13 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> Message-ID: <1202094553.13259.4.camel@blabla.mcs.anl.gov> So I was trying some stuff on Friday night. I guess I've found the strategy on when to run the tests: when nobody else has jobs there (besides Buzz doing gridftp tests, Ioan having some Falkon workers running, and the occasional Inca tests). In any event, the machine jumps to about 100% utilization at around 130 jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to 1 in swift.properties. There's still more work I need to do test-wise. On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > Mike, You're killing tg-grid1 again. Can someone work with Mike to get > some swift settings that don't kill our server? > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > Yes, I'm submitting molecular dynamics simulations > > using Swift. > > > > Is there a default wall-time limit for jobs on tg-uc? > > > > > > > > --- joseph insley wrote: > > > >> Actually, these numbers are now escalating... > >> > >> top - 17:18:54 up 2:29, 1 user, load average: > >> 149.02, 123.63, 91.94 > >> Tasks: 469 total, 4 running, 465 sleeping, 0 > >> stopped, 0 zombie > >> > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >> 479 > >> > >> insley at tg-viz-login1:~> time globusrun -a -r > >> tg-grid.uc.teragrid.org > >> GRAM Authentication test successful > >> real 0m26.134s > >> user 0m0.090s > >> sys 0m0.010s > >> > >> > >> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >> > >>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > >> TG GRAM host) > >>> became unresponsive and had to be rebooted. I am > >> now seeing slow > >>> response times from the Gatekeeper there again. > >> Authenticating to > >>> the gatekeeper should only take a second or two, > >> but it is > >>> periodically taking up to 16 seconds: > >>> > >>> insley at tg-viz-login1:~> time globusrun -a -r > >> tg-grid.uc.teragrid.org > >>> GRAM Authentication test successful > >>> real 0m16.096s > >>> user 0m0.060s > >>> sys 0m0.020s > >>> > >>> looking at the load on tg-grid, it is rather high: > >>> > >>> top - 16:55:26 up 2:06, 1 user, load average: > >> 89.59, 78.69, 62.92 > >>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >> stopped, 0 zombie > >>> > >>> And there appear to be a large number of processes > >> owned by kubal: > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>> 380 > >>> > >>> I assume that Mike is using swift to do the job > >> submission. Is > >>> there some throttling of the rate at which jobs > >> are submitted to > >>> the gatekeeper that could be done that would > >> lighten this load > >>> some? (Or has that already been done since > >> earlier today?) The > >>> current response times are not unacceptable, but > >> I'm hoping to > >>> avoid having the machine grind to a halt as it did > >> earlier today. > >>> > >>> Thanks, > >>> joe. > >>> > >>> > >>> > >> =================================================== > >>> joseph a. > >>> insley > >> > >>> insley at mcs.anl.gov > >>> mathematics & computer science division > >> (630) 252-5649 > >>> argonne national laboratory > >> (630) > >>> 252-5986 (fax) > >>> > >>> > >> > >> =================================================== > >> joseph a. insley > >> > >> insley at mcs.anl.gov > >> mathematics & computer science division (630) > >> 252-5649 > >> argonne national laboratory > >> (630) > >> 252-5986 (fax) > >> > >> > >> > > > > > > > > > > ____________________________________________________________________________________ > > Be a better friend, newshound, and > > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at mcs.anl.gov Sun Feb 3 21:12:08 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 03 Feb 2008 21:12:08 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202094553.13259.4.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> Message-ID: <47A68288.8060702@mcs.anl.gov> Mihael: Is there any chance you can try GRAM4, as was requested early last week? Ian. Mihael Hategan wrote: > So I was trying some stuff on Friday night. I guess I've found the > strategy on when to run the tests: when nobody else has jobs there > (besides Buzz doing gridftp tests, Ioan having some Falkon workers > running, and the occasional Inca tests). > > In any event, the machine jumps to about 100% utilization at around 130 > jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to > 1 in swift.properties. > > There's still more work I need to do test-wise. > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >> Mike, You're killing tg-grid1 again. Can someone work with Mike to get >> some swift settings that don't kill our server? >> >> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >> >> >>> Yes, I'm submitting molecular dynamics simulations >>> using Swift. >>> >>> Is there a default wall-time limit for jobs on tg-uc? >>> >>> >>> >>> --- joseph insley wrote: >>> >>> >>>> Actually, these numbers are now escalating... >>>> >>>> top - 17:18:54 up 2:29, 1 user, load average: >>>> 149.02, 123.63, 91.94 >>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>> stopped, 0 zombie >>>> >>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>> 479 >>>> >>>> insley at tg-viz-login1:~> time globusrun -a -r >>>> tg-grid.uc.teragrid.org >>>> GRAM Authentication test successful >>>> real 0m26.134s >>>> user 0m0.090s >>>> sys 0m0.010s >>>> >>>> >>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>> >>>> >>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>> >>>> TG GRAM host) >>>> >>>>> became unresponsive and had to be rebooted. I am >>>>> >>>> now seeing slow >>>> >>>>> response times from the Gatekeeper there again. >>>>> >>>> Authenticating to >>>> >>>>> the gatekeeper should only take a second or two, >>>>> >>>> but it is >>>> >>>>> periodically taking up to 16 seconds: >>>>> >>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>> >>>> tg-grid.uc.teragrid.org >>>> >>>>> GRAM Authentication test successful >>>>> real 0m16.096s >>>>> user 0m0.060s >>>>> sys 0m0.020s >>>>> >>>>> looking at the load on tg-grid, it is rather high: >>>>> >>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>> >>>> 89.59, 78.69, 62.92 >>>> >>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>> >>>> stopped, 0 zombie >>>> >>>>> And there appear to be a large number of processes >>>>> >>>> owned by kubal: >>>> >>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>> 380 >>>>> >>>>> I assume that Mike is using swift to do the job >>>>> >>>> submission. Is >>>> >>>>> there some throttling of the rate at which jobs >>>>> >>>> are submitted to >>>> >>>>> the gatekeeper that could be done that would >>>>> >>>> lighten this load >>>> >>>>> some? (Or has that already been done since >>>>> >>>> earlier today?) The >>>> >>>>> current response times are not unacceptable, but >>>>> >>>> I'm hoping to >>>> >>>>> avoid having the machine grind to a halt as it did >>>>> >>>> earlier today. >>>> >>>>> Thanks, >>>>> joe. >>>>> >>>>> >>>>> >>>>> >>>> =================================================== >>>> >>>>> joseph a. >>>>> insley >>>>> >>>>> insley at mcs.anl.gov >>>>> mathematics & computer science division >>>>> >>>> (630) 252-5649 >>>> >>>>> argonne national laboratory >>>>> >>>> (630) >>>> >>>>> 252-5986 (fax) >>>>> >>>>> >>>>> >>>> =================================================== >>>> joseph a. insley >>>> >>>> insley at mcs.anl.gov >>>> mathematics & computer science division (630) >>>> 252-5649 >>>> argonne national laboratory >>>> (630) >>>> 252-5986 (fax) >>>> >>>> >>>> >>>> >>> >>> >>> ____________________________________________________________________________________ >>> Be a better friend, newshound, and >>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Feb 3 21:16:05 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 03 Feb 2008 21:16:05 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <47A68288.8060702@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> Message-ID: <1202094965.13259.8.camel@blabla.mcs.anl.gov> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: > Mihael: > > Is there any chance you can try GRAM4, as was requested early last > week? For the tests, sure. That's a big part of why I'm doing them. If we're talking about the workflow that seems to be repeatedly killing tg-grid1, then Mike Kubal would be the right person to ask. > > Ian. > > Mihael Hategan wrote: > > So I was trying some stuff on Friday night. I guess I've found the > > strategy on when to run the tests: when nobody else has jobs there > > (besides Buzz doing gridftp tests, Ioan having some Falkon workers > > running, and the occasional Inca tests). > > > > In any event, the machine jumps to about 100% utilization at around 130 > > jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to > > 1 in swift.properties. > > > > There's still more work I need to do test-wise. > > > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > > > > Mike, You're killing tg-grid1 again. Can someone work with Mike to get > > > some swift settings that don't kill our server? > > > > > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > > > > > > > > Yes, I'm submitting molecular dynamics simulations > > > > using Swift. > > > > > > > > Is there a default wall-time limit for jobs on tg-uc? > > > > > > > > > > > > > > > > --- joseph insley wrote: > > > > > > > > > > > > > Actually, these numbers are now escalating... > > > > > > > > > > top - 17:18:54 up 2:29, 1 user, load average: > > > > > 149.02, 123.63, 91.94 > > > > > Tasks: 469 total, 4 running, 465 sleeping, 0 > > > > > stopped, 0 zombie > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > > > 479 > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r > > > > > tg-grid.uc.teragrid.org > > > > > GRAM Authentication test successful > > > > > real 0m26.134s > > > > > user 0m0.090s > > > > > sys 0m0.010s > > > > > > > > > > > > > > > On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > > > > > > > > > > > > > > > > Earlier today tg-grid.uc.teragrid.org (the UC/ANL > > > > > > > > > > > TG GRAM host) > > > > > > > > > > > became unresponsive and had to be rebooted. I am > > > > > > > > > > > now seeing slow > > > > > > > > > > > response times from the Gatekeeper there again. > > > > > > > > > > > Authenticating to > > > > > > > > > > > the gatekeeper should only take a second or two, > > > > > > > > > > > but it is > > > > > > > > > > > periodically taking up to 16 seconds: > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r > > > > > > > > > > > tg-grid.uc.teragrid.org > > > > > > > > > > > GRAM Authentication test successful > > > > > > real 0m16.096s > > > > > > user 0m0.060s > > > > > > sys 0m0.020s > > > > > > > > > > > > looking at the load on tg-grid, it is rather high: > > > > > > > > > > > > top - 16:55:26 up 2:06, 1 user, load average: > > > > > > > > > > > 89.59, 78.69, 62.92 > > > > > > > > > > > Tasks: 398 total, 20 running, 378 sleeping, 0 > > > > > > > > > > > stopped, 0 zombie > > > > > > > > > > > And there appear to be a large number of processes > > > > > > > > > > > owned by kubal: > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > > > > 380 > > > > > > > > > > > > I assume that Mike is using swift to do the job > > > > > > > > > > > submission. Is > > > > > > > > > > > there some throttling of the rate at which jobs > > > > > > > > > > > are submitted to > > > > > > > > > > > the gatekeeper that could be done that would > > > > > > > > > > > lighten this load > > > > > > > > > > > some? (Or has that already been done since > > > > > > > > > > > earlier today?) The > > > > > > > > > > > current response times are not unacceptable, but > > > > > > > > > > > I'm hoping to > > > > > > > > > > > avoid having the machine grind to a halt as it did > > > > > > > > > > > earlier today. > > > > > > > > > > > Thanks, > > > > > > joe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > > > > > > > joseph a. > > > > > > insley > > > > > > > > > > > > insley at mcs.anl.gov > > > > > > mathematics & computer science division > > > > > > > > > > > (630) 252-5649 > > > > > > > > > > > argonne national laboratory > > > > > > > > > > > (630) > > > > > > > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > joseph a. insley > > > > > > > > > > insley at mcs.anl.gov > > > > > mathematics & computer science division (630) > > > > > 252-5649 > > > > > argonne national laboratory > > > > > (630) > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Be a better friend, newshound, and > > > > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From foster at mcs.anl.gov Sun Feb 3 21:23:24 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 03 Feb 2008 21:23:24 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202094965.13259.8.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> <1202094965.13259.8.camel@blabla.mcs.anl.gov> Message-ID: <47A6852C.9080208@mcs.anl.gov> Mihael: The motivation for doing the tests is so that we can provide appropriate advice to Mike, our super-high-priority Swift user who we want to help as much and as quickly as possible. I'm concerned that we don't seem to feel any sense of urgency in doing this. I'd like to emphasize that the sole reason for anyone funding work on Swift is because they believe us when we say that Swift can help people make more effective use of high-performance computing systems (parallel and grid). Mike K. is our most engaged and committed user, and if he is successful, will bring us fame and fortune (and fun, I think, to provide three Fs!). It shouldn't take a week for us to get back to him with information on how to run his application efficiently on TG. Ian. Mihael Hategan wrote: > On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: > >> Mihael: >> >> Is there any chance you can try GRAM4, as was requested early last >> week? >> > > For the tests, sure. That's a big part of why I'm doing them. > > If we're talking about the workflow that seems to be repeatedly killing > tg-grid1, then Mike Kubal would be the right person to ask. > > >> Ian. >> >> Mihael Hategan wrote: >> >>> So I was trying some stuff on Friday night. I guess I've found the >>> strategy on when to run the tests: when nobody else has jobs there >>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers >>> running, and the occasional Inca tests). >>> >>> In any event, the machine jumps to about 100% utilization at around 130 >>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to >>> 1 in swift.properties. >>> >>> There's still more work I need to do test-wise. >>> >>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>> >>> >>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get >>>> some swift settings that don't kill our server? >>>> >>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>> >>>> >>>> >>>>> Yes, I'm submitting molecular dynamics simulations >>>>> using Swift. >>>>> >>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>> >>>>> >>>>> >>>>> --- joseph insley wrote: >>>>> >>>>> >>>>> >>>>>> Actually, these numbers are now escalating... >>>>>> >>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>> 149.02, 123.63, 91.94 >>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>> stopped, 0 zombie >>>>>> >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 479 >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m26.134s >>>>>> user 0m0.090s >>>>>> sys 0m0.010s >>>>>> >>>>>> >>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>> >>>>>>> >>>>>> TG GRAM host) >>>>>> >>>>>> >>>>>>> became unresponsive and had to be rebooted. I am >>>>>>> >>>>>>> >>>>>> now seeing slow >>>>>> >>>>>> >>>>>>> response times from the Gatekeeper there again. >>>>>>> >>>>>>> >>>>>> Authenticating to >>>>>> >>>>>> >>>>>>> the gatekeeper should only take a second or two, >>>>>>> >>>>>>> >>>>>> but it is >>>>>> >>>>>> >>>>>>> periodically taking up to 16 seconds: >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> >>>>>>> >>>>>> tg-grid.uc.teragrid.org >>>>>> >>>>>> >>>>>>> GRAM Authentication test successful >>>>>>> real 0m16.096s >>>>>>> user 0m0.060s >>>>>>> sys 0m0.020s >>>>>>> >>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>> >>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>> >>>>>>> >>>>>> 89.59, 78.69, 62.92 >>>>>> >>>>>> >>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>> >>>>>>> >>>>>> stopped, 0 zombie >>>>>> >>>>>> >>>>>>> And there appear to be a large number of processes >>>>>>> >>>>>>> >>>>>> owned by kubal: >>>>>> >>>>>> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 380 >>>>>>> >>>>>>> I assume that Mike is using swift to do the job >>>>>>> >>>>>>> >>>>>> submission. Is >>>>>> >>>>>> >>>>>>> there some throttling of the rate at which jobs >>>>>>> >>>>>>> >>>>>> are submitted to >>>>>> >>>>>> >>>>>>> the gatekeeper that could be done that would >>>>>>> >>>>>>> >>>>>> lighten this load >>>>>> >>>>>> >>>>>>> some? (Or has that already been done since >>>>>>> >>>>>>> >>>>>> earlier today?) The >>>>>> >>>>>> >>>>>>> current response times are not unacceptable, but >>>>>>> >>>>>>> >>>>>> I'm hoping to >>>>>> >>>>>> >>>>>>> avoid having the machine grind to a halt as it did >>>>>>> >>>>>>> >>>>>> earlier today. >>>>>> >>>>>> >>>>>>> Thanks, >>>>>>> joe. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> =================================================== >>>>>> >>>>>> >>>>>>> joseph a. >>>>>>> insley >>>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division >>>>>>> >>>>>>> >>>>>> (630) 252-5649 >>>>>> >>>>>> >>>>>>> argonne national laboratory >>>>>>> >>>>>>> >>>>>> (630) >>>>>> >>>>>> >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> =================================================== >>>>>> joseph a. insley >>>>>> >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division (630) >>>>>> 252-5649 >>>>>> argonne national laboratory >>>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> ____________________________________________________________________________________ >>>>> Be a better friend, newshound, and >>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >>> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Feb 3 21:53:51 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 03 Feb 2008 21:53:51 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <47A6852C.9080208@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> <1202094965.13259.8.camel@blabla.mcs.anl.gov> <47A6852C.9080208@mcs.anl.gov> Message-ID: <1202097231.13666.21.camel@blabla.mcs.anl.gov> If you want to prioritize things differently, then please do so from the beginning instead of pointing out the priorities were wrong after a while. So please stop doing this. It is frustrating and it is not what I signed up for. Mihael On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote: > Mihael: > > The motivation for doing the tests is so that we can provide > appropriate advice to Mike, our super-high-priority Swift user who we > want to help as much and as quickly as possible. I'm concerned that we > don't seem to feel any sense of urgency in doing this. I'd like to > emphasize that the sole reason for anyone funding work on Swift is > because they believe us when we say that Swift can help people make > more effective use of high-performance computing systems (parallel and > grid). Mike K. is our most engaged and committed user, and if he is > successful, will bring us fame and fortune (and fun, I think, to > provide three Fs!). It shouldn't take a week for us to get back to him > with information on how to run his application efficiently on TG. > > Ian. > > Mihael Hategan wrote: > > On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: > > > > > Mihael: > > > > > > Is there any chance you can try GRAM4, as was requested early last > > > week? > > > > > > > For the tests, sure. That's a big part of why I'm doing them. > > > > If we're talking about the workflow that seems to be repeatedly killing > > tg-grid1, then Mike Kubal would be the right person to ask. > > > > > > > Ian. > > > > > > Mihael Hategan wrote: > > > > > > > So I was trying some stuff on Friday night. I guess I've found the > > > > strategy on when to run the tests: when nobody else has jobs there > > > > (besides Buzz doing gridftp tests, Ioan having some Falkon workers > > > > running, and the occasional Inca tests). > > > > > > > > In any event, the machine jumps to about 100% utilization at around 130 > > > > jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to > > > > 1 in swift.properties. > > > > > > > > There's still more work I need to do test-wise. > > > > > > > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > > > > > > > > > > > > Mike, You're killing tg-grid1 again. Can someone work with Mike to get > > > > > some swift settings that don't kill our server? > > > > > > > > > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > > > > > > > > > > > > > > > > > > > Yes, I'm submitting molecular dynamics simulations > > > > > > using Swift. > > > > > > > > > > > > Is there a default wall-time limit for jobs on tg-uc? > > > > > > > > > > > > > > > > > > > > > > > > --- joseph insley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Actually, these numbers are now escalating... > > > > > > > > > > > > > > top - 17:18:54 up 2:29, 1 user, load average: > > > > > > > 149.02, 123.63, 91.94 > > > > > > > Tasks: 469 total, 4 running, 465 sleeping, 0 > > > > > > > stopped, 0 zombie > > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > > > > > 479 > > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r > > > > > > > tg-grid.uc.teragrid.org > > > > > > > GRAM Authentication test successful > > > > > > > real 0m26.134s > > > > > > > user 0m0.090s > > > > > > > sys 0m0.010s > > > > > > > > > > > > > > > > > > > > > On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Earlier today tg-grid.uc.teragrid.org (the UC/ANL > > > > > > > > > > > > > > > > > > > > > > > TG GRAM host) > > > > > > > > > > > > > > > > > > > > > > became unresponsive and had to be rebooted. I am > > > > > > > > > > > > > > > > > > > > > > > now seeing slow > > > > > > > > > > > > > > > > > > > > > > response times from the Gatekeeper there again. > > > > > > > > > > > > > > > > > > > > > > > Authenticating to > > > > > > > > > > > > > > > > > > > > > > the gatekeeper should only take a second or two, > > > > > > > > > > > > > > > > > > > > > > > but it is > > > > > > > > > > > > > > > > > > > > > > periodically taking up to 16 seconds: > > > > > > > > > > > > > > > > insley at tg-viz-login1:~> time globusrun -a -r > > > > > > > > > > > > > > > > > > > > > > > tg-grid.uc.teragrid.org > > > > > > > > > > > > > > > > > > > > > > GRAM Authentication test successful > > > > > > > > real 0m16.096s > > > > > > > > user 0m0.060s > > > > > > > > sys 0m0.020s > > > > > > > > > > > > > > > > looking at the load on tg-grid, it is rather high: > > > > > > > > > > > > > > > > top - 16:55:26 up 2:06, 1 user, load average: > > > > > > > > > > > > > > > > > > > > > > > 89.59, 78.69, 62.92 > > > > > > > > > > > > > > > > > > > > > > Tasks: 398 total, 20 running, 378 sleeping, 0 > > > > > > > > > > > > > > > > > > > > > > > stopped, 0 zombie > > > > > > > > > > > > > > > > > > > > > > And there appear to be a large number of processes > > > > > > > > > > > > > > > > > > > > > > > owned by kubal: > > > > > > > > > > > > > > > > > > > > > > insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > > > > > > 380 > > > > > > > > > > > > > > > > I assume that Mike is using swift to do the job > > > > > > > > > > > > > > > > > > > > > > > submission. Is > > > > > > > > > > > > > > > > > > > > > > there some throttling of the rate at which jobs > > > > > > > > > > > > > > > > > > > > > > > are submitted to > > > > > > > > > > > > > > > > > > > > > > the gatekeeper that could be done that would > > > > > > > > > > > > > > > > > > > > > > > lighten this load > > > > > > > > > > > > > > > > > > > > > > some? (Or has that already been done since > > > > > > > > > > > > > > > > > > > > > > > earlier today?) The > > > > > > > > > > > > > > > > > > > > > > current response times are not unacceptable, but > > > > > > > > > > > > > > > > > > > > > > > I'm hoping to > > > > > > > > > > > > > > > > > > > > > > avoid having the machine grind to a halt as it did > > > > > > > > > > > > > > > > > > > > > > > earlier today. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > joe. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > > > > > > > > > > > > > > > > > > joseph a. > > > > > > > > insley > > > > > > > > > > > > > > > > insley at mcs.anl.gov > > > > > > > > mathematics & computer science division > > > > > > > > > > > > > > > > > > > > > > > (630) 252-5649 > > > > > > > > > > > > > > > > > > > > > > argonne national laboratory > > > > > > > > > > > > > > > > > > > > > > > (630) > > > > > > > > > > > > > > > > > > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =================================================== > > > > > > > joseph a. insley > > > > > > > > > > > > > > insley at mcs.anl.gov > > > > > > > mathematics & computer science division (630) > > > > > > > 252-5649 > > > > > > > argonne national laboratory > > > > > > > (630) > > > > > > > 252-5986 (fax) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > > > Be a better friend, newshound, and > > > > > > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > From wilde at mcs.anl.gov Sun Feb 3 22:02:02 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 03 Feb 2008 22:02:02 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202097231.13666.21.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> <1202094965.13259.8.camel@blabla.mcs.anl.gov> <47A6852C.9080208@mcs.anl.gov> <1202097231.13666.21.camel@blabla.mcs.anl.gov> Message-ID: <47A68E3A.1090603@mcs.anl.gov> Ian, Mihael, confusion on the priorities is my fault, and I'll work to fix that. - Mike On 2/3/08 9:53 PM, Mihael Hategan wrote: > If you want to prioritize things differently, then please do so from the > beginning instead of pointing out the priorities were wrong after a > while. So please stop doing this. It is frustrating and it is not what I > signed up for. > > Mihael > > On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote: >> Mihael: >> >> The motivation for doing the tests is so that we can provide >> appropriate advice to Mike, our super-high-priority Swift user who we >> want to help as much and as quickly as possible. I'm concerned that we >> don't seem to feel any sense of urgency in doing this. I'd like to >> emphasize that the sole reason for anyone funding work on Swift is >> because they believe us when we say that Swift can help people make >> more effective use of high-performance computing systems (parallel and >> grid). Mike K. is our most engaged and committed user, and if he is >> successful, will bring us fame and fortune (and fun, I think, to >> provide three Fs!). It shouldn't take a week for us to get back to him >> with information on how to run his application efficiently on TG. >> >> Ian. >> >> Mihael Hategan wrote: >>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: >>> >>>> Mihael: >>>> >>>> Is there any chance you can try GRAM4, as was requested early last >>>> week? >>>> >>> For the tests, sure. That's a big part of why I'm doing them. >>> >>> If we're talking about the workflow that seems to be repeatedly killing >>> tg-grid1, then Mike Kubal would be the right person to ask. >>> >>> >>>> Ian. >>>> >>>> Mihael Hategan wrote: >>>> >>>>> So I was trying some stuff on Friday night. I guess I've found the >>>>> strategy on when to run the tests: when nobody else has jobs there >>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers >>>>> running, and the occasional Inca tests). >>>>> >>>>> In any event, the machine jumps to about 100% utilization at around 130 >>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to >>>>> 1 in swift.properties. >>>>> >>>>> There's still more work I need to do test-wise. >>>>> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>> >>>>> >>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get >>>>>> some swift settings that don't kill our server? >>>>>> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Yes, I'm submitting molecular dynamics simulations >>>>>>> using Swift. >>>>>>> >>>>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- joseph insley wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Actually, these numbers are now escalating... >>>>>>>> >>>>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>>>> 149.02, 123.63, 91.94 >>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 479 >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m26.134s >>>>>>>> user 0m0.090s >>>>>>>> sys 0m0.010s >>>>>>>> >>>>>>>> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>>>> >>>>>>>>> >>>>>>>> TG GRAM host) >>>>>>>> >>>>>>>> >>>>>>>>> became unresponsive and had to be rebooted. I am >>>>>>>>> >>>>>>>>> >>>>>>>> now seeing slow >>>>>>>> >>>>>>>> >>>>>>>>> response times from the Gatekeeper there again. >>>>>>>>> >>>>>>>>> >>>>>>>> Authenticating to >>>>>>>> >>>>>>>> >>>>>>>>> the gatekeeper should only take a second or two, >>>>>>>>> >>>>>>>>> >>>>>>>> but it is >>>>>>>> >>>>>>>> >>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>> >>>>>>>>> >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>> >>>>>>>> >>>>>>>>> GRAM Authentication test successful >>>>>>>>> real 0m16.096s >>>>>>>>> user 0m0.060s >>>>>>>>> sys 0m0.020s >>>>>>>>> >>>>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>>>> >>>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>>>> >>>>>>>>> >>>>>>>> 89.59, 78.69, 62.92 >>>>>>>> >>>>>>>> >>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>>>> >>>>>>>>> >>>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> >>>>>>>>> And there appear to be a large number of processes >>>>>>>>> >>>>>>>>> >>>>>>>> owned by kubal: >>>>>>>> >>>>>>>> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>> 380 >>>>>>>>> >>>>>>>>> I assume that Mike is using swift to do the job >>>>>>>>> >>>>>>>>> >>>>>>>> submission. Is >>>>>>>> >>>>>>>> >>>>>>>>> there some throttling of the rate at which jobs >>>>>>>>> >>>>>>>>> >>>>>>>> are submitted to >>>>>>>> >>>>>>>> >>>>>>>>> the gatekeeper that could be done that would >>>>>>>>> >>>>>>>>> >>>>>>>> lighten this load >>>>>>>> >>>>>>>> >>>>>>>>> some? (Or has that already been done since >>>>>>>>> >>>>>>>>> >>>>>>>> earlier today?) The >>>>>>>> >>>>>>>> >>>>>>>>> current response times are not unacceptable, but >>>>>>>>> >>>>>>>>> >>>>>>>> I'm hoping to >>>>>>>> >>>>>>>> >>>>>>>>> avoid having the machine grind to a halt as it did >>>>>>>>> >>>>>>>>> >>>>>>>> earlier today. >>>>>>>> >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> joe. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> =================================================== >>>>>>>> >>>>>>>> >>>>>>>>> joseph a. >>>>>>>>> insley >>>>>>>>> >>>>>>>>> insley at mcs.anl.gov >>>>>>>>> mathematics & computer science division >>>>>>>>> >>>>>>>>> >>>>>>>> (630) 252-5649 >>>>>>>> >>>>>>>> >>>>>>>>> argonne national laboratory >>>>>>>>> >>>>>>>>> >>>>>>>> (630) >>>>>>>> >>>>>>>> >>>>>>>>> 252-5986 (fax) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> =================================================== >>>>>>>> joseph a. insley >>>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division (630) >>>>>>>> 252-5649 >>>>>>>> argonne national laboratory >>>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> ____________________________________________________________________________________ >>>>>>> Be a better friend, newshound, and >>>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>> > > From foster at mcs.anl.gov Sun Feb 3 22:05:03 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Sun, 03 Feb 2008 22:05:03 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202097231.13666.21.camel@blabla.mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> <1202094965.13259.8.camel@blabla.mcs.anl.gov> <47A6852C.9080208@mcs.anl.gov> <1202097231.13666.21.camel@blabla.mcs.anl.gov> Message-ID: <47A68EEF.50804@mcs.anl.gov> Mihael: The point of my mail was to express what I think our priorities should be. It would be useful to have a discussion of what our priorities are, and how they differ from what I think they should be. But probably we shouldn't do that via email. Ian. Mihael Hategan wrote: > If you want to prioritize things differently, then please do so from the > beginning instead of pointing out the priorities were wrong after a > while. So please stop doing this. It is frustrating and it is not what I > signed up for. > > Mihael > > On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote: > >> Mihael: >> >> The motivation for doing the tests is so that we can provide >> appropriate advice to Mike, our super-high-priority Swift user who we >> want to help as much and as quickly as possible. I'm concerned that we >> don't seem to feel any sense of urgency in doing this. I'd like to >> emphasize that the sole reason for anyone funding work on Swift is >> because they believe us when we say that Swift can help people make >> more effective use of high-performance computing systems (parallel and >> grid). Mike K. is our most engaged and committed user, and if he is >> successful, will bring us fame and fortune (and fun, I think, to >> provide three Fs!). It shouldn't take a week for us to get back to him >> with information on how to run his application efficiently on TG. >> >> Ian. >> >> Mihael Hategan wrote: >> >>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: >>> >>> >>>> Mihael: >>>> >>>> Is there any chance you can try GRAM4, as was requested early last >>>> week? >>>> >>>> >>> For the tests, sure. That's a big part of why I'm doing them. >>> >>> If we're talking about the workflow that seems to be repeatedly killing >>> tg-grid1, then Mike Kubal would be the right person to ask. >>> >>> >>> >>>> Ian. >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> So I was trying some stuff on Friday night. I guess I've found the >>>>> strategy on when to run the tests: when nobody else has jobs there >>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers >>>>> running, and the occasional Inca tests). >>>>> >>>>> In any event, the machine jumps to about 100% utilization at around 130 >>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to >>>>> 1 in swift.properties. >>>>> >>>>> There's still more work I need to do test-wise. >>>>> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>> >>>>> >>>>> >>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get >>>>>> some swift settings that don't kill our server? >>>>>> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Yes, I'm submitting molecular dynamics simulations >>>>>>> using Swift. >>>>>>> >>>>>>> Is there a default wall-time limit for jobs on tg-uc? >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- joseph insley wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Actually, these numbers are now escalating... >>>>>>>> >>>>>>>> top - 17:18:54 up 2:29, 1 user, load average: >>>>>>>> 149.02, 123.63, 91.94 >>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 >>>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 479 >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m26.134s >>>>>>>> user 0m0.090s >>>>>>>> sys 0m0.010s >>>>>>>> >>>>>>>> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> TG GRAM host) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> became unresponsive and had to be rebooted. I am >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> now seeing slow >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> response times from the Gatekeeper there again. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Authenticating to >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> the gatekeeper should only take a second or two, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> but it is >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> GRAM Authentication test successful >>>>>>>>> real 0m16.096s >>>>>>>>> user 0m0.060s >>>>>>>>> sys 0m0.020s >>>>>>>>> >>>>>>>>> looking at the load on tg-grid, it is rather high: >>>>>>>>> >>>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> 89.59, 78.69, 62.92 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> And there appear to be a large number of processes >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> owned by kubal: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>> 380 >>>>>>>>> >>>>>>>>> I assume that Mike is using swift to do the job >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> submission. Is >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> there some throttling of the rate at which jobs >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> are submitted to >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> the gatekeeper that could be done that would >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> lighten this load >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> some? (Or has that already been done since >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> earlier today?) The >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> current response times are not unacceptable, but >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> I'm hoping to >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> avoid having the machine grind to a halt as it did >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> earlier today. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> joe. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> =================================================== >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> joseph a. >>>>>>>>> insley >>>>>>>>> >>>>>>>>> insley at mcs.anl.gov >>>>>>>>> mathematics & computer science division >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> (630) 252-5649 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> argonne national laboratory >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> (630) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 252-5986 (fax) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> =================================================== >>>>>>>> joseph a. insley >>>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division (630) >>>>>>>> 252-5649 >>>>>>>> argonne national laboratory >>>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> ____________________________________________________________________________________ >>>>>>> Be a better friend, newshound, and >>>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>>> >>>>> >>> >>> > > From hategan at mcs.anl.gov Sun Feb 3 22:39:05 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 03 Feb 2008 22:39:05 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <47A68EEF.50804@mcs.anl.gov> References: <921658.18899.qm@web52308.mail.re2.yahoo.com> <50432C18-E863-4E5F-B108-A5AF57AD45D2@mcs.anl.gov> <1202094553.13259.4.camel@blabla.mcs.anl.gov> <47A68288.8060702@mcs.anl.gov> <1202094965.13259.8.camel@blabla.mcs.anl.gov> <47A6852C.9080208@mcs.anl.gov> <1202097231.13666.21.camel@blabla.mcs.anl.gov> <47A68EEF.50804@mcs.anl.gov> Message-ID: <1202099945.14375.22.camel@blabla.mcs.anl.gov> We cannot define priorities about things we don't know. This management by crisis (i.e. every new thing is of utmost priority, and maybe some older things that used to be of utmost priority may or may not still be of utmost priority) doesn't seem to work well. Add to that the implications that x didn't do things right (so that we make it slightly personal), and you've got a recipe for things not working well at all. Repeat this a few times, and even the most resilient of people will begin having second thoughts. And the reaction to things one cannot control are not those of fight but those of flight. Now, onto the problem. The tests are no easy thing. I need time to find the right settings, the right ways to do it, and the right times to do it (the process involves getting that machine close to the point of crashing). And then some way to transform some seemingly garbage like log files into something meaningful. So no, it's not a one day job. In the mean time, Mike was informed about what we believe might be better ways to make things work (throttling parameters, trying ws-gram, local PBS). Mihael On Sun, 2008-02-03 at 22:05 -0600, Ian Foster wrote: > Mihael: > > The point of my mail was to express what I think our priorities should be. > > It would be useful to have a discussion of what our priorities are, and > how they differ from what I think they should be. But probably we > shouldn't do that via email. > > Ian. > > Mihael Hategan wrote: > > If you want to prioritize things differently, then please do so from the > > beginning instead of pointing out the priorities were wrong after a > > while. So please stop doing this. It is frustrating and it is not what I > > signed up for. > > > > Mihael > > > > On Sun, 2008-02-03 at 21:23 -0600, Ian Foster wrote: > > > >> Mihael: > >> > >> The motivation for doing the tests is so that we can provide > >> appropriate advice to Mike, our super-high-priority Swift user who we > >> want to help as much and as quickly as possible. I'm concerned that we > >> don't seem to feel any sense of urgency in doing this. I'd like to > >> emphasize that the sole reason for anyone funding work on Swift is > >> because they believe us when we say that Swift can help people make > >> more effective use of high-performance computing systems (parallel and > >> grid). Mike K. is our most engaged and committed user, and if he is > >> successful, will bring us fame and fortune (and fun, I think, to > >> provide three Fs!). It shouldn't take a week for us to get back to him > >> with information on how to run his application efficiently on TG. > >> > >> Ian. > >> > >> Mihael Hategan wrote: > >> > >>> On Sun, 2008-02-03 at 21:12 -0600, Ian Foster wrote: > >>> > >>> > >>>> Mihael: > >>>> > >>>> Is there any chance you can try GRAM4, as was requested early last > >>>> week? > >>>> > >>>> > >>> For the tests, sure. That's a big part of why I'm doing them. > >>> > >>> If we're talking about the workflow that seems to be repeatedly killing > >>> tg-grid1, then Mike Kubal would be the right person to ask. > >>> > >>> > >>> > >>>> Ian. > >>>> > >>>> Mihael Hategan wrote: > >>>> > >>>> > >>>>> So I was trying some stuff on Friday night. I guess I've found the > >>>>> strategy on when to run the tests: when nobody else has jobs there > >>>>> (besides Buzz doing gridftp tests, Ioan having some Falkon workers > >>>>> running, and the occasional Inca tests). > >>>>> > >>>>> In any event, the machine jumps to about 100% utilization at around 130 > >>>>> jobs with pre-ws gram. So Mike, please set throttle.score.job.factor to > >>>>> 1 in swift.properties. > >>>>> > >>>>> There's still more work I need to do test-wise. > >>>>> > >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Mike, You're killing tg-grid1 again. Can someone work with Mike to get > >>>>>> some swift settings that don't kill our server? > >>>>>> > >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Yes, I'm submitting molecular dynamics simulations > >>>>>>> using Swift. > >>>>>>> > >>>>>>> Is there a default wall-time limit for jobs on tg-uc? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> --- joseph insley wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Actually, these numbers are now escalating... > >>>>>>>> > >>>>>>>> top - 17:18:54 up 2:29, 1 user, load average: > >>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, 0 > >>>>>>>> stopped, 0 zombie > >>>>>>>> > >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>> 479 > >>>>>>>> > >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>> GRAM Authentication test successful > >>>>>>>> real 0m26.134s > >>>>>>>> user 0m0.090s > >>>>>>>> sys 0m0.010s > >>>>>>>> > >>>>>>>> > >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the UC/ANL > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> TG GRAM host) > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> became unresponsive and had to be rebooted. I am > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> now seeing slow > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> response times from the Gatekeeper there again. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> Authenticating to > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> the gatekeeper should only take a second or two, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> but it is > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>> > >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> GRAM Authentication test successful > >>>>>>>>> real 0m16.096s > >>>>>>>>> user 0m0.060s > >>>>>>>>> sys 0m0.020s > >>>>>>>>> > >>>>>>>>> looking at the load on tg-grid, it is rather high: > >>>>>>>>> > >>>>>>>>> top - 16:55:26 up 2:06, 1 user, load average: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, 0 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> stopped, 0 zombie > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> And there appear to be a large number of processes > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> owned by kubal: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>> 380 > >>>>>>>>> > >>>>>>>>> I assume that Mike is using swift to do the job > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> submission. Is > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> there some throttling of the rate at which jobs > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> are submitted to > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> the gatekeeper that could be done that would > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> lighten this load > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> some? (Or has that already been done since > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> earlier today?) The > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> current response times are not unacceptable, but > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> I'm hoping to > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> avoid having the machine grind to a halt as it did > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> earlier today. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> joe. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> =================================================== > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> joseph a. > >>>>>>>>> insley > >>>>>>>>> > >>>>>>>>> insley at mcs.anl.gov > >>>>>>>>> mathematics & computer science division > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> (630) 252-5649 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> argonne national laboratory > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> (630) > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> 252-5986 (fax) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> =================================================== > >>>>>>>> joseph a. insley > >>>>>>>> > >>>>>>>> insley at mcs.anl.gov > >>>>>>>> mathematics & computer science division (630) > >>>>>>>> 252-5649 > >>>>>>>> argonne national laboratory > >>>>>>>> (630) > >>>>>>>> 252-5986 (fax) > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> ____________________________________________________________________________________ > >>>>>>> Be a better friend, newshound, and > >>>>>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>>> > >>>>> > >>> > >>> > > > > > From mikekubal at yahoo.com Mon Feb 4 00:11:34 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Sun, 3 Feb 2008 22:11:34 -0800 (PST) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202094553.13259.4.camel@blabla.mcs.anl.gov> Message-ID: <548830.35963.qm@web52311.mail.re2.yahoo.com> Sorry for killing the server. I'm pushing to get results to guide the selection of compounds for wet-lab testing. I had set the throttle.score.job.factor to 1 in the swift.properties file. I certainly appreciate everyone's efforts and responsiveness. Let me know what to try next, before I kill again. Cheers, Mike --- Mihael Hategan wrote: > So I was trying some stuff on Friday night. I guess > I've found the > strategy on when to run the tests: when nobody else > has jobs there > (besides Buzz doing gridftp tests, Ioan having some > Falkon workers > running, and the occasional Inca tests). > > In any event, the machine jumps to about 100% > utilization at around 130 > jobs with pre-ws gram. So Mike, please set > throttle.score.job.factor to > 1 in swift.properties. > > There's still more work I need to do test-wise. > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > Mike, You're killing tg-grid1 again. Can someone > work with Mike to get > > some swift settings that don't kill our server? > > > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > > > Yes, I'm submitting molecular dynamics > simulations > > > using Swift. > > > > > > Is there a default wall-time limit for jobs on > tg-uc? > > > > > > > > > > > > --- joseph insley wrote: > > > > > >> Actually, these numbers are now escalating... > > >> > > >> top - 17:18:54 up 2:29, 1 user, load > average: > > >> 149.02, 123.63, 91.94 > > >> Tasks: 469 total, 4 running, 465 sleeping, > 0 > > >> stopped, 0 zombie > > >> > > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >> 479 > > >> > > >> insley at tg-viz-login1:~> time globusrun -a -r > > >> tg-grid.uc.teragrid.org > > >> GRAM Authentication test successful > > >> real 0m26.134s > > >> user 0m0.090s > > >> sys 0m0.010s > > >> > > >> > > >> On Jan 28, 2008, at 5:15 PM, joseph insley > wrote: > > >> > > >>> Earlier today tg-grid.uc.teragrid.org (the > UC/ANL > > >> TG GRAM host) > > >>> became unresponsive and had to be rebooted. I > am > > >> now seeing slow > > >>> response times from the Gatekeeper there > again. > > >> Authenticating to > > >>> the gatekeeper should only take a second or > two, > > >> but it is > > >>> periodically taking up to 16 seconds: > > >>> > > >>> insley at tg-viz-login1:~> time globusrun -a -r > > >> tg-grid.uc.teragrid.org > > >>> GRAM Authentication test successful > > >>> real 0m16.096s > > >>> user 0m0.060s > > >>> sys 0m0.020s > > >>> > > >>> looking at the load on tg-grid, it is rather > high: > > >>> > > >>> top - 16:55:26 up 2:06, 1 user, load > average: > > >> 89.59, 78.69, 62.92 > > >>> Tasks: 398 total, 20 running, 378 sleeping, > 0 > > >> stopped, 0 zombie > > >>> > > >>> And there appear to be a large number of > processes > > >> owned by kubal: > > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >>> 380 > > >>> > > >>> I assume that Mike is using swift to do the > job > > >> submission. Is > > >>> there some throttling of the rate at which > jobs > > >> are submitted to > > >>> the gatekeeper that could be done that would > > >> lighten this load > > >>> some? (Or has that already been done since > > >> earlier today?) The > > >>> current response times are not unacceptable, > but > > >> I'm hoping to > > >>> avoid having the machine grind to a halt as it > did > > >> earlier today. > > >>> > > >>> Thanks, > > >>> joe. > > >>> > > >>> > > >>> > > >> > =================================================== > > >>> joseph a. > > >>> insley > > >> > > >>> insley at mcs.anl.gov > > >>> mathematics & computer science division > > >> (630) 252-5649 > > >>> argonne national laboratory > > >> (630) > > >>> 252-5986 (fax) > > >>> > > >>> > > >> > > >> > =================================================== > > >> joseph a. insley > > >> > > >> insley at mcs.anl.gov > > >> mathematics & computer science division > (630) > > >> 252-5649 > > >> argonne national laboratory > > >> (630) > > >> 252-5986 (fax) > > >> > > >> > > >> > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Be a better friend, newshound, and > > > know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From hategan at mcs.anl.gov Mon Feb 4 00:14:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 00:14:09 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: <1202105649.15397.46.camel@blabla.mcs.anl.gov> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > Sorry for killing the server. I'm pushing to get > results to guide the selection of compounds for > wet-lab testing. > > I had set the throttle.score.job.factor to 1 in the > swift.properties file. Hmm. Ti, at the time of the massacre, how many did you kill? Mihael > > I certainly appreciate everyone's efforts and > responsiveness. > > Let me know what to try next, before I kill again. > > Cheers, > > Mike > > > > --- Mihael Hategan wrote: > > > So I was trying some stuff on Friday night. I guess > > I've found the > > strategy on when to run the tests: when nobody else > > has jobs there > > (besides Buzz doing gridftp tests, Ioan having some > > Falkon workers > > running, and the occasional Inca tests). > > > > In any event, the machine jumps to about 100% > > utilization at around 130 > > jobs with pre-ws gram. So Mike, please set > > throttle.score.job.factor to > > 1 in swift.properties. > > > > There's still more work I need to do test-wise. > > > > On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > > Mike, You're killing tg-grid1 again. Can someone > > work with Mike to get > > > some swift settings that don't kill our server? > > > > > > On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > > > > > Yes, I'm submitting molecular dynamics > > simulations > > > > using Swift. > > > > > > > > Is there a default wall-time limit for jobs on > > tg-uc? > > > > > > > > > > > > > > > > --- joseph insley wrote: > > > > > > > >> Actually, these numbers are now escalating... > > > >> > > > >> top - 17:18:54 up 2:29, 1 user, load > > average: > > > >> 149.02, 123.63, 91.94 > > > >> Tasks: 469 total, 4 running, 465 sleeping, > > 0 > > > >> stopped, 0 zombie > > > >> > > > >> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >> 479 > > > >> > > > >> insley at tg-viz-login1:~> time globusrun -a -r > > > >> tg-grid.uc.teragrid.org > > > >> GRAM Authentication test successful > > > >> real 0m26.134s > > > >> user 0m0.090s > > > >> sys 0m0.010s > > > >> > > > >> > > > >> On Jan 28, 2008, at 5:15 PM, joseph insley > > wrote: > > > >> > > > >>> Earlier today tg-grid.uc.teragrid.org (the > > UC/ANL > > > >> TG GRAM host) > > > >>> became unresponsive and had to be rebooted. I > > am > > > >> now seeing slow > > > >>> response times from the Gatekeeper there > > again. > > > >> Authenticating to > > > >>> the gatekeeper should only take a second or > > two, > > > >> but it is > > > >>> periodically taking up to 16 seconds: > > > >>> > > > >>> insley at tg-viz-login1:~> time globusrun -a -r > > > >> tg-grid.uc.teragrid.org > > > >>> GRAM Authentication test successful > > > >>> real 0m16.096s > > > >>> user 0m0.060s > > > >>> sys 0m0.020s > > > >>> > > > >>> looking at the load on tg-grid, it is rather > > high: > > > >>> > > > >>> top - 16:55:26 up 2:06, 1 user, load > > average: > > > >> 89.59, 78.69, 62.92 > > > >>> Tasks: 398 total, 20 running, 378 sleeping, > > 0 > > > >> stopped, 0 zombie > > > >>> > > > >>> And there appear to be a large number of > > processes > > > >> owned by kubal: > > > >>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >>> 380 > > > >>> > > > >>> I assume that Mike is using swift to do the > > job > > > >> submission. Is > > > >>> there some throttling of the rate at which > > jobs > > > >> are submitted to > > > >>> the gatekeeper that could be done that would > > > >> lighten this load > > > >>> some? (Or has that already been done since > > > >> earlier today?) The > > > >>> current response times are not unacceptable, > > but > > > >> I'm hoping to > > > >>> avoid having the machine grind to a halt as it > > did > > > >> earlier today. > > > >>> > > > >>> Thanks, > > > >>> joe. > > > >>> > > > >>> > > > >>> > > > >> > > =================================================== > > > >>> joseph a. > > > >>> insley > > > >> > > > >>> insley at mcs.anl.gov > > > >>> mathematics & computer science division > > > >> (630) 252-5649 > > > >>> argonne national laboratory > > > >> (630) > > > >>> 252-5986 (fax) > > > >>> > > > >>> > > > >> > > > >> > > =================================================== > > > >> joseph a. insley > > > >> > > > >> insley at mcs.anl.gov > > > >> mathematics & computer science division > > (630) > > > >> 252-5649 > > > >> argonne national laboratory > > > >> (630) > > > >> 252-5986 (fax) > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Be a better friend, newshound, and > > > > know-it-all with Yahoo! Mobile. Try it now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > From leggett at mcs.anl.gov Mon Feb 4 07:16:38 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Mon, 4 Feb 2008 07:16:38 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202105649.15397.46.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> Message-ID: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> Around 80. On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > > On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >> Sorry for killing the server. I'm pushing to get >> results to guide the selection of compounds for >> wet-lab testing. >> >> I had set the throttle.score.job.factor to 1 in the >> swift.properties file. > > Hmm. Ti, at the time of the massacre, how many did you kill? > > Mihael > >> >> I certainly appreciate everyone's efforts and >> responsiveness. >> >> Let me know what to try next, before I kill again. >> >> Cheers, >> >> Mike >> >> >> >> --- Mihael Hategan wrote: >> >>> So I was trying some stuff on Friday night. I guess >>> I've found the >>> strategy on when to run the tests: when nobody else >>> has jobs there >>> (besides Buzz doing gridftp tests, Ioan having some >>> Falkon workers >>> running, and the occasional Inca tests). >>> >>> In any event, the machine jumps to about 100% >>> utilization at around 130 >>> jobs with pre-ws gram. So Mike, please set >>> throttle.score.job.factor to >>> 1 in swift.properties. >>> >>> There's still more work I need to do test-wise. >>> >>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>> Mike, You're killing tg-grid1 again. Can someone >>> work with Mike to get >>>> some swift settings that don't kill our server? >>>> >>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>> >>>>> Yes, I'm submitting molecular dynamics >>> simulations >>>>> using Swift. >>>>> >>>>> Is there a default wall-time limit for jobs on >>> tg-uc? >>>>> >>>>> >>>>> >>>>> --- joseph insley wrote: >>>>> >>>>>> Actually, these numbers are now escalating... >>>>>> >>>>>> top - 17:18:54 up 2:29, 1 user, load >>> average: >>>>>> 149.02, 123.63, 91.94 >>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>> 0 >>>>>> stopped, 0 zombie >>>>>> >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>> 479 >>>>>> >>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>> tg-grid.uc.teragrid.org >>>>>> GRAM Authentication test successful >>>>>> real 0m26.134s >>>>>> user 0m0.090s >>>>>> sys 0m0.010s >>>>>> >>>>>> >>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>> wrote: >>>>>> >>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>> UC/ANL >>>>>> TG GRAM host) >>>>>>> became unresponsive and had to be rebooted. I >>> am >>>>>> now seeing slow >>>>>>> response times from the Gatekeeper there >>> again. >>>>>> Authenticating to >>>>>>> the gatekeeper should only take a second or >>> two, >>>>>> but it is >>>>>>> periodically taking up to 16 seconds: >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>> tg-grid.uc.teragrid.org >>>>>>> GRAM Authentication test successful >>>>>>> real 0m16.096s >>>>>>> user 0m0.060s >>>>>>> sys 0m0.020s >>>>>>> >>>>>>> looking at the load on tg-grid, it is rather >>> high: >>>>>>> >>>>>>> top - 16:55:26 up 2:06, 1 user, load >>> average: >>>>>> 89.59, 78.69, 62.92 >>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>> 0 >>>>>> stopped, 0 zombie >>>>>>> >>>>>>> And there appear to be a large number of >>> processes >>>>>> owned by kubal: >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 380 >>>>>>> >>>>>>> I assume that Mike is using swift to do the >>> job >>>>>> submission. Is >>>>>>> there some throttling of the rate at which >>> jobs >>>>>> are submitted to >>>>>>> the gatekeeper that could be done that would >>>>>> lighten this load >>>>>>> some? (Or has that already been done since >>>>>> earlier today?) The >>>>>>> current response times are not unacceptable, >>> but >>>>>> I'm hoping to >>>>>>> avoid having the machine grind to a halt as it >>> did >>>>>> earlier today. >>>>>>> >>>>>>> Thanks, >>>>>>> joe. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>> =================================================== >>>>>>> joseph a. >>>>>>> insley >>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division >>>>>> (630) 252-5649 >>>>>>> argonne national laboratory >>>>>> (630) >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>> >>>>>> >>> =================================================== >>>>>> joseph a. insley >>>>>> >>>>>> insley at mcs.anl.gov >>>>>> mathematics & computer science division >>> (630) >>>>>> 252-5649 >>>>>> argonne national laboratory >>>>>> (630) >>>>>> 252-5986 (fax) >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >> ____________________________________________________________________________________ >>>>> Be a better friend, newshound, and >>>>> know-it-all with Yahoo! Mobile. Try it now. >>> >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> >>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> Never miss a thing. Make Yahoo your home page. >> http://www.yahoo.com/r/hs >> > From wilde at mcs.anl.gov Mon Feb 4 08:13:36 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 04 Feb 2008 08:13:36 -0600 Subject: [Swift-devel] Swift throttling In-Reply-To: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> Message-ID: <47A71D90.6080907@mcs.anl.gov> Mihael, Ben - bear with me - I'd like to revisit where we are n throttling. The following may already be in place, but I think we need to review and clarify it, maybe re-assess the numbers: Seems like for both pre-WS and WS-GRAM we need to stay within two roughly-known limits: - number of jobs submitted per second - total # of jobs that can be submitted at once It seems that we need to set limits on these two parameters, *around* the slow-start algorithm that tries to sense a sustainable maximum rate of job submission. To what extent is that in the code already, and does it need improvement? I thought that for pre-WS GRAM the parameters are approximately - .5 jobs/sec - < 100 jobs in queue I realize that these can only be limited on per-workflow basis, but for interactions between two workflows, hopefully the slow-start sensing algorithms will sense that resource is already under strain and stay at a low submission rate. So what Im suggesting here is: - we agree on some arbitrary conservative numbers for the moment (till we can do more measurement) - we modify the code to enable explicit limits on the algorithm to be set by the user, eg: throttle.host.submitlimit - max # jobs that can be queued to a host throttle.host.submitrate - max #jobs/sec that can be queued to a host (float) Does Ti's report of 80 jobs indicates that maybe even 100 jobs in the queue is too much (for pre-WS)? Does this seem reasonable? If not, what is the mechanism by which we can reliably avoid over-running a site? - Mike On 2/4/08 7:16 AM, Ti Leggett wrote: > Around 80. > > On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >> >> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>> Sorry for killing the server. I'm pushing to get >>> results to guide the selection of compounds for >>> wet-lab testing. >>> >>> I had set the throttle.score.job.factor to 1 in the >>> swift.properties file. >> >> Hmm. Ti, at the time of the massacre, how many did you kill? >> >> Mihael >> >>> >>> I certainly appreciate everyone's efforts and >>> responsiveness. >>> >>> Let me know what to try next, before I kill again. >>> >>> Cheers, >>> >>> Mike >>> >>> >>> >>> --- Mihael Hategan wrote: >>> >>>> So I was trying some stuff on Friday night. I guess >>>> I've found the >>>> strategy on when to run the tests: when nobody else >>>> has jobs there >>>> (besides Buzz doing gridftp tests, Ioan having some >>>> Falkon workers >>>> running, and the occasional Inca tests). >>>> >>>> In any event, the machine jumps to about 100% >>>> utilization at around 130 >>>> jobs with pre-ws gram. So Mike, please set >>>> throttle.score.job.factor to >>>> 1 in swift.properties. >>>> >>>> There's still more work I need to do test-wise. >>>> >>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>> Mike, You're killing tg-grid1 again. Can someone >>>> work with Mike to get >>>>> some swift settings that don't kill our server? >>>>> >>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>> >>>>>> Yes, I'm submitting molecular dynamics >>>> simulations >>>>>> using Swift. >>>>>> >>>>>> Is there a default wall-time limit for jobs on >>>> tg-uc? >>>>>> >>>>>> >>>>>> >>>>>> --- joseph insley wrote: >>>>>> >>>>>>> Actually, these numbers are now escalating... >>>>>>> >>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>> average: >>>>>>> 149.02, 123.63, 91.94 >>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>> 0 >>>>>>> stopped, 0 zombie >>>>>>> >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>> 479 >>>>>>> >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>> GRAM Authentication test successful >>>>>>> real 0m26.134s >>>>>>> user 0m0.090s >>>>>>> sys 0m0.010s >>>>>>> >>>>>>> >>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>> wrote: >>>>>>> >>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>> UC/ANL >>>>>>> TG GRAM host) >>>>>>>> became unresponsive and had to be rebooted. I >>>> am >>>>>>> now seeing slow >>>>>>>> response times from the Gatekeeper there >>>> again. >>>>>>> Authenticating to >>>>>>>> the gatekeeper should only take a second or >>>> two, >>>>>>> but it is >>>>>>>> periodically taking up to 16 seconds: >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m16.096s >>>>>>>> user 0m0.060s >>>>>>>> sys 0m0.020s >>>>>>>> >>>>>>>> looking at the load on tg-grid, it is rather >>>> high: >>>>>>>> >>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>> average: >>>>>>> 89.59, 78.69, 62.92 >>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>> 0 >>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> And there appear to be a large number of >>>> processes >>>>>>> owned by kubal: >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 380 >>>>>>>> >>>>>>>> I assume that Mike is using swift to do the >>>> job >>>>>>> submission. Is >>>>>>>> there some throttling of the rate at which >>>> jobs >>>>>>> are submitted to >>>>>>>> the gatekeeper that could be done that would >>>>>>> lighten this load >>>>>>>> some? (Or has that already been done since >>>>>>> earlier today?) The >>>>>>>> current response times are not unacceptable, >>>> but >>>>>>> I'm hoping to >>>>>>>> avoid having the machine grind to a halt as it >>>> did >>>>>>> earlier today. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> joe. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>> =================================================== >>>>>>>> joseph a. >>>>>>>> insley >>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division >>>>>>> (630) 252-5649 >>>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>> =================================================== >>>>>>> joseph a. insley >>>>>>> >>>>>>> insley at mcs.anl.gov >>>>>>> mathematics & computer science division >>>> (630) >>>>>>> 252-5649 >>>>>>> argonne national laboratory >>>>>>> (630) >>>>>>> 252-5986 (fax) >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>> ____________________________________________________________________________________ >>> >>>>>> Be a better friend, newshound, and >>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>> >>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> >>>> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>> >>>> _______________________________________________ >>>> Swift-devel mailing list >>>> Swift-devel at ci.uchicago.edu >>>> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>> >>>> >>> >>> >>> >>> >>> ____________________________________________________________________________________ >>> >>> Never miss a thing. Make Yahoo your home page. >>> http://www.yahoo.com/r/hs >>> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Mon Feb 4 09:30:54 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 09:30:54 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> Message-ID: <1202139054.16407.5.camel@blabla.mcs.anl.gov> That's odd. Clearly if that's not acceptable from your perspective, yet I thought 130 are fine, there's a disconnect between what you think is acceptable and what I think is acceptable. What was that prompted you to conclude things are bad? On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > Around 80. > > On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > > > > > On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >> Sorry for killing the server. I'm pushing to get > >> results to guide the selection of compounds for > >> wet-lab testing. > >> > >> I had set the throttle.score.job.factor to 1 in the > >> swift.properties file. > > > > Hmm. Ti, at the time of the massacre, how many did you kill? > > > > Mihael > > > >> > >> I certainly appreciate everyone's efforts and > >> responsiveness. > >> > >> Let me know what to try next, before I kill again. > >> > >> Cheers, > >> > >> Mike > >> > >> > >> > >> --- Mihael Hategan wrote: > >> > >>> So I was trying some stuff on Friday night. I guess > >>> I've found the > >>> strategy on when to run the tests: when nobody else > >>> has jobs there > >>> (besides Buzz doing gridftp tests, Ioan having some > >>> Falkon workers > >>> running, and the occasional Inca tests). > >>> > >>> In any event, the machine jumps to about 100% > >>> utilization at around 130 > >>> jobs with pre-ws gram. So Mike, please set > >>> throttle.score.job.factor to > >>> 1 in swift.properties. > >>> > >>> There's still more work I need to do test-wise. > >>> > >>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>> Mike, You're killing tg-grid1 again. Can someone > >>> work with Mike to get > >>>> some swift settings that don't kill our server? > >>>> > >>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>> > >>>>> Yes, I'm submitting molecular dynamics > >>> simulations > >>>>> using Swift. > >>>>> > >>>>> Is there a default wall-time limit for jobs on > >>> tg-uc? > >>>>> > >>>>> > >>>>> > >>>>> --- joseph insley wrote: > >>>>> > >>>>>> Actually, these numbers are now escalating... > >>>>>> > >>>>>> top - 17:18:54 up 2:29, 1 user, load > >>> average: > >>>>>> 149.02, 123.63, 91.94 > >>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>> 0 > >>>>>> stopped, 0 zombie > >>>>>> > >>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>> 479 > >>>>>> > >>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>> tg-grid.uc.teragrid.org > >>>>>> GRAM Authentication test successful > >>>>>> real 0m26.134s > >>>>>> user 0m0.090s > >>>>>> sys 0m0.010s > >>>>>> > >>>>>> > >>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>> wrote: > >>>>>> > >>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>> UC/ANL > >>>>>> TG GRAM host) > >>>>>>> became unresponsive and had to be rebooted. I > >>> am > >>>>>> now seeing slow > >>>>>>> response times from the Gatekeeper there > >>> again. > >>>>>> Authenticating to > >>>>>>> the gatekeeper should only take a second or > >>> two, > >>>>>> but it is > >>>>>>> periodically taking up to 16 seconds: > >>>>>>> > >>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>> tg-grid.uc.teragrid.org > >>>>>>> GRAM Authentication test successful > >>>>>>> real 0m16.096s > >>>>>>> user 0m0.060s > >>>>>>> sys 0m0.020s > >>>>>>> > >>>>>>> looking at the load on tg-grid, it is rather > >>> high: > >>>>>>> > >>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>> average: > >>>>>> 89.59, 78.69, 62.92 > >>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>> 0 > >>>>>> stopped, 0 zombie > >>>>>>> > >>>>>>> And there appear to be a large number of > >>> processes > >>>>>> owned by kubal: > >>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>> 380 > >>>>>>> > >>>>>>> I assume that Mike is using swift to do the > >>> job > >>>>>> submission. Is > >>>>>>> there some throttling of the rate at which > >>> jobs > >>>>>> are submitted to > >>>>>>> the gatekeeper that could be done that would > >>>>>> lighten this load > >>>>>>> some? (Or has that already been done since > >>>>>> earlier today?) The > >>>>>>> current response times are not unacceptable, > >>> but > >>>>>> I'm hoping to > >>>>>>> avoid having the machine grind to a halt as it > >>> did > >>>>>> earlier today. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> joe. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>> =================================================== > >>>>>>> joseph a. > >>>>>>> insley > >>>>>> > >>>>>>> insley at mcs.anl.gov > >>>>>>> mathematics & computer science division > >>>>>> (630) 252-5649 > >>>>>>> argonne national laboratory > >>>>>> (630) > >>>>>>> 252-5986 (fax) > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>> =================================================== > >>>>>> joseph a. insley > >>>>>> > >>>>>> insley at mcs.anl.gov > >>>>>> mathematics & computer science division > >>> (630) > >>>>>> 252-5649 > >>>>>> argonne national laboratory > >>>>>> (630) > >>>>>> 252-5986 (fax) > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>> > >> ____________________________________________________________________________________ > >>>>> Be a better friend, newshound, and > >>>>> know-it-all with Yahoo! Mobile. Try it now. > >>> > >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>> > >>>> > >>>> _______________________________________________ > >>>> Swift-devel mailing list > >>>> Swift-devel at ci.uchicago.edu > >>>> > >>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >> > >> > >> > >> > >> ____________________________________________________________________________________ > >> Never miss a thing. Make Yahoo your home page. > >> http://www.yahoo.com/r/hs > >> > > > From leggett at mcs.anl.gov Mon Feb 4 09:58:40 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Mon, 4 Feb 2008 09:58:40 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202139054.16407.5.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> Message-ID: <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> That inca tests were timing out after 5 minutes and the load on the machine was ~27. How are you concluding when things aren't acceptable? On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > That's odd. Clearly if that's not acceptable from your perspective, > yet > I thought 130 are fine, there's a disconnect between what you think is > acceptable and what I think is acceptable. > > What was that prompted you to conclude things are bad? > > On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: >> Around 80. >> >> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: >> >>> >>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>>> Sorry for killing the server. I'm pushing to get >>>> results to guide the selection of compounds for >>>> wet-lab testing. >>>> >>>> I had set the throttle.score.job.factor to 1 in the >>>> swift.properties file. >>> >>> Hmm. Ti, at the time of the massacre, how many did you kill? >>> >>> Mihael >>> >>>> >>>> I certainly appreciate everyone's efforts and >>>> responsiveness. >>>> >>>> Let me know what to try next, before I kill again. >>>> >>>> Cheers, >>>> >>>> Mike >>>> >>>> >>>> >>>> --- Mihael Hategan wrote: >>>> >>>>> So I was trying some stuff on Friday night. I guess >>>>> I've found the >>>>> strategy on when to run the tests: when nobody else >>>>> has jobs there >>>>> (besides Buzz doing gridftp tests, Ioan having some >>>>> Falkon workers >>>>> running, and the occasional Inca tests). >>>>> >>>>> In any event, the machine jumps to about 100% >>>>> utilization at around 130 >>>>> jobs with pre-ws gram. So Mike, please set >>>>> throttle.score.job.factor to >>>>> 1 in swift.properties. >>>>> >>>>> There's still more work I need to do test-wise. >>>>> >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>>> Mike, You're killing tg-grid1 again. Can someone >>>>> work with Mike to get >>>>>> some swift settings that don't kill our server? >>>>>> >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>> >>>>>>> Yes, I'm submitting molecular dynamics >>>>> simulations >>>>>>> using Swift. >>>>>>> >>>>>>> Is there a default wall-time limit for jobs on >>>>> tg-uc? >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- joseph insley wrote: >>>>>>> >>>>>>>> Actually, these numbers are now escalating... >>>>>>>> >>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>> average: >>>>>>>> 149.02, 123.63, 91.94 >>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>> 0 >>>>>>>> stopped, 0 zombie >>>>>>>> >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>> 479 >>>>>>>> >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>> GRAM Authentication test successful >>>>>>>> real 0m26.134s >>>>>>>> user 0m0.090s >>>>>>>> sys 0m0.010s >>>>>>>> >>>>>>>> >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>> wrote: >>>>>>>> >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>> UC/ANL >>>>>>>> TG GRAM host) >>>>>>>>> became unresponsive and had to be rebooted. I >>>>> am >>>>>>>> now seeing slow >>>>>>>>> response times from the Gatekeeper there >>>>> again. >>>>>>>> Authenticating to >>>>>>>>> the gatekeeper should only take a second or >>>>> two, >>>>>>>> but it is >>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>> >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>> GRAM Authentication test successful >>>>>>>>> real 0m16.096s >>>>>>>>> user 0m0.060s >>>>>>>>> sys 0m0.020s >>>>>>>>> >>>>>>>>> looking at the load on tg-grid, it is rather >>>>> high: >>>>>>>>> >>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>> average: >>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>>> 0 >>>>>>>> stopped, 0 zombie >>>>>>>>> >>>>>>>>> And there appear to be a large number of >>>>> processes >>>>>>>> owned by kubal: >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>> 380 >>>>>>>>> >>>>>>>>> I assume that Mike is using swift to do the >>>>> job >>>>>>>> submission. Is >>>>>>>>> there some throttling of the rate at which >>>>> jobs >>>>>>>> are submitted to >>>>>>>>> the gatekeeper that could be done that would >>>>>>>> lighten this load >>>>>>>>> some? (Or has that already been done since >>>>>>>> earlier today?) The >>>>>>>>> current response times are not unacceptable, >>>>> but >>>>>>>> I'm hoping to >>>>>>>>> avoid having the machine grind to a halt as it >>>>> did >>>>>>>> earlier today. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> joe. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>> =================================================== >>>>>>>>> joseph a. >>>>>>>>> insley >>>>>>>> >>>>>>>>> insley at mcs.anl.gov >>>>>>>>> mathematics & computer science division >>>>>>>> (630) 252-5649 >>>>>>>>> argonne national laboratory >>>>>>>> (630) >>>>>>>>> 252-5986 (fax) >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>> =================================================== >>>>>>>> joseph a. insley >>>>>>>> >>>>>>>> insley at mcs.anl.gov >>>>>>>> mathematics & computer science division >>>>> (630) >>>>>>>> 252-5649 >>>>>>>> argonne national laboratory >>>>>>>> (630) >>>>>>>> 252-5986 (fax) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>> ____________________________________________________________________________________ >>>>>>> Be a better friend, newshound, and >>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>> >>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> >>>>> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Swift-devel mailing list >>>>> Swift-devel at ci.uchicago.edu >>>>> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> ____________________________________________________________________________________ >>>> Never miss a thing. Make Yahoo your home page. >>>> http://www.yahoo.com/r/hs >>>> >>> >> > From benc at hawaga.org.uk Mon Feb 4 10:03:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 16:03:17 +0000 (GMT) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: On Sun, 3 Feb 2008, Mike Kubal wrote: > Let me know what to try next, before I kill again. You can try the PB local provider perhaps. That needs you to run Swift on eg tg-grid1 rather than on some arbitrary grid machine. Then use a sites.xml entry something like: /home/benc/swift-run-dir/ Make sure the workdirectory is somewhere shared (the directory you're using at the moment is probably OK). I've run this on teraport with a hundred or so jobs without any apparent problem, so hopefully this will scale better. -- From benc at hawaga.org.uk Mon Feb 4 10:09:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 16:09:40 +0000 (GMT) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <548830.35963.qm@web52311.mail.re2.yahoo.com> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: On Sun, 3 Feb 2008, Mike Kubal wrote: > Let me know what to try next, before I kill again. Also, there is a clustering mechanism - this is where swift takes a bunch of jobs and aggregates them into a single submission to GRAM or PBS. If you know a maximum execution time for your jobs, you can do that. There's a users guide section with some details: http://www.ci.uchicago.edu/swift/guides/userguide.php#clustering Basically, you need to set a maximum time for your executables in your tc.data file using the maxwalltime profie and then specify a value for clustering.min.time property. Clusters fo about 10 jobs perhaps the size to aim for. You can use this with any submission mechanism - GRAM2, GRAM4 or PBS. -- From benc at hawaga.org.uk Mon Feb 4 10:13:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 16:13:07 +0000 (GMT) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: out of the clustering and pbs suggestions, I'd try PBS first... -- From benc at hawaga.org.uk Mon Feb 4 10:17:52 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 16:17:52 +0000 (GMT) Subject: [Swift-devel] Swift throttling In-Reply-To: <47A71D90.6080907@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <47A71D90.6080907@mcs.anl.gov> Message-ID: On Mon, 4 Feb 2008, Michael Wilde wrote: > hopefully the slow-start sensing > algorithms will sense that resource is already under strain and stay at a low > submission rate. I don't think it does that at all. > - we modify the code to enable explicit limits on the algorithm to be set by > the user, eg: > throttle.host.submitlimit - max # jobs that can be queued to a host > throttle.host.submitrate - max #jobs/sec that can be queued to a host > (float) parameters that control thsoe exist already, I think, for the whole workflow. In the single site case, site specific ones aren't needed because of that. If they were being implemented, it would probably be better to make them settable in the sites catalog so that they can be defined differently for each site. throttle.scote.job.factor limits the number of concurrent jobs to 2 + 100*throttle.score.job.factor (so to achieve a limit of 52, set throttle.score.job.factor to 0.5) There's a per-site profile setting: maxSubmitRate - limits the maximum rate of job submission, in jobs per second. -- From hategan at mcs.anl.gov Mon Feb 4 10:18:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 10:18:36 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> Message-ID: <1202141916.17237.4.camel@blabla.mcs.anl.gov> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > That inca tests were timing out after 5 minutes and the load on the > machine was ~27. How are you concluding when things aren't acceptable? It's got 2 cpus. So to me an average load of under 100 and the SSH session being responsive looks fine. The fact that inca tests are timing out may be because inca has too low of a tolerance for things. > > On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > > > That's odd. Clearly if that's not acceptable from your perspective, > > yet > > I thought 130 are fine, there's a disconnect between what you think is > > acceptable and what I think is acceptable. > > > > What was that prompted you to conclude things are bad? > > > > On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > >> Around 80. > >> > >> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >> > >>> > >>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >>>> Sorry for killing the server. I'm pushing to get > >>>> results to guide the selection of compounds for > >>>> wet-lab testing. > >>>> > >>>> I had set the throttle.score.job.factor to 1 in the > >>>> swift.properties file. > >>> > >>> Hmm. Ti, at the time of the massacre, how many did you kill? > >>> > >>> Mihael > >>> > >>>> > >>>> I certainly appreciate everyone's efforts and > >>>> responsiveness. > >>>> > >>>> Let me know what to try next, before I kill again. > >>>> > >>>> Cheers, > >>>> > >>>> Mike > >>>> > >>>> > >>>> > >>>> --- Mihael Hategan wrote: > >>>> > >>>>> So I was trying some stuff on Friday night. I guess > >>>>> I've found the > >>>>> strategy on when to run the tests: when nobody else > >>>>> has jobs there > >>>>> (besides Buzz doing gridftp tests, Ioan having some > >>>>> Falkon workers > >>>>> running, and the occasional Inca tests). > >>>>> > >>>>> In any event, the machine jumps to about 100% > >>>>> utilization at around 130 > >>>>> jobs with pre-ws gram. So Mike, please set > >>>>> throttle.score.job.factor to > >>>>> 1 in swift.properties. > >>>>> > >>>>> There's still more work I need to do test-wise. > >>>>> > >>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>>> Mike, You're killing tg-grid1 again. Can someone > >>>>> work with Mike to get > >>>>>> some swift settings that don't kill our server? > >>>>>> > >>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>> > >>>>>>> Yes, I'm submitting molecular dynamics > >>>>> simulations > >>>>>>> using Swift. > >>>>>>> > >>>>>>> Is there a default wall-time limit for jobs on > >>>>> tg-uc? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> --- joseph insley wrote: > >>>>>>> > >>>>>>>> Actually, these numbers are now escalating... > >>>>>>>> > >>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>> average: > >>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>> 0 > >>>>>>>> stopped, 0 zombie > >>>>>>>> > >>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>> 479 > >>>>>>>> > >>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>> GRAM Authentication test successful > >>>>>>>> real 0m26.134s > >>>>>>>> user 0m0.090s > >>>>>>>> sys 0m0.010s > >>>>>>>> > >>>>>>>> > >>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>> wrote: > >>>>>>>> > >>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>> UC/ANL > >>>>>>>> TG GRAM host) > >>>>>>>>> became unresponsive and had to be rebooted. I > >>>>> am > >>>>>>>> now seeing slow > >>>>>>>>> response times from the Gatekeeper there > >>>>> again. > >>>>>>>> Authenticating to > >>>>>>>>> the gatekeeper should only take a second or > >>>>> two, > >>>>>>>> but it is > >>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>> > >>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>> GRAM Authentication test successful > >>>>>>>>> real 0m16.096s > >>>>>>>>> user 0m0.060s > >>>>>>>>> sys 0m0.020s > >>>>>>>>> > >>>>>>>>> looking at the load on tg-grid, it is rather > >>>>> high: > >>>>>>>>> > >>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>> average: > >>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>>>> 0 > >>>>>>>> stopped, 0 zombie > >>>>>>>>> > >>>>>>>>> And there appear to be a large number of > >>>>> processes > >>>>>>>> owned by kubal: > >>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>> 380 > >>>>>>>>> > >>>>>>>>> I assume that Mike is using swift to do the > >>>>> job > >>>>>>>> submission. Is > >>>>>>>>> there some throttling of the rate at which > >>>>> jobs > >>>>>>>> are submitted to > >>>>>>>>> the gatekeeper that could be done that would > >>>>>>>> lighten this load > >>>>>>>>> some? (Or has that already been done since > >>>>>>>> earlier today?) The > >>>>>>>>> current response times are not unacceptable, > >>>>> but > >>>>>>>> I'm hoping to > >>>>>>>>> avoid having the machine grind to a halt as it > >>>>> did > >>>>>>>> earlier today. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> joe. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>> =================================================== > >>>>>>>>> joseph a. > >>>>>>>>> insley > >>>>>>>> > >>>>>>>>> insley at mcs.anl.gov > >>>>>>>>> mathematics & computer science division > >>>>>>>> (630) 252-5649 > >>>>>>>>> argonne national laboratory > >>>>>>>> (630) > >>>>>>>>> 252-5986 (fax) > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>> =================================================== > >>>>>>>> joseph a. insley > >>>>>>>> > >>>>>>>> insley at mcs.anl.gov > >>>>>>>> mathematics & computer science division > >>>>> (630) > >>>>>>>> 252-5649 > >>>>>>>> argonne national laboratory > >>>>>>>> (630) > >>>>>>>> 252-5986 (fax) > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>> ____________________________________________________________________________________ > >>>>>>> Be a better friend, newshound, and > >>>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>>>> > >>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-devel mailing list > >>>>>> Swift-devel at ci.uchicago.edu > >>>>>> > >>>>> > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Swift-devel mailing list > >>>>> Swift-devel at ci.uchicago.edu > >>>>> > >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > >>>> ____________________________________________________________________________________ > >>>> Never miss a thing. Make Yahoo your home page. > >>>> http://www.yahoo.com/r/hs > >>>> > >>> > >> > > > From hategan at mcs.anl.gov Mon Feb 4 10:23:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 10:23:36 -0600 Subject: [Swift-devel] Swift throttling In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <47A71D90.6080907@mcs.anl.gov> Message-ID: <1202142217.17237.9.camel@blabla.mcs.anl.gov> On Mon, 2008-02-04 at 16:17 +0000, Ben Clifford wrote: > > On Mon, 4 Feb 2008, Michael Wilde wrote: > > > hopefully the slow-start sensing > > algorithms will sense that resource is already under strain and stay at a low > > submission rate. > > I don't think it does that at all. Actually, yes. There's the submit throttle which limits the submission parallelism. When remote site is under load and accepts jobs slowly, the client will invariably submit slower. And now that I think of it, the maxSubmitRate looks a lot like it could be integrated here. > > > - we modify the code to enable explicit limits on the algorithm to be set by > > the user, eg: > > throttle.host.submitlimit - max # jobs that can be queued to a host > > throttle.host.submitrate - max #jobs/sec that can be queued to a host > > (float) > > parameters that control thsoe exist already, I think, for the whole > workflow. In the single site case, site specific ones aren't needed > because of that. If they were being implemented, it would probably be > better to make them settable in the sites catalog so that they can be > defined differently for each site. > > throttle.scote.job.factor limits the number of concurrent jobs to 2 + > 100*throttle.score.job.factor (so to achieve a limit of 52, set > throttle.score.job.factor to 0.5) Unfortunately that's an int. So it won't work. I'll make it a float. > > There's a per-site profile setting: > maxSubmitRate - limits the maximum rate of job submission, in jobs per > second. > From leggett at mcs.anl.gov Mon Feb 4 10:28:38 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Mon, 4 Feb 2008 10:28:38 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202141916.17237.4.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> Message-ID: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> Then I'd say we have very different levels of acceptable. A simple job submission test should never take longer than 5 minutes to complete and a load of 27 is not acceptable when the responsiveness of the machine is impacted. And since we're having this conversation, there is a perceived problem on our end so an adjustment to our definition of acceptable is needed. On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > > On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: >> That inca tests were timing out after 5 minutes and the load on the >> machine was ~27. How are you concluding when things aren't >> acceptable? > > It's got 2 cpus. So to me an average load of under 100 and the SSH > session being responsive looks fine. > > The fact that inca tests are timing out may be because inca has too > low > of a tolerance for things. > >> >> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: >> >>> That's odd. Clearly if that's not acceptable from your perspective, >>> yet >>> I thought 130 are fine, there's a disconnect between what you >>> think is >>> acceptable and what I think is acceptable. >>> >>> What was that prompted you to conclude things are bad? >>> >>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: >>>> Around 80. >>>> >>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: >>>> >>>>> >>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>>>>> Sorry for killing the server. I'm pushing to get >>>>>> results to guide the selection of compounds for >>>>>> wet-lab testing. >>>>>> >>>>>> I had set the throttle.score.job.factor to 1 in the >>>>>> swift.properties file. >>>>> >>>>> Hmm. Ti, at the time of the massacre, how many did you kill? >>>>> >>>>> Mihael >>>>> >>>>>> >>>>>> I certainly appreciate everyone's efforts and >>>>>> responsiveness. >>>>>> >>>>>> Let me know what to try next, before I kill again. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Mike >>>>>> >>>>>> >>>>>> >>>>>> --- Mihael Hategan wrote: >>>>>> >>>>>>> So I was trying some stuff on Friday night. I guess >>>>>>> I've found the >>>>>>> strategy on when to run the tests: when nobody else >>>>>>> has jobs there >>>>>>> (besides Buzz doing gridftp tests, Ioan having some >>>>>>> Falkon workers >>>>>>> running, and the occasional Inca tests). >>>>>>> >>>>>>> In any event, the machine jumps to about 100% >>>>>>> utilization at around 130 >>>>>>> jobs with pre-ws gram. So Mike, please set >>>>>>> throttle.score.job.factor to >>>>>>> 1 in swift.properties. >>>>>>> >>>>>>> There's still more work I need to do test-wise. >>>>>>> >>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>>>>> Mike, You're killing tg-grid1 again. Can someone >>>>>>> work with Mike to get >>>>>>>> some swift settings that don't kill our server? >>>>>>>> >>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>>>> >>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>> simulations >>>>>>>>> using Swift. >>>>>>>>> >>>>>>>>> Is there a default wall-time limit for jobs on >>>>>>> tg-uc? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --- joseph insley wrote: >>>>>>>>> >>>>>>>>>> Actually, these numbers are now escalating... >>>>>>>>>> >>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>>>> average: >>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>>>> 0 >>>>>>>>>> stopped, 0 zombie >>>>>>>>>> >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>> 479 >>>>>>>>>> >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>> GRAM Authentication test successful >>>>>>>>>> real 0m26.134s >>>>>>>>>> user 0m0.090s >>>>>>>>>> sys 0m0.010s >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>> UC/ANL >>>>>>>>>> TG GRAM host) >>>>>>>>>>> became unresponsive and had to be rebooted. I >>>>>>> am >>>>>>>>>> now seeing slow >>>>>>>>>>> response times from the Gatekeeper there >>>>>>> again. >>>>>>>>>> Authenticating to >>>>>>>>>>> the gatekeeper should only take a second or >>>>>>> two, >>>>>>>>>> but it is >>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>> real 0m16.096s >>>>>>>>>>> user 0m0.060s >>>>>>>>>>> sys 0m0.020s >>>>>>>>>>> >>>>>>>>>>> looking at the load on tg-grid, it is rather >>>>>>> high: >>>>>>>>>>> >>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>> average: >>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>>>>> 0 >>>>>>>>>> stopped, 0 zombie >>>>>>>>>>> >>>>>>>>>>> And there appear to be a large number of >>>>>>> processes >>>>>>>>>> owned by kubal: >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>> 380 >>>>>>>>>>> >>>>>>>>>>> I assume that Mike is using swift to do the >>>>>>> job >>>>>>>>>> submission. Is >>>>>>>>>>> there some throttling of the rate at which >>>>>>> jobs >>>>>>>>>> are submitted to >>>>>>>>>>> the gatekeeper that could be done that would >>>>>>>>>> lighten this load >>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>> earlier today?) The >>>>>>>>>>> current response times are not unacceptable, >>>>>>> but >>>>>>>>>> I'm hoping to >>>>>>>>>>> avoid having the machine grind to a halt as it >>>>>>> did >>>>>>>>>> earlier today. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> joe. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>> =================================================== >>>>>>>>>>> joseph a. >>>>>>>>>>> insley >>>>>>>>>> >>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>> mathematics & computer science division >>>>>>>>>> (630) 252-5649 >>>>>>>>>>> argonne national laboratory >>>>>>>>>> (630) >>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> =================================================== >>>>>>>>>> joseph a. insley >>>>>>>>>> >>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>> mathematics & computer science division >>>>>>> (630) >>>>>>>>>> 252-5649 >>>>>>>>>> argonne national laboratory >>>>>>>>>> (630) >>>>>>>>>> 252-5986 (fax) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> ____________________________________________________________________________________ >>>>>>>>> Be a better friend, newshound, and >>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>>> >>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>>>>>> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Swift-devel mailing list >>>>>>> Swift-devel at ci.uchicago.edu >>>>>>> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ____________________________________________________________________________________ >>>>>> Never miss a thing. Make Yahoo your home page. >>>>>> http://www.yahoo.com/r/hs >>>>>> >>>>> >>>> >>> >> > From foster at mcs.anl.gov Mon Feb 4 10:31:59 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Mon, 04 Feb 2008 10:31:59 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> Message-ID: <47A73DFF.3010402@mcs.anl.gov> It would be really wonderful if someone can try GRAM4, which we believe addresses this problem. Ian. Ti Leggett wrote: > Then I'd say we have very different levels of acceptable. A simple job > submission test should never take longer than 5 minutes to complete > and a load of 27 is not acceptable when the responsiveness of the > machine is impacted. And since we're having this conversation, there > is a perceived problem on our end so an adjustment to our definition > of acceptable is needed. > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > >> >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: >>> That inca tests were timing out after 5 minutes and the load on the >>> machine was ~27. How are you concluding when things aren't acceptable? >> >> It's got 2 cpus. So to me an average load of under 100 and the SSH >> session being responsive looks fine. >> >> The fact that inca tests are timing out may be because inca has too low >> of a tolerance for things. >> >>> >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: >>> >>>> That's odd. Clearly if that's not acceptable from your perspective, >>>> yet >>>> I thought 130 are fine, there's a disconnect between what you think is >>>> acceptable and what I think is acceptable. >>>> >>>> What was that prompted you to conclude things are bad? >>>> >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: >>>>> Around 80. >>>>> >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: >>>>> >>>>>> >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>>>>>> Sorry for killing the server. I'm pushing to get >>>>>>> results to guide the selection of compounds for >>>>>>> wet-lab testing. >>>>>>> >>>>>>> I had set the throttle.score.job.factor to 1 in the >>>>>>> swift.properties file. >>>>>> >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? >>>>>> >>>>>> Mihael >>>>>> >>>>>>> >>>>>>> I certainly appreciate everyone's efforts and >>>>>>> responsiveness. >>>>>>> >>>>>>> Let me know what to try next, before I kill again. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- Mihael Hategan wrote: >>>>>>> >>>>>>>> So I was trying some stuff on Friday night. I guess >>>>>>>> I've found the >>>>>>>> strategy on when to run the tests: when nobody else >>>>>>>> has jobs there >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some >>>>>>>> Falkon workers >>>>>>>> running, and the occasional Inca tests). >>>>>>>> >>>>>>>> In any event, the machine jumps to about 100% >>>>>>>> utilization at around 130 >>>>>>>> jobs with pre-ws gram. So Mike, please set >>>>>>>> throttle.score.job.factor to >>>>>>>> 1 in swift.properties. >>>>>>>> >>>>>>>> There's still more work I need to do test-wise. >>>>>>>> >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone >>>>>>>> work with Mike to get >>>>>>>>> some swift settings that don't kill our server? >>>>>>>>> >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>>>>> >>>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>>> simulations >>>>>>>>>> using Swift. >>>>>>>>>> >>>>>>>>>> Is there a default wall-time limit for jobs on >>>>>>>> tg-uc? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --- joseph insley wrote: >>>>>>>>>> >>>>>>>>>>> Actually, these numbers are now escalating... >>>>>>>>>>> >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>>>>> average: >>>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>>>>> 0 >>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>> >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>> 479 >>>>>>>>>>> >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>> real 0m26.134s >>>>>>>>>>> user 0m0.090s >>>>>>>>>>> sys 0m0.010s >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>>> UC/ANL >>>>>>>>>>> TG GRAM host) >>>>>>>>>>>> became unresponsive and had to be rebooted. I >>>>>>>> am >>>>>>>>>>> now seeing slow >>>>>>>>>>>> response times from the Gatekeeper there >>>>>>>> again. >>>>>>>>>>> Authenticating to >>>>>>>>>>>> the gatekeeper should only take a second or >>>>>>>> two, >>>>>>>>>>> but it is >>>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>>> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>> real 0m16.096s >>>>>>>>>>>> user 0m0.060s >>>>>>>>>>>> sys 0m0.020s >>>>>>>>>>>> >>>>>>>>>>>> looking at the load on tg-grid, it is rather >>>>>>>> high: >>>>>>>>>>>> >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>>> average: >>>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>>>>>> 0 >>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>> >>>>>>>>>>>> And there appear to be a large number of >>>>>>>> processes >>>>>>>>>>> owned by kubal: >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>>> 380 >>>>>>>>>>>> >>>>>>>>>>>> I assume that Mike is using swift to do the >>>>>>>> job >>>>>>>>>>> submission. Is >>>>>>>>>>>> there some throttling of the rate at which >>>>>>>> jobs >>>>>>>>>>> are submitted to >>>>>>>>>>>> the gatekeeper that could be done that would >>>>>>>>>>> lighten this load >>>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>>> earlier today?) The >>>>>>>>>>>> current response times are not unacceptable, >>>>>>>> but >>>>>>>>>>> I'm hoping to >>>>>>>>>>>> avoid having the machine grind to a halt as it >>>>>>>> did >>>>>>>>>>> earlier today. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> joe. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> =================================================== >>>>>>>>>>>> joseph a. >>>>>>>>>>>> insley >>>>>>>>>>> >>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>> (630) 252-5649 >>>>>>>>>>>> argonne national laboratory >>>>>>>>>>> (630) >>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> =================================================== >>>>>>>>>>> joseph a. insley >>>>>>>>>>> >>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>> mathematics & computer science division >>>>>>>> (630) >>>>>>>>>>> 252-5649 >>>>>>>>>>> argonne national laboratory >>>>>>>>>>> (630) >>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> ____________________________________________________________________________________ >>>>>>> >>>>>>>>>> Be a better friend, newshound, and >>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>>>> >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-devel mailing list >>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>> >>>>>>>> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-devel mailing list >>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>> >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ____________________________________________________________________________________ >>>>>>> >>>>>>> Never miss a thing. Make Yahoo your home page. >>>>>>> http://www.yahoo.com/r/hs >>>>>>> >>>>>> >>>>> >>>> >>> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Feb 4 10:47:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 10:47:33 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> Message-ID: <1202143654.17665.12.camel@blabla.mcs.anl.gov> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote: > Then I'd say we have very different levels of acceptable. Yes, that's why we're having this discussion. > A simple job > submission test should never take longer than 5 minutes to complete > and a load of 27 is not acceptable when the responsiveness of the > machine is impacted. And since we're having this conversation, there > is a perceived problem on our end so an adjustment to our definition > of acceptable is needed. And we need to adjust our definition of not-acceptable. So we need to meet in the middle. So, 25 (sustained) reasonably acceptable average load? That amounts to about 13 hungry processes per cpu. Even with a 100Hz time slice, each process would get 8 slices per second on average. > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > > > > > On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > >> That inca tests were timing out after 5 minutes and the load on the > >> machine was ~27. How are you concluding when things aren't > >> acceptable? > > > > It's got 2 cpus. So to me an average load of under 100 and the SSH > > session being responsive looks fine. > > > > The fact that inca tests are timing out may be because inca has too > > low > > of a tolerance for things. > > > >> > >> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > >> > >>> That's odd. Clearly if that's not acceptable from your perspective, > >>> yet > >>> I thought 130 are fine, there's a disconnect between what you > >>> think is > >>> acceptable and what I think is acceptable. > >>> > >>> What was that prompted you to conclude things are bad? > >>> > >>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > >>>> Around 80. > >>>> > >>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >>>> > >>>>> > >>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >>>>>> Sorry for killing the server. I'm pushing to get > >>>>>> results to guide the selection of compounds for > >>>>>> wet-lab testing. > >>>>>> > >>>>>> I had set the throttle.score.job.factor to 1 in the > >>>>>> swift.properties file. > >>>>> > >>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > >>>>> > >>>>> Mihael > >>>>> > >>>>>> > >>>>>> I certainly appreciate everyone's efforts and > >>>>>> responsiveness. > >>>>>> > >>>>>> Let me know what to try next, before I kill again. > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Mike > >>>>>> > >>>>>> > >>>>>> > >>>>>> --- Mihael Hategan wrote: > >>>>>> > >>>>>>> So I was trying some stuff on Friday night. I guess > >>>>>>> I've found the > >>>>>>> strategy on when to run the tests: when nobody else > >>>>>>> has jobs there > >>>>>>> (besides Buzz doing gridftp tests, Ioan having some > >>>>>>> Falkon workers > >>>>>>> running, and the occasional Inca tests). > >>>>>>> > >>>>>>> In any event, the machine jumps to about 100% > >>>>>>> utilization at around 130 > >>>>>>> jobs with pre-ws gram. So Mike, please set > >>>>>>> throttle.score.job.factor to > >>>>>>> 1 in swift.properties. > >>>>>>> > >>>>>>> There's still more work I need to do test-wise. > >>>>>>> > >>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>>>>> Mike, You're killing tg-grid1 again. Can someone > >>>>>>> work with Mike to get > >>>>>>>> some swift settings that don't kill our server? > >>>>>>>> > >>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>>>> > >>>>>>>>> Yes, I'm submitting molecular dynamics > >>>>>>> simulations > >>>>>>>>> using Swift. > >>>>>>>>> > >>>>>>>>> Is there a default wall-time limit for jobs on > >>>>>>> tg-uc? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> --- joseph insley wrote: > >>>>>>>>> > >>>>>>>>>> Actually, these numbers are now escalating... > >>>>>>>>>> > >>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>>>> average: > >>>>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>>>> 0 > >>>>>>>>>> stopped, 0 zombie > >>>>>>>>>> > >>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>> 479 > >>>>>>>>>> > >>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>> real 0m26.134s > >>>>>>>>>> user 0m0.090s > >>>>>>>>>> sys 0m0.010s > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>>>> UC/ANL > >>>>>>>>>> TG GRAM host) > >>>>>>>>>>> became unresponsive and had to be rebooted. I > >>>>>>> am > >>>>>>>>>> now seeing slow > >>>>>>>>>>> response times from the Gatekeeper there > >>>>>>> again. > >>>>>>>>>> Authenticating to > >>>>>>>>>>> the gatekeeper should only take a second or > >>>>>>> two, > >>>>>>>>>> but it is > >>>>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>>>> > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>> real 0m16.096s > >>>>>>>>>>> user 0m0.060s > >>>>>>>>>>> sys 0m0.020s > >>>>>>>>>>> > >>>>>>>>>>> looking at the load on tg-grid, it is rather > >>>>>>> high: > >>>>>>>>>>> > >>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>>>> average: > >>>>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>>>>>> 0 > >>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>> > >>>>>>>>>>> And there appear to be a large number of > >>>>>>> processes > >>>>>>>>>> owned by kubal: > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>> 380 > >>>>>>>>>>> > >>>>>>>>>>> I assume that Mike is using swift to do the > >>>>>>> job > >>>>>>>>>> submission. Is > >>>>>>>>>>> there some throttling of the rate at which > >>>>>>> jobs > >>>>>>>>>> are submitted to > >>>>>>>>>>> the gatekeeper that could be done that would > >>>>>>>>>> lighten this load > >>>>>>>>>>> some? (Or has that already been done since > >>>>>>>>>> earlier today?) The > >>>>>>>>>>> current response times are not unacceptable, > >>>>>>> but > >>>>>>>>>> I'm hoping to > >>>>>>>>>>> avoid having the machine grind to a halt as it > >>>>>>> did > >>>>>>>>>> earlier today. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> joe. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>> =================================================== > >>>>>>>>>>> joseph a. > >>>>>>>>>>> insley > >>>>>>>>>> > >>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>> mathematics & computer science division > >>>>>>>>>> (630) 252-5649 > >>>>>>>>>>> argonne national laboratory > >>>>>>>>>> (630) > >>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> =================================================== > >>>>>>>>>> joseph a. insley > >>>>>>>>>> > >>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>> mathematics & computer science division > >>>>>>> (630) > >>>>>>>>>> 252-5649 > >>>>>>>>>> argonne national laboratory > >>>>>>>>>> (630) > >>>>>>>>>> 252-5986 (fax) > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> ____________________________________________________________________________________ > >>>>>>>>> Be a better friend, newshound, and > >>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>>>>>> > >>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>>>> > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Swift-devel mailing list > >>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>> > >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ____________________________________________________________________________________ > >>>>>> Never miss a thing. Make Yahoo your home page. > >>>>>> http://www.yahoo.com/r/hs > >>>>>> > >>>>> > >>>> > >>> > >> > > > From hategan at mcs.anl.gov Mon Feb 4 10:48:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 10:48:31 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <47A73DFF.3010402@mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <47A73DFF.3010402@mcs.anl.gov> Message-ID: <1202143711.17665.13.camel@blabla.mcs.anl.gov> Yes, and I will. But unless we're completely dropping support for pre-ws GRAM, we still need to do this. On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote: > It would be really wonderful if someone can try GRAM4, which we believe > addresses this problem. > > Ian. > > Ti Leggett wrote: > > Then I'd say we have very different levels of acceptable. A simple job > > submission test should never take longer than 5 minutes to complete > > and a load of 27 is not acceptable when the responsiveness of the > > machine is impacted. And since we're having this conversation, there > > is a perceived problem on our end so an adjustment to our definition > > of acceptable is needed. > > > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > > > >> > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > >>> That inca tests were timing out after 5 minutes and the load on the > >>> machine was ~27. How are you concluding when things aren't acceptable? > >> > >> It's got 2 cpus. So to me an average load of under 100 and the SSH > >> session being responsive looks fine. > >> > >> The fact that inca tests are timing out may be because inca has too low > >> of a tolerance for things. > >> > >>> > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > >>> > >>>> That's odd. Clearly if that's not acceptable from your perspective, > >>>> yet > >>>> I thought 130 are fine, there's a disconnect between what you think is > >>>> acceptable and what I think is acceptable. > >>>> > >>>> What was that prompted you to conclude things are bad? > >>>> > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > >>>>> Around 80. > >>>>> > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >>>>> > >>>>>> > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >>>>>>> Sorry for killing the server. I'm pushing to get > >>>>>>> results to guide the selection of compounds for > >>>>>>> wet-lab testing. > >>>>>>> > >>>>>>> I had set the throttle.score.job.factor to 1 in the > >>>>>>> swift.properties file. > >>>>>> > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > >>>>>> > >>>>>> Mihael > >>>>>> > >>>>>>> > >>>>>>> I certainly appreciate everyone's efforts and > >>>>>>> responsiveness. > >>>>>>> > >>>>>>> Let me know what to try next, before I kill again. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Mike > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> --- Mihael Hategan wrote: > >>>>>>> > >>>>>>>> So I was trying some stuff on Friday night. I guess > >>>>>>>> I've found the > >>>>>>>> strategy on when to run the tests: when nobody else > >>>>>>>> has jobs there > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some > >>>>>>>> Falkon workers > >>>>>>>> running, and the occasional Inca tests). > >>>>>>>> > >>>>>>>> In any event, the machine jumps to about 100% > >>>>>>>> utilization at around 130 > >>>>>>>> jobs with pre-ws gram. So Mike, please set > >>>>>>>> throttle.score.job.factor to > >>>>>>>> 1 in swift.properties. > >>>>>>>> > >>>>>>>> There's still more work I need to do test-wise. > >>>>>>>> > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone > >>>>>>>> work with Mike to get > >>>>>>>>> some swift settings that don't kill our server? > >>>>>>>>> > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>>>>> > >>>>>>>>>> Yes, I'm submitting molecular dynamics > >>>>>>>> simulations > >>>>>>>>>> using Swift. > >>>>>>>>>> > >>>>>>>>>> Is there a default wall-time limit for jobs on > >>>>>>>> tg-uc? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> --- joseph insley wrote: > >>>>>>>>>> > >>>>>>>>>>> Actually, these numbers are now escalating... > >>>>>>>>>>> > >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>>>>> average: > >>>>>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>>>>> 0 > >>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>> > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>> 479 > >>>>>>>>>>> > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>> real 0m26.134s > >>>>>>>>>>> user 0m0.090s > >>>>>>>>>>> sys 0m0.010s > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>>>>> UC/ANL > >>>>>>>>>>> TG GRAM host) > >>>>>>>>>>>> became unresponsive and had to be rebooted. I > >>>>>>>> am > >>>>>>>>>>> now seeing slow > >>>>>>>>>>>> response times from the Gatekeeper there > >>>>>>>> again. > >>>>>>>>>>> Authenticating to > >>>>>>>>>>>> the gatekeeper should only take a second or > >>>>>>>> two, > >>>>>>>>>>> but it is > >>>>>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>> real 0m16.096s > >>>>>>>>>>>> user 0m0.060s > >>>>>>>>>>>> sys 0m0.020s > >>>>>>>>>>>> > >>>>>>>>>>>> looking at the load on tg-grid, it is rather > >>>>>>>> high: > >>>>>>>>>>>> > >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>>>>> average: > >>>>>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>>>>>>> 0 > >>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>> > >>>>>>>>>>>> And there appear to be a large number of > >>>>>>>> processes > >>>>>>>>>>> owned by kubal: > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>>> 380 > >>>>>>>>>>>> > >>>>>>>>>>>> I assume that Mike is using swift to do the > >>>>>>>> job > >>>>>>>>>>> submission. Is > >>>>>>>>>>>> there some throttling of the rate at which > >>>>>>>> jobs > >>>>>>>>>>> are submitted to > >>>>>>>>>>>> the gatekeeper that could be done that would > >>>>>>>>>>> lighten this load > >>>>>>>>>>>> some? (Or has that already been done since > >>>>>>>>>>> earlier today?) The > >>>>>>>>>>>> current response times are not unacceptable, > >>>>>>>> but > >>>>>>>>>>> I'm hoping to > >>>>>>>>>>>> avoid having the machine grind to a halt as it > >>>>>>>> did > >>>>>>>>>>> earlier today. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> joe. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> =================================================== > >>>>>>>>>>>> joseph a. > >>>>>>>>>>>> insley > >>>>>>>>>>> > >>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>> mathematics & computer science division > >>>>>>>>>>> (630) 252-5649 > >>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>> (630) > >>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> =================================================== > >>>>>>>>>>> joseph a. insley > >>>>>>>>>>> > >>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>> mathematics & computer science division > >>>>>>>> (630) > >>>>>>>>>>> 252-5649 > >>>>>>>>>>> argonne national laboratory > >>>>>>>>>>> (630) > >>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> ____________________________________________________________________________________ > >>>>>>> > >>>>>>>>>> Be a better friend, newshound, and > >>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>>>>>>> > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> > >>>>>>>> > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-devel mailing list > >>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>> > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ____________________________________________________________________________________ > >>>>>>> > >>>>>>> Never miss a thing. Make Yahoo your home page. > >>>>>>> http://www.yahoo.com/r/hs > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > From leggett at mcs.anl.gov Mon Feb 4 10:55:48 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Mon, 4 Feb 2008 10:55:48 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202143654.17665.12.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <1202143654.17665.12.camel@blabla.mcs.anl.gov> Message-ID: load average is only an indication of what may be a problem. I've seen a load of 10000 on a machine and it still be very responsive because the processes weren't CPU hungry. So using load as a metric for determining acceptability is a small piece. In this case it should be the response of the gatekeeper. For instance, the inca jobs were timing out getting a response from the gatekeeper after 5 minutes. This is unacceptable. I would say as soon as it takes more than a minute for the GK to respond, back off. On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote: > > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote: >> Then I'd say we have very different levels of acceptable. > > Yes, that's why we're having this discussion. > >> A simple job >> submission test should never take longer than 5 minutes to complete >> and a load of 27 is not acceptable when the responsiveness of the >> machine is impacted. And since we're having this conversation, there >> is a perceived problem on our end so an adjustment to our definition >> of acceptable is needed. > > And we need to adjust our definition of not-acceptable. So we need to > meet in the middle. > > So, 25 (sustained) reasonably acceptable average load? That amounts to > about 13 hungry processes per cpu. Even with a 100Hz time slice, each > process would get 8 slices per second on average. > >> >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: >> >>> >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: >>>> That inca tests were timing out after 5 minutes and the load on the >>>> machine was ~27. How are you concluding when things aren't >>>> acceptable? >>> >>> It's got 2 cpus. So to me an average load of under 100 and the SSH >>> session being responsive looks fine. >>> >>> The fact that inca tests are timing out may be because inca has too >>> low >>> of a tolerance for things. >>> >>>> >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: >>>> >>>>> That's odd. Clearly if that's not acceptable from your >>>>> perspective, >>>>> yet >>>>> I thought 130 are fine, there's a disconnect between what you >>>>> think is >>>>> acceptable and what I think is acceptable. >>>>> >>>>> What was that prompted you to conclude things are bad? >>>>> >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: >>>>>> Around 80. >>>>>> >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: >>>>>> >>>>>>> >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>>>>>>> Sorry for killing the server. I'm pushing to get >>>>>>>> results to guide the selection of compounds for >>>>>>>> wet-lab testing. >>>>>>>> >>>>>>>> I had set the throttle.score.job.factor to 1 in the >>>>>>>> swift.properties file. >>>>>>> >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? >>>>>>> >>>>>>> Mihael >>>>>>> >>>>>>>> >>>>>>>> I certainly appreciate everyone's efforts and >>>>>>>> responsiveness. >>>>>>>> >>>>>>>> Let me know what to try next, before I kill again. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --- Mihael Hategan wrote: >>>>>>>> >>>>>>>>> So I was trying some stuff on Friday night. I guess >>>>>>>>> I've found the >>>>>>>>> strategy on when to run the tests: when nobody else >>>>>>>>> has jobs there >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some >>>>>>>>> Falkon workers >>>>>>>>> running, and the occasional Inca tests). >>>>>>>>> >>>>>>>>> In any event, the machine jumps to about 100% >>>>>>>>> utilization at around 130 >>>>>>>>> jobs with pre-ws gram. So Mike, please set >>>>>>>>> throttle.score.job.factor to >>>>>>>>> 1 in swift.properties. >>>>>>>>> >>>>>>>>> There's still more work I need to do test-wise. >>>>>>>>> >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone >>>>>>>>> work with Mike to get >>>>>>>>>> some swift settings that don't kill our server? >>>>>>>>>> >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>>>>>> >>>>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>>>> simulations >>>>>>>>>>> using Swift. >>>>>>>>>>> >>>>>>>>>>> Is there a default wall-time limit for jobs on >>>>>>>>> tg-uc? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --- joseph insley wrote: >>>>>>>>>>> >>>>>>>>>>>> Actually, these numbers are now escalating... >>>>>>>>>>>> >>>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>>>>>> average: >>>>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>>>>>> 0 >>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>> >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>>> 479 >>>>>>>>>>>> >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>> real 0m26.134s >>>>>>>>>>>> user 0m0.090s >>>>>>>>>>>> sys 0m0.010s >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>>>> UC/ANL >>>>>>>>>>>> TG GRAM host) >>>>>>>>>>>>> became unresponsive and had to be rebooted. I >>>>>>>>> am >>>>>>>>>>>> now seeing slow >>>>>>>>>>>>> response times from the Gatekeeper there >>>>>>>>> again. >>>>>>>>>>>> Authenticating to >>>>>>>>>>>>> the gatekeeper should only take a second or >>>>>>>>> two, >>>>>>>>>>>> but it is >>>>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>>>> >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>>> real 0m16.096s >>>>>>>>>>>>> user 0m0.060s >>>>>>>>>>>>> sys 0m0.020s >>>>>>>>>>>>> >>>>>>>>>>>>> looking at the load on tg-grid, it is rather >>>>>>>>> high: >>>>>>>>>>>>> >>>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>>>> average: >>>>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>>>>>>> 0 >>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>>> >>>>>>>>>>>>> And there appear to be a large number of >>>>>>>>> processes >>>>>>>>>>>> owned by kubal: >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>>>> 380 >>>>>>>>>>>>> >>>>>>>>>>>>> I assume that Mike is using swift to do the >>>>>>>>> job >>>>>>>>>>>> submission. Is >>>>>>>>>>>>> there some throttling of the rate at which >>>>>>>>> jobs >>>>>>>>>>>> are submitted to >>>>>>>>>>>>> the gatekeeper that could be done that would >>>>>>>>>>>> lighten this load >>>>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>>>> earlier today?) The >>>>>>>>>>>>> current response times are not unacceptable, >>>>>>>>> but >>>>>>>>>>>> I'm hoping to >>>>>>>>>>>>> avoid having the machine grind to a halt as it >>>>>>>>> did >>>>>>>>>>>> earlier today. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> joe. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> =================================================== >>>>>>>>>>>>> joseph a. >>>>>>>>>>>>> insley >>>>>>>>>>>> >>>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>>> (630) 252-5649 >>>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>> (630) >>>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> =================================================== >>>>>>>>>>>> joseph a. insley >>>>>>>>>>>> >>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>> mathematics & computer science division >>>>>>>>> (630) >>>>>>>>>>>> 252-5649 >>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>> (630) >>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> ____________________________________________________________________________________ >>>>>>>>>>> Be a better friend, newshound, and >>>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>>>>> >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Swift-devel mailing list >>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>> >>>>>>>>> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Swift-devel mailing list >>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>> >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ____________________________________________________________________________________ >>>>>>>> Never miss a thing. Make Yahoo your home page. >>>>>>>> http://www.yahoo.com/r/hs >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > From hategan at mcs.anl.gov Mon Feb 4 11:27:15 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 11:27:15 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <1202143654.17665.12.camel@blabla.mcs.anl.gov> Message-ID: <1202146035.18610.0.camel@blabla.mcs.anl.gov> On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote: > load average is only an indication of what may be a problem. I've seen > a load of 10000 on a machine and it still be very responsive because > the processes weren't CPU hungry. So using load as a metric for > determining acceptability is a small piece. In this case it should be > the response of the gatekeeper. For instance, the inca jobs were > timing out getting a response from the gatekeeper after 5 minutes. > This is unacceptable. I would say as soon as it takes more than a > minute for the GK to respond, back off. Excellent. Now we have a useable metric and value. > > On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote: > > > > > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote: > >> Then I'd say we have very different levels of acceptable. > > > > Yes, that's why we're having this discussion. > > > >> A simple job > >> submission test should never take longer than 5 minutes to complete > >> and a load of 27 is not acceptable when the responsiveness of the > >> machine is impacted. And since we're having this conversation, there > >> is a perceived problem on our end so an adjustment to our definition > >> of acceptable is needed. > > > > And we need to adjust our definition of not-acceptable. So we need to > > meet in the middle. > > > > So, 25 (sustained) reasonably acceptable average load? That amounts to > > about 13 hungry processes per cpu. Even with a 100Hz time slice, each > > process would get 8 slices per second on average. > > > >> > >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > >> > >>> > >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > >>>> That inca tests were timing out after 5 minutes and the load on the > >>>> machine was ~27. How are you concluding when things aren't > >>>> acceptable? > >>> > >>> It's got 2 cpus. So to me an average load of under 100 and the SSH > >>> session being responsive looks fine. > >>> > >>> The fact that inca tests are timing out may be because inca has too > >>> low > >>> of a tolerance for things. > >>> > >>>> > >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > >>>> > >>>>> That's odd. Clearly if that's not acceptable from your > >>>>> perspective, > >>>>> yet > >>>>> I thought 130 are fine, there's a disconnect between what you > >>>>> think is > >>>>> acceptable and what I think is acceptable. > >>>>> > >>>>> What was that prompted you to conclude things are bad? > >>>>> > >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > >>>>>> Around 80. > >>>>>> > >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >>>>>> > >>>>>>> > >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >>>>>>>> Sorry for killing the server. I'm pushing to get > >>>>>>>> results to guide the selection of compounds for > >>>>>>>> wet-lab testing. > >>>>>>>> > >>>>>>>> I had set the throttle.score.job.factor to 1 in the > >>>>>>>> swift.properties file. > >>>>>>> > >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > >>>>>>> > >>>>>>> Mihael > >>>>>>> > >>>>>>>> > >>>>>>>> I certainly appreciate everyone's efforts and > >>>>>>>> responsiveness. > >>>>>>>> > >>>>>>>> Let me know what to try next, before I kill again. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Mike > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> --- Mihael Hategan wrote: > >>>>>>>> > >>>>>>>>> So I was trying some stuff on Friday night. I guess > >>>>>>>>> I've found the > >>>>>>>>> strategy on when to run the tests: when nobody else > >>>>>>>>> has jobs there > >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some > >>>>>>>>> Falkon workers > >>>>>>>>> running, and the occasional Inca tests). > >>>>>>>>> > >>>>>>>>> In any event, the machine jumps to about 100% > >>>>>>>>> utilization at around 130 > >>>>>>>>> jobs with pre-ws gram. So Mike, please set > >>>>>>>>> throttle.score.job.factor to > >>>>>>>>> 1 in swift.properties. > >>>>>>>>> > >>>>>>>>> There's still more work I need to do test-wise. > >>>>>>>>> > >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone > >>>>>>>>> work with Mike to get > >>>>>>>>>> some swift settings that don't kill our server? > >>>>>>>>>> > >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>>>>>> > >>>>>>>>>>> Yes, I'm submitting molecular dynamics > >>>>>>>>> simulations > >>>>>>>>>>> using Swift. > >>>>>>>>>>> > >>>>>>>>>>> Is there a default wall-time limit for jobs on > >>>>>>>>> tg-uc? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> --- joseph insley wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Actually, these numbers are now escalating... > >>>>>>>>>>>> > >>>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>>>>>> average: > >>>>>>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>>>>>> 0 > >>>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>>> 479 > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>> real 0m26.134s > >>>>>>>>>>>> user 0m0.090s > >>>>>>>>>>>> sys 0m0.010s > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>>>>>> UC/ANL > >>>>>>>>>>>> TG GRAM host) > >>>>>>>>>>>>> became unresponsive and had to be rebooted. I > >>>>>>>>> am > >>>>>>>>>>>> now seeing slow > >>>>>>>>>>>>> response times from the Gatekeeper there > >>>>>>>>> again. > >>>>>>>>>>>> Authenticating to > >>>>>>>>>>>>> the gatekeeper should only take a second or > >>>>>>>>> two, > >>>>>>>>>>>> but it is > >>>>>>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>>>>>> > >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>>> real 0m16.096s > >>>>>>>>>>>>> user 0m0.060s > >>>>>>>>>>>>> sys 0m0.020s > >>>>>>>>>>>>> > >>>>>>>>>>>>> looking at the load on tg-grid, it is rather > >>>>>>>>> high: > >>>>>>>>>>>>> > >>>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>>>>>> average: > >>>>>>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>>>>>>>> 0 > >>>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>>> > >>>>>>>>>>>>> And there appear to be a large number of > >>>>>>>>> processes > >>>>>>>>>>>> owned by kubal: > >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>>>> 380 > >>>>>>>>>>>>> > >>>>>>>>>>>>> I assume that Mike is using swift to do the > >>>>>>>>> job > >>>>>>>>>>>> submission. Is > >>>>>>>>>>>>> there some throttling of the rate at which > >>>>>>>>> jobs > >>>>>>>>>>>> are submitted to > >>>>>>>>>>>>> the gatekeeper that could be done that would > >>>>>>>>>>>> lighten this load > >>>>>>>>>>>>> some? (Or has that already been done since > >>>>>>>>>>>> earlier today?) The > >>>>>>>>>>>>> current response times are not unacceptable, > >>>>>>>>> but > >>>>>>>>>>>> I'm hoping to > >>>>>>>>>>>>> avoid having the machine grind to a halt as it > >>>>>>>>> did > >>>>>>>>>>>> earlier today. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> joe. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> =================================================== > >>>>>>>>>>>>> joseph a. > >>>>>>>>>>>>> insley > >>>>>>>>>>>> > >>>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>>> mathematics & computer science division > >>>>>>>>>>>> (630) 252-5649 > >>>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>>> (630) > >>>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> =================================================== > >>>>>>>>>>>> joseph a. insley > >>>>>>>>>>>> > >>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>> mathematics & computer science division > >>>>>>>>> (630) > >>>>>>>>>>>> 252-5649 > >>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>>> (630) > >>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> ____________________________________________________________________________________ > >>>>>>>>>>> Be a better friend, newshound, and > >>>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>>>>>>>> > >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Swift-devel mailing list > >>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> > >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ____________________________________________________________________________________ > >>>>>>>> Never miss a thing. Make Yahoo your home page. > >>>>>>>> http://www.yahoo.com/r/hs > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > From mikekubal at yahoo.com Mon Feb 4 12:30:07 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Mon, 4 Feb 2008 10:30:07 -0800 (PST) Subject: [Swift-devel] throttle.score.job.transfer Message-ID: <424515.93529.qm@web52305.mail.re2.yahoo.com> I attempted to run a job with throttle.score.job.transfer of .5 and the job failed with the following: Execution failed: Could not convert value to number: .5 Caused by: For input string: ".5" -MikeK ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From benc at hawaga.org.uk Mon Feb 4 12:40:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 18:40:19 +0000 (GMT) Subject: [Swift-devel] throttle.score.job.transfer In-Reply-To: <424515.93529.qm@web52305.mail.re2.yahoo.com> References: <424515.93529.qm@web52305.mail.re2.yahoo.com> Message-ID: On Mon, 4 Feb 2008, Mike Kubal wrote: > I attempted to run a job with > throttle.score.job.transfer of .5 and the job failed > with the following: > > Execution failed: > Could not convert value to number: .5 > Caused by: > For input string: ".5" yeah, turns out its an integer only field. my bad for telling you otherwise. -- From benc at hawaga.org.uk Mon Feb 4 12:44:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 18:44:38 +0000 (GMT) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: you can also try out gram4 as follows: * get swift r1609 from SVN * set a site entry like this: /home/benc TG-CCR080002N Change the project key to a project that you are on (or, if you have a default project, you can remove it). I've run this from from teraport submitting to TG-UC using the default load parameters and it has made it through 730 or so jobs of a 1000 node workflow without apparently excessive load (its still running - also I got some ftp failures, but job retry should handle those) -- From mikekubal at yahoo.com Mon Feb 4 12:49:23 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Mon, 4 Feb 2008 10:49:23 -0800 (PST) Subject: [Swift-devel] throttle.score.job.transfer In-Reply-To: Message-ID: <690293.82602.qm@web52311.mail.re2.yahoo.com> Unless there are any objections, I'd like to submit a maximum of 21 jobs to the UC -teragrid with the throttling thresholds limited to the following so a baseline metric could be established: throttle.submit = 2 (default 4) throttle.host.submit = 1 (default 2) throttle.score.job.factor = 1 (default 4) throttle.transfers = 2 (default 4) throttle.file.operations = 4 (default 8) Ti, I'll let you know as soon as I launch it. Depending on how this goes, I plan to try Ben's local PB approach next. Cheers, Mike --- Ben Clifford wrote: > > > On Mon, 4 Feb 2008, Mike Kubal wrote: > > > I attempted to run a job with > > throttle.score.job.transfer of .5 and the job > failed > > with the following: > > > > Execution failed: > > Could not convert value to number: .5 > > Caused by: > > For input string: ".5" > > yeah, turns out its an integer only field. my bad > for telling you > otherwise. > > -- > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From mikekubal at yahoo.com Mon Feb 4 12:55:06 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Mon, 4 Feb 2008 10:55:06 -0800 (PST) Subject: [Swift-devel] throttle.score.job.transfer In-Reply-To: <690293.82602.qm@web52311.mail.re2.yahoo.com> Message-ID: <767735.55578.qm@web52312.mail.re2.yahoo.com> or I'll try Ben's r1609 approach, unless folks would like a baseline. --- Mike Kubal wrote: > Unless there are any objections, I'd like to submit > a > maximum of 21 jobs to the UC -teragrid with the > throttling thresholds limited to the following so a > baseline metric could be established: > > throttle.submit = 2 (default 4) > throttle.host.submit = 1 (default 2) > throttle.score.job.factor = 1 (default 4) > throttle.transfers = 2 (default 4) > throttle.file.operations = 4 (default 8) > > Ti, I'll let you know as soon as I launch it. > > Depending on how this goes, I plan to try Ben's > local > PB approach next. > > Cheers, > > Mike > > --- Ben Clifford wrote: > > > > > > > On Mon, 4 Feb 2008, Mike Kubal wrote: > > > > > I attempted to run a job with > > > throttle.score.job.transfer of .5 and the job > > failed > > > with the following: > > > > > > Execution failed: > > > Could not convert value to number: .5 > > > Caused by: > > > For input string: ".5" > > > > yeah, turns out its an integer only field. my bad > > for telling you > > otherwise. > > > > -- > > > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From benc at hawaga.org.uk Mon Feb 4 12:59:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 18:59:40 +0000 (GMT) Subject: [Swift-devel] throttle.score.job.transfer In-Reply-To: <767735.55578.qm@web52312.mail.re2.yahoo.com> References: <767735.55578.qm@web52312.mail.re2.yahoo.com> Message-ID: On Mon, 4 Feb 2008, Mike Kubal wrote: > or I'll try Ben's r1609 approach, unless folks would > like a baseline. I think trying PBS and GRAM4 are better things for you to do than continue spending time with GRAM2. A comparison of how PBS and GRAM4 weigh up would be very interesting (to me). When you run anything there, please save your log files - I can do interesting things with them (for example, watching how the Swift internal scheduler is behaving). Also if you are getting kickstart records, save those too. There's a single commandline to type to do this, at the bottom of the user guide: rsync --ignore-existing *.log *-kickstart.xml login.ci.uchicago.edu:/home/benc/swift-logs/ --verbose -- From benc at hawaga.org.uk Mon Feb 4 13:27:31 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 4 Feb 2008 19:27:31 +0000 (GMT) Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: On Mon, 4 Feb 2008, Ben Clifford wrote: > I've run this from from teraport submitting to TG-UC using the default > load parameters and it has made it through 730 or so jobs of a 1000 node > workflow without apparently excessive load (its still running - also I got > some ftp failures, but job retry should handle those) actually, that run got stuck - it seems to have lost one job (as in, it isn't in the PBS queue but Swift still thinks its in progress) I'll look at that closer somewhat later. -- From hategan at mcs.anl.gov Mon Feb 4 14:43:21 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 14:43:21 -0600 Subject: [Swift-devel] throttle.score.job.transfer In-Reply-To: <424515.93529.qm@web52305.mail.re2.yahoo.com> References: <424515.93529.qm@web52305.mail.re2.yahoo.com> Message-ID: <1202157801.20465.1.camel@blabla.mcs.anl.gov> Yes. That will not work. You need integral numbers there. I will fix this hopefully tonight. On Mon, 2008-02-04 at 10:30 -0800, Mike Kubal wrote: > I attempted to run a job with > throttle.score.job.transfer of .5 and the job failed > with the following: > > Execution failed: > Could not convert value to number: .5 > Caused by: > For input string: ".5" > > -MikeK > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Feb 4 14:45:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 14:45:06 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> Message-ID: <1202157906.20465.4.camel@blabla.mcs.anl.gov> For that to work, you'll need to fetch Swift from SVN. You can find instructions on how to do that here: http://www.ci.uchicago.edu/swift/downloads/index.php Mihael On Mon, 2008-02-04 at 18:44 +0000, Ben Clifford wrote: > you can also try out gram4 as follows: > > * get swift r1609 from SVN > > * set a site entry like this: > > gridlaunch="/home/wilde/vds/mystart"> > storage="/home/benc" maj > or="2" minor="2" /> > url="tg-grid.uc.teragrid.org" /> > /home/benc > TG-CCR080002N > > > Change the project key to a project that you are on (or, if you have a > default project, you can remove it). > > I've run this from from teraport submitting to TG-UC using the default > load parameters and it has made it through 730 or so jobs of a 1000 node > workflow without apparently excessive load (its still running - also I got > some ftp failures, but job retry should handle those) > From hategan at mcs.anl.gov Mon Feb 4 16:32:32 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 16:32:32 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202143711.17665.13.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <47A73DFF.3010402@mcs.anl.gov> <1202143711.17665.13.camel@blabla.mcs.anl.gov> Message-ID: <1202164352.22470.4.camel@blabla.mcs.anl.gov> So WS-GRAM in terms of machine load seems to work better (i.e. barely visible), which is to be expected. Swift does however run out of memory faster. Whereas I could safely (from the client side perspective) run 256 parallel jobs with the default 64M of heap space, with WS-GRAM it dies. I don't have an exact dependence of load vs. number of jobs yet, but I'll be working on that. Mihael On Mon, 2008-02-04 at 10:48 -0600, Mihael Hategan wrote: > Yes, and I will. But unless we're completely dropping support for pre-ws > GRAM, we still need to do this. > > > On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote: > > It would be really wonderful if someone can try GRAM4, which we believe > > addresses this problem. > > > > Ian. > > > > Ti Leggett wrote: > > > Then I'd say we have very different levels of acceptable. A simple job > > > submission test should never take longer than 5 minutes to complete > > > and a load of 27 is not acceptable when the responsiveness of the > > > machine is impacted. And since we're having this conversation, there > > > is a perceived problem on our end so an adjustment to our definition > > > of acceptable is needed. > > > > > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > > > > > >> > > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > > >>> That inca tests were timing out after 5 minutes and the load on the > > >>> machine was ~27. How are you concluding when things aren't acceptable? > > >> > > >> It's got 2 cpus. So to me an average load of under 100 and the SSH > > >> session being responsive looks fine. > > >> > > >> The fact that inca tests are timing out may be because inca has too low > > >> of a tolerance for things. > > >> > > >>> > > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > > >>> > > >>>> That's odd. Clearly if that's not acceptable from your perspective, > > >>>> yet > > >>>> I thought 130 are fine, there's a disconnect between what you think is > > >>>> acceptable and what I think is acceptable. > > >>>> > > >>>> What was that prompted you to conclude things are bad? > > >>>> > > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > > >>>>> Around 80. > > >>>>> > > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > > >>>>> > > >>>>>> > > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > > >>>>>>> Sorry for killing the server. I'm pushing to get > > >>>>>>> results to guide the selection of compounds for > > >>>>>>> wet-lab testing. > > >>>>>>> > > >>>>>>> I had set the throttle.score.job.factor to 1 in the > > >>>>>>> swift.properties file. > > >>>>>> > > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > > >>>>>> > > >>>>>> Mihael > > >>>>>> > > >>>>>>> > > >>>>>>> I certainly appreciate everyone's efforts and > > >>>>>>> responsiveness. > > >>>>>>> > > >>>>>>> Let me know what to try next, before I kill again. > > >>>>>>> > > >>>>>>> Cheers, > > >>>>>>> > > >>>>>>> Mike > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> --- Mihael Hategan wrote: > > >>>>>>> > > >>>>>>>> So I was trying some stuff on Friday night. I guess > > >>>>>>>> I've found the > > >>>>>>>> strategy on when to run the tests: when nobody else > > >>>>>>>> has jobs there > > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some > > >>>>>>>> Falkon workers > > >>>>>>>> running, and the occasional Inca tests). > > >>>>>>>> > > >>>>>>>> In any event, the machine jumps to about 100% > > >>>>>>>> utilization at around 130 > > >>>>>>>> jobs with pre-ws gram. So Mike, please set > > >>>>>>>> throttle.score.job.factor to > > >>>>>>>> 1 in swift.properties. > > >>>>>>>> > > >>>>>>>> There's still more work I need to do test-wise. > > >>>>>>>> > > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone > > >>>>>>>> work with Mike to get > > >>>>>>>>> some swift settings that don't kill our server? > > >>>>>>>>> > > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > >>>>>>>>> > > >>>>>>>>>> Yes, I'm submitting molecular dynamics > > >>>>>>>> simulations > > >>>>>>>>>> using Swift. > > >>>>>>>>>> > > >>>>>>>>>> Is there a default wall-time limit for jobs on > > >>>>>>>> tg-uc? > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> --- joseph insley wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Actually, these numbers are now escalating... > > >>>>>>>>>>> > > >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > > >>>>>>>> average: > > >>>>>>>>>>> 149.02, 123.63, 91.94 > > >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > > >>>>>>>> 0 > > >>>>>>>>>>> stopped, 0 zombie > > >>>>>>>>>>> > > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >>>>>>>>>>> 479 > > >>>>>>>>>>> > > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > > >>>>>>>>>>> tg-grid.uc.teragrid.org > > >>>>>>>>>>> GRAM Authentication test successful > > >>>>>>>>>>> real 0m26.134s > > >>>>>>>>>>> user 0m0.090s > > >>>>>>>>>>> sys 0m0.010s > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > > >>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > > >>>>>>>> UC/ANL > > >>>>>>>>>>> TG GRAM host) > > >>>>>>>>>>>> became unresponsive and had to be rebooted. I > > >>>>>>>> am > > >>>>>>>>>>> now seeing slow > > >>>>>>>>>>>> response times from the Gatekeeper there > > >>>>>>>> again. > > >>>>>>>>>>> Authenticating to > > >>>>>>>>>>>> the gatekeeper should only take a second or > > >>>>>>>> two, > > >>>>>>>>>>> but it is > > >>>>>>>>>>>> periodically taking up to 16 seconds: > > >>>>>>>>>>>> > > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > > >>>>>>>>>>> tg-grid.uc.teragrid.org > > >>>>>>>>>>>> GRAM Authentication test successful > > >>>>>>>>>>>> real 0m16.096s > > >>>>>>>>>>>> user 0m0.060s > > >>>>>>>>>>>> sys 0m0.020s > > >>>>>>>>>>>> > > >>>>>>>>>>>> looking at the load on tg-grid, it is rather > > >>>>>>>> high: > > >>>>>>>>>>>> > > >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > > >>>>>>>> average: > > >>>>>>>>>>> 89.59, 78.69, 62.92 > > >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > > >>>>>>>> 0 > > >>>>>>>>>>> stopped, 0 zombie > > >>>>>>>>>>>> > > >>>>>>>>>>>> And there appear to be a large number of > > >>>>>>>> processes > > >>>>>>>>>>> owned by kubal: > > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > >>>>>>>>>>>> 380 > > >>>>>>>>>>>> > > >>>>>>>>>>>> I assume that Mike is using swift to do the > > >>>>>>>> job > > >>>>>>>>>>> submission. Is > > >>>>>>>>>>>> there some throttling of the rate at which > > >>>>>>>> jobs > > >>>>>>>>>>> are submitted to > > >>>>>>>>>>>> the gatekeeper that could be done that would > > >>>>>>>>>>> lighten this load > > >>>>>>>>>>>> some? (Or has that already been done since > > >>>>>>>>>>> earlier today?) The > > >>>>>>>>>>>> current response times are not unacceptable, > > >>>>>>>> but > > >>>>>>>>>>> I'm hoping to > > >>>>>>>>>>>> avoid having the machine grind to a halt as it > > >>>>>>>> did > > >>>>>>>>>>> earlier today. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks, > > >>>>>>>>>>>> joe. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>> =================================================== > > >>>>>>>>>>>> joseph a. > > >>>>>>>>>>>> insley > > >>>>>>>>>>> > > >>>>>>>>>>>> insley at mcs.anl.gov > > >>>>>>>>>>>> mathematics & computer science division > > >>>>>>>>>>> (630) 252-5649 > > >>>>>>>>>>>> argonne national laboratory > > >>>>>>>>>>> (630) > > >>>>>>>>>>>> 252-5986 (fax) > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>> =================================================== > > >>>>>>>>>>> joseph a. insley > > >>>>>>>>>>> > > >>>>>>>>>>> insley at mcs.anl.gov > > >>>>>>>>>>> mathematics & computer science division > > >>>>>>>> (630) > > >>>>>>>>>>> 252-5649 > > >>>>>>>>>>> argonne national laboratory > > >>>>>>>>>>> (630) > > >>>>>>>>>>> 252-5986 (fax) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>> ____________________________________________________________________________________ > > >>>>>>> > > >>>>>>>>>> Be a better friend, newshound, and > > >>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > > >>>>>>>> > > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> _______________________________________________ > > >>>>>>>>> Swift-devel mailing list > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> Swift-devel mailing list > > >>>>>>>> Swift-devel at ci.uchicago.edu > > >>>>>>>> > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> ____________________________________________________________________________________ > > >>>>>>> > > >>>>>>> Never miss a thing. Make Yahoo your home page. > > >>>>>>> http://www.yahoo.com/r/hs > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Mon Feb 4 17:16:05 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 04 Feb 2008 17:16:05 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202164352.22470.4.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <47A73DFF.3010402@mcs.anl.gov> <1202143711.17665.13.camel@blabla.mcs.anl.gov> <1202164352.22470.4.camel@blabla.mcs.anl.gov> Message-ID: <1202166965.22912.0.camel@blabla.mcs.anl.gov> On Mon, 2008-02-04 at 16:32 -0600, Mihael Hategan wrote: > So WS-GRAM in terms of machine load seems to work better (i.e. barely > visible), which is to be expected. Swift does however run out of memory > faster. Whereas I could safely (from the client side perspective) run > 256 parallel jobs with ... pre-WS-GRAM and... > the default 64M of heap space, with WS-GRAM it > dies. > > I don't have an exact dependence of load vs. number of jobs yet, but > I'll be working on that. > > Mihael > > On Mon, 2008-02-04 at 10:48 -0600, Mihael Hategan wrote: > > Yes, and I will. But unless we're completely dropping support for pre-ws > > GRAM, we still need to do this. > > > > > > On Mon, 2008-02-04 at 10:31 -0600, Ian Foster wrote: > > > It would be really wonderful if someone can try GRAM4, which we believe > > > addresses this problem. > > > > > > Ian. > > > > > > Ti Leggett wrote: > > > > Then I'd say we have very different levels of acceptable. A simple job > > > > submission test should never take longer than 5 minutes to complete > > > > and a load of 27 is not acceptable when the responsiveness of the > > > > machine is impacted. And since we're having this conversation, there > > > > is a perceived problem on our end so an adjustment to our definition > > > > of acceptable is needed. > > > > > > > > On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > > > > > > > >> > > > >> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > > > >>> That inca tests were timing out after 5 minutes and the load on the > > > >>> machine was ~27. How are you concluding when things aren't acceptable? > > > >> > > > >> It's got 2 cpus. So to me an average load of under 100 and the SSH > > > >> session being responsive looks fine. > > > >> > > > >> The fact that inca tests are timing out may be because inca has too low > > > >> of a tolerance for things. > > > >> > > > >>> > > > >>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > > > >>> > > > >>>> That's odd. Clearly if that's not acceptable from your perspective, > > > >>>> yet > > > >>>> I thought 130 are fine, there's a disconnect between what you think is > > > >>>> acceptable and what I think is acceptable. > > > >>>> > > > >>>> What was that prompted you to conclude things are bad? > > > >>>> > > > >>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > > > >>>>> Around 80. > > > >>>>> > > > >>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > > > >>>>> > > > >>>>>> > > > >>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > > > >>>>>>> Sorry for killing the server. I'm pushing to get > > > >>>>>>> results to guide the selection of compounds for > > > >>>>>>> wet-lab testing. > > > >>>>>>> > > > >>>>>>> I had set the throttle.score.job.factor to 1 in the > > > >>>>>>> swift.properties file. > > > >>>>>> > > > >>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > > > >>>>>> > > > >>>>>> Mihael > > > >>>>>> > > > >>>>>>> > > > >>>>>>> I certainly appreciate everyone's efforts and > > > >>>>>>> responsiveness. > > > >>>>>>> > > > >>>>>>> Let me know what to try next, before I kill again. > > > >>>>>>> > > > >>>>>>> Cheers, > > > >>>>>>> > > > >>>>>>> Mike > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> --- Mihael Hategan wrote: > > > >>>>>>> > > > >>>>>>>> So I was trying some stuff on Friday night. I guess > > > >>>>>>>> I've found the > > > >>>>>>>> strategy on when to run the tests: when nobody else > > > >>>>>>>> has jobs there > > > >>>>>>>> (besides Buzz doing gridftp tests, Ioan having some > > > >>>>>>>> Falkon workers > > > >>>>>>>> running, and the occasional Inca tests). > > > >>>>>>>> > > > >>>>>>>> In any event, the machine jumps to about 100% > > > >>>>>>>> utilization at around 130 > > > >>>>>>>> jobs with pre-ws gram. So Mike, please set > > > >>>>>>>> throttle.score.job.factor to > > > >>>>>>>> 1 in swift.properties. > > > >>>>>>>> > > > >>>>>>>> There's still more work I need to do test-wise. > > > >>>>>>>> > > > >>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > > > >>>>>>>>> Mike, You're killing tg-grid1 again. Can someone > > > >>>>>>>> work with Mike to get > > > >>>>>>>>> some swift settings that don't kill our server? > > > >>>>>>>>> > > > >>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Yes, I'm submitting molecular dynamics > > > >>>>>>>> simulations > > > >>>>>>>>>> using Swift. > > > >>>>>>>>>> > > > >>>>>>>>>> Is there a default wall-time limit for jobs on > > > >>>>>>>> tg-uc? > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> --- joseph insley wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Actually, these numbers are now escalating... > > > >>>>>>>>>>> > > > >>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > > > >>>>>>>> average: > > > >>>>>>>>>>> 149.02, 123.63, 91.94 > > > >>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > > > >>>>>>>> 0 > > > >>>>>>>>>>> stopped, 0 zombie > > > >>>>>>>>>>> > > > >>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >>>>>>>>>>> 479 > > > >>>>>>>>>>> > > > >>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > > > >>>>>>>>>>> tg-grid.uc.teragrid.org > > > >>>>>>>>>>> GRAM Authentication test successful > > > >>>>>>>>>>> real 0m26.134s > > > >>>>>>>>>>> user 0m0.090s > > > >>>>>>>>>>> sys 0m0.010s > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > > > >>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > > > >>>>>>>> UC/ANL > > > >>>>>>>>>>> TG GRAM host) > > > >>>>>>>>>>>> became unresponsive and had to be rebooted. I > > > >>>>>>>> am > > > >>>>>>>>>>> now seeing slow > > > >>>>>>>>>>>> response times from the Gatekeeper there > > > >>>>>>>> again. > > > >>>>>>>>>>> Authenticating to > > > >>>>>>>>>>>> the gatekeeper should only take a second or > > > >>>>>>>> two, > > > >>>>>>>>>>> but it is > > > >>>>>>>>>>>> periodically taking up to 16 seconds: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > > > >>>>>>>>>>> tg-grid.uc.teragrid.org > > > >>>>>>>>>>>> GRAM Authentication test successful > > > >>>>>>>>>>>> real 0m16.096s > > > >>>>>>>>>>>> user 0m0.060s > > > >>>>>>>>>>>> sys 0m0.020s > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> looking at the load on tg-grid, it is rather > > > >>>>>>>> high: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > > > >>>>>>>> average: > > > >>>>>>>>>>> 89.59, 78.69, 62.92 > > > >>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > > > >>>>>>>> 0 > > > >>>>>>>>>>> stopped, 0 zombie > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> And there appear to be a large number of > > > >>>>>>>> processes > > > >>>>>>>>>>> owned by kubal: > > > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > > > >>>>>>>>>>>> 380 > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I assume that Mike is using swift to do the > > > >>>>>>>> job > > > >>>>>>>>>>> submission. Is > > > >>>>>>>>>>>> there some throttling of the rate at which > > > >>>>>>>> jobs > > > >>>>>>>>>>> are submitted to > > > >>>>>>>>>>>> the gatekeeper that could be done that would > > > >>>>>>>>>>> lighten this load > > > >>>>>>>>>>>> some? (Or has that already been done since > > > >>>>>>>>>>> earlier today?) The > > > >>>>>>>>>>>> current response times are not unacceptable, > > > >>>>>>>> but > > > >>>>>>>>>>> I'm hoping to > > > >>>>>>>>>>>> avoid having the machine grind to a halt as it > > > >>>>>>>> did > > > >>>>>>>>>>> earlier today. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thanks, > > > >>>>>>>>>>>> joe. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>> =================================================== > > > >>>>>>>>>>>> joseph a. > > > >>>>>>>>>>>> insley > > > >>>>>>>>>>> > > > >>>>>>>>>>>> insley at mcs.anl.gov > > > >>>>>>>>>>>> mathematics & computer science division > > > >>>>>>>>>>> (630) 252-5649 > > > >>>>>>>>>>>> argonne national laboratory > > > >>>>>>>>>>> (630) > > > >>>>>>>>>>>> 252-5986 (fax) > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>> =================================================== > > > >>>>>>>>>>> joseph a. insley > > > >>>>>>>>>>> > > > >>>>>>>>>>> insley at mcs.anl.gov > > > >>>>>>>>>>> mathematics & computer science division > > > >>>>>>>> (630) > > > >>>>>>>>>>> 252-5649 > > > >>>>>>>>>>> argonne national laboratory > > > >>>>>>>>>>> (630) > > > >>>>>>>>>>> 252-5986 (fax) > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>> ____________________________________________________________________________________ > > > >>>>>>> > > > >>>>>>>>>> Be a better friend, newshound, and > > > >>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > > > >>>>>>>> > > > >>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> _______________________________________________ > > > >>>>>>>>> Swift-devel mailing list > > > >>>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> _______________________________________________ > > > >>>>>>>> Swift-devel mailing list > > > >>>>>>>> Swift-devel at ci.uchicago.edu > > > >>>>>>>> > > > >>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> ____________________________________________________________________________________ > > > >>>>>>> > > > >>>>>>> Never miss a thing. Make Yahoo your home page. > > > >>>>>>> http://www.yahoo.com/r/hs > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue Feb 5 14:21:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 05 Feb 2008 14:21:01 -0600 Subject: [Swift-devel] RFF (request for feature) In-Reply-To: References: Message-ID: <1202242861.4718.9.camel@blabla.mcs.anl.gov> You can probably simulate lots of these with arrays. For example: int queue[]; foreach i in [0:100] { queue[i] = ...; } foreach x in queue { f(x); } On Mon, 2008-01-28 at 15:51 -0600, Tiberiu Stef-Praun wrote: > Hi gang, > > I find myself in the need for a queuing facility in swift with the > following operations: > > createQ > submitQ(function) > triggerQ(function, #jobs in queue) - to signal empty queues, for instance > deleteQ > > I would think that in addition to atomic functions and composite > functions, we will have the queue facility acting as an intermediary. > > Is any of this possible/doable in a data-flow language ? > > Thanks > Tibi > From hategan at mcs.anl.gov Thu Feb 7 15:24:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 07 Feb 2008 15:24:50 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <1202143654.17665.12.camel@blabla.mcs.anl.gov> Message-ID: <1202419491.13362.5.camel@blabla.mcs.anl.gov> Ok, so I'll change the scheduler feedback loop to aim towards a 20 s max submission time. This should apply nicely to all providers. Any objections? On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote: > load average is only an indication of what may be a problem. I've seen > a load of 10000 on a machine and it still be very responsive because > the processes weren't CPU hungry. So using load as a metric for > determining acceptability is a small piece. In this case it should be > the response of the gatekeeper. For instance, the inca jobs were > timing out getting a response from the gatekeeper after 5 minutes. > This is unacceptable. I would say as soon as it takes more than a > minute for the GK to respond, back off. > > On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote: > > > > > On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote: > >> Then I'd say we have very different levels of acceptable. > > > > Yes, that's why we're having this discussion. > > > >> A simple job > >> submission test should never take longer than 5 minutes to complete > >> and a load of 27 is not acceptable when the responsiveness of the > >> machine is impacted. And since we're having this conversation, there > >> is a perceived problem on our end so an adjustment to our definition > >> of acceptable is needed. > > > > And we need to adjust our definition of not-acceptable. So we need to > > meet in the middle. > > > > So, 25 (sustained) reasonably acceptable average load? That amounts to > > about 13 hungry processes per cpu. Even with a 100Hz time slice, each > > process would get 8 slices per second on average. > > > >> > >> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: > >> > >>> > >>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: > >>>> That inca tests were timing out after 5 minutes and the load on the > >>>> machine was ~27. How are you concluding when things aren't > >>>> acceptable? > >>> > >>> It's got 2 cpus. So to me an average load of under 100 and the SSH > >>> session being responsive looks fine. > >>> > >>> The fact that inca tests are timing out may be because inca has too > >>> low > >>> of a tolerance for things. > >>> > >>>> > >>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: > >>>> > >>>>> That's odd. Clearly if that's not acceptable from your > >>>>> perspective, > >>>>> yet > >>>>> I thought 130 are fine, there's a disconnect between what you > >>>>> think is > >>>>> acceptable and what I think is acceptable. > >>>>> > >>>>> What was that prompted you to conclude things are bad? > >>>>> > >>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: > >>>>>> Around 80. > >>>>>> > >>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: > >>>>>> > >>>>>>> > >>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: > >>>>>>>> Sorry for killing the server. I'm pushing to get > >>>>>>>> results to guide the selection of compounds for > >>>>>>>> wet-lab testing. > >>>>>>>> > >>>>>>>> I had set the throttle.score.job.factor to 1 in the > >>>>>>>> swift.properties file. > >>>>>>> > >>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? > >>>>>>> > >>>>>>> Mihael > >>>>>>> > >>>>>>>> > >>>>>>>> I certainly appreciate everyone's efforts and > >>>>>>>> responsiveness. > >>>>>>>> > >>>>>>>> Let me know what to try next, before I kill again. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Mike > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> --- Mihael Hategan wrote: > >>>>>>>> > >>>>>>>>> So I was trying some stuff on Friday night. I guess > >>>>>>>>> I've found the > >>>>>>>>> strategy on when to run the tests: when nobody else > >>>>>>>>> has jobs there > >>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some > >>>>>>>>> Falkon workers > >>>>>>>>> running, and the occasional Inca tests). > >>>>>>>>> > >>>>>>>>> In any event, the machine jumps to about 100% > >>>>>>>>> utilization at around 130 > >>>>>>>>> jobs with pre-ws gram. So Mike, please set > >>>>>>>>> throttle.score.job.factor to > >>>>>>>>> 1 in swift.properties. > >>>>>>>>> > >>>>>>>>> There's still more work I need to do test-wise. > >>>>>>>>> > >>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: > >>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone > >>>>>>>>> work with Mike to get > >>>>>>>>>> some swift settings that don't kill our server? > >>>>>>>>>> > >>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: > >>>>>>>>>> > >>>>>>>>>>> Yes, I'm submitting molecular dynamics > >>>>>>>>> simulations > >>>>>>>>>>> using Swift. > >>>>>>>>>>> > >>>>>>>>>>> Is there a default wall-time limit for jobs on > >>>>>>>>> tg-uc? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> --- joseph insley wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Actually, these numbers are now escalating... > >>>>>>>>>>>> > >>>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load > >>>>>>>>> average: > >>>>>>>>>>>> 149.02, 123.63, 91.94 > >>>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, > >>>>>>>>> 0 > >>>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>>> 479 > >>>>>>>>>>>> > >>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>> real 0m26.134s > >>>>>>>>>>>> user 0m0.090s > >>>>>>>>>>>> sys 0m0.010s > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley > >>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the > >>>>>>>>> UC/ANL > >>>>>>>>>>>> TG GRAM host) > >>>>>>>>>>>>> became unresponsive and had to be rebooted. I > >>>>>>>>> am > >>>>>>>>>>>> now seeing slow > >>>>>>>>>>>>> response times from the Gatekeeper there > >>>>>>>>> again. > >>>>>>>>>>>> Authenticating to > >>>>>>>>>>>>> the gatekeeper should only take a second or > >>>>>>>>> two, > >>>>>>>>>>>> but it is > >>>>>>>>>>>>> periodically taking up to 16 seconds: > >>>>>>>>>>>>> > >>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r > >>>>>>>>>>>> tg-grid.uc.teragrid.org > >>>>>>>>>>>>> GRAM Authentication test successful > >>>>>>>>>>>>> real 0m16.096s > >>>>>>>>>>>>> user 0m0.060s > >>>>>>>>>>>>> sys 0m0.020s > >>>>>>>>>>>>> > >>>>>>>>>>>>> looking at the load on tg-grid, it is rather > >>>>>>>>> high: > >>>>>>>>>>>>> > >>>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load > >>>>>>>>> average: > >>>>>>>>>>>> 89.59, 78.69, 62.92 > >>>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, > >>>>>>>>> 0 > >>>>>>>>>>>> stopped, 0 zombie > >>>>>>>>>>>>> > >>>>>>>>>>>>> And there appear to be a large number of > >>>>>>>>> processes > >>>>>>>>>>>> owned by kubal: > >>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l > >>>>>>>>>>>>> 380 > >>>>>>>>>>>>> > >>>>>>>>>>>>> I assume that Mike is using swift to do the > >>>>>>>>> job > >>>>>>>>>>>> submission. Is > >>>>>>>>>>>>> there some throttling of the rate at which > >>>>>>>>> jobs > >>>>>>>>>>>> are submitted to > >>>>>>>>>>>>> the gatekeeper that could be done that would > >>>>>>>>>>>> lighten this load > >>>>>>>>>>>>> some? (Or has that already been done since > >>>>>>>>>>>> earlier today?) The > >>>>>>>>>>>>> current response times are not unacceptable, > >>>>>>>>> but > >>>>>>>>>>>> I'm hoping to > >>>>>>>>>>>>> avoid having the machine grind to a halt as it > >>>>>>>>> did > >>>>>>>>>>>> earlier today. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> joe. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> =================================================== > >>>>>>>>>>>>> joseph a. > >>>>>>>>>>>>> insley > >>>>>>>>>>>> > >>>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>>> mathematics & computer science division > >>>>>>>>>>>> (630) 252-5649 > >>>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>>> (630) > >>>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> =================================================== > >>>>>>>>>>>> joseph a. insley > >>>>>>>>>>>> > >>>>>>>>>>>> insley at mcs.anl.gov > >>>>>>>>>>>> mathematics & computer science division > >>>>>>>>> (630) > >>>>>>>>>>>> 252-5649 > >>>>>>>>>>>> argonne national laboratory > >>>>>>>>>>>> (630) > >>>>>>>>>>>> 252-5986 (fax) > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> ____________________________________________________________________________________ > >>>>>>>>>>> Be a better friend, newshound, and > >>>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>>>>>>>> > >>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Swift-devel mailing list > >>>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> Swift-devel mailing list > >>>>>>>>> Swift-devel at ci.uchicago.edu > >>>>>>>>> > >>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ____________________________________________________________________________________ > >>>>>>>> Never miss a thing. Make Yahoo your home page. > >>>>>>>> http://www.yahoo.com/r/hs > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > From leggett at mcs.anl.gov Thu Feb 7 15:34:46 2008 From: leggett at mcs.anl.gov (Ti Leggett) Date: Thu, 7 Feb 2008 15:34:46 -0600 Subject: [Swift-devel] Re: Swift jobs on UC/ANL TG In-Reply-To: <1202419491.13362.5.camel@blabla.mcs.anl.gov> References: <548830.35963.qm@web52311.mail.re2.yahoo.com> <1202105649.15397.46.camel@blabla.mcs.anl.gov> <4915E73B-B112-46E7-A83C-352D6A3F9C1E@mcs.anl.gov> <1202139054.16407.5.camel@blabla.mcs.anl.gov> <814F55AE-BAE4-402E-BDF6-B31D0B4DD17E@mcs.anl.gov> <1202141916.17237.4.camel@blabla.mcs.anl.gov> <963B4080-FDDC-4B42-8D2E-1EEF6FB10755@mcs.anl.gov> <1202143654.17665.12.camel@blabla.mcs.anl.gov> <1202419491.13362.5.camel@blabla.mcs.anl.gov> Message-ID: <09616F48-D103-43FB-9E2C-9FFC470D0AC5@mcs.anl.gov> This sounds like a good place to start. On Feb 7, 2008, at 3:24 PM, Mihael Hategan wrote: > Ok, so I'll change the scheduler feedback loop to aim towards a 20 s > max > submission time. This should apply nicely to all providers. > > Any objections? > > On Mon, 2008-02-04 at 10:55 -0600, Ti Leggett wrote: >> load average is only an indication of what may be a problem. I've >> seen >> a load of 10000 on a machine and it still be very responsive because >> the processes weren't CPU hungry. So using load as a metric for >> determining acceptability is a small piece. In this case it should be >> the response of the gatekeeper. For instance, the inca jobs were >> timing out getting a response from the gatekeeper after 5 minutes. >> This is unacceptable. I would say as soon as it takes more than a >> minute for the GK to respond, back off. >> >> On Feb 4, 2008, at 10:47 AM, Mihael Hategan wrote: >> >>> >>> On Mon, 2008-02-04 at 10:28 -0600, Ti Leggett wrote: >>>> Then I'd say we have very different levels of acceptable. >>> >>> Yes, that's why we're having this discussion. >>> >>>> A simple job >>>> submission test should never take longer than 5 minutes to complete >>>> and a load of 27 is not acceptable when the responsiveness of the >>>> machine is impacted. And since we're having this conversation, >>>> there >>>> is a perceived problem on our end so an adjustment to our >>>> definition >>>> of acceptable is needed. >>> >>> And we need to adjust our definition of not-acceptable. So we need >>> to >>> meet in the middle. >>> >>> So, 25 (sustained) reasonably acceptable average load? That >>> amounts to >>> about 13 hungry processes per cpu. Even with a 100Hz time slice, >>> each >>> process would get 8 slices per second on average. >>> >>>> >>>> On Feb 4, 2008, at 10:18 AM, Mihael Hategan wrote: >>>> >>>>> >>>>> On Mon, 2008-02-04 at 09:58 -0600, Ti Leggett wrote: >>>>>> That inca tests were timing out after 5 minutes and the load on >>>>>> the >>>>>> machine was ~27. How are you concluding when things aren't >>>>>> acceptable? >>>>> >>>>> It's got 2 cpus. So to me an average load of under 100 and the SSH >>>>> session being responsive looks fine. >>>>> >>>>> The fact that inca tests are timing out may be because inca has >>>>> too >>>>> low >>>>> of a tolerance for things. >>>>> >>>>>> >>>>>> On Feb 4, 2008, at 9:30 AM, Mihael Hategan wrote: >>>>>> >>>>>>> That's odd. Clearly if that's not acceptable from your >>>>>>> perspective, >>>>>>> yet >>>>>>> I thought 130 are fine, there's a disconnect between what you >>>>>>> think is >>>>>>> acceptable and what I think is acceptable. >>>>>>> >>>>>>> What was that prompted you to conclude things are bad? >>>>>>> >>>>>>> On Mon, 2008-02-04 at 07:16 -0600, Ti Leggett wrote: >>>>>>>> Around 80. >>>>>>>> >>>>>>>> On Feb 4, 2008, at 12:14 AM, Mihael Hategan wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, 2008-02-03 at 22:11 -0800, Mike Kubal wrote: >>>>>>>>>> Sorry for killing the server. I'm pushing to get >>>>>>>>>> results to guide the selection of compounds for >>>>>>>>>> wet-lab testing. >>>>>>>>>> >>>>>>>>>> I had set the throttle.score.job.factor to 1 in the >>>>>>>>>> swift.properties file. >>>>>>>>> >>>>>>>>> Hmm. Ti, at the time of the massacre, how many did you kill? >>>>>>>>> >>>>>>>>> Mihael >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I certainly appreciate everyone's efforts and >>>>>>>>>> responsiveness. >>>>>>>>>> >>>>>>>>>> Let me know what to try next, before I kill again. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Mike >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --- Mihael Hategan wrote: >>>>>>>>>> >>>>>>>>>>> So I was trying some stuff on Friday night. I guess >>>>>>>>>>> I've found the >>>>>>>>>>> strategy on when to run the tests: when nobody else >>>>>>>>>>> has jobs there >>>>>>>>>>> (besides Buzz doing gridftp tests, Ioan having some >>>>>>>>>>> Falkon workers >>>>>>>>>>> running, and the occasional Inca tests). >>>>>>>>>>> >>>>>>>>>>> In any event, the machine jumps to about 100% >>>>>>>>>>> utilization at around 130 >>>>>>>>>>> jobs with pre-ws gram. So Mike, please set >>>>>>>>>>> throttle.score.job.factor to >>>>>>>>>>> 1 in swift.properties. >>>>>>>>>>> >>>>>>>>>>> There's still more work I need to do test-wise. >>>>>>>>>>> >>>>>>>>>>> On Sun, 2008-02-03 at 15:34 -0600, Ti Leggett wrote: >>>>>>>>>>>> Mike, You're killing tg-grid1 again. Can someone >>>>>>>>>>> work with Mike to get >>>>>>>>>>>> some swift settings that don't kill our server? >>>>>>>>>>>> >>>>>>>>>>>> On Jan 28, 2008, at 7:13 PM, Mike Kubal wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes, I'm submitting molecular dynamics >>>>>>>>>>> simulations >>>>>>>>>>>>> using Swift. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there a default wall-time limit for jobs on >>>>>>>>>>> tg-uc? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> --- joseph insley wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Actually, these numbers are now escalating... >>>>>>>>>>>>>> >>>>>>>>>>>>>> top - 17:18:54 up 2:29, 1 user, load >>>>>>>>>>> average: >>>>>>>>>>>>>> 149.02, 123.63, 91.94 >>>>>>>>>>>>>> Tasks: 469 total, 4 running, 465 sleeping, >>>>>>>>>>> 0 >>>>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>>>> >>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>>>>> 479 >>>>>>>>>>>>>> >>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>>>> real 0m26.134s >>>>>>>>>>>>>> user 0m0.090s >>>>>>>>>>>>>> sys 0m0.010s >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Jan 28, 2008, at 5:15 PM, joseph insley >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Earlier today tg-grid.uc.teragrid.org (the >>>>>>>>>>> UC/ANL >>>>>>>>>>>>>> TG GRAM host) >>>>>>>>>>>>>>> became unresponsive and had to be rebooted. I >>>>>>>>>>> am >>>>>>>>>>>>>> now seeing slow >>>>>>>>>>>>>>> response times from the Gatekeeper there >>>>>>>>>>> again. >>>>>>>>>>>>>> Authenticating to >>>>>>>>>>>>>>> the gatekeeper should only take a second or >>>>>>>>>>> two, >>>>>>>>>>>>>> but it is >>>>>>>>>>>>>>> periodically taking up to 16 seconds: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> insley at tg-viz-login1:~> time globusrun -a -r >>>>>>>>>>>>>> tg-grid.uc.teragrid.org >>>>>>>>>>>>>>> GRAM Authentication test successful >>>>>>>>>>>>>>> real 0m16.096s >>>>>>>>>>>>>>> user 0m0.060s >>>>>>>>>>>>>>> sys 0m0.020s >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> looking at the load on tg-grid, it is rather >>>>>>>>>>> high: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> top - 16:55:26 up 2:06, 1 user, load >>>>>>>>>>> average: >>>>>>>>>>>>>> 89.59, 78.69, 62.92 >>>>>>>>>>>>>>> Tasks: 398 total, 20 running, 378 sleeping, >>>>>>>>>>> 0 >>>>>>>>>>>>>> stopped, 0 zombie >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> And there appear to be a large number of >>>>>>>>>>> processes >>>>>>>>>>>>>> owned by kubal: >>>>>>>>>>>>>>> insley at tg-grid1:~> ps -ef | grep kubal | wc -l >>>>>>>>>>>>>>> 380 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I assume that Mike is using swift to do the >>>>>>>>>>> job >>>>>>>>>>>>>> submission. Is >>>>>>>>>>>>>>> there some throttling of the rate at which >>>>>>>>>>> jobs >>>>>>>>>>>>>> are submitted to >>>>>>>>>>>>>>> the gatekeeper that could be done that would >>>>>>>>>>>>>> lighten this load >>>>>>>>>>>>>>> some? (Or has that already been done since >>>>>>>>>>>>>> earlier today?) The >>>>>>>>>>>>>>> current response times are not unacceptable, >>>>>>>>>>> but >>>>>>>>>>>>>> I'm hoping to >>>>>>>>>>>>>>> avoid having the machine grind to a halt as it >>>>>>>>>>> did >>>>>>>>>>>>>> earlier today. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> joe. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> =================================================== >>>>>>>>>>>>>>> joseph a. >>>>>>>>>>>>>>> insley >>>>>>>>>>>>>> >>>>>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>>>>> (630) 252-5649 >>>>>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>>>> (630) >>>>>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> =================================================== >>>>>>>>>>>>>> joseph a. insley >>>>>>>>>>>>>> >>>>>>>>>>>>>> insley at mcs.anl.gov >>>>>>>>>>>>>> mathematics & computer science division >>>>>>>>>>> (630) >>>>>>>>>>>>>> 252-5649 >>>>>>>>>>>>>> argonne national laboratory >>>>>>>>>>>>>> (630) >>>>>>>>>>>>>> 252-5986 (fax) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> ____________________________________________________________________________________ >>>>>>>>>>>>> Be a better friend, newshound, and >>>>>>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>>>>>>> >>>>>>>>>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Swift-devel mailing list >>>>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Swift-devel mailing list >>>>>>>>>>> Swift-devel at ci.uchicago.edu >>>>>>>>>>> >>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ____________________________________________________________________________________ >>>>>>>>>> Never miss a thing. Make Yahoo your home page. >>>>>>>>>> http://www.yahoo.com/r/hs >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > From hategan at mcs.anl.gov Thu Feb 7 20:34:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 07 Feb 2008 20:34:13 -0600 Subject: [Swift-devel] ws-gram tests Message-ID: <1202438053.26812.12.camel@blabla.mcs.anl.gov> I did a 1024 job run today with ws-gram. I painted the results here: http://www-unix.mcs.anl.gov/~hategan/s/g.html Seems like client memory per job is about 370k. Which is quite a lot. What kinda worries me is that it doesn't seem to go down after the jobs are done, so maybe there's a memory leak, or maybe the garbage collector doesn't do any major collections. I'll need to profile this to see exactly what we're talking about. The container memory is figured by looking at the process in /proc. It's total memory including shared libraries and things. But libraries take a fixed amount of space, so a fuzzy correlation can probably be made. It looks quite similar to the amount of memory eaten on the client side (per job). CPU-load-wise, WS-GRAM behaves. There is some work during the time the jobs are submitted, but the machine itself seems responsive. I have yet to plot the exact submission time for each job. So at this point I would recommend trying ws-gram as long as there aren't too many jobs involved (i.e. under 4000 parallel jobs), and while making sure the jvm has enough heap. More than that seems like a gamble. Mihael From hategan at mcs.anl.gov Thu Feb 7 20:41:33 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 07 Feb 2008 20:41:33 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> Message-ID: <1202438493.27139.0.camel@blabla.mcs.anl.gov> > So at this point I would recommend trying ws-gram as long as there > aren't too many jobs involved (i.e. under 4000 parallel jobs), .. actually submitted jobs. This may be somewhat unlikely to occur. > and while > making sure the jvm has enough heap. More than that seems like a gamble. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Feb 8 08:01:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Feb 2008 14:01:17 +0000 (GMT) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> Message-ID: some rough numbers in this space that I collected yesterday: I ran a 10000 (10^5) parallel jobs workflow on teraport using the PBS provider. It launched up to 401 jobs at once, as per the default config file. It took about 6h but ran ok. I didn't keep any other statistics, though - just set it going in the morning and let it run. That's a throughput of about one job every couple of seconds. My laptop can do about 5000 of the same kind of job through the fork provider, running at most 2 jobs at once, in about 15 minutes. -- From foster at mcs.anl.gov Fri Feb 8 09:19:21 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Fri, 08 Feb 2008 09:19:21 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202438053.26812.12.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> Message-ID: <47AC72F9.8010701@mcs.anl.gov> Mihael: That's great, thanks! Ian. Mihael Hategan wrote: > I did a 1024 job run today with ws-gram. > I painted the results here: > http://www-unix.mcs.anl.gov/~hategan/s/g.html > > Seems like client memory per job is about 370k. Which is quite a lot. > What kinda worries me is that it doesn't seem to go down after the jobs > are done, so maybe there's a memory leak, or maybe the garbage collector > doesn't do any major collections. I'll need to profile this to see > exactly what we're talking about. > > The container memory is figured by looking at the process in /proc. It's > total memory including shared libraries and things. But libraries take a > fixed amount of space, so a fuzzy correlation can probably be made. It > looks quite similar to the amount of memory eaten on the client side > (per job). > > CPU-load-wise, WS-GRAM behaves. There is some work during the time the > jobs are submitted, but the machine itself seems responsive. I have yet > to plot the exact submission time for each job. > > So at this point I would recommend trying ws-gram as long as there > aren't too many jobs involved (i.e. under 4000 parallel jobs), and while > making sure the jvm has enough heap. More than that seems like a gamble. > > Mihael > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Fri Feb 8 09:33:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 09:33:43 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> Message-ID: <1202484823.4800.2.camel@blabla.mcs.anl.gov> I disabled the job throttle in this case. I also didn't consider failures (restarts were at 0). In the last run I got exactly one failure, but in some previous runs (with 256 jobs) I got more. All that needs to be debugged. On Fri, 2008-02-08 at 14:01 +0000, Ben Clifford wrote: > some rough numbers in this space that I collected yesterday: > > I ran a 10000 (10^5) parallel jobs workflow on teraport using the PBS > provider. It launched up to 401 jobs at once, as per the default config > file. It took about 6h but ran ok. I didn't keep any other statistics, > though - just set it going in the morning and let it run. That's a > throughput of about one job every couple of seconds. > > My laptop can do about 5000 of the same kind of job through the fork > provider, running at most 2 jobs at once, in about 15 minutes. > From smartin at mcs.anl.gov Fri Feb 8 09:33:53 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 8 Feb 2008 09:33:53 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <47AC72F9.8010701@mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> Message-ID: <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> Mihael, Glad to hear things are improved with GRAM4. Lets keep going to have swift using GRAM4 routinely. Below is a recent thread that looked at this exact issue with condor- g. But it is entirely relevant to your use of GRAM4. the 2 issues to look for are 1) your use of notifications >> I ran one more test with the improved callback code. This time, I >> stopped storing the notification producer EPRs associated with the >> GRAM job resources. Memory usage went down markedly. 2) you could avoid notifications and instead do client-side polling for job state. This has shown to be more reliable than notifications under heavy loads, condor-g processing 1000s of jobs. The core team will be looking at improving notifications once their other 4.2 deliverables are done. -Stu Begin forwarded message: > From: feller at mcs.anl.gov > Date: February 1, 2008 9:41:05 AM CST > To: "Jaime Frey" > Cc: "Stuart Martin" , "Terrence Martin" >, "Martin Feller" , "charles bacon" >, "Suchandra Thapa" , "Rob Gardner" >, "Jeff Porter" , "Alain Roy" , > "Todd Tannenbaum" , "Miron Livny" > > Subject: Re: Condor-G WS GRAM memory usage > >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >>> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >>>> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM >>>>> raised some concerns about memory usage on the client side. I did >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared >>>>> to be the primary memory consumer. The GAHP server is a wrapper >>>>> around the java client libraries for WS GRAM. >>>>> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a >>>>> time. The jobs were 2-minute sleep jobs with minimal data >>>>> transfer. All of the jobs overlapped in submission and execution. >>>>> Here is what I've discovered so far. >>>>> >>>>> Aside from the heap available to the java code, the jvm used 117 >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G >>>>> creates one GAHP server for each (local uid, X509 DN) pair. >>>>> >>>>> The maximum jvm heap usage (as reported by the garbage collector) >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was >>>>> quiescent (jobs executing, Condor-G waiting for them to complete), >>>>> heap usage was about 5 megs plus 0.6 megs per job. >>>>> >>>>> The only long-term memory per job that I know of in the GAHP is >>>>> for the notification sink for job status callbacks. 600kb seems a >>>>> little high for that. Stu, could someone on Globus help us >>>>> determine if we're using the notification sinks inefficiently? >>>> >>>> Martin just looked and for the most part, there is nothing wrong >>>> with how condor-g manages the callback sink. >>>> However, one improvement that would reduce the memory used per job >>>> would be to not have a notification consumer per job. Instead use >>>> one for all jobs. >>>> >>>> Also, Martin recently did some analysis on condor-g stress tests >>>> and found that notifications are building up on the in the GRAM4 >>>> service container and that is causing delays which seem to be >>>> causing multiple problems. We're looking at this in a separate >>>> effort with the GT Core team. But, after this was clear, Martin >>>> re- >>>> ran the condor-g test and relied on polling between condor-g and >>>> the GRAM4 service instead of notifications. Jaime, could you >>>> repeat the no-notification test and see the difference in memory? >>>> The changes would be to increase the polling frequency in condor-g >>>> and comment out the subscribe for notification. You could also >>>> comment out the notification listener call(s) too. >>> >>> >>> I did two new sets of tests today. The first used more efficient >>> callback code in the GAHP (one notification consumer rather than one >>> per job). The second disabled notifications and relied on polling >>> for job status changes. >>> >>> The more efficient callback code did not produce a noticeable >>> reduction in memory usage. >>> >>> Disabling notifications did reduce memory usage. The maximum jvm >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum >>> heap usage after job submission and before job completion was about >>> 4 megs + 0.1 megs per job. >> >> >> I ran one more test with the improved callback code. This time, I >> stopped storing the notification producer EPRs associated with the >> GRAM job resources. Memory usage went down markedly. >> >> I was told the client had to explicitly destroy these serve-side >> notification producer resources when it destroys the job, otherwise >> they hang around bogging down the server. Is this still the case? The >> server can't destroy notification producers when their sources of >> information are destroyed? >> > > This reminds me of the odd fact that i had to suddenly grant much more > memory to Condor-g as soon as condor-g started storing EPRs of > subscription resources to be able to destroy them eventually. > Those EPR's are maybe not so tiny as they look like. > > For 4.0: yes, currently you'll have to store and eventually destroy > subscription resources manually to avoid heaping up persistence data > on the server-side. > For 4.2: no, you won't have to store them. A job resource will > destroy all subscription resources when it's destroyed. > > Overall i suggest to concentrate on 4.2 gram since the "container > hangs in job destruction" problem won't exist anymore. > > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes > sense > for us to do the 4.2-related changes in Gahp and hand it to you for > fine-tuning then? > > Martin On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > Mihael: > > That's great, thanks! > > Ian. > > Mihael Hategan wrote: >> I did a 1024 job run today with ws-gram. >> I painted the results here: >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> Seems like client memory per job is about 370k. Which is quite a lot. >> What kinda worries me is that it doesn't seem to go down after the >> jobs >> are done, so maybe there's a memory leak, or maybe the garbage >> collector >> doesn't do any major collections. I'll need to profile this to see >> exactly what we're talking about. >> >> The container memory is figured by looking at the process in /proc. >> It's >> total memory including shared libraries and things. But libraries >> take a >> fixed amount of space, so a fuzzy correlation can probably be made. >> It >> looks quite similar to the amount of memory eaten on the client side >> (per job). >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time >> the >> jobs are submitted, but the machine itself seems responsive. I have >> yet >> to plot the exact submission time for each job. >> >> So at this point I would recommend trying ws-gram as long as there >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and >> while >> making sure the jvm has enough heap. More than that seems like a >> gamble. >> >> Mihael >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > From hategan at mcs.anl.gov Fri Feb 8 09:46:42 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 09:46:42 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> Message-ID: <1202485602.4800.13.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 09:33 -0600, Stuart Martin wrote: > Mihael, > > Glad to hear things are improved with GRAM4. Lets keep going to have > swift using GRAM4 routinely. You're being a bit assertive there. > > Below is a recent thread that looked at this exact issue with condor- > g. But it is entirely relevant to your use of GRAM4. the 2 issues to > look for are > > 1) your use of notifications > > >> I ran one more test with the improved callback code. This time, I > >> stopped storing the notification producer EPRs associated with the > >> GRAM job resources. Memory usage went down markedly. > > 2) you could avoid notifications and instead do client-side polling > for job state. This has shown to be more reliable than notifications > under heavy loads, condor-g processing 1000s of jobs. These are both hacks. I'm not sure I want to go there. 300K per job is a bit too much considering that swift (which has to consider many more things) has less than 10K overhead per job. > > The core team will be looking at improving notifications once their > other 4.2 deliverables are done. > > -Stu > > Begin forwarded message: > > > From: feller at mcs.anl.gov > > Date: February 1, 2008 9:41:05 AM CST > > To: "Jaime Frey" > > Cc: "Stuart Martin" , "Terrence Martin" > >, "Martin Feller" , "charles bacon" > >, "Suchandra Thapa" , "Rob Gardner" > >, "Jeff Porter" , "Alain Roy" , > > "Todd Tannenbaum" , "Miron Livny" > > > > Subject: Re: Condor-G WS GRAM memory usage > > > >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> > >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >>> > >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >>>> > >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM > >>>>> raised some concerns about memory usage on the client side. I did > >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared > >>>>> to be the primary memory consumer. The GAHP server is a wrapper > >>>>> around the java client libraries for WS GRAM. > >>>>> > >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a > >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >>>>> transfer. All of the jobs overlapped in submission and execution. > >>>>> Here is what I've discovered so far. > >>>>> > >>>>> Aside from the heap available to the java code, the jvm used 117 > >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G > >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >>>>> > >>>>> The maximum jvm heap usage (as reported by the garbage collector) > >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >>>>> quiescent (jobs executing, Condor-G waiting for them to complete), > >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >>>>> > >>>>> The only long-term memory per job that I know of in the GAHP is > >>>>> for the notification sink for job status callbacks. 600kb seems a > >>>>> little high for that. Stu, could someone on Globus help us > >>>>> determine if we're using the notification sinks inefficiently? > >>>> > >>>> Martin just looked and for the most part, there is nothing wrong > >>>> with how condor-g manages the callback sink. > >>>> However, one improvement that would reduce the memory used per job > >>>> would be to not have a notification consumer per job. Instead use > >>>> one for all jobs. > >>>> > >>>> Also, Martin recently did some analysis on condor-g stress tests > >>>> and found that notifications are building up on the in the GRAM4 > >>>> service container and that is causing delays which seem to be > >>>> causing multiple problems. We're looking at this in a separate > >>>> effort with the GT Core team. But, after this was clear, Martin > >>>> re- > >>>> ran the condor-g test and relied on polling between condor-g and > >>>> the GRAM4 service instead of notifications. Jaime, could you > >>>> repeat the no-notification test and see the difference in memory? > >>>> The changes would be to increase the polling frequency in condor-g > >>>> and comment out the subscribe for notification. You could also > >>>> comment out the notification listener call(s) too. > >>> > >>> > >>> I did two new sets of tests today. The first used more efficient > >>> callback code in the GAHP (one notification consumer rather than one > >>> per job). The second disabled notifications and relied on polling > >>> for job status changes. > >>> > >>> The more efficient callback code did not produce a noticeable > >>> reduction in memory usage. > >>> > >>> Disabling notifications did reduce memory usage. The maximum jvm > >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum > >>> heap usage after job submission and before job completion was about > >>> 4 megs + 0.1 megs per job. > >> > >> > >> I ran one more test with the improved callback code. This time, I > >> stopped storing the notification producer EPRs associated with the > >> GRAM job resources. Memory usage went down markedly. > >> > >> I was told the client had to explicitly destroy these serve-side > >> notification producer resources when it destroys the job, otherwise > >> they hang around bogging down the server. Is this still the case? The > >> server can't destroy notification producers when their sources of > >> information are destroyed? > >> > > > > This reminds me of the odd fact that i had to suddenly grant much more > > memory to Condor-g as soon as condor-g started storing EPRs of > > subscription resources to be able to destroy them eventually. > > Those EPR's are maybe not so tiny as they look like. > > > > For 4.0: yes, currently you'll have to store and eventually destroy > > subscription resources manually to avoid heaping up persistence data > > on the server-side. > > For 4.2: no, you won't have to store them. A job resource will > > destroy all subscription resources when it's destroyed. > > > > Overall i suggest to concentrate on 4.2 gram since the "container > > hangs in job destruction" problem won't exist anymore. > > > > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes > > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes > > sense > > for us to do the 4.2-related changes in Gahp and hand it to you for > > fine-tuning then? > > > > Martin > > > > > On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > > > Mihael: > > > > That's great, thanks! > > > > Ian. > > > > Mihael Hategan wrote: > >> I did a 1024 job run today with ws-gram. > >> I painted the results here: > >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> > >> Seems like client memory per job is about 370k. Which is quite a lot. > >> What kinda worries me is that it doesn't seem to go down after the > >> jobs > >> are done, so maybe there's a memory leak, or maybe the garbage > >> collector > >> doesn't do any major collections. I'll need to profile this to see > >> exactly what we're talking about. > >> > >> The container memory is figured by looking at the process in /proc. > >> It's > >> total memory including shared libraries and things. But libraries > >> take a > >> fixed amount of space, so a fuzzy correlation can probably be made. > >> It > >> looks quite similar to the amount of memory eaten on the client side > >> (per job). > >> > >> CPU-load-wise, WS-GRAM behaves. There is some work during the time > >> the > >> jobs are submitted, but the machine itself seems responsive. I have > >> yet > >> to plot the exact submission time for each job. > >> > >> So at this point I would recommend trying ws-gram as long as there > >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and > >> while > >> making sure the jvm has enough heap. More than that seems like a > >> gamble. > >> > >> Mihael > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> > > > From feller at mcs.anl.gov Fri Feb 8 10:09:34 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 10:09:34 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202485602.4800.13.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> Message-ID: <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> > > On Fri, 2008-02-08 at 09:33 -0600, Stuart Martin wrote: >> Mihael, >> >> Glad to hear things are improved with GRAM4. Lets keep going to have >> swift using GRAM4 routinely. > > You're being a bit assertive there. > >> >> Below is a recent thread that looked at this exact issue with condor- >> g. But it is entirely relevant to your use of GRAM4. the 2 issues to >> look for are >> >> 1) your use of notifications >> >> >> I ran one more test with the improved callback code. This time, I >> >> stopped storing the notification producer EPRs associated with the >> >> GRAM job resources. Memory usage went down markedly. >> >> 2) you could avoid notifications and instead do client-side polling >> for job state. This has shown to be more reliable than notifications >> under heavy loads, condor-g processing 1000s of jobs. > > These are both hacks. I'm not sure I want to go there. 300K per job is a > bit too much considering that swift (which has to consider many more > things) has less than 10K overhead per job. > For my better understanding: Do you start up your own notification consumer manager that listens for notifications of all jobs or do you let each GramJob instance listen for notifications itself? In case you listen for notifications yourself: do you store GramJob objects or just EPR's of jobs and create GramJob objects if needed? Martin >> >> The core team will be looking at improving notifications once their >> other 4.2 deliverables are done. >> >> -Stu >> >> Begin forwarded message: >> >> > From: feller at mcs.anl.gov >> > Date: February 1, 2008 9:41:05 AM CST >> > To: "Jaime Frey" >> > Cc: "Stuart Martin" , "Terrence Martin" >> > > >, "Martin Feller" , "charles bacon" >> > > >, "Suchandra Thapa" , "Rob Gardner" >> > > >, "Jeff Porter" , "Alain Roy" , >> > "Todd Tannenbaum" , "Miron Livny" >> > > > >> > Subject: Re: Condor-G WS GRAM memory usage >> > >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> >>>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM >> >>>>> raised some concerns about memory usage on the client side. I did >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper >> >>>>> around the java client libraries for WS GRAM. >> >>>>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data >> >>>>> transfer. All of the jobs overlapped in submission and execution. >> >>>>> Here is what I've discovered so far. >> >>>>> >> >>>>> Aside from the heap available to the java code, the jvm used 117 >> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. >> >>>>> >> >>>>> The maximum jvm heap usage (as reported by the garbage collector) >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was >> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete), >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >>>>> >> >>>>> The only long-term memory per job that I know of in the GAHP is >> >>>>> for the notification sink for job status callbacks. 600kb seems a >> >>>>> little high for that. Stu, could someone on Globus help us >> >>>>> determine if we're using the notification sinks inefficiently? >> >>>> >> >>>> Martin just looked and for the most part, there is nothing wrong >> >>>> with how condor-g manages the callback sink. >> >>>> However, one improvement that would reduce the memory used per job >> >>>> would be to not have a notification consumer per job. Instead use >> >>>> one for all jobs. >> >>>> >> >>>> Also, Martin recently did some analysis on condor-g stress tests >> >>>> and found that notifications are building up on the in the GRAM4 >> >>>> service container and that is causing delays which seem to be >> >>>> causing multiple problems. We're looking at this in a separate >> >>>> effort with the GT Core team. But, after this was clear, Martin >> >>>> re- >> >>>> ran the condor-g test and relied on polling between condor-g and >> >>>> the GRAM4 service instead of notifications. Jaime, could you >> >>>> repeat the no-notification test and see the difference in memory? >> >>>> The changes would be to increase the polling frequency in condor-g >> >>>> and comment out the subscribe for notification. You could also >> >>>> comment out the notification listener call(s) too. >> >>> >> >>> >> >>> I did two new sets of tests today. The first used more efficient >> >>> callback code in the GAHP (one notification consumer rather than one >> >>> per job). The second disabled notifications and relied on polling >> >>> for job status changes. >> >>> >> >>> The more efficient callback code did not produce a noticeable >> >>> reduction in memory usage. >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum >> >>> heap usage after job submission and before job completion was about >> >>> 4 megs + 0.1 megs per job. >> >> >> >> >> >> I ran one more test with the improved callback code. This time, I >> >> stopped storing the notification producer EPRs associated with the >> >> GRAM job resources. Memory usage went down markedly. >> >> >> >> I was told the client had to explicitly destroy these serve-side >> >> notification producer resources when it destroys the job, otherwise >> >> they hang around bogging down the server. Is this still the case? The >> >> server can't destroy notification producers when their sources of >> >> information are destroyed? >> >> >> > >> > This reminds me of the odd fact that i had to suddenly grant much more >> > memory to Condor-g as soon as condor-g started storing EPRs of >> > subscription resources to be able to destroy them eventually. >> > Those EPR's are maybe not so tiny as they look like. >> > >> > For 4.0: yes, currently you'll have to store and eventually destroy >> > subscription resources manually to avoid heaping up persistence data >> > on the server-side. >> > For 4.2: no, you won't have to store them. A job resource will >> > destroy all subscription resources when it's destroyed. >> > >> > Overall i suggest to concentrate on 4.2 gram since the "container >> > hangs in job destruction" problem won't exist anymore. >> > >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes >> > sense >> > for us to do the 4.2-related changes in Gahp and hand it to you for >> > fine-tuning then? >> > >> > Martin >> >> >> >> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >> > Mihael: >> > >> > That's great, thanks! >> > >> > Ian. >> > >> > Mihael Hategan wrote: >> >> I did a 1024 job run today with ws-gram. >> >> I painted the results here: >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> >> >> Seems like client memory per job is about 370k. Which is quite a lot. >> >> What kinda worries me is that it doesn't seem to go down after the >> >> jobs >> >> are done, so maybe there's a memory leak, or maybe the garbage >> >> collector >> >> doesn't do any major collections. I'll need to profile this to see >> >> exactly what we're talking about. >> >> >> >> The container memory is figured by looking at the process in /proc. >> >> It's >> >> total memory including shared libraries and things. But libraries >> >> take a >> >> fixed amount of space, so a fuzzy correlation can probably be made. >> >> It >> >> looks quite similar to the amount of memory eaten on the client side >> >> (per job). >> >> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time >> >> the >> >> jobs are submitted, but the machine itself seems responsive. I have >> >> yet >> >> to plot the exact submission time for each job. >> >> >> >> So at this point I would recommend trying ws-gram as long as there >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and >> >> while >> >> making sure the jvm has enough heap. More than that seems like a >> >> gamble. >> >> >> >> Mihael >> >> >> >> _______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> >> >> > >> > > From hategan at mcs.anl.gov Fri Feb 8 10:18:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 10:18:13 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202487494.5642.7.camel@blabla.mcs.anl.gov> > > > > These are both hacks. I'm not sure I want to go there. 300K per job is a > > bit too much considering that swift (which has to consider many more > > things) has less than 10K overhead per job. > > > > > For my better understanding: > Do you start up your own notification consumer manager that listens for > notifications of all jobs or do you let each GramJob instance listen for > notifications itself? > In case you listen for notifications yourself: do you store > GramJob objects or just EPR's of jobs and create GramJob objects if > needed? Excellent points. I let each GramJob instance listen for notifications itself. What I observed is that it uses only one container for that. Due to the above, a reference to the GramJob is kept anyway, regardless of whether that reference is in client code or the local container. I'll try to profile a run and see if I can spot where the problems are. > > Martin > > >> > >> The core team will be looking at improving notifications once their > >> other 4.2 deliverables are done. > >> > >> -Stu > >> > >> Begin forwarded message: > >> > >> > From: feller at mcs.anl.gov > >> > Date: February 1, 2008 9:41:05 AM CST > >> > To: "Jaime Frey" > >> > Cc: "Stuart Martin" , "Terrence Martin" > >> >> > >, "Martin Feller" , "charles bacon" > >> >> > >, "Suchandra Thapa" , "Rob Gardner" > >> >> > >, "Jeff Porter" , "Alain Roy" , > >> > "Todd Tannenbaum" , "Miron Livny" > >> >> > > > >> > Subject: Re: Condor-G WS GRAM memory usage > >> > > >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> >> > >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> >>> > >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> >>>> > >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM > >> >>>>> raised some concerns about memory usage on the client side. I did > >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which appeared > >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper > >> >>>>> around the java client libraries for WS GRAM. > >> >>>>> > >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a > >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >> >>>>> transfer. All of the jobs overlapped in submission and execution. > >> >>>>> Here is what I've discovered so far. > >> >>>>> > >> >>>>> Aside from the heap available to the java code, the jvm used 117 > >> >>>>> megs of non-shared memory and 74 megs of shared memory. Condor-G > >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >> >>>>> > >> >>>>> The maximum jvm heap usage (as reported by the garbage collector) > >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >> >>>>> quiescent (jobs executing, Condor-G waiting for them to complete), > >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> >>>>> > >> >>>>> The only long-term memory per job that I know of in the GAHP is > >> >>>>> for the notification sink for job status callbacks. 600kb seems a > >> >>>>> little high for that. Stu, could someone on Globus help us > >> >>>>> determine if we're using the notification sinks inefficiently? > >> >>>> > >> >>>> Martin just looked and for the most part, there is nothing wrong > >> >>>> with how condor-g manages the callback sink. > >> >>>> However, one improvement that would reduce the memory used per job > >> >>>> would be to not have a notification consumer per job. Instead use > >> >>>> one for all jobs. > >> >>>> > >> >>>> Also, Martin recently did some analysis on condor-g stress tests > >> >>>> and found that notifications are building up on the in the GRAM4 > >> >>>> service container and that is causing delays which seem to be > >> >>>> causing multiple problems. We're looking at this in a separate > >> >>>> effort with the GT Core team. But, after this was clear, Martin > >> >>>> re- > >> >>>> ran the condor-g test and relied on polling between condor-g and > >> >>>> the GRAM4 service instead of notifications. Jaime, could you > >> >>>> repeat the no-notification test and see the difference in memory? > >> >>>> The changes would be to increase the polling frequency in condor-g > >> >>>> and comment out the subscribe for notification. You could also > >> >>>> comment out the notification listener call(s) too. > >> >>> > >> >>> > >> >>> I did two new sets of tests today. The first used more efficient > >> >>> callback code in the GAHP (one notification consumer rather than one > >> >>> per job). The second disabled notifications and relied on polling > >> >>> for job status changes. > >> >>> > >> >>> The more efficient callback code did not produce a noticeable > >> >>> reduction in memory usage. > >> >>> > >> >>> Disabling notifications did reduce memory usage. The maximum jvm > >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum > >> >>> heap usage after job submission and before job completion was about > >> >>> 4 megs + 0.1 megs per job. > >> >> > >> >> > >> >> I ran one more test with the improved callback code. This time, I > >> >> stopped storing the notification producer EPRs associated with the > >> >> GRAM job resources. Memory usage went down markedly. > >> >> > >> >> I was told the client had to explicitly destroy these serve-side > >> >> notification producer resources when it destroys the job, otherwise > >> >> they hang around bogging down the server. Is this still the case? The > >> >> server can't destroy notification producers when their sources of > >> >> information are destroyed? > >> >> > >> > > >> > This reminds me of the odd fact that i had to suddenly grant much more > >> > memory to Condor-g as soon as condor-g started storing EPRs of > >> > subscription resources to be able to destroy them eventually. > >> > Those EPR's are maybe not so tiny as they look like. > >> > > >> > For 4.0: yes, currently you'll have to store and eventually destroy > >> > subscription resources manually to avoid heaping up persistence data > >> > on the server-side. > >> > For 4.2: no, you won't have to store them. A job resource will > >> > destroy all subscription resources when it's destroyed. > >> > > >> > Overall i suggest to concentrate on 4.2 gram since the "container > >> > hangs in job destruction" problem won't exist anymore. > >> > > >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 changes > >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes > >> > sense > >> > for us to do the 4.2-related changes in Gahp and hand it to you for > >> > fine-tuning then? > >> > > >> > Martin > >> > >> > >> > >> > >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> > >> > Mihael: > >> > > >> > That's great, thanks! > >> > > >> > Ian. > >> > > >> > Mihael Hategan wrote: > >> >> I did a 1024 job run today with ws-gram. > >> >> I painted the results here: > >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> >> > >> >> Seems like client memory per job is about 370k. Which is quite a lot. > >> >> What kinda worries me is that it doesn't seem to go down after the > >> >> jobs > >> >> are done, so maybe there's a memory leak, or maybe the garbage > >> >> collector > >> >> doesn't do any major collections. I'll need to profile this to see > >> >> exactly what we're talking about. > >> >> > >> >> The container memory is figured by looking at the process in /proc. > >> >> It's > >> >> total memory including shared libraries and things. But libraries > >> >> take a > >> >> fixed amount of space, so a fuzzy correlation can probably be made. > >> >> It > >> >> looks quite similar to the amount of memory eaten on the client side > >> >> (per job). > >> >> > >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time > >> >> the > >> >> jobs are submitted, but the machine itself seems responsive. I have > >> >> yet > >> >> to plot the exact submission time for each job. > >> >> > >> >> So at this point I would recommend trying ws-gram as long as there > >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and > >> >> while > >> >> making sure the jvm has enough heap. More than that seems like a > >> >> gamble. > >> >> > >> >> Mihael > >> >> > >> >> _______________________________________________ > >> >> Swift-devel mailing list > >> >> Swift-devel at ci.uchicago.edu > >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> >> > >> >> > >> > > >> > > > > > > From feller at mcs.anl.gov Fri Feb 8 10:26:26 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 10:26:26 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202487494.5642.7.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> Message-ID: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> >> > >> > These are both hacks. I'm not sure I want to go there. 300K per job is >> a >> > bit too much considering that swift (which has to consider many more >> > things) has less than 10K overhead per job. >> > >> >> >> For my better understanding: >> Do you start up your own notification consumer manager that listens for >> notifications of all jobs or do you let each GramJob instance listen for >> notifications itself? >> In case you listen for notifications yourself: do you store >> GramJob objects or just EPR's of jobs and create GramJob objects if >> needed? > > Excellent points. I let each GramJob instance listen for notifications > itself. What I observed is that it uses only one container for that. > Shoot! i didn't know that and thought there would be a container per GramJob in that case. That's the core mysteries with notifications. Anyway: I did a quick check some days ago and found that GramJob is surprisingly greedy regarding memory as you said. I'll have to further check what it is, but will probably not do that before 4.2 is out. > Due to the above, a reference to the GramJob is kept anyway, regardless > of whether that reference is in client code or the local container. > > I'll try to profile a run and see if I can spot where the problems are. > >> >> Martin >> >> >> >> >> The core team will be looking at improving notifications once their >> >> other 4.2 deliverables are done. >> >> >> >> -Stu >> >> >> >> Begin forwarded message: >> >> >> >> > From: feller at mcs.anl.gov >> >> > Date: February 1, 2008 9:41:05 AM CST >> >> > To: "Jaime Frey" >> >> > Cc: "Stuart Martin" , "Terrence Martin" >> >> > >> > >, "Martin Feller" , "charles bacon" >> >> > >> > >, "Suchandra Thapa" , "Rob Gardner" >> >> > >> > >, "Jeff Porter" , "Alain Roy" , >> >> > "Todd Tannenbaum" , "Miron Livny" >> >> > >> > > >> >> > Subject: Re: Condor-G WS GRAM memory usage >> >> > >> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >> >> >> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >> >>> >> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> >> >>>> >> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM >> >> >>>>> raised some concerns about memory usage on the client side. I >> did >> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which >> appeared >> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper >> >> >>>>> around the java client libraries for WS GRAM. >> >> >>>>> >> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a >> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data >> >> >>>>> transfer. All of the jobs overlapped in submission and >> execution. >> >> >>>>> Here is what I've discovered so far. >> >> >>>>> >> >> >>>>> Aside from the heap available to the java code, the jvm used >> 117 >> >> >>>>> megs of non-shared memory and 74 megs of shared memory. >> Condor-G >> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. >> >> >>>>> >> >> >>>>> The maximum jvm heap usage (as reported by the garbage >> collector) >> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was >> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to >> complete), >> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >> >>>>> >> >> >>>>> The only long-term memory per job that I know of in the GAHP is >> >> >>>>> for the notification sink for job status callbacks. 600kb seems >> a >> >> >>>>> little high for that. Stu, could someone on Globus help us >> >> >>>>> determine if we're using the notification sinks inefficiently? >> >> >>>> >> >> >>>> Martin just looked and for the most part, there is nothing wrong >> >> >>>> with how condor-g manages the callback sink. >> >> >>>> However, one improvement that would reduce the memory used per >> job >> >> >>>> would be to not have a notification consumer per job. Instead >> use >> >> >>>> one for all jobs. >> >> >>>> >> >> >>>> Also, Martin recently did some analysis on condor-g stress tests >> >> >>>> and found that notifications are building up on the in the GRAM4 >> >> >>>> service container and that is causing delays which seem to be >> >> >>>> causing multiple problems. We're looking at this in a separate >> >> >>>> effort with the GT Core team. But, after this was clear, Martin >> >> >>>> re- >> >> >>>> ran the condor-g test and relied on polling between condor-g and >> >> >>>> the GRAM4 service instead of notifications. Jaime, could you >> >> >>>> repeat the no-notification test and see the difference in >> memory? >> >> >>>> The changes would be to increase the polling frequency in >> condor-g >> >> >>>> and comment out the subscribe for notification. You could also >> >> >>>> comment out the notification listener call(s) too. >> >> >>> >> >> >>> >> >> >>> I did two new sets of tests today. The first used more efficient >> >> >>> callback code in the GAHP (one notification consumer rather than >> one >> >> >>> per job). The second disabled notifications and relied on polling >> >> >>> for job status changes. >> >> >>> >> >> >>> The more efficient callback code did not produce a noticeable >> >> >>> reduction in memory usage. >> >> >>> >> >> >>> Disabling notifications did reduce memory usage. The maximum jvm >> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum >> >> >>> heap usage after job submission and before job completion was >> about >> >> >>> 4 megs + 0.1 megs per job. >> >> >> >> >> >> >> >> >> I ran one more test with the improved callback code. This time, I >> >> >> stopped storing the notification producer EPRs associated with the >> >> >> GRAM job resources. Memory usage went down markedly. >> >> >> >> >> >> I was told the client had to explicitly destroy these serve-side >> >> >> notification producer resources when it destroys the job, >> otherwise >> >> >> they hang around bogging down the server. Is this still the case? >> The >> >> >> server can't destroy notification producers when their sources of >> >> >> information are destroyed? >> >> >> >> >> > >> >> > This reminds me of the odd fact that i had to suddenly grant much >> more >> >> > memory to Condor-g as soon as condor-g started storing EPRs of >> >> > subscription resources to be able to destroy them eventually. >> >> > Those EPR's are maybe not so tiny as they look like. >> >> > >> >> > For 4.0: yes, currently you'll have to store and eventually destroy >> >> > subscription resources manually to avoid heaping up persistence >> data >> >> > on the server-side. >> >> > For 4.2: no, you won't have to store them. A job resource will >> >> > destroy all subscription resources when it's destroyed. >> >> > >> >> > Overall i suggest to concentrate on 4.2 gram since the "container >> >> > hangs in job destruction" problem won't exist anymore. >> >> > >> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 >> changes >> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes >> >> > sense >> >> > for us to do the 4.2-related changes in Gahp and hand it to you for >> >> > fine-tuning then? >> >> > >> >> > Martin >> >> >> >> >> >> >> >> >> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >> >> >> > Mihael: >> >> > >> >> > That's great, thanks! >> >> > >> >> > Ian. >> >> > >> >> > Mihael Hategan wrote: >> >> >> I did a 1024 job run today with ws-gram. >> >> >> I painted the results here: >> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> >> >> >> >> Seems like client memory per job is about 370k. Which is quite a >> lot. >> >> >> What kinda worries me is that it doesn't seem to go down after the >> >> >> jobs >> >> >> are done, so maybe there's a memory leak, or maybe the garbage >> >> >> collector >> >> >> doesn't do any major collections. I'll need to profile this to see >> >> >> exactly what we're talking about. >> >> >> >> >> >> The container memory is figured by looking at the process in >> /proc. >> >> >> It's >> >> >> total memory including shared libraries and things. But libraries >> >> >> take a >> >> >> fixed amount of space, so a fuzzy correlation can probably be >> made. >> >> >> It >> >> >> looks quite similar to the amount of memory eaten on the client >> side >> >> >> (per job). >> >> >> >> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time >> >> >> the >> >> >> jobs are submitted, but the machine itself seems responsive. I >> have >> >> >> yet >> >> >> to plot the exact submission time for each job. >> >> >> >> >> >> So at this point I would recommend trying ws-gram as long as there >> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and >> >> >> while >> >> >> making sure the jvm has enough heap. More than that seems like a >> >> >> gamble. >> >> >> >> >> >> Mihael >> >> >> >> >> >> _______________________________________________ >> >> >> Swift-devel mailing list >> >> >> Swift-devel at ci.uchicago.edu >> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> >> >> >> >> >> > >> >> >> > >> > >> >> > > From benc at hawaga.org.uk Fri Feb 8 10:18:26 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Feb 2008 16:18:26 +0000 (GMT) Subject: [Swift-devel] behaviour on out-of-memory Message-ID: On my laptop when I run swift with so many jobs that it runs out of memory, it gives the below errors and hangs. It doesn't seem to exit. That's icky for using this in any automated environment. $ swift -tc.file ./tc.data -sites.file ./sites.xml badmonkey.swift -goodmonkeys=10000 Swift v0.3-dev r1609 (modified locally) RunID: 20080208-1015-5h5huekc Exception in thread "Worker 0" Exception in thread "Timer-0" Exception in thread "Worker 2" java.lang.OutOfMemoryError: Java heap space Exception in thread "Worker 3" java.lang.OutOfMemoryError: Java heap space Exception in thread "Worker 1" java.lang.OutOfMemoryError: Java heap space From hategan at mcs.anl.gov Fri Feb 8 11:11:13 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:11:13 -0600 Subject: [Swift-devel] behaviour on out-of-memory In-Reply-To: References: Message-ID: <1202490673.8302.3.camel@blabla.mcs.anl.gov> Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking of pre-allocating some space to use in such cases for clean shutdown, but given the concurrency, this may or may not work properly. On Fri, 2008-02-08 at 16:18 +0000, Ben Clifford wrote: > On my laptop when I run swift with so many jobs that it runs out of > memory, it gives the below errors and hangs. It doesn't seem to exit. > That's icky for using this in any automated environment. > > $ swift -tc.file ./tc.data -sites.file ./sites.xml badmonkey.swift > -goodmonkeys=10000 > Swift v0.3-dev r1609 (modified locally) > > RunID: 20080208-1015-5h5huekc > Exception in thread "Worker 0" Exception in thread "Timer-0" Exception in > thread "Worker 2" java.lang.OutOfMemoryError: Java heap space > Exception in thread "Worker 3" java.lang.OutOfMemoryError: Java heap space > Exception in thread "Worker 1" java.lang.OutOfMemoryError: Java heap space > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Feb 8 11:16:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:16:30 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202490990.8302.9.camel@blabla.mcs.anl.gov> > Shoot! i didn't know that and thought there would be a container per > GramJob in that case. Yep. There was even a bug, not sure if it was fixed, that would mess up the port for that container on subsequent requests (basically a second sequential job would start the container on 8443 instead of whatever was in the port range). > That's the core mysteries with notifications. > Anyway: I did a quick check some days ago and found that GramJob is > surprisingly greedy regarding memory as you said. I'll have to further > check what it is, but will probably not do that before 4.2 is out. I'll try to profile it today. You should get a license for YJP so that you can look at the snapshots I might come up with. It's free for OSS projects (just point them to the globus page that has your name). > > > > Due to the above, a reference to the GramJob is kept anyway, regardless > > of whether that reference is in client code or the local container. > > > > I'll try to profile a run and see if I can spot where the problems are. > > > >> > >> Martin > >> > >> >> > >> >> The core team will be looking at improving notifications once their > >> >> other 4.2 deliverables are done. > >> >> > >> >> -Stu > >> >> > >> >> Begin forwarded message: > >> >> > >> >> > From: feller at mcs.anl.gov > >> >> > Date: February 1, 2008 9:41:05 AM CST > >> >> > To: "Jaime Frey" > >> >> > Cc: "Stuart Martin" , "Terrence Martin" > >> >> >> >> > >, "Martin Feller" , "charles bacon" > >> >> >> >> > >, "Suchandra Thapa" , "Rob Gardner" > >> >> >> >> > >, "Jeff Porter" , "Alain Roy" , > >> >> > "Todd Tannenbaum" , "Miron Livny" > >> >> >> >> > > > >> >> > Subject: Re: Condor-G WS GRAM memory usage > >> >> > > >> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> >> >> > >> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> >> >>> > >> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> >> >>>> > >> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM > >> >> >>>>> raised some concerns about memory usage on the client side. I > >> did > >> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > >> appeared > >> >> >>>>> to be the primary memory consumer. The GAHP server is a wrapper > >> >> >>>>> around the java client libraries for WS GRAM. > >> >> >>>>> > >> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at a > >> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >> >> >>>>> transfer. All of the jobs overlapped in submission and > >> execution. > >> >> >>>>> Here is what I've discovered so far. > >> >> >>>>> > >> >> >>>>> Aside from the heap available to the java code, the jvm used > >> 117 > >> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > >> Condor-G > >> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >> >> >>>>> > >> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >> collector) > >> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > >> complete), > >> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> >> >>>>> > >> >> >>>>> The only long-term memory per job that I know of in the GAHP is > >> >> >>>>> for the notification sink for job status callbacks. 600kb seems > >> a > >> >> >>>>> little high for that. Stu, could someone on Globus help us > >> >> >>>>> determine if we're using the notification sinks inefficiently? > >> >> >>>> > >> >> >>>> Martin just looked and for the most part, there is nothing wrong > >> >> >>>> with how condor-g manages the callback sink. > >> >> >>>> However, one improvement that would reduce the memory used per > >> job > >> >> >>>> would be to not have a notification consumer per job. Instead > >> use > >> >> >>>> one for all jobs. > >> >> >>>> > >> >> >>>> Also, Martin recently did some analysis on condor-g stress tests > >> >> >>>> and found that notifications are building up on the in the GRAM4 > >> >> >>>> service container and that is causing delays which seem to be > >> >> >>>> causing multiple problems. We're looking at this in a separate > >> >> >>>> effort with the GT Core team. But, after this was clear, Martin > >> >> >>>> re- > >> >> >>>> ran the condor-g test and relied on polling between condor-g and > >> >> >>>> the GRAM4 service instead of notifications. Jaime, could you > >> >> >>>> repeat the no-notification test and see the difference in > >> memory? > >> >> >>>> The changes would be to increase the polling frequency in > >> condor-g > >> >> >>>> and comment out the subscribe for notification. You could also > >> >> >>>> comment out the notification listener call(s) too. > >> >> >>> > >> >> >>> > >> >> >>> I did two new sets of tests today. The first used more efficient > >> >> >>> callback code in the GAHP (one notification consumer rather than > >> one > >> >> >>> per job). The second disabled notifications and relied on polling > >> >> >>> for job status changes. > >> >> >>> > >> >> >>> The more efficient callback code did not produce a noticeable > >> >> >>> reduction in memory usage. > >> >> >>> > >> >> >>> Disabling notifications did reduce memory usage. The maximum jvm > >> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum > >> >> >>> heap usage after job submission and before job completion was > >> about > >> >> >>> 4 megs + 0.1 megs per job. > >> >> >> > >> >> >> > >> >> >> I ran one more test with the improved callback code. This time, I > >> >> >> stopped storing the notification producer EPRs associated with the > >> >> >> GRAM job resources. Memory usage went down markedly. > >> >> >> > >> >> >> I was told the client had to explicitly destroy these serve-side > >> >> >> notification producer resources when it destroys the job, > >> otherwise > >> >> >> they hang around bogging down the server. Is this still the case? > >> The > >> >> >> server can't destroy notification producers when their sources of > >> >> >> information are destroyed? > >> >> >> > >> >> > > >> >> > This reminds me of the odd fact that i had to suddenly grant much > >> more > >> >> > memory to Condor-g as soon as condor-g started storing EPRs of > >> >> > subscription resources to be able to destroy them eventually. > >> >> > Those EPR's are maybe not so tiny as they look like. > >> >> > > >> >> > For 4.0: yes, currently you'll have to store and eventually destroy > >> >> > subscription resources manually to avoid heaping up persistence > >> data > >> >> > on the server-side. > >> >> > For 4.2: no, you won't have to store them. A job resource will > >> >> > destroy all subscription resources when it's destroyed. > >> >> > > >> >> > Overall i suggest to concentrate on 4.2 gram since the "container > >> >> > hangs in job destruction" problem won't exist anymore. > >> >> > > >> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > >> changes > >> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes > >> >> > sense > >> >> > for us to do the 4.2-related changes in Gahp and hand it to you for > >> >> > fine-tuning then? > >> >> > > >> >> > Martin > >> >> > >> >> > >> >> > >> >> > >> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> >> > >> >> > Mihael: > >> >> > > >> >> > That's great, thanks! > >> >> > > >> >> > Ian. > >> >> > > >> >> > Mihael Hategan wrote: > >> >> >> I did a 1024 job run today with ws-gram. > >> >> >> I painted the results here: > >> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> >> >> > >> >> >> Seems like client memory per job is about 370k. Which is quite a > >> lot. > >> >> >> What kinda worries me is that it doesn't seem to go down after the > >> >> >> jobs > >> >> >> are done, so maybe there's a memory leak, or maybe the garbage > >> >> >> collector > >> >> >> doesn't do any major collections. I'll need to profile this to see > >> >> >> exactly what we're talking about. > >> >> >> > >> >> >> The container memory is figured by looking at the process in > >> /proc. > >> >> >> It's > >> >> >> total memory including shared libraries and things. But libraries > >> >> >> take a > >> >> >> fixed amount of space, so a fuzzy correlation can probably be > >> made. > >> >> >> It > >> >> >> looks quite similar to the amount of memory eaten on the client > >> side > >> >> >> (per job). > >> >> >> > >> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the time > >> >> >> the > >> >> >> jobs are submitted, but the machine itself seems responsive. I > >> have > >> >> >> yet > >> >> >> to plot the exact submission time for each job. > >> >> >> > >> >> >> So at this point I would recommend trying ws-gram as long as there > >> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), and > >> >> >> while > >> >> >> making sure the jvm has enough heap. More than that seems like a > >> >> >> gamble. > >> >> >> > >> >> >> Mihael > >> >> >> > >> >> >> _______________________________________________ > >> >> >> Swift-devel mailing list > >> >> >> Swift-devel at ci.uchicago.edu > >> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> >> >> > >> >> >> > >> >> > > >> >> > >> > > >> > > >> > >> > > > > > > From benc at hawaga.org.uk Fri Feb 8 11:19:37 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Feb 2008 17:19:37 +0000 (GMT) Subject: [Swift-devel] behaviour on out-of-memory In-Reply-To: <1202490673.8302.3.camel@blabla.mcs.anl.gov> References: <1202490673.8302.3.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 8 Feb 2008, Mihael Hategan wrote: > Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking > of pre-allocating some space to use in such cases for clean shutdown, > but given the concurrency, this may or may not work properly. For my purposes, I don't really need anything cleaner than the JVM exiting with an error code sometime around the memory running out. I hacked in a try/catch around karajan's EventWorker.run() which is catching enough for me at the moment. -- From feller at mcs.anl.gov Fri Feb 8 11:19:40 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 11:19:40 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> Message-ID: <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> Mihael, i think i found the memory hole in GramJob. 100 jobs in a test of mine consumed about 23MB (constantly growing) before the fix and 8MB (very slowly growing) after the fix. The big part of that (7MB) is used right from the first job which may be the NotificationConsumerManager. Will commit that change soon to 4.0 branch and you may try it then. Are you using 4.0.x in your tests? Martin >>> > >>> > These are both hacks. I'm not sure I want to go there. 300K per job >>> is >>> a >>> > bit too much considering that swift (which has to consider many more >>> > things) has less than 10K overhead per job. >>> > >>> >>> >>> For my better understanding: >>> Do you start up your own notification consumer manager that listens for >>> notifications of all jobs or do you let each GramJob instance listen >>> for >>> notifications itself? >>> In case you listen for notifications yourself: do you store >>> GramJob objects or just EPR's of jobs and create GramJob objects if >>> needed? >> >> Excellent points. I let each GramJob instance listen for notifications >> itself. What I observed is that it uses only one container for that. >> > > Shoot! i didn't know that and thought there would be a container per > GramJob in that case. That's the core mysteries with notifications. > Anyway: I did a quick check some days ago and found that GramJob is > surprisingly greedy regarding memory as you said. I'll have to further > check what it is, but will probably not do that before 4.2 is out. > > >> Due to the above, a reference to the GramJob is kept anyway, regardless >> of whether that reference is in client code or the local container. >> >> I'll try to profile a run and see if I can spot where the problems are. >> >>> >>> Martin >>> >>> >> >>> >> The core team will be looking at improving notifications once their >>> >> other 4.2 deliverables are done. >>> >> >>> >> -Stu >>> >> >>> >> Begin forwarded message: >>> >> >>> >> > From: feller at mcs.anl.gov >>> >> > Date: February 1, 2008 9:41:05 AM CST >>> >> > To: "Jaime Frey" >>> >> > Cc: "Stuart Martin" , "Terrence Martin" >>> >> >> >> > >, "Martin Feller" , "charles bacon" >>> >> >> >> > >, "Suchandra Thapa" , "Rob Gardner" >>> >> >> >> > >, "Jeff Porter" , "Alain Roy" >>> , >>> >> > "Todd Tannenbaum" , "Miron Livny" >>> >> >> >> > > >>> >> > Subject: Re: Condor-G WS GRAM memory usage >>> >> > >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >>> >> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >>> >> >>> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >>> >> >>>> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM >>> >> >>>>> raised some concerns about memory usage on the client side. I >>> did >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which >>> appeared >>> >> >>>>> to be the primary memory consumer. The GAHP server is a >>> wrapper >>> >> >>>>> around the java client libraries for WS GRAM. >>> >> >>>>> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at >>> a >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data >>> >> >>>>> transfer. All of the jobs overlapped in submission and >>> execution. >>> >> >>>>> Here is what I've discovered so far. >>> >> >>>>> >>> >> >>>>> Aside from the heap available to the java code, the jvm used >>> 117 >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. >>> Condor-G >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. >>> >> >>>>> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage >>> collector) >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to >>> complete), >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >>> >> >>>>> >>> >> >>>>> The only long-term memory per job that I know of in the GAHP >>> is >>> >> >>>>> for the notification sink for job status callbacks. 600kb >>> seems >>> a >>> >> >>>>> little high for that. Stu, could someone on Globus help us >>> >> >>>>> determine if we're using the notification sinks inefficiently? >>> >> >>>> >>> >> >>>> Martin just looked and for the most part, there is nothing >>> wrong >>> >> >>>> with how condor-g manages the callback sink. >>> >> >>>> However, one improvement that would reduce the memory used per >>> job >>> >> >>>> would be to not have a notification consumer per job. Instead >>> use >>> >> >>>> one for all jobs. >>> >> >>>> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress >>> tests >>> >> >>>> and found that notifications are building up on the in the >>> GRAM4 >>> >> >>>> service container and that is causing delays which seem to be >>> >> >>>> causing multiple problems. We're looking at this in a separate >>> >> >>>> effort with the GT Core team. But, after this was clear, >>> Martin >>> >> >>>> re- >>> >> >>>> ran the condor-g test and relied on polling between condor-g >>> and >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could you >>> >> >>>> repeat the no-notification test and see the difference in >>> memory? >>> >> >>>> The changes would be to increase the polling frequency in >>> condor-g >>> >> >>>> and comment out the subscribe for notification. You could also >>> >> >>>> comment out the notification listener call(s) too. >>> >> >>> >>> >> >>> >>> >> >>> I did two new sets of tests today. The first used more efficient >>> >> >>> callback code in the GAHP (one notification consumer rather than >>> one >>> >> >>> per job). The second disabled notifications and relied on >>> polling >>> >> >>> for job status changes. >>> >> >>> >>> >> >>> The more efficient callback code did not produce a noticeable >>> >> >>> reduction in memory usage. >>> >> >>> >>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum >>> >> >>> heap usage after job submission and before job completion was >>> about >>> >> >>> 4 megs + 0.1 megs per job. >>> >> >> >>> >> >> >>> >> >> I ran one more test with the improved callback code. This time, I >>> >> >> stopped storing the notification producer EPRs associated with >>> the >>> >> >> GRAM job resources. Memory usage went down markedly. >>> >> >> >>> >> >> I was told the client had to explicitly destroy these serve-side >>> >> >> notification producer resources when it destroys the job, >>> otherwise >>> >> >> they hang around bogging down the server. Is this still the case? >>> The >>> >> >> server can't destroy notification producers when their sources of >>> >> >> information are destroyed? >>> >> >> >>> >> > >>> >> > This reminds me of the odd fact that i had to suddenly grant much >>> more >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of >>> >> > subscription resources to be able to destroy them eventually. >>> >> > Those EPR's are maybe not so tiny as they look like. >>> >> > >>> >> > For 4.0: yes, currently you'll have to store and eventually >>> destroy >>> >> > subscription resources manually to avoid heaping up persistence >>> data >>> >> > on the server-side. >>> >> > For 4.2: no, you won't have to store them. A job resource will >>> >> > destroy all subscription resources when it's destroyed. >>> >> > >>> >> > Overall i suggest to concentrate on 4.2 gram since the "container >>> >> > hangs in job destruction" problem won't exist anymore. >>> >> > >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 >>> changes >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes >>> >> > sense >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you >>> for >>> >> > fine-tuning then? >>> >> > >>> >> > Martin >>> >> >>> >> >>> >> >>> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >>> >> >>> >> > Mihael: >>> >> > >>> >> > That's great, thanks! >>> >> > >>> >> > Ian. >>> >> > >>> >> > Mihael Hategan wrote: >>> >> >> I did a 1024 job run today with ws-gram. >>> >> >> I painted the results here: >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >>> >> >> >>> >> >> Seems like client memory per job is about 370k. Which is quite a >>> lot. >>> >> >> What kinda worries me is that it doesn't seem to go down after >>> the >>> >> >> jobs >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage >>> >> >> collector >>> >> >> doesn't do any major collections. I'll need to profile this to >>> see >>> >> >> exactly what we're talking about. >>> >> >> >>> >> >> The container memory is figured by looking at the process in >>> /proc. >>> >> >> It's >>> >> >> total memory including shared libraries and things. But libraries >>> >> >> take a >>> >> >> fixed amount of space, so a fuzzy correlation can probably be >>> made. >>> >> >> It >>> >> >> looks quite similar to the amount of memory eaten on the client >>> side >>> >> >> (per job). >>> >> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the >>> time >>> >> >> the >>> >> >> jobs are submitted, but the machine itself seems responsive. I >>> have >>> >> >> yet >>> >> >> to plot the exact submission time for each job. >>> >> >> >>> >> >> So at this point I would recommend trying ws-gram as long as >>> there >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), >>> and >>> >> >> while >>> >> >> making sure the jvm has enough heap. More than that seems like a >>> >> >> gamble. >>> >> >> >>> >> >> Mihael >>> >> >> >>> >> >> _______________________________________________ >>> >> >> Swift-devel mailing list >>> >> >> Swift-devel at ci.uchicago.edu >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> >>> >> >> >>> >> > >>> >> >>> > >>> > >>> >>> >> >> > > > From benc at hawaga.org.uk Fri Feb 8 11:22:08 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Feb 2008 17:22:08 +0000 (GMT) Subject: [Swift-devel] local provider maximum simultaneous jobs Message-ID: I'd like to make it so out-of-the-box the localhost site does not try to run more than a handful of jobs at once - in almost any case, that is the desired behaviour, I think. There's no documented per-site profile entry for rate limiting like this. Is there a secret one? -- From hategan at mcs.anl.gov Fri Feb 8 11:24:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:24:43 -0600 Subject: [Swift-devel] behaviour on out-of-memory In-Reply-To: References: <1202490673.8302.3.camel@blabla.mcs.anl.gov> Message-ID: <1202491483.9045.4.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 17:19 +0000, Ben Clifford wrote: > > On Fri, 8 Feb 2008, Mihael Hategan wrote: > > > Yep. Hard problem. In general, OOMs are tricky to handle. I was thinking > > of pre-allocating some space to use in such cases for clean shutdown, > > but given the concurrency, this may or may not work properly. > > For my purposes, I don't really need anything cleaner than the JVM exiting > with an error code sometime around the memory running out. Not correct semantics when swift acts as a service (think I2U2). I should probably find a way to immediately cancel a whole workflow when OOMs are caught so that client software can un-reference it and eventually get back to stability. But again, not having enough memory may cause arbitrary breakage in arbitrary threads, so it's hard to guarantee consistency after such a thing. So let's keep chatting, maybe something will come up. > > I hacked in a try/catch around karajan's EventWorker.run() which is > catching enough for me at the moment. Normally it should generate a fault and propagate it up the call stack, but that may itself require memory. > From hategan at mcs.anl.gov Fri Feb 8 11:27:29 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:27:29 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202491649.9045.8.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > Mihael, > > i think i found the memory hole in GramJob. > 100 jobs in a test of mine consumed about 23MB (constantly > growing) before the fix and 8MB (very slowly growing) after > the fix. The big part of that (7MB) is used right from the > first job which may be the NotificationConsumerManager. > Will commit that change soon to 4.0 branch and you may try > it then. > Are you using 4.0.x in your tests? Yes. If there are no API changes, you can send me the jar file. I don't have enough knowledge to selectively build WS-GRAM, nor enough disk space to build the whole GT. > > Martin > > >>> > > >>> > These are both hacks. I'm not sure I want to go there. 300K per job > >>> is > >>> a > >>> > bit too much considering that swift (which has to consider many more > >>> > things) has less than 10K overhead per job. > >>> > > >>> > >>> > >>> For my better understanding: > >>> Do you start up your own notification consumer manager that listens for > >>> notifications of all jobs or do you let each GramJob instance listen > >>> for > >>> notifications itself? > >>> In case you listen for notifications yourself: do you store > >>> GramJob objects or just EPR's of jobs and create GramJob objects if > >>> needed? > >> > >> Excellent points. I let each GramJob instance listen for notifications > >> itself. What I observed is that it uses only one container for that. > >> > > > > Shoot! i didn't know that and thought there would be a container per > > GramJob in that case. That's the core mysteries with notifications. > > Anyway: I did a quick check some days ago and found that GramJob is > > surprisingly greedy regarding memory as you said. I'll have to further > > check what it is, but will probably not do that before 4.2 is out. > > > > > >> Due to the above, a reference to the GramJob is kept anyway, regardless > >> of whether that reference is in client code or the local container. > >> > >> I'll try to profile a run and see if I can spot where the problems are. > >> > >>> > >>> Martin > >>> > >>> >> > >>> >> The core team will be looking at improving notifications once their > >>> >> other 4.2 deliverables are done. > >>> >> > >>> >> -Stu > >>> >> > >>> >> Begin forwarded message: > >>> >> > >>> >> > From: feller at mcs.anl.gov > >>> >> > Date: February 1, 2008 9:41:05 AM CST > >>> >> > To: "Jaime Frey" > >>> >> > Cc: "Stuart Martin" , "Terrence Martin" > >>> >> >>> >> > >, "Martin Feller" , "charles bacon" > >>> >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" > >>> >> >>> >> > >, "Jeff Porter" , "Alain Roy" > >>> , > >>> >> > "Todd Tannenbaum" , "Miron Livny" > >>> >> >>> >> > > > >>> >> > Subject: Re: Condor-G WS GRAM memory usage > >>> >> > > >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >>> >> >> > >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >>> >> >>> > >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >>> >> >>>> > >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS GRAM > >>> >> >>>>> raised some concerns about memory usage on the client side. I > >>> did > >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > >>> appeared > >>> >> >>>>> to be the primary memory consumer. The GAHP server is a > >>> wrapper > >>> >> >>>>> around the java client libraries for WS GRAM. > >>> >> >>>>> > >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 at > >>> a > >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >>> >> >>>>> transfer. All of the jobs overlapped in submission and > >>> execution. > >>> >> >>>>> Here is what I've discovered so far. > >>> >> >>>>> > >>> >> >>>>> Aside from the heap available to the java code, the jvm used > >>> 117 > >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > >>> Condor-G > >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >>> >> >>>>> > >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >>> collector) > >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > >>> complete), > >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >>> >> >>>>> > >>> >> >>>>> The only long-term memory per job that I know of in the GAHP > >>> is > >>> >> >>>>> for the notification sink for job status callbacks. 600kb > >>> seems > >>> a > >>> >> >>>>> little high for that. Stu, could someone on Globus help us > >>> >> >>>>> determine if we're using the notification sinks inefficiently? > >>> >> >>>> > >>> >> >>>> Martin just looked and for the most part, there is nothing > >>> wrong > >>> >> >>>> with how condor-g manages the callback sink. > >>> >> >>>> However, one improvement that would reduce the memory used per > >>> job > >>> >> >>>> would be to not have a notification consumer per job. Instead > >>> use > >>> >> >>>> one for all jobs. > >>> >> >>>> > >>> >> >>>> Also, Martin recently did some analysis on condor-g stress > >>> tests > >>> >> >>>> and found that notifications are building up on the in the > >>> GRAM4 > >>> >> >>>> service container and that is causing delays which seem to be > >>> >> >>>> causing multiple problems. We're looking at this in a separate > >>> >> >>>> effort with the GT Core team. But, after this was clear, > >>> Martin > >>> >> >>>> re- > >>> >> >>>> ran the condor-g test and relied on polling between condor-g > >>> and > >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could you > >>> >> >>>> repeat the no-notification test and see the difference in > >>> memory? > >>> >> >>>> The changes would be to increase the polling frequency in > >>> condor-g > >>> >> >>>> and comment out the subscribe for notification. You could also > >>> >> >>>> comment out the notification listener call(s) too. > >>> >> >>> > >>> >> >>> > >>> >> >>> I did two new sets of tests today. The first used more efficient > >>> >> >>> callback code in the GAHP (one notification consumer rather than > >>> one > >>> >> >>> per job). The second disabled notifications and relied on > >>> polling > >>> >> >>> for job status changes. > >>> >> >>> > >>> >> >>> The more efficient callback code did not produce a noticeable > >>> >> >>> reduction in memory usage. > >>> >> >>> > >>> >> >>> Disabling notifications did reduce memory usage. The maximum jvm > >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The minimum > >>> >> >>> heap usage after job submission and before job completion was > >>> about > >>> >> >>> 4 megs + 0.1 megs per job. > >>> >> >> > >>> >> >> > >>> >> >> I ran one more test with the improved callback code. This time, I > >>> >> >> stopped storing the notification producer EPRs associated with > >>> the > >>> >> >> GRAM job resources. Memory usage went down markedly. > >>> >> >> > >>> >> >> I was told the client had to explicitly destroy these serve-side > >>> >> >> notification producer resources when it destroys the job, > >>> otherwise > >>> >> >> they hang around bogging down the server. Is this still the case? > >>> The > >>> >> >> server can't destroy notification producers when their sources of > >>> >> >> information are destroyed? > >>> >> >> > >>> >> > > >>> >> > This reminds me of the odd fact that i had to suddenly grant much > >>> more > >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of > >>> >> > subscription resources to be able to destroy them eventually. > >>> >> > Those EPR's are maybe not so tiny as they look like. > >>> >> > > >>> >> > For 4.0: yes, currently you'll have to store and eventually > >>> destroy > >>> >> > subscription resources manually to avoid heaping up persistence > >>> data > >>> >> > on the server-side. > >>> >> > For 4.2: no, you won't have to store them. A job resource will > >>> >> > destroy all subscription resources when it's destroyed. > >>> >> > > >>> >> > Overall i suggest to concentrate on 4.2 gram since the "container > >>> >> > hangs in job destruction" problem won't exist anymore. > >>> >> > > >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > >>> changes > >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it makes > >>> >> > sense > >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you > >>> for > >>> >> > fine-tuning then? > >>> >> > > >>> >> > Martin > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >>> >> > >>> >> > Mihael: > >>> >> > > >>> >> > That's great, thanks! > >>> >> > > >>> >> > Ian. > >>> >> > > >>> >> > Mihael Hategan wrote: > >>> >> >> I did a 1024 job run today with ws-gram. > >>> >> >> I painted the results here: > >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >>> >> >> > >>> >> >> Seems like client memory per job is about 370k. Which is quite a > >>> lot. > >>> >> >> What kinda worries me is that it doesn't seem to go down after > >>> the > >>> >> >> jobs > >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage > >>> >> >> collector > >>> >> >> doesn't do any major collections. I'll need to profile this to > >>> see > >>> >> >> exactly what we're talking about. > >>> >> >> > >>> >> >> The container memory is figured by looking at the process in > >>> /proc. > >>> >> >> It's > >>> >> >> total memory including shared libraries and things. But libraries > >>> >> >> take a > >>> >> >> fixed amount of space, so a fuzzy correlation can probably be > >>> made. > >>> >> >> It > >>> >> >> looks quite similar to the amount of memory eaten on the client > >>> side > >>> >> >> (per job). > >>> >> >> > >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the > >>> time > >>> >> >> the > >>> >> >> jobs are submitted, but the machine itself seems responsive. I > >>> have > >>> >> >> yet > >>> >> >> to plot the exact submission time for each job. > >>> >> >> > >>> >> >> So at this point I would recommend trying ws-gram as long as > >>> there > >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), > >>> and > >>> >> >> while > >>> >> >> making sure the jvm has enough heap. More than that seems like a > >>> >> >> gamble. > >>> >> >> > >>> >> >> Mihael > >>> >> >> > >>> >> >> _______________________________________________ > >>> >> >> Swift-devel mailing list > >>> >> >> Swift-devel at ci.uchicago.edu > >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> >> >> > >>> >> >> > >>> >> > > >>> >> > >>> > > >>> > > >>> > >>> > >> > >> > > > > > > > > From hategan at mcs.anl.gov Fri Feb 8 11:28:23 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:28:23 -0600 Subject: [Swift-devel] local provider maximum simultaneous jobs In-Reply-To: References: Message-ID: <1202491703.9045.10.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 17:22 +0000, Ben Clifford wrote: > I'd like to make it so out-of-the-box the localhost site does not try to > run more than a handful of jobs at once - in almost any case, that is the > desired behaviour, I think. > > There's no documented per-site profile entry for rate limiting like this. > Is there a secret one? Yep. It involves writing some Java code ;) > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Feb 8 11:34:20 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:34:20 -0600 Subject: [Swift-devel] local provider maximum simultaneous jobs In-Reply-To: <1202491703.9045.10.camel@blabla.mcs.anl.gov> References: <1202491703.9045.10.camel@blabla.mcs.anl.gov> Message-ID: <1202492061.9775.1.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 11:28 -0600, Mihael Hategan wrote: > On Fri, 2008-02-08 at 17:22 +0000, Ben Clifford wrote: > > I'd like to make it so out-of-the-box the localhost site does not try to > > run more than a handful of jobs at once - in almost any case, that is the > > desired behaviour, I think. > > > > There's no documented per-site profile entry for rate limiting like this. > > Is there a secret one? > > Yep. It involves writing some Java code ;) I'd say file a bug report and I'll probably get to it next week, since I'll be playing with the scheduler anyway to put the gram responsiveness stuff in. > > > > > -- > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Fri Feb 8 11:36:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 8 Feb 2008 17:36:21 +0000 (GMT) Subject: [Swift-devel] behaviour on out-of-memory In-Reply-To: <1202491483.9045.4.camel@blabla.mcs.anl.gov> References: <1202490673.8302.3.camel@blabla.mcs.anl.gov> <1202491483.9045.4.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 8 Feb 2008, Mihael Hategan wrote: > Not correct semantics when swift acts as a service (think I2U2). I > should probably find a way to immediately cancel a whole workflow when > OOMs are caught so that client software can un-reference it and > eventually get back to stability. But again, not having enough memory > may cause arbitrary breakage in arbitrary threads, so it's hard to > guarantee consistency after such a thing. My philosophy, which is sort of backed up by the javadocs, is that OOM Errors are a signal that the JVM is so broken that it cannot continue - its the end of the universe as far as the JVM is concerned and there's nothing you can do. If you're so foolish as to run something (eg Swift) in your web server JVM that puts the JVM into that state, then sucker to you! cf. javadoc VirtualMachineError: > Thrown to indicate that the Java Virtual Machine is broken or has run > out of resources necessary for it to continue operating. There's a bunch more memory management stuff in java 5 (eg the MXBeans) which are perhaps interesting - eg. when memory gets low, stop doing certain things / more cleanly abort select pieces of what lives in the JVM. -- From hategan at mcs.anl.gov Fri Feb 8 11:44:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 11:44:03 -0600 Subject: [Swift-devel] behaviour on out-of-memory In-Reply-To: References: <1202490673.8302.3.camel@blabla.mcs.anl.gov> <1202491483.9045.4.camel@blabla.mcs.anl.gov> Message-ID: <1202492643.10121.4.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 17:36 +0000, Ben Clifford wrote: > > On Fri, 8 Feb 2008, Mihael Hategan wrote: > > > Not correct semantics when swift acts as a service (think I2U2). I > > should probably find a way to immediately cancel a whole workflow when > > OOMs are caught so that client software can un-reference it and > > eventually get back to stability. But again, not having enough memory > > may cause arbitrary breakage in arbitrary threads, so it's hard to > > guarantee consistency after such a thing. > > My philosophy, which is sort of backed up by the javadocs, is that OOM > Errors are a signal that the JVM is so broken that it cannot continue - > its the end of the universe as far as the JVM is concerned and there's > nothing you can do. If you're so foolish as to run something (eg Swift) in > your web server JVM that puts the JVM into that state, then sucker to you! Yes and no. I there are cases when one can safely deal with it and other cases when it's ok to let it provide partial functionality. I don't want to definitely do/say one thing or the other at this point. I've had the same argument with Jarek (or rather the reverse argument). The WSRF container catches OOMs and does some cleanup and continues. I said it shouldn't be done. When you have an OOM it's safer to have no service than to risk nondeterministic behavior or even potential security problems. So yes, I also happen to agree with you besides disagreeing with you. > > cf. javadoc VirtualMachineError: > > > Thrown to indicate that the Java Virtual Machine is broken or has run > > out of resources necessary for it to continue operating. > > There's a bunch more memory management stuff in java 5 (eg the MXBeans) > which are perhaps interesting - eg. when memory gets low, stop doing > certain things / more cleanly abort select pieces of what lives in the > JVM. > Hmm. Interesting. I have to look at that. It may be time to slowly move towards java 5. From feller at mcs.anl.gov Fri Feb 8 13:21:29 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 13:21:29 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202491649.9045.8.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> Message-ID: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> Try the attached 4.0 compliant jar in your tests by dropping it in your 4.0.x $GLOBUS_LOCATION/lib. My tests showed about 2MB memory increase per 100 GramJob objects which sounds to me like a reasonable number (about 20k per GramJob object ignoring the notification consumer manager in one job - if my calculations are right) Martin > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: >> Mihael, >> >> i think i found the memory hole in GramJob. >> 100 jobs in a test of mine consumed about 23MB (constantly >> growing) before the fix and 8MB (very slowly growing) after >> the fix. The big part of that (7MB) is used right from the >> first job which may be the NotificationConsumerManager. >> Will commit that change soon to 4.0 branch and you may try >> it then. >> Are you using 4.0.x in your tests? > > Yes. If there are no API changes, you can send me the jar file. I don't > have enough knowledge to selectively build WS-GRAM, nor enough disk > space to build the whole GT. > >> >> Martin >> >> >>> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per >> job >> >>> is >> >>> a >> >>> > bit too much considering that swift (which has to consider many >> more >> >>> > things) has less than 10K overhead per job. >> >>> > >> >>> >> >>> >> >>> For my better understanding: >> >>> Do you start up your own notification consumer manager that listens >> for >> >>> notifications of all jobs or do you let each GramJob instance listen >> >>> for >> >>> notifications itself? >> >>> In case you listen for notifications yourself: do you store >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if >> >>> needed? >> >> >> >> Excellent points. I let each GramJob instance listen for >> notifications >> >> itself. What I observed is that it uses only one container for that. >> >> >> > >> > Shoot! i didn't know that and thought there would be a container per >> > GramJob in that case. That's the core mysteries with notifications. >> > Anyway: I did a quick check some days ago and found that GramJob is >> > surprisingly greedy regarding memory as you said. I'll have to further >> > check what it is, but will probably not do that before 4.2 is out. >> > >> > >> >> Due to the above, a reference to the GramJob is kept anyway, >> regardless >> >> of whether that reference is in client code or the local container. >> >> >> >> I'll try to profile a run and see if I can spot where the problems >> are. >> >> >> >>> >> >>> Martin >> >>> >> >>> >> >> >>> >> The core team will be looking at improving notifications once >> their >> >>> >> other 4.2 deliverables are done. >> >>> >> >> >>> >> -Stu >> >>> >> >> >>> >> Begin forwarded message: >> >>> >> >> >>> >> > From: feller at mcs.anl.gov >> >>> >> > Date: February 1, 2008 9:41:05 AM CST >> >>> >> > To: "Jaime Frey" >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" >> >>> >> > >>> >> > >, "Martin Feller" , "charles bacon" >> >>> >> > >>> >> > >, "Suchandra Thapa" , "Rob Gardner" >> >>> >> > >>> >> > >, "Jeff Porter" , "Alain Roy" >> >>> , >> >>> >> > "Todd Tannenbaum" , "Miron Livny" >> >>> >> > >>> >> > > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage >> >>> >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >>> >> >> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >>> >> >>> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> >>> >> >>>> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS >> GRAM >> >>> >> >>>>> raised some concerns about memory usage on the client side. >> I >> >>> did >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which >> >>> appeared >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a >> >>> wrapper >> >>> >> >>>>> around the java client libraries for WS GRAM. >> >>> >> >>>>> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 >> at >> >>> a >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and >> >>> execution. >> >>> >> >>>>> Here is what I've discovered so far. >> >>> >> >>>>> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm >> used >> >>> 117 >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. >> >>> Condor-G >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. >> >>> >> >>>>> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage >> >>> collector) >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to >> >>> complete), >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >>> >> >>>>> >> >>> >> >>>>> The only long-term memory per job that I know of in the >> GAHP >> >>> is >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb >> >>> seems >> >>> a >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us >> >>> >> >>>>> determine if we're using the notification sinks >> inefficiently? >> >>> >> >>>> >> >>> >> >>>> Martin just looked and for the most part, there is nothing >> >>> wrong >> >>> >> >>>> with how condor-g manages the callback sink. >> >>> >> >>>> However, one improvement that would reduce the memory used >> per >> >>> job >> >>> >> >>>> would be to not have a notification consumer per job. >> Instead >> >>> use >> >>> >> >>>> one for all jobs. >> >>> >> >>>> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress >> >>> tests >> >>> >> >>>> and found that notifications are building up on the in the >> >>> GRAM4 >> >>> >> >>>> service container and that is causing delays which seem to >> be >> >>> >> >>>> causing multiple problems. We're looking at this in a >> separate >> >>> >> >>>> effort with the GT Core team. But, after this was clear, >> >>> Martin >> >>> >> >>>> re- >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g >> >>> and >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could >> you >> >>> >> >>>> repeat the no-notification test and see the difference in >> >>> memory? >> >>> >> >>>> The changes would be to increase the polling frequency in >> >>> condor-g >> >>> >> >>>> and comment out the subscribe for notification. You could >> also >> >>> >> >>>> comment out the notification listener call(s) too. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> I did two new sets of tests today. The first used more >> efficient >> >>> >> >>> callback code in the GAHP (one notification consumer rather >> than >> >>> one >> >>> >> >>> per job). The second disabled notifications and relied on >> >>> polling >> >>> >> >>> for job status changes. >> >>> >> >>> >> >>> >> >>> The more efficient callback code did not produce a noticeable >> >>> >> >>> reduction in memory usage. >> >>> >> >>> >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum >> jvm >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The >> minimum >> >>> >> >>> heap usage after job submission and before job completion was >> >>> about >> >>> >> >>> 4 megs + 0.1 megs per job. >> >>> >> >> >> >>> >> >> >> >>> >> >> I ran one more test with the improved callback code. This >> time, I >> >>> >> >> stopped storing the notification producer EPRs associated with >> >>> the >> >>> >> >> GRAM job resources. Memory usage went down markedly. >> >>> >> >> >> >>> >> >> I was told the client had to explicitly destroy these >> serve-side >> >>> >> >> notification producer resources when it destroys the job, >> >>> otherwise >> >>> >> >> they hang around bogging down the server. Is this still the >> case? >> >>> The >> >>> >> >> server can't destroy notification producers when their sources >> of >> >>> >> >> information are destroyed? >> >>> >> >> >> >>> >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant >> much >> >>> more >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of >> >>> >> > subscription resources to be able to destroy them eventually. >> >>> >> > Those EPR's are maybe not so tiny as they look like. >> >>> >> > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually >> >>> destroy >> >>> >> > subscription resources manually to avoid heaping up persistence >> >>> data >> >>> >> > on the server-side. >> >>> >> > For 4.2: no, you won't have to store them. A job resource will >> >>> >> > destroy all subscription resources when it's destroyed. >> >>> >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the >> "container >> >>> >> > hangs in job destruction" problem won't exist anymore. >> >>> >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 >> >>> changes >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it >> makes >> >>> >> > sense >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you >> >>> for >> >>> >> > fine-tuning then? >> >>> >> > >> >>> >> > Martin >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >>> >> >> >>> >> > Mihael: >> >>> >> > >> >>> >> > That's great, thanks! >> >>> >> > >> >>> >> > Ian. >> >>> >> > >> >>> >> > Mihael Hategan wrote: >> >>> >> >> I did a 1024 job run today with ws-gram. >> >>> >> >> I painted the results here: >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >>> >> >> >> >>> >> >> Seems like client memory per job is about 370k. Which is quite >> a >> >>> lot. >> >>> >> >> What kinda worries me is that it doesn't seem to go down after >> >>> the >> >>> >> >> jobs >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage >> >>> >> >> collector >> >>> >> >> doesn't do any major collections. I'll need to profile this to >> >>> see >> >>> >> >> exactly what we're talking about. >> >>> >> >> >> >>> >> >> The container memory is figured by looking at the process in >> >>> /proc. >> >>> >> >> It's >> >>> >> >> total memory including shared libraries and things. But >> libraries >> >>> >> >> take a >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be >> >>> made. >> >>> >> >> It >> >>> >> >> looks quite similar to the amount of memory eaten on the >> client >> >>> side >> >>> >> >> (per job). >> >>> >> >> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the >> >>> time >> >>> >> >> the >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I >> >>> have >> >>> >> >> yet >> >>> >> >> to plot the exact submission time for each job. >> >>> >> >> >> >>> >> >> So at this point I would recommend trying ws-gram as long as >> >>> there >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), >> >>> and >> >>> >> >> while >> >>> >> >> making sure the jvm has enough heap. More than that seems like >> a >> >>> >> >> gamble. >> >>> >> >> >> >>> >> >> Mihael >> >>> >> >> >> >>> >> >> _______________________________________________ >> >>> >> >> Swift-devel mailing list >> >>> >> >> Swift-devel at ci.uchicago.edu >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >>> >> >> >> >>> >> >> >> >>> >> > >> >>> >> >> >>> > >> >>> > >> >>> >> >>> >> >> >> >> >> > >> > >> > >> >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: gram-client.jar Type: application/octet-stream Size: 35825 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri Feb 8 13:29:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 13:29:03 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202498943.15258.4.camel@blabla.mcs.anl.gov> Thanks. I'll give it a try as people head home for the weekend and the heat in the queues is allowed to dissipate. My profiler says that some hefty amount of heap is used by a relatively low number of EndpointReferenceType objects. Btw, where do I get the sources for addressing? On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: > Try the attached 4.0 compliant jar in your tests by dropping > it in your 4.0.x $GLOBUS_LOCATION/lib. > My tests showed about 2MB memory increase per 100 GramJob > objects which sounds to me like a reasonable number (about 20k > per GramJob object ignoring the notification consumer manager > in one job - if my calculations are right) > > Martin > > > > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > >> Mihael, > >> > >> i think i found the memory hole in GramJob. > >> 100 jobs in a test of mine consumed about 23MB (constantly > >> growing) before the fix and 8MB (very slowly growing) after > >> the fix. The big part of that (7MB) is used right from the > >> first job which may be the NotificationConsumerManager. > >> Will commit that change soon to 4.0 branch and you may try > >> it then. > >> Are you using 4.0.x in your tests? > > > > Yes. If there are no API changes, you can send me the jar file. I don't > > have enough knowledge to selectively build WS-GRAM, nor enough disk > > space to build the whole GT. > > > >> > >> Martin > >> > >> >>> > > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per > >> job > >> >>> is > >> >>> a > >> >>> > bit too much considering that swift (which has to consider many > >> more > >> >>> > things) has less than 10K overhead per job. > >> >>> > > >> >>> > >> >>> > >> >>> For my better understanding: > >> >>> Do you start up your own notification consumer manager that listens > >> for > >> >>> notifications of all jobs or do you let each GramJob instance listen > >> >>> for > >> >>> notifications itself? > >> >>> In case you listen for notifications yourself: do you store > >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if > >> >>> needed? > >> >> > >> >> Excellent points. I let each GramJob instance listen for > >> notifications > >> >> itself. What I observed is that it uses only one container for that. > >> >> > >> > > >> > Shoot! i didn't know that and thought there would be a container per > >> > GramJob in that case. That's the core mysteries with notifications. > >> > Anyway: I did a quick check some days ago and found that GramJob is > >> > surprisingly greedy regarding memory as you said. I'll have to further > >> > check what it is, but will probably not do that before 4.2 is out. > >> > > >> > > >> >> Due to the above, a reference to the GramJob is kept anyway, > >> regardless > >> >> of whether that reference is in client code or the local container. > >> >> > >> >> I'll try to profile a run and see if I can spot where the problems > >> are. > >> >> > >> >>> > >> >>> Martin > >> >>> > >> >>> >> > >> >>> >> The core team will be looking at improving notifications once > >> their > >> >>> >> other 4.2 deliverables are done. > >> >>> >> > >> >>> >> -Stu > >> >>> >> > >> >>> >> Begin forwarded message: > >> >>> >> > >> >>> >> > From: feller at mcs.anl.gov > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST > >> >>> >> > To: "Jaime Frey" > >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" > >> >>> >> >> >>> >> > >, "Martin Feller" , "charles bacon" > >> >>> >> >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" > >> >>> >> >> >>> >> > >, "Jeff Porter" , "Alain Roy" > >> >>> , > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" > >> >>> >> >> >>> >> > > > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage > >> >>> >> > > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> >>> >> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> >>> >> >>> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> >>> >> >>>> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS > >> GRAM > >> >>> >> >>>>> raised some concerns about memory usage on the client side. > >> I > >> >>> did > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > >> >>> appeared > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a > >> >>> wrapper > >> >>> >> >>>>> around the java client libraries for WS GRAM. > >> >>> >> >>>>> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 > >> at > >> >>> a > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and > >> >>> execution. > >> >>> >> >>>>> Here is what I've discovered so far. > >> >>> >> >>>>> > >> >>> >> >>>>> Aside from the heap available to the java code, the jvm > >> used > >> >>> 117 > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > >> >>> Condor-G > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >> >>> >> >>>>> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >> >>> collector) > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > >> >>> complete), > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> >>> >> >>>>> > >> >>> >> >>>>> The only long-term memory per job that I know of in the > >> GAHP > >> >>> is > >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb > >> >>> seems > >> >>> a > >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us > >> >>> >> >>>>> determine if we're using the notification sinks > >> inefficiently? > >> >>> >> >>>> > >> >>> >> >>>> Martin just looked and for the most part, there is nothing > >> >>> wrong > >> >>> >> >>>> with how condor-g manages the callback sink. > >> >>> >> >>>> However, one improvement that would reduce the memory used > >> per > >> >>> job > >> >>> >> >>>> would be to not have a notification consumer per job. > >> Instead > >> >>> use > >> >>> >> >>>> one for all jobs. > >> >>> >> >>>> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress > >> >>> tests > >> >>> >> >>>> and found that notifications are building up on the in the > >> >>> GRAM4 > >> >>> >> >>>> service container and that is causing delays which seem to > >> be > >> >>> >> >>>> causing multiple problems. We're looking at this in a > >> separate > >> >>> >> >>>> effort with the GT Core team. But, after this was clear, > >> >>> Martin > >> >>> >> >>>> re- > >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g > >> >>> and > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could > >> you > >> >>> >> >>>> repeat the no-notification test and see the difference in > >> >>> memory? > >> >>> >> >>>> The changes would be to increase the polling frequency in > >> >>> condor-g > >> >>> >> >>>> and comment out the subscribe for notification. You could > >> also > >> >>> >> >>>> comment out the notification listener call(s) too. > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> I did two new sets of tests today. The first used more > >> efficient > >> >>> >> >>> callback code in the GAHP (one notification consumer rather > >> than > >> >>> one > >> >>> >> >>> per job). The second disabled notifications and relied on > >> >>> polling > >> >>> >> >>> for job status changes. > >> >>> >> >>> > >> >>> >> >>> The more efficient callback code did not produce a noticeable > >> >>> >> >>> reduction in memory usage. > >> >>> >> >>> > >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum > >> jvm > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The > >> minimum > >> >>> >> >>> heap usage after job submission and before job completion was > >> >>> about > >> >>> >> >>> 4 megs + 0.1 megs per job. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> I ran one more test with the improved callback code. This > >> time, I > >> >>> >> >> stopped storing the notification producer EPRs associated with > >> >>> the > >> >>> >> >> GRAM job resources. Memory usage went down markedly. > >> >>> >> >> > >> >>> >> >> I was told the client had to explicitly destroy these > >> serve-side > >> >>> >> >> notification producer resources when it destroys the job, > >> >>> otherwise > >> >>> >> >> they hang around bogging down the server. Is this still the > >> case? > >> >>> The > >> >>> >> >> server can't destroy notification producers when their sources > >> of > >> >>> >> >> information are destroyed? > >> >>> >> >> > >> >>> >> > > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant > >> much > >> >>> more > >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of > >> >>> >> > subscription resources to be able to destroy them eventually. > >> >>> >> > Those EPR's are maybe not so tiny as they look like. > >> >>> >> > > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually > >> >>> destroy > >> >>> >> > subscription resources manually to avoid heaping up persistence > >> >>> data > >> >>> >> > on the server-side. > >> >>> >> > For 4.2: no, you won't have to store them. A job resource will > >> >>> >> > destroy all subscription resources when it's destroyed. > >> >>> >> > > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the > >> "container > >> >>> >> > hangs in job destruction" problem won't exist anymore. > >> >>> >> > > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > >> >>> changes > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it > >> makes > >> >>> >> > sense > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you > >> >>> for > >> >>> >> > fine-tuning then? > >> >>> >> > > >> >>> >> > Martin > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> >>> >> > >> >>> >> > Mihael: > >> >>> >> > > >> >>> >> > That's great, thanks! > >> >>> >> > > >> >>> >> > Ian. > >> >>> >> > > >> >>> >> > Mihael Hategan wrote: > >> >>> >> >> I did a 1024 job run today with ws-gram. > >> >>> >> >> I painted the results here: > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> >>> >> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is quite > >> a > >> >>> lot. > >> >>> >> >> What kinda worries me is that it doesn't seem to go down after > >> >>> the > >> >>> >> >> jobs > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage > >> >>> >> >> collector > >> >>> >> >> doesn't do any major collections. I'll need to profile this to > >> >>> see > >> >>> >> >> exactly what we're talking about. > >> >>> >> >> > >> >>> >> >> The container memory is figured by looking at the process in > >> >>> /proc. > >> >>> >> >> It's > >> >>> >> >> total memory including shared libraries and things. But > >> libraries > >> >>> >> >> take a > >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be > >> >>> made. > >> >>> >> >> It > >> >>> >> >> looks quite similar to the amount of memory eaten on the > >> client > >> >>> side > >> >>> >> >> (per job). > >> >>> >> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the > >> >>> time > >> >>> >> >> the > >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I > >> >>> have > >> >>> >> >> yet > >> >>> >> >> to plot the exact submission time for each job. > >> >>> >> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long as > >> >>> there > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), > >> >>> and > >> >>> >> >> while > >> >>> >> >> making sure the jvm has enough heap. More than that seems like > >> a > >> >>> >> >> gamble. > >> >>> >> >> > >> >>> >> >> Mihael > >> >>> >> >> > >> >>> >> >> _______________________________________________ > >> >>> >> >> Swift-devel mailing list > >> >>> >> >> Swift-devel at ci.uchicago.edu > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> >>> >> >> > >> >>> >> >> > >> >>> >> > > >> >>> >> > >> >>> > > >> >>> > > >> >>> > >> >>> > >> >> > >> >> > >> > > >> > > >> > > >> > >> > > > > From feller at mcs.anl.gov Fri Feb 8 13:46:00 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 13:46:00 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202498943.15258.4.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202498943.15258.4.camel@blabla.mcs.anl.gov> Message-ID: <11618.208.54.7.179.1202499960.squirrel@www-unix.mcs.anl.gov> > Thanks. I'll give it a try as people head home for the weekend and the > heat in the queues is allowed to dissipate. > > My profiler says that some hefty amount of heap is used by a relatively > low number of EndpointReferenceType objects. Btw, where do I get the > sources for addressing? It's included as a jar in wsrf, but you can also see the sources by extracting java/lib-src/ws-addressing/ws-addressing.tar.gz of the wsrf package. so: cvs co -r globus_4_0_6 wsrf cd wsrf/java/lib-src/ws-addressing/ ... And yes, it seems to be the objects of type EndpointReferenceType. Those seem to be beasts. Rachana once mentioned that they should be trimmed when you get them from the stubs because they contain "SOAP crap". GramJob stored the job-EPR and subscription-EPR as they came from the output from the call to the factory stub. In the new jar trimmed eprs (provided by ObjectSerializer.clone(eprObject)) are stored in GramJob objects instead of the raw ones. Martin > On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: >> Try the attached 4.0 compliant jar in your tests by dropping >> it in your 4.0.x $GLOBUS_LOCATION/lib. >> My tests showed about 2MB memory increase per 100 GramJob >> objects which sounds to me like a reasonable number (about 20k >> per GramJob object ignoring the notification consumer manager >> in one job - if my calculations are right) >> >> Martin >> >> > >> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: >> >> Mihael, >> >> >> >> i think i found the memory hole in GramJob. >> >> 100 jobs in a test of mine consumed about 23MB (constantly >> >> growing) before the fix and 8MB (very slowly growing) after >> >> the fix. The big part of that (7MB) is used right from the >> >> first job which may be the NotificationConsumerManager. >> >> Will commit that change soon to 4.0 branch and you may try >> >> it then. >> >> Are you using 4.0.x in your tests? >> > >> > Yes. If there are no API changes, you can send me the jar file. I >> don't >> > have enough knowledge to selectively build WS-GRAM, nor enough disk >> > space to build the whole GT. >> > >> >> >> >> Martin >> >> >> >> >>> > >> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per >> >> job >> >> >>> is >> >> >>> a >> >> >>> > bit too much considering that swift (which has to consider many >> >> more >> >> >>> > things) has less than 10K overhead per job. >> >> >>> > >> >> >>> >> >> >>> >> >> >>> For my better understanding: >> >> >>> Do you start up your own notification consumer manager that >> listens >> >> for >> >> >>> notifications of all jobs or do you let each GramJob instance >> listen >> >> >>> for >> >> >>> notifications itself? >> >> >>> In case you listen for notifications yourself: do you store >> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects >> if >> >> >>> needed? >> >> >> >> >> >> Excellent points. I let each GramJob instance listen for >> >> notifications >> >> >> itself. What I observed is that it uses only one container for >> that. >> >> >> >> >> > >> >> > Shoot! i didn't know that and thought there would be a container >> per >> >> > GramJob in that case. That's the core mysteries with notifications. >> >> > Anyway: I did a quick check some days ago and found that GramJob is >> >> > surprisingly greedy regarding memory as you said. I'll have to >> further >> >> > check what it is, but will probably not do that before 4.2 is out. >> >> > >> >> > >> >> >> Due to the above, a reference to the GramJob is kept anyway, >> >> regardless >> >> >> of whether that reference is in client code or the local >> container. >> >> >> >> >> >> I'll try to profile a run and see if I can spot where the problems >> >> are. >> >> >> >> >> >>> >> >> >>> Martin >> >> >>> >> >> >>> >> >> >> >>> >> The core team will be looking at improving notifications once >> >> their >> >> >>> >> other 4.2 deliverables are done. >> >> >>> >> >> >> >>> >> -Stu >> >> >>> >> >> >> >>> >> Begin forwarded message: >> >> >>> >> >> >> >>> >> > From: feller at mcs.anl.gov >> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST >> >> >>> >> > To: "Jaime Frey" >> >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" >> >> >>> >> > >> >>> >> > >, "Martin Feller" , "charles bacon" >> >> >>> >> > >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" >> >> >>> >> > >> >>> >> > >, "Jeff Porter" , "Alain Roy" >> >> >>> , >> >> >>> >> > "Todd Tannenbaum" , "Miron Livny" >> >> >>> >> > >> >>> >> > > >> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage >> >> >>> >> > >> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >> >>> >> >> >> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >> >>> >> >>> >> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> >> >>> >> >>>> >> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with >> WS >> >> GRAM >> >> >>> >> >>>>> raised some concerns about memory usage on the client >> side. >> >> I >> >> >>> did >> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which >> >> >>> appeared >> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a >> >> >>> wrapper >> >> >>> >> >>>>> around the java client libraries for WS GRAM. >> >> >>> >> >>>>> >> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to >> 30 >> >> at >> >> >>> a >> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal >> data >> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and >> >> >>> execution. >> >> >>> >> >>>>> Here is what I've discovered so far. >> >> >>> >> >>>>> >> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm >> >> used >> >> >>> 117 >> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. >> >> >>> Condor-G >> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) >> pair. >> >> >>> >> >>>>> >> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage >> >> >>> collector) >> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP >> was >> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to >> >> >>> complete), >> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >> >>> >> >>>>> >> >> >>> >> >>>>> The only long-term memory per job that I know of in the >> >> GAHP >> >> >>> is >> >> >>> >> >>>>> for the notification sink for job status callbacks. >> 600kb >> >> >>> seems >> >> >>> a >> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help >> us >> >> >>> >> >>>>> determine if we're using the notification sinks >> >> inefficiently? >> >> >>> >> >>>> >> >> >>> >> >>>> Martin just looked and for the most part, there is >> nothing >> >> >>> wrong >> >> >>> >> >>>> with how condor-g manages the callback sink. >> >> >>> >> >>>> However, one improvement that would reduce the memory >> used >> >> per >> >> >>> job >> >> >>> >> >>>> would be to not have a notification consumer per job. >> >> Instead >> >> >>> use >> >> >>> >> >>>> one for all jobs. >> >> >>> >> >>>> >> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g >> stress >> >> >>> tests >> >> >>> >> >>>> and found that notifications are building up on the in >> the >> >> >>> GRAM4 >> >> >>> >> >>>> service container and that is causing delays which seem >> to >> >> be >> >> >>> >> >>>> causing multiple problems. We're looking at this in a >> >> separate >> >> >>> >> >>>> effort with the GT Core team. But, after this was clear, >> >> >>> Martin >> >> >>> >> >>>> re- >> >> >>> >> >>>> ran the condor-g test and relied on polling between >> condor-g >> >> >>> and >> >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could >> >> you >> >> >>> >> >>>> repeat the no-notification test and see the difference in >> >> >>> memory? >> >> >>> >> >>>> The changes would be to increase the polling frequency in >> >> >>> condor-g >> >> >>> >> >>>> and comment out the subscribe for notification. You >> could >> >> also >> >> >>> >> >>>> comment out the notification listener call(s) too. >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> I did two new sets of tests today. The first used more >> >> efficient >> >> >>> >> >>> callback code in the GAHP (one notification consumer >> rather >> >> than >> >> >>> one >> >> >>> >> >>> per job). The second disabled notifications and relied on >> >> >>> polling >> >> >>> >> >>> for job status changes. >> >> >>> >> >>> >> >> >>> >> >>> The more efficient callback code did not produce a >> noticeable >> >> >>> >> >>> reduction in memory usage. >> >> >>> >> >>> >> >> >>> >> >>> Disabling notifications did reduce memory usage. The >> maximum >> >> jvm >> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The >> >> minimum >> >> >>> >> >>> heap usage after job submission and before job completion >> was >> >> >>> about >> >> >>> >> >>> 4 megs + 0.1 megs per job. >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> I ran one more test with the improved callback code. This >> >> time, I >> >> >>> >> >> stopped storing the notification producer EPRs associated >> with >> >> >>> the >> >> >>> >> >> GRAM job resources. Memory usage went down markedly. >> >> >>> >> >> >> >> >>> >> >> I was told the client had to explicitly destroy these >> >> serve-side >> >> >>> >> >> notification producer resources when it destroys the job, >> >> >>> otherwise >> >> >>> >> >> they hang around bogging down the server. Is this still the >> >> case? >> >> >>> The >> >> >>> >> >> server can't destroy notification producers when their >> sources >> >> of >> >> >>> >> >> information are destroyed? >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant >> >> much >> >> >>> more >> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs >> of >> >> >>> >> > subscription resources to be able to destroy them >> eventually. >> >> >>> >> > Those EPR's are maybe not so tiny as they look like. >> >> >>> >> > >> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually >> >> >>> destroy >> >> >>> >> > subscription resources manually to avoid heaping up >> persistence >> >> >>> data >> >> >>> >> > on the server-side. >> >> >>> >> > For 4.2: no, you won't have to store them. A job resource >> will >> >> >>> >> > destroy all subscription resources when it's destroyed. >> >> >>> >> > >> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the >> >> "container >> >> >>> >> > hangs in job destruction" problem won't exist anymore. >> >> >>> >> > >> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable >> 4.2 >> >> >>> changes >> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it >> >> makes >> >> >>> >> > sense >> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to >> you >> >> >>> for >> >> >>> >> > fine-tuning then? >> >> >>> >> > >> >> >>> >> > Martin >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >> >>> >> >> >> >>> >> > Mihael: >> >> >>> >> > >> >> >>> >> > That's great, thanks! >> >> >>> >> > >> >> >>> >> > Ian. >> >> >>> >> > >> >> >>> >> > Mihael Hategan wrote: >> >> >>> >> >> I did a 1024 job run today with ws-gram. >> >> >>> >> >> I painted the results here: >> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> >>> >> >> >> >> >>> >> >> Seems like client memory per job is about 370k. Which is >> quite >> >> a >> >> >>> lot. >> >> >>> >> >> What kinda worries me is that it doesn't seem to go down >> after >> >> >>> the >> >> >>> >> >> jobs >> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the >> garbage >> >> >>> >> >> collector >> >> >>> >> >> doesn't do any major collections. I'll need to profile this >> to >> >> >>> see >> >> >>> >> >> exactly what we're talking about. >> >> >>> >> >> >> >> >>> >> >> The container memory is figured by looking at the process >> in >> >> >>> /proc. >> >> >>> >> >> It's >> >> >>> >> >> total memory including shared libraries and things. But >> >> libraries >> >> >>> >> >> take a >> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably >> be >> >> >>> made. >> >> >>> >> >> It >> >> >>> >> >> looks quite similar to the amount of memory eaten on the >> >> client >> >> >>> side >> >> >>> >> >> (per job). >> >> >>> >> >> >> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during >> the >> >> >>> time >> >> >>> >> >> the >> >> >>> >> >> jobs are submitted, but the machine itself seems >> responsive. I >> >> >>> have >> >> >>> >> >> yet >> >> >>> >> >> to plot the exact submission time for each job. >> >> >>> >> >> >> >> >>> >> >> So at this point I would recommend trying ws-gram as long >> as >> >> >>> there >> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel >> jobs), >> >> >>> and >> >> >>> >> >> while >> >> >>> >> >> making sure the jvm has enough heap. More than that seems >> like >> >> a >> >> >>> >> >> gamble. >> >> >>> >> >> >> >> >>> >> >> Mihael >> >> >>> >> >> >> >> >>> >> >> _______________________________________________ >> >> >>> >> >> Swift-devel mailing list >> >> >>> >> >> Swift-devel at ci.uchicago.edu >> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> >> >> >>> > >> >> >>> > >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> >> >> >> > >> > > > From hategan at mcs.anl.gov Fri Feb 8 13:57:53 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 13:57:53 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202500673.17544.1.camel@blabla.mcs.anl.gov> Won't fly: java.lang.NoClassDefFoundError: org/globus/exec/utils/audit/AuditUtil at org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:952) at org.globus.exec.client.GramJob.submit(GramJob.java:447) On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: > Try the attached 4.0 compliant jar in your tests by dropping > it in your 4.0.x $GLOBUS_LOCATION/lib. > My tests showed about 2MB memory increase per 100 GramJob > objects which sounds to me like a reasonable number (about 20k > per GramJob object ignoring the notification consumer manager > in one job - if my calculations are right) > > Martin > > > > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > >> Mihael, > >> > >> i think i found the memory hole in GramJob. > >> 100 jobs in a test of mine consumed about 23MB (constantly > >> growing) before the fix and 8MB (very slowly growing) after > >> the fix. The big part of that (7MB) is used right from the > >> first job which may be the NotificationConsumerManager. > >> Will commit that change soon to 4.0 branch and you may try > >> it then. > >> Are you using 4.0.x in your tests? > > > > Yes. If there are no API changes, you can send me the jar file. I don't > > have enough knowledge to selectively build WS-GRAM, nor enough disk > > space to build the whole GT. > > > >> > >> Martin > >> > >> >>> > > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per > >> job > >> >>> is > >> >>> a > >> >>> > bit too much considering that swift (which has to consider many > >> more > >> >>> > things) has less than 10K overhead per job. > >> >>> > > >> >>> > >> >>> > >> >>> For my better understanding: > >> >>> Do you start up your own notification consumer manager that listens > >> for > >> >>> notifications of all jobs or do you let each GramJob instance listen > >> >>> for > >> >>> notifications itself? > >> >>> In case you listen for notifications yourself: do you store > >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if > >> >>> needed? > >> >> > >> >> Excellent points. I let each GramJob instance listen for > >> notifications > >> >> itself. What I observed is that it uses only one container for that. > >> >> > >> > > >> > Shoot! i didn't know that and thought there would be a container per > >> > GramJob in that case. That's the core mysteries with notifications. > >> > Anyway: I did a quick check some days ago and found that GramJob is > >> > surprisingly greedy regarding memory as you said. I'll have to further > >> > check what it is, but will probably not do that before 4.2 is out. > >> > > >> > > >> >> Due to the above, a reference to the GramJob is kept anyway, > >> regardless > >> >> of whether that reference is in client code or the local container. > >> >> > >> >> I'll try to profile a run and see if I can spot where the problems > >> are. > >> >> > >> >>> > >> >>> Martin > >> >>> > >> >>> >> > >> >>> >> The core team will be looking at improving notifications once > >> their > >> >>> >> other 4.2 deliverables are done. > >> >>> >> > >> >>> >> -Stu > >> >>> >> > >> >>> >> Begin forwarded message: > >> >>> >> > >> >>> >> > From: feller at mcs.anl.gov > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST > >> >>> >> > To: "Jaime Frey" > >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" > >> >>> >> >> >>> >> > >, "Martin Feller" , "charles bacon" > >> >>> >> >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" > >> >>> >> >> >>> >> > >, "Jeff Porter" , "Alain Roy" > >> >>> , > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" > >> >>> >> >> >>> >> > > > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage > >> >>> >> > > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> >>> >> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> >>> >> >>> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> >>> >> >>>> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS > >> GRAM > >> >>> >> >>>>> raised some concerns about memory usage on the client side. > >> I > >> >>> did > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > >> >>> appeared > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a > >> >>> wrapper > >> >>> >> >>>>> around the java client libraries for WS GRAM. > >> >>> >> >>>>> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 > >> at > >> >>> a > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and > >> >>> execution. > >> >>> >> >>>>> Here is what I've discovered so far. > >> >>> >> >>>>> > >> >>> >> >>>>> Aside from the heap available to the java code, the jvm > >> used > >> >>> 117 > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > >> >>> Condor-G > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >> >>> >> >>>>> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >> >>> collector) > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > >> >>> complete), > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> >>> >> >>>>> > >> >>> >> >>>>> The only long-term memory per job that I know of in the > >> GAHP > >> >>> is > >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb > >> >>> seems > >> >>> a > >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us > >> >>> >> >>>>> determine if we're using the notification sinks > >> inefficiently? > >> >>> >> >>>> > >> >>> >> >>>> Martin just looked and for the most part, there is nothing > >> >>> wrong > >> >>> >> >>>> with how condor-g manages the callback sink. > >> >>> >> >>>> However, one improvement that would reduce the memory used > >> per > >> >>> job > >> >>> >> >>>> would be to not have a notification consumer per job. > >> Instead > >> >>> use > >> >>> >> >>>> one for all jobs. > >> >>> >> >>>> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress > >> >>> tests > >> >>> >> >>>> and found that notifications are building up on the in the > >> >>> GRAM4 > >> >>> >> >>>> service container and that is causing delays which seem to > >> be > >> >>> >> >>>> causing multiple problems. We're looking at this in a > >> separate > >> >>> >> >>>> effort with the GT Core team. But, after this was clear, > >> >>> Martin > >> >>> >> >>>> re- > >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g > >> >>> and > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could > >> you > >> >>> >> >>>> repeat the no-notification test and see the difference in > >> >>> memory? > >> >>> >> >>>> The changes would be to increase the polling frequency in > >> >>> condor-g > >> >>> >> >>>> and comment out the subscribe for notification. You could > >> also > >> >>> >> >>>> comment out the notification listener call(s) too. > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> I did two new sets of tests today. The first used more > >> efficient > >> >>> >> >>> callback code in the GAHP (one notification consumer rather > >> than > >> >>> one > >> >>> >> >>> per job). The second disabled notifications and relied on > >> >>> polling > >> >>> >> >>> for job status changes. > >> >>> >> >>> > >> >>> >> >>> The more efficient callback code did not produce a noticeable > >> >>> >> >>> reduction in memory usage. > >> >>> >> >>> > >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum > >> jvm > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The > >> minimum > >> >>> >> >>> heap usage after job submission and before job completion was > >> >>> about > >> >>> >> >>> 4 megs + 0.1 megs per job. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> I ran one more test with the improved callback code. This > >> time, I > >> >>> >> >> stopped storing the notification producer EPRs associated with > >> >>> the > >> >>> >> >> GRAM job resources. Memory usage went down markedly. > >> >>> >> >> > >> >>> >> >> I was told the client had to explicitly destroy these > >> serve-side > >> >>> >> >> notification producer resources when it destroys the job, > >> >>> otherwise > >> >>> >> >> they hang around bogging down the server. Is this still the > >> case? > >> >>> The > >> >>> >> >> server can't destroy notification producers when their sources > >> of > >> >>> >> >> information are destroyed? > >> >>> >> >> > >> >>> >> > > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant > >> much > >> >>> more > >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of > >> >>> >> > subscription resources to be able to destroy them eventually. > >> >>> >> > Those EPR's are maybe not so tiny as they look like. > >> >>> >> > > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually > >> >>> destroy > >> >>> >> > subscription resources manually to avoid heaping up persistence > >> >>> data > >> >>> >> > on the server-side. > >> >>> >> > For 4.2: no, you won't have to store them. A job resource will > >> >>> >> > destroy all subscription resources when it's destroyed. > >> >>> >> > > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the > >> "container > >> >>> >> > hangs in job destruction" problem won't exist anymore. > >> >>> >> > > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > >> >>> changes > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it > >> makes > >> >>> >> > sense > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you > >> >>> for > >> >>> >> > fine-tuning then? > >> >>> >> > > >> >>> >> > Martin > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> >>> >> > >> >>> >> > Mihael: > >> >>> >> > > >> >>> >> > That's great, thanks! > >> >>> >> > > >> >>> >> > Ian. > >> >>> >> > > >> >>> >> > Mihael Hategan wrote: > >> >>> >> >> I did a 1024 job run today with ws-gram. > >> >>> >> >> I painted the results here: > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> >>> >> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is quite > >> a > >> >>> lot. > >> >>> >> >> What kinda worries me is that it doesn't seem to go down after > >> >>> the > >> >>> >> >> jobs > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage > >> >>> >> >> collector > >> >>> >> >> doesn't do any major collections. I'll need to profile this to > >> >>> see > >> >>> >> >> exactly what we're talking about. > >> >>> >> >> > >> >>> >> >> The container memory is figured by looking at the process in > >> >>> /proc. > >> >>> >> >> It's > >> >>> >> >> total memory including shared libraries and things. But > >> libraries > >> >>> >> >> take a > >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be > >> >>> made. > >> >>> >> >> It > >> >>> >> >> looks quite similar to the amount of memory eaten on the > >> client > >> >>> side > >> >>> >> >> (per job). > >> >>> >> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the > >> >>> time > >> >>> >> >> the > >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I > >> >>> have > >> >>> >> >> yet > >> >>> >> >> to plot the exact submission time for each job. > >> >>> >> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long as > >> >>> there > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), > >> >>> and > >> >>> >> >> while > >> >>> >> >> making sure the jvm has enough heap. More than that seems like > >> a > >> >>> >> >> gamble. > >> >>> >> >> > >> >>> >> >> Mihael > >> >>> >> >> > >> >>> >> >> _______________________________________________ > >> >>> >> >> Swift-devel mailing list > >> >>> >> >> Swift-devel at ci.uchicago.edu > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> >>> >> >> > >> >>> >> >> > >> >>> >> > > >> >>> >> > >> >>> > > >> >>> > > >> >>> > >> >>> > >> >> > >> >> > >> > > >> > > >> > > >> > >> > > > > From feller at mcs.anl.gov Fri Feb 8 14:15:09 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 14:15:09 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202500673.17544.1.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202500673.17544.1.camel@blabla.mcs.anl.gov> Message-ID: <21032.208.54.7.179.1202501709.squirrel@www-unix.mcs.anl.gov> ok, replace all gram jars with the attached ones. > Won't fly: > > java.lang.NoClassDefFoundError: org/globus/exec/utils/audit/AuditUtil > at > org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:952) > at org.globus.exec.client.GramJob.submit(GramJob.java:447) > > On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: >> Try the attached 4.0 compliant jar in your tests by dropping >> it in your 4.0.x $GLOBUS_LOCATION/lib. >> My tests showed about 2MB memory increase per 100 GramJob >> objects which sounds to me like a reasonable number (about 20k >> per GramJob object ignoring the notification consumer manager >> in one job - if my calculations are right) >> >> Martin >> >> > >> > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: >> >> Mihael, >> >> >> >> i think i found the memory hole in GramJob. >> >> 100 jobs in a test of mine consumed about 23MB (constantly >> >> growing) before the fix and 8MB (very slowly growing) after >> >> the fix. The big part of that (7MB) is used right from the >> >> first job which may be the NotificationConsumerManager. >> >> Will commit that change soon to 4.0 branch and you may try >> >> it then. >> >> Are you using 4.0.x in your tests? >> > >> > Yes. If there are no API changes, you can send me the jar file. I >> don't >> > have enough knowledge to selectively build WS-GRAM, nor enough disk >> > space to build the whole GT. >> > >> >> >> >> Martin >> >> >> >> >>> > >> >> >>> > These are both hacks. I'm not sure I want to go there. 300K per >> >> job >> >> >>> is >> >> >>> a >> >> >>> > bit too much considering that swift (which has to consider many >> >> more >> >> >>> > things) has less than 10K overhead per job. >> >> >>> > >> >> >>> >> >> >>> >> >> >>> For my better understanding: >> >> >>> Do you start up your own notification consumer manager that >> listens >> >> for >> >> >>> notifications of all jobs or do you let each GramJob instance >> listen >> >> >>> for >> >> >>> notifications itself? >> >> >>> In case you listen for notifications yourself: do you store >> >> >>> GramJob objects or just EPR's of jobs and create GramJob objects >> if >> >> >>> needed? >> >> >> >> >> >> Excellent points. I let each GramJob instance listen for >> >> notifications >> >> >> itself. What I observed is that it uses only one container for >> that. >> >> >> >> >> > >> >> > Shoot! i didn't know that and thought there would be a container >> per >> >> > GramJob in that case. That's the core mysteries with notifications. >> >> > Anyway: I did a quick check some days ago and found that GramJob is >> >> > surprisingly greedy regarding memory as you said. I'll have to >> further >> >> > check what it is, but will probably not do that before 4.2 is out. >> >> > >> >> > >> >> >> Due to the above, a reference to the GramJob is kept anyway, >> >> regardless >> >> >> of whether that reference is in client code or the local >> container. >> >> >> >> >> >> I'll try to profile a run and see if I can spot where the problems >> >> are. >> >> >> >> >> >>> >> >> >>> Martin >> >> >>> >> >> >>> >> >> >> >>> >> The core team will be looking at improving notifications once >> >> their >> >> >>> >> other 4.2 deliverables are done. >> >> >>> >> >> >> >>> >> -Stu >> >> >>> >> >> >> >>> >> Begin forwarded message: >> >> >>> >> >> >> >>> >> > From: feller at mcs.anl.gov >> >> >>> >> > Date: February 1, 2008 9:41:05 AM CST >> >> >>> >> > To: "Jaime Frey" >> >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" >> >> >>> >> > >> >>> >> > >, "Martin Feller" , "charles bacon" >> >> >>> >> > >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" >> >> >>> >> > >> >>> >> > >, "Jeff Porter" , "Alain Roy" >> >> >>> , >> >> >>> >> > "Todd Tannenbaum" , "Miron Livny" >> >> >>> >> > >> >>> >> > > >> >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage >> >> >>> >> > >> >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >> >>> >> >> >> >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >> >>> >> >>> >> >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> >> >>> >> >>>> >> >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with >> WS >> >> GRAM >> >> >>> >> >>>>> raised some concerns about memory usage on the client >> side. >> >> I >> >> >>> did >> >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which >> >> >>> appeared >> >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a >> >> >>> wrapper >> >> >>> >> >>>>> around the java client libraries for WS GRAM. >> >> >>> >> >>>>> >> >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to >> 30 >> >> at >> >> >>> a >> >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal >> data >> >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and >> >> >>> execution. >> >> >>> >> >>>>> Here is what I've discovered so far. >> >> >>> >> >>>>> >> >> >>> >> >>>>> Aside from the heap available to the java code, the jvm >> >> used >> >> >>> 117 >> >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. >> >> >>> Condor-G >> >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) >> pair. >> >> >>> >> >>>>> >> >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage >> >> >>> collector) >> >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP >> was >> >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to >> >> >>> complete), >> >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >> >>> >> >>>>> >> >> >>> >> >>>>> The only long-term memory per job that I know of in the >> >> GAHP >> >> >>> is >> >> >>> >> >>>>> for the notification sink for job status callbacks. >> 600kb >> >> >>> seems >> >> >>> a >> >> >>> >> >>>>> little high for that. Stu, could someone on Globus help >> us >> >> >>> >> >>>>> determine if we're using the notification sinks >> >> inefficiently? >> >> >>> >> >>>> >> >> >>> >> >>>> Martin just looked and for the most part, there is >> nothing >> >> >>> wrong >> >> >>> >> >>>> with how condor-g manages the callback sink. >> >> >>> >> >>>> However, one improvement that would reduce the memory >> used >> >> per >> >> >>> job >> >> >>> >> >>>> would be to not have a notification consumer per job. >> >> Instead >> >> >>> use >> >> >>> >> >>>> one for all jobs. >> >> >>> >> >>>> >> >> >>> >> >>>> Also, Martin recently did some analysis on condor-g >> stress >> >> >>> tests >> >> >>> >> >>>> and found that notifications are building up on the in >> the >> >> >>> GRAM4 >> >> >>> >> >>>> service container and that is causing delays which seem >> to >> >> be >> >> >>> >> >>>> causing multiple problems. We're looking at this in a >> >> separate >> >> >>> >> >>>> effort with the GT Core team. But, after this was clear, >> >> >>> Martin >> >> >>> >> >>>> re- >> >> >>> >> >>>> ran the condor-g test and relied on polling between >> condor-g >> >> >>> and >> >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could >> >> you >> >> >>> >> >>>> repeat the no-notification test and see the difference in >> >> >>> memory? >> >> >>> >> >>>> The changes would be to increase the polling frequency in >> >> >>> condor-g >> >> >>> >> >>>> and comment out the subscribe for notification. You >> could >> >> also >> >> >>> >> >>>> comment out the notification listener call(s) too. >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> I did two new sets of tests today. The first used more >> >> efficient >> >> >>> >> >>> callback code in the GAHP (one notification consumer >> rather >> >> than >> >> >>> one >> >> >>> >> >>> per job). The second disabled notifications and relied on >> >> >>> polling >> >> >>> >> >>> for job status changes. >> >> >>> >> >>> >> >> >>> >> >>> The more efficient callback code did not produce a >> noticeable >> >> >>> >> >>> reduction in memory usage. >> >> >>> >> >>> >> >> >>> >> >>> Disabling notifications did reduce memory usage. The >> maximum >> >> jvm >> >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The >> >> minimum >> >> >>> >> >>> heap usage after job submission and before job completion >> was >> >> >>> about >> >> >>> >> >>> 4 megs + 0.1 megs per job. >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> I ran one more test with the improved callback code. This >> >> time, I >> >> >>> >> >> stopped storing the notification producer EPRs associated >> with >> >> >>> the >> >> >>> >> >> GRAM job resources. Memory usage went down markedly. >> >> >>> >> >> >> >> >>> >> >> I was told the client had to explicitly destroy these >> >> serve-side >> >> >>> >> >> notification producer resources when it destroys the job, >> >> >>> otherwise >> >> >>> >> >> they hang around bogging down the server. Is this still the >> >> case? >> >> >>> The >> >> >>> >> >> server can't destroy notification producers when their >> sources >> >> of >> >> >>> >> >> information are destroyed? >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> > This reminds me of the odd fact that i had to suddenly grant >> >> much >> >> >>> more >> >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs >> of >> >> >>> >> > subscription resources to be able to destroy them >> eventually. >> >> >>> >> > Those EPR's are maybe not so tiny as they look like. >> >> >>> >> > >> >> >>> >> > For 4.0: yes, currently you'll have to store and eventually >> >> >>> destroy >> >> >>> >> > subscription resources manually to avoid heaping up >> persistence >> >> >>> data >> >> >>> >> > on the server-side. >> >> >>> >> > For 4.2: no, you won't have to store them. A job resource >> will >> >> >>> >> > destroy all subscription resources when it's destroyed. >> >> >>> >> > >> >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the >> >> "container >> >> >>> >> > hangs in job destruction" problem won't exist anymore. >> >> >>> >> > >> >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable >> 4.2 >> >> >>> changes >> >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it >> >> makes >> >> >>> >> > sense >> >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to >> you >> >> >>> for >> >> >>> >> > fine-tuning then? >> >> >>> >> > >> >> >>> >> > Martin >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >> >>> >> >> >> >>> >> > Mihael: >> >> >>> >> > >> >> >>> >> > That's great, thanks! >> >> >>> >> > >> >> >>> >> > Ian. >> >> >>> >> > >> >> >>> >> > Mihael Hategan wrote: >> >> >>> >> >> I did a 1024 job run today with ws-gram. >> >> >>> >> >> I painted the results here: >> >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> >>> >> >> >> >> >>> >> >> Seems like client memory per job is about 370k. Which is >> quite >> >> a >> >> >>> lot. >> >> >>> >> >> What kinda worries me is that it doesn't seem to go down >> after >> >> >>> the >> >> >>> >> >> jobs >> >> >>> >> >> are done, so maybe there's a memory leak, or maybe the >> garbage >> >> >>> >> >> collector >> >> >>> >> >> doesn't do any major collections. I'll need to profile this >> to >> >> >>> see >> >> >>> >> >> exactly what we're talking about. >> >> >>> >> >> >> >> >>> >> >> The container memory is figured by looking at the process >> in >> >> >>> /proc. >> >> >>> >> >> It's >> >> >>> >> >> total memory including shared libraries and things. But >> >> libraries >> >> >>> >> >> take a >> >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably >> be >> >> >>> made. >> >> >>> >> >> It >> >> >>> >> >> looks quite similar to the amount of memory eaten on the >> >> client >> >> >>> side >> >> >>> >> >> (per job). >> >> >>> >> >> >> >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during >> the >> >> >>> time >> >> >>> >> >> the >> >> >>> >> >> jobs are submitted, but the machine itself seems >> responsive. I >> >> >>> have >> >> >>> >> >> yet >> >> >>> >> >> to plot the exact submission time for each job. >> >> >>> >> >> >> >> >>> >> >> So at this point I would recommend trying ws-gram as long >> as >> >> >>> there >> >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel >> jobs), >> >> >>> and >> >> >>> >> >> while >> >> >>> >> >> making sure the jvm has enough heap. More than that seems >> like >> >> a >> >> >>> >> >> gamble. >> >> >>> >> >> >> >> >>> >> >> Mihael >> >> >>> >> >> >> >> >>> >> >> _______________________________________________ >> >> >>> >> >> Swift-devel mailing list >> >> >>> >> >> Swift-devel at ci.uchicago.edu >> >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> >> >> >>> > >> >> >>> > >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> >> >> >> > >> > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: gramjars.tar.gz Type: application/x-gzip Size: 531778 bytes Desc: not available URL: From hategan at mcs.anl.gov Fri Feb 8 15:02:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 15:02:30 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202504550.21618.0.camel@blabla.mcs.anl.gov> On a first look it indeed looks like the gc is more successful at cleaning stuff up. On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: > Try the attached 4.0 compliant jar in your tests by dropping > it in your 4.0.x $GLOBUS_LOCATION/lib. > My tests showed about 2MB memory increase per 100 GramJob > objects which sounds to me like a reasonable number (about 20k > per GramJob object ignoring the notification consumer manager > in one job - if my calculations are right) > > Martin > > > > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > >> Mihael, > >> > >> i think i found the memory hole in GramJob. > >> 100 jobs in a test of mine consumed about 23MB (constantly > >> growing) before the fix and 8MB (very slowly growing) after > >> the fix. The big part of that (7MB) is used right from the > >> first job which may be the NotificationConsumerManager. > >> Will commit that change soon to 4.0 branch and you may try > >> it then. > >> Are you using 4.0.x in your tests? > > > > Yes. If there are no API changes, you can send me the jar file. I don't > > have enough knowledge to selectively build WS-GRAM, nor enough disk > > space to build the whole GT. > > > >> > >> Martin > >> > >> >>> > > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per > >> job > >> >>> is > >> >>> a > >> >>> > bit too much considering that swift (which has to consider many > >> more > >> >>> > things) has less than 10K overhead per job. > >> >>> > > >> >>> > >> >>> > >> >>> For my better understanding: > >> >>> Do you start up your own notification consumer manager that listens > >> for > >> >>> notifications of all jobs or do you let each GramJob instance listen > >> >>> for > >> >>> notifications itself? > >> >>> In case you listen for notifications yourself: do you store > >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if > >> >>> needed? > >> >> > >> >> Excellent points. I let each GramJob instance listen for > >> notifications > >> >> itself. What I observed is that it uses only one container for that. > >> >> > >> > > >> > Shoot! i didn't know that and thought there would be a container per > >> > GramJob in that case. That's the core mysteries with notifications. > >> > Anyway: I did a quick check some days ago and found that GramJob is > >> > surprisingly greedy regarding memory as you said. I'll have to further > >> > check what it is, but will probably not do that before 4.2 is out. > >> > > >> > > >> >> Due to the above, a reference to the GramJob is kept anyway, > >> regardless > >> >> of whether that reference is in client code or the local container. > >> >> > >> >> I'll try to profile a run and see if I can spot where the problems > >> are. > >> >> > >> >>> > >> >>> Martin > >> >>> > >> >>> >> > >> >>> >> The core team will be looking at improving notifications once > >> their > >> >>> >> other 4.2 deliverables are done. > >> >>> >> > >> >>> >> -Stu > >> >>> >> > >> >>> >> Begin forwarded message: > >> >>> >> > >> >>> >> > From: feller at mcs.anl.gov > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST > >> >>> >> > To: "Jaime Frey" > >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" > >> >>> >> >> >>> >> > >, "Martin Feller" , "charles bacon" > >> >>> >> >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" > >> >>> >> >> >>> >> > >, "Jeff Porter" , "Alain Roy" > >> >>> , > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" > >> >>> >> >> >>> >> > > > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage > >> >>> >> > > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> >>> >> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> >>> >> >>> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> >>> >> >>>> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS > >> GRAM > >> >>> >> >>>>> raised some concerns about memory usage on the client side. > >> I > >> >>> did > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > >> >>> appeared > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a > >> >>> wrapper > >> >>> >> >>>>> around the java client libraries for WS GRAM. > >> >>> >> >>>>> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 > >> at > >> >>> a > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and > >> >>> execution. > >> >>> >> >>>>> Here is what I've discovered so far. > >> >>> >> >>>>> > >> >>> >> >>>>> Aside from the heap available to the java code, the jvm > >> used > >> >>> 117 > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > >> >>> Condor-G > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > >> >>> >> >>>>> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >> >>> collector) > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > >> >>> complete), > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> >>> >> >>>>> > >> >>> >> >>>>> The only long-term memory per job that I know of in the > >> GAHP > >> >>> is > >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb > >> >>> seems > >> >>> a > >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us > >> >>> >> >>>>> determine if we're using the notification sinks > >> inefficiently? > >> >>> >> >>>> > >> >>> >> >>>> Martin just looked and for the most part, there is nothing > >> >>> wrong > >> >>> >> >>>> with how condor-g manages the callback sink. > >> >>> >> >>>> However, one improvement that would reduce the memory used > >> per > >> >>> job > >> >>> >> >>>> would be to not have a notification consumer per job. > >> Instead > >> >>> use > >> >>> >> >>>> one for all jobs. > >> >>> >> >>>> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress > >> >>> tests > >> >>> >> >>>> and found that notifications are building up on the in the > >> >>> GRAM4 > >> >>> >> >>>> service container and that is causing delays which seem to > >> be > >> >>> >> >>>> causing multiple problems. We're looking at this in a > >> separate > >> >>> >> >>>> effort with the GT Core team. But, after this was clear, > >> >>> Martin > >> >>> >> >>>> re- > >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g > >> >>> and > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could > >> you > >> >>> >> >>>> repeat the no-notification test and see the difference in > >> >>> memory? > >> >>> >> >>>> The changes would be to increase the polling frequency in > >> >>> condor-g > >> >>> >> >>>> and comment out the subscribe for notification. You could > >> also > >> >>> >> >>>> comment out the notification listener call(s) too. > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> I did two new sets of tests today. The first used more > >> efficient > >> >>> >> >>> callback code in the GAHP (one notification consumer rather > >> than > >> >>> one > >> >>> >> >>> per job). The second disabled notifications and relied on > >> >>> polling > >> >>> >> >>> for job status changes. > >> >>> >> >>> > >> >>> >> >>> The more efficient callback code did not produce a noticeable > >> >>> >> >>> reduction in memory usage. > >> >>> >> >>> > >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum > >> jvm > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The > >> minimum > >> >>> >> >>> heap usage after job submission and before job completion was > >> >>> about > >> >>> >> >>> 4 megs + 0.1 megs per job. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> I ran one more test with the improved callback code. This > >> time, I > >> >>> >> >> stopped storing the notification producer EPRs associated with > >> >>> the > >> >>> >> >> GRAM job resources. Memory usage went down markedly. > >> >>> >> >> > >> >>> >> >> I was told the client had to explicitly destroy these > >> serve-side > >> >>> >> >> notification producer resources when it destroys the job, > >> >>> otherwise > >> >>> >> >> they hang around bogging down the server. Is this still the > >> case? > >> >>> The > >> >>> >> >> server can't destroy notification producers when their sources > >> of > >> >>> >> >> information are destroyed? > >> >>> >> >> > >> >>> >> > > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant > >> much > >> >>> more > >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of > >> >>> >> > subscription resources to be able to destroy them eventually. > >> >>> >> > Those EPR's are maybe not so tiny as they look like. > >> >>> >> > > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually > >> >>> destroy > >> >>> >> > subscription resources manually to avoid heaping up persistence > >> >>> data > >> >>> >> > on the server-side. > >> >>> >> > For 4.2: no, you won't have to store them. A job resource will > >> >>> >> > destroy all subscription resources when it's destroyed. > >> >>> >> > > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the > >> "container > >> >>> >> > hangs in job destruction" problem won't exist anymore. > >> >>> >> > > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > >> >>> changes > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it > >> makes > >> >>> >> > sense > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you > >> >>> for > >> >>> >> > fine-tuning then? > >> >>> >> > > >> >>> >> > Martin > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> >>> >> > >> >>> >> > Mihael: > >> >>> >> > > >> >>> >> > That's great, thanks! > >> >>> >> > > >> >>> >> > Ian. > >> >>> >> > > >> >>> >> > Mihael Hategan wrote: > >> >>> >> >> I did a 1024 job run today with ws-gram. > >> >>> >> >> I painted the results here: > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> >>> >> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is quite > >> a > >> >>> lot. > >> >>> >> >> What kinda worries me is that it doesn't seem to go down after > >> >>> the > >> >>> >> >> jobs > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage > >> >>> >> >> collector > >> >>> >> >> doesn't do any major collections. I'll need to profile this to > >> >>> see > >> >>> >> >> exactly what we're talking about. > >> >>> >> >> > >> >>> >> >> The container memory is figured by looking at the process in > >> >>> /proc. > >> >>> >> >> It's > >> >>> >> >> total memory including shared libraries and things. But > >> libraries > >> >>> >> >> take a > >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be > >> >>> made. > >> >>> >> >> It > >> >>> >> >> looks quite similar to the amount of memory eaten on the > >> client > >> >>> side > >> >>> >> >> (per job). > >> >>> >> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the > >> >>> time > >> >>> >> >> the > >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I > >> >>> have > >> >>> >> >> yet > >> >>> >> >> to plot the exact submission time for each job. > >> >>> >> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long as > >> >>> there > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), > >> >>> and > >> >>> >> >> while > >> >>> >> >> making sure the jvm has enough heap. More than that seems like > >> a > >> >>> >> >> gamble. > >> >>> >> >> > >> >>> >> >> Mihael > >> >>> >> >> > >> >>> >> >> _______________________________________________ > >> >>> >> >> Swift-devel mailing list > >> >>> >> >> Swift-devel at ci.uchicago.edu > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> >>> >> >> > >> >>> >> >> > >> >>> >> > > >> >>> >> > >> >>> > > >> >>> > > >> >>> > >> >>> > >> >> > >> >> > >> > > >> > > >> > > >> > >> > > > > From hategan at mcs.anl.gov Fri Feb 8 16:12:43 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 16:12:43 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202504550.21618.0.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202504550.21618.0.camel@blabla.mcs.anl.gov> Message-ID: <1202508763.25421.0.camel@blabla.mcs.anl.gov> Yep. Looks much better. How stable is this otherwise? On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote: > On a first look it indeed looks like the gc is more successful at > cleaning stuff up. > > On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: > > Try the attached 4.0 compliant jar in your tests by dropping > > it in your 4.0.x $GLOBUS_LOCATION/lib. > > My tests showed about 2MB memory increase per 100 GramJob > > objects which sounds to me like a reasonable number (about 20k > > per GramJob object ignoring the notification consumer manager > > in one job - if my calculations are right) > > > > Martin > > > > > > > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > > >> Mihael, > > >> > > >> i think i found the memory hole in GramJob. > > >> 100 jobs in a test of mine consumed about 23MB (constantly > > >> growing) before the fix and 8MB (very slowly growing) after > > >> the fix. The big part of that (7MB) is used right from the > > >> first job which may be the NotificationConsumerManager. > > >> Will commit that change soon to 4.0 branch and you may try > > >> it then. > > >> Are you using 4.0.x in your tests? > > > > > > Yes. If there are no API changes, you can send me the jar file. I don't > > > have enough knowledge to selectively build WS-GRAM, nor enough disk > > > space to build the whole GT. > > > > > >> > > >> Martin > > >> > > >> >>> > > > >> >>> > These are both hacks. I'm not sure I want to go there. 300K per > > >> job > > >> >>> is > > >> >>> a > > >> >>> > bit too much considering that swift (which has to consider many > > >> more > > >> >>> > things) has less than 10K overhead per job. > > >> >>> > > > >> >>> > > >> >>> > > >> >>> For my better understanding: > > >> >>> Do you start up your own notification consumer manager that listens > > >> for > > >> >>> notifications of all jobs or do you let each GramJob instance listen > > >> >>> for > > >> >>> notifications itself? > > >> >>> In case you listen for notifications yourself: do you store > > >> >>> GramJob objects or just EPR's of jobs and create GramJob objects if > > >> >>> needed? > > >> >> > > >> >> Excellent points. I let each GramJob instance listen for > > >> notifications > > >> >> itself. What I observed is that it uses only one container for that. > > >> >> > > >> > > > >> > Shoot! i didn't know that and thought there would be a container per > > >> > GramJob in that case. That's the core mysteries with notifications. > > >> > Anyway: I did a quick check some days ago and found that GramJob is > > >> > surprisingly greedy regarding memory as you said. I'll have to further > > >> > check what it is, but will probably not do that before 4.2 is out. > > >> > > > >> > > > >> >> Due to the above, a reference to the GramJob is kept anyway, > > >> regardless > > >> >> of whether that reference is in client code or the local container. > > >> >> > > >> >> I'll try to profile a run and see if I can spot where the problems > > >> are. > > >> >> > > >> >>> > > >> >>> Martin > > >> >>> > > >> >>> >> > > >> >>> >> The core team will be looking at improving notifications once > > >> their > > >> >>> >> other 4.2 deliverables are done. > > >> >>> >> > > >> >>> >> -Stu > > >> >>> >> > > >> >>> >> Begin forwarded message: > > >> >>> >> > > >> >>> >> > From: feller at mcs.anl.gov > > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST > > >> >>> >> > To: "Jaime Frey" > > >> >>> >> > Cc: "Stuart Martin" , "Terrence Martin" > > >> >>> >> > >> >>> >> > >, "Martin Feller" , "charles bacon" > > >> >>> >> > >> >>> >> > >, "Suchandra Thapa" , "Rob Gardner" > > >> >>> >> > >> >>> >> > >, "Jeff Porter" , "Alain Roy" > > >> >>> , > > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" > > >> >>> >> > >> >>> >> > > > > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage > > >> >>> >> > > > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > > >> >>> >> >> > > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > > >> >>> >> >>> > > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > > >> >>> >> >>>> > > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with WS > > >> GRAM > > >> >>> >> >>>>> raised some concerns about memory usage on the client side. > > >> I > > >> >>> did > > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, which > > >> >>> appeared > > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is a > > >> >>> wrapper > > >> >>> >> >>>>> around the java client libraries for WS GRAM. > > >> >>> >> >>>>> > > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up to 30 > > >> at > > >> >>> a > > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal data > > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and > > >> >>> execution. > > >> >>> >> >>>>> Here is what I've discovered so far. > > >> >>> >> >>>>> > > >> >>> >> >>>>> Aside from the heap available to the java code, the jvm > > >> used > > >> >>> 117 > > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared memory. > > >> >>> Condor-G > > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) pair. > > >> >>> >> >>>>> > > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > > >> >>> collector) > > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP was > > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them to > > >> >>> complete), > > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > > >> >>> >> >>>>> > > >> >>> >> >>>>> The only long-term memory per job that I know of in the > > >> GAHP > > >> >>> is > > >> >>> >> >>>>> for the notification sink for job status callbacks. 600kb > > >> >>> seems > > >> >>> a > > >> >>> >> >>>>> little high for that. Stu, could someone on Globus help us > > >> >>> >> >>>>> determine if we're using the notification sinks > > >> inefficiently? > > >> >>> >> >>>> > > >> >>> >> >>>> Martin just looked and for the most part, there is nothing > > >> >>> wrong > > >> >>> >> >>>> with how condor-g manages the callback sink. > > >> >>> >> >>>> However, one improvement that would reduce the memory used > > >> per > > >> >>> job > > >> >>> >> >>>> would be to not have a notification consumer per job. > > >> Instead > > >> >>> use > > >> >>> >> >>>> one for all jobs. > > >> >>> >> >>>> > > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g stress > > >> >>> tests > > >> >>> >> >>>> and found that notifications are building up on the in the > > >> >>> GRAM4 > > >> >>> >> >>>> service container and that is causing delays which seem to > > >> be > > >> >>> >> >>>> causing multiple problems. We're looking at this in a > > >> separate > > >> >>> >> >>>> effort with the GT Core team. But, after this was clear, > > >> >>> Martin > > >> >>> >> >>>> re- > > >> >>> >> >>>> ran the condor-g test and relied on polling between condor-g > > >> >>> and > > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, could > > >> you > > >> >>> >> >>>> repeat the no-notification test and see the difference in > > >> >>> memory? > > >> >>> >> >>>> The changes would be to increase the polling frequency in > > >> >>> condor-g > > >> >>> >> >>>> and comment out the subscribe for notification. You could > > >> also > > >> >>> >> >>>> comment out the notification listener call(s) too. > > >> >>> >> >>> > > >> >>> >> >>> > > >> >>> >> >>> I did two new sets of tests today. The first used more > > >> efficient > > >> >>> >> >>> callback code in the GAHP (one notification consumer rather > > >> than > > >> >>> one > > >> >>> >> >>> per job). The second disabled notifications and relied on > > >> >>> polling > > >> >>> >> >>> for job status changes. > > >> >>> >> >>> > > >> >>> >> >>> The more efficient callback code did not produce a noticeable > > >> >>> >> >>> reduction in memory usage. > > >> >>> >> >>> > > >> >>> >> >>> Disabling notifications did reduce memory usage. The maximum > > >> jvm > > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The > > >> minimum > > >> >>> >> >>> heap usage after job submission and before job completion was > > >> >>> about > > >> >>> >> >>> 4 megs + 0.1 megs per job. > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> I ran one more test with the improved callback code. This > > >> time, I > > >> >>> >> >> stopped storing the notification producer EPRs associated with > > >> >>> the > > >> >>> >> >> GRAM job resources. Memory usage went down markedly. > > >> >>> >> >> > > >> >>> >> >> I was told the client had to explicitly destroy these > > >> serve-side > > >> >>> >> >> notification producer resources when it destroys the job, > > >> >>> otherwise > > >> >>> >> >> they hang around bogging down the server. Is this still the > > >> case? > > >> >>> The > > >> >>> >> >> server can't destroy notification producers when their sources > > >> of > > >> >>> >> >> information are destroyed? > > >> >>> >> >> > > >> >>> >> > > > >> >>> >> > This reminds me of the odd fact that i had to suddenly grant > > >> much > > >> >>> more > > >> >>> >> > memory to Condor-g as soon as condor-g started storing EPRs of > > >> >>> >> > subscription resources to be able to destroy them eventually. > > >> >>> >> > Those EPR's are maybe not so tiny as they look like. > > >> >>> >> > > > >> >>> >> > For 4.0: yes, currently you'll have to store and eventually > > >> >>> destroy > > >> >>> >> > subscription resources manually to avoid heaping up persistence > > >> >>> data > > >> >>> >> > on the server-side. > > >> >>> >> > For 4.2: no, you won't have to store them. A job resource will > > >> >>> >> > destroy all subscription resources when it's destroyed. > > >> >>> >> > > > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the > > >> "container > > >> >>> >> > hangs in job destruction" problem won't exist anymore. > > >> >>> >> > > > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable 4.2 > > >> >>> changes > > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if it > > >> makes > > >> >>> >> > sense > > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it to you > > >> >>> for > > >> >>> >> > fine-tuning then? > > >> >>> >> > > > >> >>> >> > Martin > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > > >> >>> >> > > >> >>> >> > Mihael: > > >> >>> >> > > > >> >>> >> > That's great, thanks! > > >> >>> >> > > > >> >>> >> > Ian. > > >> >>> >> > > > >> >>> >> > Mihael Hategan wrote: > > >> >>> >> >> I did a 1024 job run today with ws-gram. > > >> >>> >> >> I painted the results here: > > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > > >> >>> >> >> > > >> >>> >> >> Seems like client memory per job is about 370k. Which is quite > > >> a > > >> >>> lot. > > >> >>> >> >> What kinda worries me is that it doesn't seem to go down after > > >> >>> the > > >> >>> >> >> jobs > > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the garbage > > >> >>> >> >> collector > > >> >>> >> >> doesn't do any major collections. I'll need to profile this to > > >> >>> see > > >> >>> >> >> exactly what we're talking about. > > >> >>> >> >> > > >> >>> >> >> The container memory is figured by looking at the process in > > >> >>> /proc. > > >> >>> >> >> It's > > >> >>> >> >> total memory including shared libraries and things. But > > >> libraries > > >> >>> >> >> take a > > >> >>> >> >> fixed amount of space, so a fuzzy correlation can probably be > > >> >>> made. > > >> >>> >> >> It > > >> >>> >> >> looks quite similar to the amount of memory eaten on the > > >> client > > >> >>> side > > >> >>> >> >> (per job). > > >> >>> >> >> > > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during the > > >> >>> time > > >> >>> >> >> the > > >> >>> >> >> jobs are submitted, but the machine itself seems responsive. I > > >> >>> have > > >> >>> >> >> yet > > >> >>> >> >> to plot the exact submission time for each job. > > >> >>> >> >> > > >> >>> >> >> So at this point I would recommend trying ws-gram as long as > > >> >>> there > > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel jobs), > > >> >>> and > > >> >>> >> >> while > > >> >>> >> >> making sure the jvm has enough heap. More than that seems like > > >> a > > >> >>> >> >> gamble. > > >> >>> >> >> > > >> >>> >> >> Mihael > > >> >>> >> >> > > >> >>> >> >> _______________________________________________ > > >> >>> >> >> Swift-devel mailing list > > >> >>> >> >> Swift-devel at ci.uchicago.edu > > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> > > > >> >>> >> > > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > > >> >> > > >> >> > > >> > > > >> > > > >> > > > >> > > >> > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From feller at mcs.anl.gov Fri Feb 8 16:32:06 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 16:32:06 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202508763.25421.0.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202504550.21618.0.camel@blabla.mcs.anl.gov> <1202508763.25421.0.camel@blabla.mcs.anl.gov> Message-ID: <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov> I can't see any stability issues here. The only thing i changed is using EndpointReferenceType jobEPR = (EndpointReferenceType) ObjectSerializer.clone(response.getManagedJobEndpoint()); instead of EndpointReferenceType jobEPR = response.getManagedJobEndpoint(); at 2 or 3 locations in the code. Rachana uses cloning in core too. So it's supposed to be a stable thing. A question though: Do you see a speedup in submission? Martin > Yep. Looks much better. How stable is this otherwise? > > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote: >> On a first look it indeed looks like the gc is more successful at >> cleaning stuff up. >> >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: >> > Try the attached 4.0 compliant jar in your tests by dropping >> > it in your 4.0.x $GLOBUS_LOCATION/lib. >> > My tests showed about 2MB memory increase per 100 GramJob >> > objects which sounds to me like a reasonable number (about 20k >> > per GramJob object ignoring the notification consumer manager >> > in one job - if my calculations are right) >> > >> > Martin >> > >> > > >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: >> > >> Mihael, >> > >> >> > >> i think i found the memory hole in GramJob. >> > >> 100 jobs in a test of mine consumed about 23MB (constantly >> > >> growing) before the fix and 8MB (very slowly growing) after >> > >> the fix. The big part of that (7MB) is used right from the >> > >> first job which may be the NotificationConsumerManager. >> > >> Will commit that change soon to 4.0 branch and you may try >> > >> it then. >> > >> Are you using 4.0.x in your tests? >> > > >> > > Yes. If there are no API changes, you can send me the jar file. I >> don't >> > > have enough knowledge to selectively build WS-GRAM, nor enough disk >> > > space to build the whole GT. >> > > >> > >> >> > >> Martin >> > >> >> > >> >>> > >> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K >> per >> > >> job >> > >> >>> is >> > >> >>> a >> > >> >>> > bit too much considering that swift (which has to consider >> many >> > >> more >> > >> >>> > things) has less than 10K overhead per job. >> > >> >>> > >> > >> >>> >> > >> >>> >> > >> >>> For my better understanding: >> > >> >>> Do you start up your own notification consumer manager that >> listens >> > >> for >> > >> >>> notifications of all jobs or do you let each GramJob instance >> listen >> > >> >>> for >> > >> >>> notifications itself? >> > >> >>> In case you listen for notifications yourself: do you store >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob >> objects if >> > >> >>> needed? >> > >> >> >> > >> >> Excellent points. I let each GramJob instance listen for >> > >> notifications >> > >> >> itself. What I observed is that it uses only one container for >> that. >> > >> >> >> > >> > >> > >> > Shoot! i didn't know that and thought there would be a container >> per >> > >> > GramJob in that case. That's the core mysteries with >> notifications. >> > >> > Anyway: I did a quick check some days ago and found that GramJob >> is >> > >> > surprisingly greedy regarding memory as you said. I'll have to >> further >> > >> > check what it is, but will probably not do that before 4.2 is >> out. >> > >> > >> > >> > >> > >> >> Due to the above, a reference to the GramJob is kept anyway, >> > >> regardless >> > >> >> of whether that reference is in client code or the local >> container. >> > >> >> >> > >> >> I'll try to profile a run and see if I can spot where the >> problems >> > >> are. >> > >> >> >> > >> >>> >> > >> >>> Martin >> > >> >>> >> > >> >>> >> >> > >> >>> >> The core team will be looking at improving notifications >> once >> > >> their >> > >> >>> >> other 4.2 deliverables are done. >> > >> >>> >> >> > >> >>> >> -Stu >> > >> >>> >> >> > >> >>> >> Begin forwarded message: >> > >> >>> >> >> > >> >>> >> > From: feller at mcs.anl.gov >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST >> > >> >>> >> > To: "Jaime Frey" >> > >> >>> >> > Cc: "Stuart Martin" , "Terrence >> Martin" >> > >> >>> >> > > >> >>> >> > >, "Martin Feller" , "charles bacon" >> > >> >>> >> > > >> >>> >> > >, "Suchandra Thapa" , "Rob >> Gardner" >> > >> >>> >> > > >> >>> >> > >, "Jeff Porter" , "Alain Roy" >> > >> >>> , >> > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" >> > >> >>> >> > > >> >>> >> > > >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage >> > >> >>> >> > >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> > >> >>> >> >> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> > >> >>> >> >>> >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: >> > >> >>> >> >>>> >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with >> WS >> > >> GRAM >> > >> >>> >> >>>>> raised some concerns about memory usage on the client >> side. >> > >> I >> > >> >>> did >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, >> which >> > >> >>> appeared >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is >> a >> > >> >>> wrapper >> > >> >>> >> >>>>> around the java client libraries for WS GRAM. >> > >> >>> >> >>>>> >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up >> to 30 >> > >> at >> > >> >>> a >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal >> data >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and >> > >> >>> execution. >> > >> >>> >> >>>>> Here is what I've discovered so far. >> > >> >>> >> >>>>> >> > >> >>> >> >>>>> Aside from the heap available to the java code, the >> jvm >> > >> used >> > >> >>> 117 >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared >> memory. >> > >> >>> Condor-G >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) >> pair. >> > >> >>> >> >>>>> >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage >> > >> >>> collector) >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP >> was >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them >> to >> > >> >>> complete), >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> > >> >>> >> >>>>> >> > >> >>> >> >>>>> The only long-term memory per job that I know of in >> the >> > >> GAHP >> > >> >>> is >> > >> >>> >> >>>>> for the notification sink for job status callbacks. >> 600kb >> > >> >>> seems >> > >> >>> a >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus >> help us >> > >> >>> >> >>>>> determine if we're using the notification sinks >> > >> inefficiently? >> > >> >>> >> >>>> >> > >> >>> >> >>>> Martin just looked and for the most part, there is >> nothing >> > >> >>> wrong >> > >> >>> >> >>>> with how condor-g manages the callback sink. >> > >> >>> >> >>>> However, one improvement that would reduce the memory >> used >> > >> per >> > >> >>> job >> > >> >>> >> >>>> would be to not have a notification consumer per job. >> > >> Instead >> > >> >>> use >> > >> >>> >> >>>> one for all jobs. >> > >> >>> >> >>>> >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g >> stress >> > >> >>> tests >> > >> >>> >> >>>> and found that notifications are building up on the in >> the >> > >> >>> GRAM4 >> > >> >>> >> >>>> service container and that is causing delays which seem >> to >> > >> be >> > >> >>> >> >>>> causing multiple problems. We're looking at this in a >> > >> separate >> > >> >>> >> >>>> effort with the GT Core team. But, after this was >> clear, >> > >> >>> Martin >> > >> >>> >> >>>> re- >> > >> >>> >> >>>> ran the condor-g test and relied on polling between >> condor-g >> > >> >>> and >> > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, >> could >> > >> you >> > >> >>> >> >>>> repeat the no-notification test and see the difference >> in >> > >> >>> memory? >> > >> >>> >> >>>> The changes would be to increase the polling frequency >> in >> > >> >>> condor-g >> > >> >>> >> >>>> and comment out the subscribe for notification. You >> could >> > >> also >> > >> >>> >> >>>> comment out the notification listener call(s) too. >> > >> >>> >> >>> >> > >> >>> >> >>> >> > >> >>> >> >>> I did two new sets of tests today. The first used more >> > >> efficient >> > >> >>> >> >>> callback code in the GAHP (one notification consumer >> rather >> > >> than >> > >> >>> one >> > >> >>> >> >>> per job). The second disabled notifications and relied >> on >> > >> >>> polling >> > >> >>> >> >>> for job status changes. >> > >> >>> >> >>> >> > >> >>> >> >>> The more efficient callback code did not produce a >> noticeable >> > >> >>> >> >>> reduction in memory usage. >> > >> >>> >> >>> >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The >> maximum >> > >> jvm >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The >> > >> minimum >> > >> >>> >> >>> heap usage after job submission and before job >> completion was >> > >> >>> about >> > >> >>> >> >>> 4 megs + 0.1 megs per job. >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> >> I ran one more test with the improved callback code. This >> > >> time, I >> > >> >>> >> >> stopped storing the notification producer EPRs associated >> with >> > >> >>> the >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly. >> > >> >>> >> >> >> > >> >>> >> >> I was told the client had to explicitly destroy these >> > >> serve-side >> > >> >>> >> >> notification producer resources when it destroys the job, >> > >> >>> otherwise >> > >> >>> >> >> they hang around bogging down the server. Is this still >> the >> > >> case? >> > >> >>> The >> > >> >>> >> >> server can't destroy notification producers when their >> sources >> > >> of >> > >> >>> >> >> information are destroyed? >> > >> >>> >> >> >> > >> >>> >> > >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly >> grant >> > >> much >> > >> >>> more >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing >> EPRs of >> > >> >>> >> > subscription resources to be able to destroy them >> eventually. >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like. >> > >> >>> >> > >> > >> >>> >> > For 4.0: yes, currently you'll have to store and >> eventually >> > >> >>> destroy >> > >> >>> >> > subscription resources manually to avoid heaping up >> persistence >> > >> >>> data >> > >> >>> >> > on the server-side. >> > >> >>> >> > For 4.2: no, you won't have to store them. A job resource >> will >> > >> >>> >> > destroy all subscription resources when it's destroyed. >> > >> >>> >> > >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the >> > >> "container >> > >> >>> >> > hangs in job destruction" problem won't exist anymore. >> > >> >>> >> > >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable >> 4.2 >> > >> >>> changes >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if >> it >> > >> makes >> > >> >>> >> > sense >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it >> to you >> > >> >>> for >> > >> >>> >> > fine-tuning then? >> > >> >>> >> > >> > >> >>> >> > Martin >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> > >> >>> >> >> > >> >>> >> > Mihael: >> > >> >>> >> > >> > >> >>> >> > That's great, thanks! >> > >> >>> >> > >> > >> >>> >> > Ian. >> > >> >>> >> > >> > >> >>> >> > Mihael Hategan wrote: >> > >> >>> >> >> I did a 1024 job run today with ws-gram. >> > >> >>> >> >> I painted the results here: >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> > >> >>> >> >> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is >> quite >> > >> a >> > >> >>> lot. >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go down >> after >> > >> >>> the >> > >> >>> >> >> jobs >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the >> garbage >> > >> >>> >> >> collector >> > >> >>> >> >> doesn't do any major collections. I'll need to profile >> this to >> > >> >>> see >> > >> >>> >> >> exactly what we're talking about. >> > >> >>> >> >> >> > >> >>> >> >> The container memory is figured by looking at the process >> in >> > >> >>> /proc. >> > >> >>> >> >> It's >> > >> >>> >> >> total memory including shared libraries and things. But >> > >> libraries >> > >> >>> >> >> take a >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can >> probably be >> > >> >>> made. >> > >> >>> >> >> It >> > >> >>> >> >> looks quite similar to the amount of memory eaten on the >> > >> client >> > >> >>> side >> > >> >>> >> >> (per job). >> > >> >>> >> >> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during >> the >> > >> >>> time >> > >> >>> >> >> the >> > >> >>> >> >> jobs are submitted, but the machine itself seems >> responsive. I >> > >> >>> have >> > >> >>> >> >> yet >> > >> >>> >> >> to plot the exact submission time for each job. >> > >> >>> >> >> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long >> as >> > >> >>> there >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel >> jobs), >> > >> >>> and >> > >> >>> >> >> while >> > >> >>> >> >> making sure the jvm has enough heap. More than that seems >> like >> > >> a >> > >> >>> >> >> gamble. >> > >> >>> >> >> >> > >> >>> >> >> Mihael >> > >> >>> >> >> >> > >> >>> >> >> _______________________________________________ >> > >> >>> >> >> Swift-devel mailing list >> > >> >>> >> >> Swift-devel at ci.uchicago.edu >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> > >> > >> >>> >> >> > >> >>> > >> > >> >>> > >> > >> >>> >> > >> >>> >> > >> >> >> > >> >> >> > >> > >> > >> > >> > >> > >> > >> >> > >> >> > > >> > > >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Fri Feb 8 16:37:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 08 Feb 2008 16:37:18 -0600 Subject: [Swift-devel] ws-gram tests In-Reply-To: <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202504550.21618.0.camel@blabla.mcs.anl.gov> <1202508763.25421.0.camel@blabla.mcs.anl.gov> <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202510238.26717.2.camel@blabla.mcs.anl.gov> On Fri, 2008-02-08 at 16:32 -0600, feller at mcs.anl.gov wrote: > I can't see any stability issues here. The only thing i changed > is using > > EndpointReferenceType jobEPR = (EndpointReferenceType) > ObjectSerializer.clone(response.getManagedJobEndpoint()); > > instead of > > EndpointReferenceType jobEPR = response.getManagedJobEndpoint(); > > at 2 or 3 locations in the code. > > Rachana uses cloning in core too. So it's supposed to be > a stable thing. > > A question though: Do you see a speedup in submission? I wasn't looking for that. Anything I should be aware of? > > Martin > > > > Yep. Looks much better. How stable is this otherwise? > > > > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote: > >> On a first look it indeed looks like the gc is more successful at > >> cleaning stuff up. > >> > >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: > >> > Try the attached 4.0 compliant jar in your tests by dropping > >> > it in your 4.0.x $GLOBUS_LOCATION/lib. > >> > My tests showed about 2MB memory increase per 100 GramJob > >> > objects which sounds to me like a reasonable number (about 20k > >> > per GramJob object ignoring the notification consumer manager > >> > in one job - if my calculations are right) > >> > > >> > Martin > >> > > >> > > > >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: > >> > >> Mihael, > >> > >> > >> > >> i think i found the memory hole in GramJob. > >> > >> 100 jobs in a test of mine consumed about 23MB (constantly > >> > >> growing) before the fix and 8MB (very slowly growing) after > >> > >> the fix. The big part of that (7MB) is used right from the > >> > >> first job which may be the NotificationConsumerManager. > >> > >> Will commit that change soon to 4.0 branch and you may try > >> > >> it then. > >> > >> Are you using 4.0.x in your tests? > >> > > > >> > > Yes. If there are no API changes, you can send me the jar file. I > >> don't > >> > > have enough knowledge to selectively build WS-GRAM, nor enough disk > >> > > space to build the whole GT. > >> > > > >> > >> > >> > >> Martin > >> > >> > >> > >> >>> > > >> > >> >>> > These are both hacks. I'm not sure I want to go there. 300K > >> per > >> > >> job > >> > >> >>> is > >> > >> >>> a > >> > >> >>> > bit too much considering that swift (which has to consider > >> many > >> > >> more > >> > >> >>> > things) has less than 10K overhead per job. > >> > >> >>> > > >> > >> >>> > >> > >> >>> > >> > >> >>> For my better understanding: > >> > >> >>> Do you start up your own notification consumer manager that > >> listens > >> > >> for > >> > >> >>> notifications of all jobs or do you let each GramJob instance > >> listen > >> > >> >>> for > >> > >> >>> notifications itself? > >> > >> >>> In case you listen for notifications yourself: do you store > >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob > >> objects if > >> > >> >>> needed? > >> > >> >> > >> > >> >> Excellent points. I let each GramJob instance listen for > >> > >> notifications > >> > >> >> itself. What I observed is that it uses only one container for > >> that. > >> > >> >> > >> > >> > > >> > >> > Shoot! i didn't know that and thought there would be a container > >> per > >> > >> > GramJob in that case. That's the core mysteries with > >> notifications. > >> > >> > Anyway: I did a quick check some days ago and found that GramJob > >> is > >> > >> > surprisingly greedy regarding memory as you said. I'll have to > >> further > >> > >> > check what it is, but will probably not do that before 4.2 is > >> out. > >> > >> > > >> > >> > > >> > >> >> Due to the above, a reference to the GramJob is kept anyway, > >> > >> regardless > >> > >> >> of whether that reference is in client code or the local > >> container. > >> > >> >> > >> > >> >> I'll try to profile a run and see if I can spot where the > >> problems > >> > >> are. > >> > >> >> > >> > >> >>> > >> > >> >>> Martin > >> > >> >>> > >> > >> >>> >> > >> > >> >>> >> The core team will be looking at improving notifications > >> once > >> > >> their > >> > >> >>> >> other 4.2 deliverables are done. > >> > >> >>> >> > >> > >> >>> >> -Stu > >> > >> >>> >> > >> > >> >>> >> Begin forwarded message: > >> > >> >>> >> > >> > >> >>> >> > From: feller at mcs.anl.gov > >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST > >> > >> >>> >> > To: "Jaime Frey" > >> > >> >>> >> > Cc: "Stuart Martin" , "Terrence > >> Martin" > >> > >> >>> >> >> > >> >>> >> > >, "Martin Feller" , "charles bacon" > >> > >> >>> >> >> > >> >>> >> > >, "Suchandra Thapa" , "Rob > >> Gardner" > >> > >> >>> >> >> > >> >>> >> > >, "Jeff Porter" , "Alain Roy" > >> > >> >>> , > >> > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" > >> > >> >>> >> >> > >> >>> >> > > > >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage > >> > >> >>> >> > > >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: > >> > >> >>> >> >> > >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: > >> > >> >>> >> >>> > >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey wrote: > >> > >> >>> >> >>>> > >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G with > >> WS > >> > >> GRAM > >> > >> >>> >> >>>>> raised some concerns about memory usage on the client > >> side. > >> > >> I > >> > >> >>> did > >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, > >> which > >> > >> >>> appeared > >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server is > >> a > >> > >> >>> wrapper > >> > >> >>> >> >>>>> around the java client libraries for WS GRAM. > >> > >> >>> >> >>>>> > >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs up > >> to 30 > >> > >> at > >> > >> >>> a > >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with minimal > >> data > >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission and > >> > >> >>> execution. > >> > >> >>> >> >>>>> Here is what I've discovered so far. > >> > >> >>> >> >>>>> > >> > >> >>> >> >>>>> Aside from the heap available to the java code, the > >> jvm > >> > >> used > >> > >> >>> 117 > >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared > >> memory. > >> > >> >>> Condor-G > >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 DN) > >> pair. > >> > >> >>> >> >>>>> > >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the garbage > >> > >> >>> collector) > >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the GAHP > >> was > >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for them > >> to > >> > >> >>> complete), > >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. > >> > >> >>> >> >>>>> > >> > >> >>> >> >>>>> The only long-term memory per job that I know of in > >> the > >> > >> GAHP > >> > >> >>> is > >> > >> >>> >> >>>>> for the notification sink for job status callbacks. > >> 600kb > >> > >> >>> seems > >> > >> >>> a > >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus > >> help us > >> > >> >>> >> >>>>> determine if we're using the notification sinks > >> > >> inefficiently? > >> > >> >>> >> >>>> > >> > >> >>> >> >>>> Martin just looked and for the most part, there is > >> nothing > >> > >> >>> wrong > >> > >> >>> >> >>>> with how condor-g manages the callback sink. > >> > >> >>> >> >>>> However, one improvement that would reduce the memory > >> used > >> > >> per > >> > >> >>> job > >> > >> >>> >> >>>> would be to not have a notification consumer per job. > >> > >> Instead > >> > >> >>> use > >> > >> >>> >> >>>> one for all jobs. > >> > >> >>> >> >>>> > >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g > >> stress > >> > >> >>> tests > >> > >> >>> >> >>>> and found that notifications are building up on the in > >> the > >> > >> >>> GRAM4 > >> > >> >>> >> >>>> service container and that is causing delays which seem > >> to > >> > >> be > >> > >> >>> >> >>>> causing multiple problems. We're looking at this in a > >> > >> separate > >> > >> >>> >> >>>> effort with the GT Core team. But, after this was > >> clear, > >> > >> >>> Martin > >> > >> >>> >> >>>> re- > >> > >> >>> >> >>>> ran the condor-g test and relied on polling between > >> condor-g > >> > >> >>> and > >> > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, > >> could > >> > >> you > >> > >> >>> >> >>>> repeat the no-notification test and see the difference > >> in > >> > >> >>> memory? > >> > >> >>> >> >>>> The changes would be to increase the polling frequency > >> in > >> > >> >>> condor-g > >> > >> >>> >> >>>> and comment out the subscribe for notification. You > >> could > >> > >> also > >> > >> >>> >> >>>> comment out the notification listener call(s) too. > >> > >> >>> >> >>> > >> > >> >>> >> >>> > >> > >> >>> >> >>> I did two new sets of tests today. The first used more > >> > >> efficient > >> > >> >>> >> >>> callback code in the GAHP (one notification consumer > >> rather > >> > >> than > >> > >> >>> one > >> > >> >>> >> >>> per job). The second disabled notifications and relied > >> on > >> > >> >>> polling > >> > >> >>> >> >>> for job status changes. > >> > >> >>> >> >>> > >> > >> >>> >> >>> The more efficient callback code did not produce a > >> noticeable > >> > >> >>> >> >>> reduction in memory usage. > >> > >> >>> >> >>> > >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The > >> maximum > >> > >> jvm > >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. The > >> > >> minimum > >> > >> >>> >> >>> heap usage after job submission and before job > >> completion was > >> > >> >>> about > >> > >> >>> >> >>> 4 megs + 0.1 megs per job. > >> > >> >>> >> >> > >> > >> >>> >> >> > >> > >> >>> >> >> I ran one more test with the improved callback code. This > >> > >> time, I > >> > >> >>> >> >> stopped storing the notification producer EPRs associated > >> with > >> > >> >>> the > >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly. > >> > >> >>> >> >> > >> > >> >>> >> >> I was told the client had to explicitly destroy these > >> > >> serve-side > >> > >> >>> >> >> notification producer resources when it destroys the job, > >> > >> >>> otherwise > >> > >> >>> >> >> they hang around bogging down the server. Is this still > >> the > >> > >> case? > >> > >> >>> The > >> > >> >>> >> >> server can't destroy notification producers when their > >> sources > >> > >> of > >> > >> >>> >> >> information are destroyed? > >> > >> >>> >> >> > >> > >> >>> >> > > >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly > >> grant > >> > >> much > >> > >> >>> more > >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing > >> EPRs of > >> > >> >>> >> > subscription resources to be able to destroy them > >> eventually. > >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like. > >> > >> >>> >> > > >> > >> >>> >> > For 4.0: yes, currently you'll have to store and > >> eventually > >> > >> >>> destroy > >> > >> >>> >> > subscription resources manually to avoid heaping up > >> persistence > >> > >> >>> data > >> > >> >>> >> > on the server-side. > >> > >> >>> >> > For 4.2: no, you won't have to store them. A job resource > >> will > >> > >> >>> >> > destroy all subscription resources when it's destroyed. > >> > >> >>> >> > > >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the > >> > >> "container > >> > >> >>> >> > hangs in job destruction" problem won't exist anymore. > >> > >> >>> >> > > >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% reliable > >> 4.2 > >> > >> >>> changes > >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder if > >> it > >> > >> makes > >> > >> >>> >> > sense > >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand it > >> to you > >> > >> >>> for > >> > >> >>> >> > fine-tuning then? > >> > >> >>> >> > > >> > >> >>> >> > Martin > >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: > >> > >> >>> >> > >> > >> >>> >> > Mihael: > >> > >> >>> >> > > >> > >> >>> >> > That's great, thanks! > >> > >> >>> >> > > >> > >> >>> >> > Ian. > >> > >> >>> >> > > >> > >> >>> >> > Mihael Hategan wrote: > >> > >> >>> >> >> I did a 1024 job run today with ws-gram. > >> > >> >>> >> >> I painted the results here: > >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html > >> > >> >>> >> >> > >> > >> >>> >> >> Seems like client memory per job is about 370k. Which is > >> quite > >> > >> a > >> > >> >>> lot. > >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go down > >> after > >> > >> >>> the > >> > >> >>> >> >> jobs > >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the > >> garbage > >> > >> >>> >> >> collector > >> > >> >>> >> >> doesn't do any major collections. I'll need to profile > >> this to > >> > >> >>> see > >> > >> >>> >> >> exactly what we're talking about. > >> > >> >>> >> >> > >> > >> >>> >> >> The container memory is figured by looking at the process > >> in > >> > >> >>> /proc. > >> > >> >>> >> >> It's > >> > >> >>> >> >> total memory including shared libraries and things. But > >> > >> libraries > >> > >> >>> >> >> take a > >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can > >> probably be > >> > >> >>> made. > >> > >> >>> >> >> It > >> > >> >>> >> >> looks quite similar to the amount of memory eaten on the > >> > >> client > >> > >> >>> side > >> > >> >>> >> >> (per job). > >> > >> >>> >> >> > >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work during > >> the > >> > >> >>> time > >> > >> >>> >> >> the > >> > >> >>> >> >> jobs are submitted, but the machine itself seems > >> responsive. I > >> > >> >>> have > >> > >> >>> >> >> yet > >> > >> >>> >> >> to plot the exact submission time for each job. > >> > >> >>> >> >> > >> > >> >>> >> >> So at this point I would recommend trying ws-gram as long > >> as > >> > >> >>> there > >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 parallel > >> jobs), > >> > >> >>> and > >> > >> >>> >> >> while > >> > >> >>> >> >> making sure the jvm has enough heap. More than that seems > >> like > >> > >> a > >> > >> >>> >> >> gamble. > >> > >> >>> >> >> > >> > >> >>> >> >> Mihael > >> > >> >>> >> >> > >> > >> >>> >> >> _______________________________________________ > >> > >> >>> >> >> Swift-devel mailing list > >> > >> >>> >> >> Swift-devel at ci.uchicago.edu > >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > >> >>> >> >> > >> > >> >>> >> >> > >> > >> >>> >> > > >> > >> >>> >> > >> > >> >>> > > >> > >> >>> > > >> > >> >>> > >> > >> >>> > >> > >> >> > >> > >> >> > >> > >> > > >> > >> > > >> > >> > > >> > >> > >> > >> > >> > > > >> > > > >> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > > From feller at mcs.anl.gov Fri Feb 8 16:46:06 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Fri, 8 Feb 2008 16:46:06 -0600 (CST) Subject: [Swift-devel] ws-gram tests In-Reply-To: <1202510238.26717.2.camel@blabla.mcs.anl.gov> References: <1202438053.26812.12.camel@blabla.mcs.anl.gov> <47AC72F9.8010701@mcs.anl.gov> <3A6401A7-FC8D-40C4-B00C-3FFF8B01580F@mcs.anl.gov> <1202485602.4800.13.camel@blabla.mcs.anl.gov> <33402.208.54.7.179.1202486974.squirrel@www-unix.mcs.anl.gov> <1202487494.5642.7.camel@blabla.mcs.anl.gov> <34703.208.54.7.179.1202487986.squirrel@www-unix.mcs.anl.gov> <43720.208.54.7.179.1202491180.squirrel@www-unix.mcs.anl.gov> <1202491649.9045.8.camel@blabla.mcs.anl.gov> <5915.208.54.7.179.1202498489.squirrel@www-unix.mcs.anl.gov> <1202504550.21618.0.camel@blabla.mcs.anl.gov> <1202508763.25421.0.camel@blabla.mcs.anl.gov> <58838.208.54.7.179.1202509926.squirrel@www-unix.mcs.anl.gov> <1202510238.26717.2.camel@blabla.mcs.anl.gov> Message-ID: <61240.208.54.7.179.1202510766.squirrel@www-unix.mcs.anl.gov> > > On Fri, 2008-02-08 at 16:32 -0600, feller at mcs.anl.gov wrote: >> I can't see any stability issues here. The only thing i changed >> is using >> >> EndpointReferenceType jobEPR = (EndpointReferenceType) >> ObjectSerializer.clone(response.getManagedJobEndpoint()); >> >> instead of >> >> EndpointReferenceType jobEPR = response.getManagedJobEndpoint(); >> >> at 2 or 3 locations in the code. >> >> Rachana uses cloning in core too. So it's supposed to be >> a stable thing. >> >> A question though: Do you see a speedup in submission? > > I wasn't looking for that. Anything I should be aware of? > Well, i can see a quite big speedup and can't really explain it. The only thing i did was that cloning. But i'm working on trunk and i changed some things in job creation that allow faster job creation. In 4.0 you might only see it in jobs without delegation. It would be interesting for me if you see a higher submission rate in jobs that don't have any links to delegated credentials in the job description (so no jobCredentialEndpoint, no stagingCredentialEndpoint, no transferCredentialEndpoints). Martin >> >> Martin >> >> >> > Yep. Looks much better. How stable is this otherwise? >> > >> > On Fri, 2008-02-08 at 15:02 -0600, Mihael Hategan wrote: >> >> On a first look it indeed looks like the gc is more successful at >> >> cleaning stuff up. >> >> >> >> On Fri, 2008-02-08 at 13:21 -0600, feller at mcs.anl.gov wrote: >> >> > Try the attached 4.0 compliant jar in your tests by dropping >> >> > it in your 4.0.x $GLOBUS_LOCATION/lib. >> >> > My tests showed about 2MB memory increase per 100 GramJob >> >> > objects which sounds to me like a reasonable number (about 20k >> >> > per GramJob object ignoring the notification consumer manager >> >> > in one job - if my calculations are right) >> >> > >> >> > Martin >> >> > >> >> > > >> >> > > On Fri, 2008-02-08 at 11:19 -0600, feller at mcs.anl.gov wrote: >> >> > >> Mihael, >> >> > >> >> >> > >> i think i found the memory hole in GramJob. >> >> > >> 100 jobs in a test of mine consumed about 23MB (constantly >> >> > >> growing) before the fix and 8MB (very slowly growing) after >> >> > >> the fix. The big part of that (7MB) is used right from the >> >> > >> first job which may be the NotificationConsumerManager. >> >> > >> Will commit that change soon to 4.0 branch and you may try >> >> > >> it then. >> >> > >> Are you using 4.0.x in your tests? >> >> > > >> >> > > Yes. If there are no API changes, you can send me the jar file. I >> >> don't >> >> > > have enough knowledge to selectively build WS-GRAM, nor enough >> disk >> >> > > space to build the whole GT. >> >> > > >> >> > >> >> >> > >> Martin >> >> > >> >> >> > >> >>> > >> >> > >> >>> > These are both hacks. I'm not sure I want to go there. >> 300K >> >> per >> >> > >> job >> >> > >> >>> is >> >> > >> >>> a >> >> > >> >>> > bit too much considering that swift (which has to consider >> >> many >> >> > >> more >> >> > >> >>> > things) has less than 10K overhead per job. >> >> > >> >>> > >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> For my better understanding: >> >> > >> >>> Do you start up your own notification consumer manager that >> >> listens >> >> > >> for >> >> > >> >>> notifications of all jobs or do you let each GramJob >> instance >> >> listen >> >> > >> >>> for >> >> > >> >>> notifications itself? >> >> > >> >>> In case you listen for notifications yourself: do you store >> >> > >> >>> GramJob objects or just EPR's of jobs and create GramJob >> >> objects if >> >> > >> >>> needed? >> >> > >> >> >> >> > >> >> Excellent points. I let each GramJob instance listen for >> >> > >> notifications >> >> > >> >> itself. What I observed is that it uses only one container >> for >> >> that. >> >> > >> >> >> >> > >> > >> >> > >> > Shoot! i didn't know that and thought there would be a >> container >> >> per >> >> > >> > GramJob in that case. That's the core mysteries with >> >> notifications. >> >> > >> > Anyway: I did a quick check some days ago and found that >> GramJob >> >> is >> >> > >> > surprisingly greedy regarding memory as you said. I'll have to >> >> further >> >> > >> > check what it is, but will probably not do that before 4.2 is >> >> out. >> >> > >> > >> >> > >> > >> >> > >> >> Due to the above, a reference to the GramJob is kept anyway, >> >> > >> regardless >> >> > >> >> of whether that reference is in client code or the local >> >> container. >> >> > >> >> >> >> > >> >> I'll try to profile a run and see if I can spot where the >> >> problems >> >> > >> are. >> >> > >> >> >> >> > >> >>> >> >> > >> >>> Martin >> >> > >> >>> >> >> > >> >>> >> >> >> > >> >>> >> The core team will be looking at improving notifications >> >> once >> >> > >> their >> >> > >> >>> >> other 4.2 deliverables are done. >> >> > >> >>> >> >> >> > >> >>> >> -Stu >> >> > >> >>> >> >> >> > >> >>> >> Begin forwarded message: >> >> > >> >>> >> >> >> > >> >>> >> > From: feller at mcs.anl.gov >> >> > >> >>> >> > Date: February 1, 2008 9:41:05 AM CST >> >> > >> >>> >> > To: "Jaime Frey" >> >> > >> >>> >> > Cc: "Stuart Martin" , "Terrence >> >> Martin" >> >> > >> >>> >> > >> > >> >>> >> > >, "Martin Feller" , "charles >> bacon" >> >> > >> >>> >> > >> > >> >>> >> > >, "Suchandra Thapa" , "Rob >> >> Gardner" >> >> > >> >>> >> > >> > >> >>> >> > >, "Jeff Porter" , "Alain Roy" >> >> > >> >>> , >> >> > >> >>> >> > "Todd Tannenbaum" , "Miron Livny" >> >> > >> >>> >> > >> > >> >>> >> > > >> >> > >> >>> >> > Subject: Re: Condor-G WS GRAM memory usage >> >> > >> >>> >> > >> >> > >> >>> >> >> On Jan 31, 2008, at 6:26 PM, Jaime Frey wrote: >> >> > >> >>> >> >> >> >> > >> >>> >> >>> On Jan 30, 2008, at 12:25 PM, Stuart Martin wrote: >> >> > >> >>> >> >>> >> >> > >> >>> >> >>>> On Jan 30, 2008, at Jan 30, 11:46 AM, Jaime Frey >> wrote: >> >> > >> >>> >> >>>> >> >> > >> >>> >> >>>>> Terrence Martin's scalability testing of Condor-G >> with >> >> WS >> >> > >> GRAM >> >> > >> >>> >> >>>>> raised some concerns about memory usage on the >> client >> >> side. >> >> > >> I >> >> > >> >>> did >> >> > >> >>> >> >>>>> some profiling of Condor-G's WS GRAM GAHP server, >> >> which >> >> > >> >>> appeared >> >> > >> >>> >> >>>>> to be the primary memory consumer. The GAHP server >> is >> >> a >> >> > >> >>> wrapper >> >> > >> >>> >> >>>>> around the java client libraries for WS GRAM. >> >> > >> >>> >> >>>>> >> >> > >> >>> >> >>>>> In my tests, I submitted variable numbers of jobs >> up >> >> to 30 >> >> > >> at >> >> > >> >>> a >> >> > >> >>> >> >>>>> time. The jobs were 2-minute sleep jobs with >> minimal >> >> data >> >> > >> >>> >> >>>>> transfer. All of the jobs overlapped in submission >> and >> >> > >> >>> execution. >> >> > >> >>> >> >>>>> Here is what I've discovered so far. >> >> > >> >>> >> >>>>> >> >> > >> >>> >> >>>>> Aside from the heap available to the java code, the >> >> jvm >> >> > >> used >> >> > >> >>> 117 >> >> > >> >>> >> >>>>> megs of non-shared memory and 74 megs of shared >> >> memory. >> >> > >> >>> Condor-G >> >> > >> >>> >> >>>>> creates one GAHP server for each (local uid, X509 >> DN) >> >> pair. >> >> > >> >>> >> >>>>> >> >> > >> >>> >> >>>>> The maximum jvm heap usage (as reported by the >> garbage >> >> > >> >>> collector) >> >> > >> >>> >> >>>>> was about 9 megs plus 0.9 megs per job. When the >> GAHP >> >> was >> >> > >> >>> >> >>>>> quiescent (jobs executing, Condor-G waiting for >> them >> >> to >> >> > >> >>> complete), >> >> > >> >>> >> >>>>> heap usage was about 5 megs plus 0.6 megs per job. >> >> > >> >>> >> >>>>> >> >> > >> >>> >> >>>>> The only long-term memory per job that I know of in >> >> the >> >> > >> GAHP >> >> > >> >>> is >> >> > >> >>> >> >>>>> for the notification sink for job status callbacks. >> >> 600kb >> >> > >> >>> seems >> >> > >> >>> a >> >> > >> >>> >> >>>>> little high for that. Stu, could someone on Globus >> >> help us >> >> > >> >>> >> >>>>> determine if we're using the notification sinks >> >> > >> inefficiently? >> >> > >> >>> >> >>>> >> >> > >> >>> >> >>>> Martin just looked and for the most part, there is >> >> nothing >> >> > >> >>> wrong >> >> > >> >>> >> >>>> with how condor-g manages the callback sink. >> >> > >> >>> >> >>>> However, one improvement that would reduce the >> memory >> >> used >> >> > >> per >> >> > >> >>> job >> >> > >> >>> >> >>>> would be to not have a notification consumer per >> job. >> >> > >> Instead >> >> > >> >>> use >> >> > >> >>> >> >>>> one for all jobs. >> >> > >> >>> >> >>>> >> >> > >> >>> >> >>>> Also, Martin recently did some analysis on condor-g >> >> stress >> >> > >> >>> tests >> >> > >> >>> >> >>>> and found that notifications are building up on the >> in >> >> the >> >> > >> >>> GRAM4 >> >> > >> >>> >> >>>> service container and that is causing delays which >> seem >> >> to >> >> > >> be >> >> > >> >>> >> >>>> causing multiple problems. We're looking at this in >> a >> >> > >> separate >> >> > >> >>> >> >>>> effort with the GT Core team. But, after this was >> >> clear, >> >> > >> >>> Martin >> >> > >> >>> >> >>>> re- >> >> > >> >>> >> >>>> ran the condor-g test and relied on polling between >> >> condor-g >> >> > >> >>> and >> >> > >> >>> >> >>>> the GRAM4 service instead of notifications. Jaime, >> >> could >> >> > >> you >> >> > >> >>> >> >>>> repeat the no-notification test and see the >> difference >> >> in >> >> > >> >>> memory? >> >> > >> >>> >> >>>> The changes would be to increase the polling >> frequency >> >> in >> >> > >> >>> condor-g >> >> > >> >>> >> >>>> and comment out the subscribe for notification. You >> >> could >> >> > >> also >> >> > >> >>> >> >>>> comment out the notification listener call(s) too. >> >> > >> >>> >> >>> >> >> > >> >>> >> >>> >> >> > >> >>> >> >>> I did two new sets of tests today. The first used >> more >> >> > >> efficient >> >> > >> >>> >> >>> callback code in the GAHP (one notification consumer >> >> rather >> >> > >> than >> >> > >> >>> one >> >> > >> >>> >> >>> per job). The second disabled notifications and >> relied >> >> on >> >> > >> >>> polling >> >> > >> >>> >> >>> for job status changes. >> >> > >> >>> >> >>> >> >> > >> >>> >> >>> The more efficient callback code did not produce a >> >> noticeable >> >> > >> >>> >> >>> reduction in memory usage. >> >> > >> >>> >> >>> >> >> > >> >>> >> >>> Disabling notifications did reduce memory usage. The >> >> maximum >> >> > >> jvm >> >> > >> >>> >> >>> heap usage was roughly 8 megs plus 0.5 megs per job. >> The >> >> > >> minimum >> >> > >> >>> >> >>> heap usage after job submission and before job >> >> completion was >> >> > >> >>> about >> >> > >> >>> >> >>> 4 megs + 0.1 megs per job. >> >> > >> >>> >> >> >> >> > >> >>> >> >> >> >> > >> >>> >> >> I ran one more test with the improved callback code. >> This >> >> > >> time, I >> >> > >> >>> >> >> stopped storing the notification producer EPRs >> associated >> >> with >> >> > >> >>> the >> >> > >> >>> >> >> GRAM job resources. Memory usage went down markedly. >> >> > >> >>> >> >> >> >> > >> >>> >> >> I was told the client had to explicitly destroy these >> >> > >> serve-side >> >> > >> >>> >> >> notification producer resources when it destroys the >> job, >> >> > >> >>> otherwise >> >> > >> >>> >> >> they hang around bogging down the server. Is this >> still >> >> the >> >> > >> case? >> >> > >> >>> The >> >> > >> >>> >> >> server can't destroy notification producers when their >> >> sources >> >> > >> of >> >> > >> >>> >> >> information are destroyed? >> >> > >> >>> >> >> >> >> > >> >>> >> > >> >> > >> >>> >> > This reminds me of the odd fact that i had to suddenly >> >> grant >> >> > >> much >> >> > >> >>> more >> >> > >> >>> >> > memory to Condor-g as soon as condor-g started storing >> >> EPRs of >> >> > >> >>> >> > subscription resources to be able to destroy them >> >> eventually. >> >> > >> >>> >> > Those EPR's are maybe not so tiny as they look like. >> >> > >> >>> >> > >> >> > >> >>> >> > For 4.0: yes, currently you'll have to store and >> >> eventually >> >> > >> >>> destroy >> >> > >> >>> >> > subscription resources manually to avoid heaping up >> >> persistence >> >> > >> >>> data >> >> > >> >>> >> > on the server-side. >> >> > >> >>> >> > For 4.2: no, you won't have to store them. A job >> resource >> >> will >> >> > >> >>> >> > destroy all subscription resources when it's destroyed. >> >> > >> >>> >> > >> >> > >> >>> >> > Overall i suggest to concentrate on 4.2 gram since the >> >> > >> "container >> >> > >> >>> >> > hangs in job destruction" problem won't exist anymore. >> >> > >> >>> >> > >> >> > >> >>> >> > Sorry, Jaime, i still can't provide you with 100% >> reliable >> >> 4.2 >> >> > >> >>> changes >> >> > >> >>> >> > in Gram in 4.2. I'll do so as soon as i can. I wonder >> if >> >> it >> >> > >> makes >> >> > >> >>> >> > sense >> >> > >> >>> >> > for us to do the 4.2-related changes in Gahp and hand >> it >> >> to you >> >> > >> >>> for >> >> > >> >>> >> > fine-tuning then? >> >> > >> >>> >> > >> >> > >> >>> >> > Martin >> >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> On Feb 8, 2008, at Feb 8, 9:19 AM, Ian Foster wrote: >> >> > >> >>> >> >> >> > >> >>> >> > Mihael: >> >> > >> >>> >> > >> >> > >> >>> >> > That's great, thanks! >> >> > >> >>> >> > >> >> > >> >>> >> > Ian. >> >> > >> >>> >> > >> >> > >> >>> >> > Mihael Hategan wrote: >> >> > >> >>> >> >> I did a 1024 job run today with ws-gram. >> >> > >> >>> >> >> I painted the results here: >> >> > >> >>> >> >> http://www-unix.mcs.anl.gov/~hategan/s/g.html >> >> > >> >>> >> >> >> >> > >> >>> >> >> Seems like client memory per job is about 370k. Which >> is >> >> quite >> >> > >> a >> >> > >> >>> lot. >> >> > >> >>> >> >> What kinda worries me is that it doesn't seem to go >> down >> >> after >> >> > >> >>> the >> >> > >> >>> >> >> jobs >> >> > >> >>> >> >> are done, so maybe there's a memory leak, or maybe the >> >> garbage >> >> > >> >>> >> >> collector >> >> > >> >>> >> >> doesn't do any major collections. I'll need to profile >> >> this to >> >> > >> >>> see >> >> > >> >>> >> >> exactly what we're talking about. >> >> > >> >>> >> >> >> >> > >> >>> >> >> The container memory is figured by looking at the >> process >> >> in >> >> > >> >>> /proc. >> >> > >> >>> >> >> It's >> >> > >> >>> >> >> total memory including shared libraries and things. >> But >> >> > >> libraries >> >> > >> >>> >> >> take a >> >> > >> >>> >> >> fixed amount of space, so a fuzzy correlation can >> >> probably be >> >> > >> >>> made. >> >> > >> >>> >> >> It >> >> > >> >>> >> >> looks quite similar to the amount of memory eaten on >> the >> >> > >> client >> >> > >> >>> side >> >> > >> >>> >> >> (per job). >> >> > >> >>> >> >> >> >> > >> >>> >> >> CPU-load-wise, WS-GRAM behaves. There is some work >> during >> >> the >> >> > >> >>> time >> >> > >> >>> >> >> the >> >> > >> >>> >> >> jobs are submitted, but the machine itself seems >> >> responsive. I >> >> > >> >>> have >> >> > >> >>> >> >> yet >> >> > >> >>> >> >> to plot the exact submission time for each job. >> >> > >> >>> >> >> >> >> > >> >>> >> >> So at this point I would recommend trying ws-gram as >> long >> >> as >> >> > >> >>> there >> >> > >> >>> >> >> aren't too many jobs involved (i.e. under 4000 >> parallel >> >> jobs), >> >> > >> >>> and >> >> > >> >>> >> >> while >> >> > >> >>> >> >> making sure the jvm has enough heap. More than that >> seems >> >> like >> >> > >> a >> >> > >> >>> >> >> gamble. >> >> > >> >>> >> >> >> >> > >> >>> >> >> Mihael >> >> > >> >>> >> >> >> >> > >> >>> >> >> _______________________________________________ >> >> > >> >>> >> >> Swift-devel mailing list >> >> > >> >>> >> >> Swift-devel at ci.uchicago.edu >> >> > >> >>> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> > >> >>> >> >> >> >> > >> >>> >> >> >> >> > >> >>> >> > >> >> > >> >>> >> >> >> > >> >>> > >> >> > >> >>> > >> >> > >> >>> >> >> > >> >>> >> >> > >> >> >> >> > >> >> >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> >> >> > >> >> >> > > >> >> > > >> >> >> >> _______________________________________________ >> >> Swift-devel mailing list >> >> Swift-devel at ci.uchicago.edu >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> > >> > >> >> > > From benc at hawaga.org.uk Sun Feb 10 05:50:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 10 Feb 2008 11:50:05 +0000 (GMT) Subject: [Swift-devel] program order Message-ID: This works in the present code - type and mapping declaration after assignment (see tests/language-behaviour/040-program-order.swift) outfile = greeting("hi"); messagefile outfile <"040-program-order.out">; When implementing some more compile time checking, I rediscovered this. I'm not sure whether I prefer this to be permitted or to be prohibited. -- From hategan at mcs.anl.gov Sun Feb 10 11:45:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 10 Feb 2008 11:45:10 -0600 Subject: [Swift-devel] program order In-Reply-To: References: Message-ID: <1202665510.5770.15.camel@blabla.mcs.anl.gov> On Sun, 2008-02-10 at 11:50 +0000, Ben Clifford wrote: > This works in the present code - type and mapping declaration after > assignment (see tests/language-behaviour/040-program-order.swift) > > outfile = greeting("hi"); > messagefile outfile <"040-program-order.out">; > > When implementing some more compile time checking, I rediscovered this. > I'm not sure whether I prefer this to be permitted or to be prohibited. Does it really? Or does it cause a race condition which happens to work most of the times? > From benc at hawaga.org.uk Sun Feb 10 12:12:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 10 Feb 2008 18:12:11 +0000 (GMT) Subject: [Swift-devel] program order In-Reply-To: <1202665510.5770.15.camel@blabla.mcs.anl.gov> References: <1202665510.5770.15.camel@blabla.mcs.anl.gov> Message-ID: On Sun, 10 Feb 2008, Mihael Hategan wrote: > Does it really? Or does it cause a race condition which happens to work > most of the times? It produces almost the same KML either way, though at least with the partial closing stuff that I put in a month or two ago enough to be significant. It is not a race in karajan execution because variable declarations get compiled to a separate block that is always placed before the parallel execution of assignments/procedures; so the ordering is irrelevant from that perspective (That is related roblem with using not-yet-evaluated variables in mapper parameters). I suspect some funny stuff will happen with partial closing when arrays are used in this order at the moment though. -- From benc at hawaga.org.uk Mon Feb 11 08:28:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 11 Feb 2008 14:28:57 +0000 (GMT) Subject: [Swift-devel] cog r1871 Message-ID: On my laptop, I'm getting the below error message with latest cog and my development swift. cog r1864 doesn't give this problem. cog r1871 does (there's nothing in between those two commits in the piece of the cog svn that swift uses) echo failed Execution failed: Exception in echo: Arguments: [hello] Host: tp-fork-gram2 Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi stderr.txt:. stdout.txt:. ---- Caused by: Exception in getFile Caused by: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 451 refusing to store with active mode org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) org.globus.ftp.exception.DataChannelException: setPassive() must match store() and setActive() - retrieve() (error code 2) at org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317) at org.globus.ftp.FTPClient.get(FTPClient.java:1236) at org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275) at org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60) at org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490) at java.lang.Thread.run(Thread.java:613) -- From hategan at mcs.anl.gov Mon Feb 11 10:01:21 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Feb 2008 10:01:21 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: References: Message-ID: <1202745681.15887.10.camel@blabla.mcs.anl.gov> On Mon, 2008-02-11 at 14:28 +0000, Ben Clifford wrote: > On my laptop, I'm getting the below error message with latest cog and my > development swift. > > cog r1864 doesn't give this problem. cog r1871 does (there's nothing in > between those two commits in the piece of the cog svn that swift uses) I know what causes the problem. It's r1871, as you say. > > > echo failed > Execution failed: > Exception in echo: > Arguments: [hello] > Host: tp-fork-gram2 > Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi > stderr.txt:. > > stdout.txt:. > > ---- > > Caused by: > Exception in getFile > Caused by: > Server refused performing the request. Custom message: (error > code 1) [Nested exception message: Custom message: Unexpected reply: 451 > refusing to store with active mode > org.globus.ftp.exception.DataChannelException: setPassive() must match > store() and setActive() - retrieve() (error code 2) > org.globus.ftp.exception.DataChannelException: setPassive() must match > store() and setActive() - retrieve() (error code 2) > at > org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317) > at org.globus.ftp.FTPClient.get(FTPClient.java:1236) > at > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275) > at > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60) > at > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490) > at java.lang.Thread.run(Thread.java:613) > > From hategan at mcs.anl.gov Mon Feb 11 10:59:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Feb 2008 10:59:52 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: <1202745681.15887.10.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> Message-ID: <1202749192.18234.0.camel@blabla.mcs.anl.gov> On Mon, 2008-02-11 at 10:01 -0600, Mihael Hategan wrote: > On Mon, 2008-02-11 at 14:28 +0000, Ben Clifford wrote: > > On my laptop, I'm getting the below error message with latest cog and my > > development swift. > > > > cog r1864 doesn't give this problem. cog r1871 does (there's nothing in > > between those two commits in the piece of the cog svn that swift uses) > > I know what causes the problem. Actually I don't. I only have a suspicion. Can you send me the logs? > It's r1871, as you say. > > > > > > > echo failed > > Execution failed: > > Exception in echo: > > Arguments: [hello] > > Host: tp-fork-gram2 > > Directory: 001-echo-20080211-1419-kvenil6g/jobs/m/echo-mn9fr9oi > > stderr.txt:. > > > > stdout.txt:. > > > > ---- > > > > Caused by: > > Exception in getFile > > Caused by: > > Server refused performing the request. Custom message: (error > > code 1) [Nested exception message: Custom message: Unexpected reply: 451 > > refusing to store with active mode > > org.globus.ftp.exception.DataChannelException: setPassive() must match > > store() and setActive() - retrieve() (error code 2) > > org.globus.ftp.exception.DataChannelException: setPassive() must match > > store() and setActive() - retrieve() (error code 2) > > at > > org.globus.ftp.extended.GridFTPServerFacade.store(GridFTPServerFacade.java:317) > > at org.globus.ftp.FTPClient.get(FTPClient.java:1236) > > at > > org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.getFile(FileResourceImpl.java:359) > > at > > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doSource(DelegatedFileTransferHandler.java:275) > > at > > org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doSource(CachingDelegatedFileTransferHandler.java:60) > > at > > org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:490) > > at java.lang.Thread.run(Thread.java:613) > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Feb 11 11:25:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 11 Feb 2008 17:25:05 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: <1202749192.18234.0.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 11 Feb 2008, Mihael Hategan wrote: > Actually I don't. I only have a suspicion. Can you send me the logs? the log for running tests/language-behaviour/061-cattwo to tg-uc from my laptop is here: http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080211-1720-a2hqh596.log -- From hategan at mcs.anl.gov Mon Feb 11 13:34:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Feb 2008 13:34:03 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> Message-ID: <1202758443.28686.0.camel@blabla.mcs.anl.gov> On Mon, 2008-02-11 at 17:25 +0000, Ben Clifford wrote: > > On Mon, 11 Feb 2008, Mihael Hategan wrote: > > > Actually I don't. I only have a suspicion. Can you send me the logs? > > the log for running tests/language-behaviour/061-cattwo to tg-uc from my > laptop is here: > > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080211-1720-a2hqh596.log > r1875 should fix this. From benc at hawaga.org.uk Mon Feb 11 16:19:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 11 Feb 2008 22:19:15 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: <1202758443.28686.0.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 11 Feb 2008, Mihael Hategan wrote: > r1875 should fix this. yes, it seems to. -- From hategan at mcs.anl.gov Mon Feb 11 16:37:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Feb 2008 16:37:12 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> Message-ID: <1202769433.31985.1.camel@blabla.mcs.anl.gov> Also, r1876 updates the gram4 client to a patched version of 4.0.6 which seems to eat less memory than 4.0.6 and earlier. On Mon, 2008-02-11 at 22:19 +0000, Ben Clifford wrote: > > On Mon, 11 Feb 2008, Mihael Hategan wrote: > > > r1875 should fix this. > > yes, it seems to. > From benc at hawaga.org.uk Mon Feb 11 16:50:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 11 Feb 2008 22:50:56 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: <1202769433.31985.1.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> Message-ID: I'm seeing repeatable cleanup errors like the below. The workflows run to completion, though. RunID: 20080211-2248-rsqe1da0 cat started cat completed The following warnings have occurred: 1. Cleanup on tguc failed Caused by: Cannot submit job: null Caused by: java.lang.NullPointerException at org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211) at org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970) at org.globus.exec.client.GramJob.submit(GramJob.java:447) at org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) -- From hategan at mcs.anl.gov Mon Feb 11 17:46:42 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 11 Feb 2008 17:46:42 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> Message-ID: <1202773602.779.0.camel@blabla.mcs.anl.gov> Martin? On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote: > I'm seeing repeatable cleanup errors like the below. The workflows run to > completion, though. > > RunID: 20080211-2248-rsqe1da0 > cat started > cat completed > The following warnings have occurred: > 1. Cleanup on tguc failed > Caused by: > Cannot submit job: null > Caused by: > java.lang.NullPointerException > at > org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211) > at > org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970) > at org.globus.exec.client.GramJob.submit(GramJob.java:447) > at > org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54) > at > org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) > at > edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > > From feller at mcs.anl.gov Mon Feb 11 23:28:05 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Mon, 11 Feb 2008 23:28:05 -0600 (CST) Subject: [Swift-devel] cog r1871 In-Reply-To: <1202773602.779.0.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> Message-ID: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> My fault, not the ObjectSerializers one. You submitted in batch-mode? The attached jar should fix that. Hope the java version is fine. Martin > Martin? > > On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote: >> I'm seeing repeatable cleanup errors like the below. The workflows run >> to >> completion, though. >> >> RunID: 20080211-2248-rsqe1da0 >> cat started >> cat completed >> The following warnings have occurred: >> 1. Cleanup on tguc failed >> Caused by: >> Cannot submit job: null >> Caused by: >> java.lang.NullPointerException >> at >> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211) >> at >> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970) >> at org.globus.exec.client.GramJob.submit(GramJob.java:447) >> at >> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189) >> at >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54) >> at >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) >> at >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >> >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: gram-client.jar Type: application/octet-stream Size: 35855 bytes Desc: not available URL: From benc at hawaga.org.uk Tue Feb 12 06:47:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Feb 2008 12:47:23 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> Message-ID: On Mon, 11 Feb 2008, feller at mcs.anl.gov wrote: > My fault, not the ObjectSerializers one. > You submitted in batch-mode? The final cleanup job that failed is in batch mode, yes. Its the only one submitted that way. > The attached jar should fix that. With your new jar, I no longer get that error. I did once get the below stack trace, though execution appeared to continue. It hasn't happened a second time or third time on running the same tests. touch started Unable to destroy remote service for task urn:0-1-1202817892228 java.lang.NullPointerException at org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12) at org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:32) at org.globus.exec.client.GramJob.destroy(GramJob.java:1303) at org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.cleanup(JobSubmissionTaskHandler.java:431) at org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.stateChanged(JobSubmissionTaskHandler.java:397) at org.globus.exec.client.GramJob.setState(GramJob.java:321) at org.globus.exec.client.GramJob.deliver(GramJob.java:1677) at org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:126) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384) at org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281) at org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450) at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285) at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) at org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147) at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) touch completed -- From benc at hawaga.org.uk Tue Feb 12 06:49:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Feb 2008 12:49:21 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: <1202769433.31985.1.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> Message-ID: On Mon, 11 Feb 2008, Mihael Hategan wrote: > Also, r1876 updates the gram4 client to a patched version of 4.0.6 which > seems to eat less memory than 4.0.6 and earlier. For source code reproducibility when some sucker wants to go look at the source code, can you label the gram jars with a timestamp (until such time as GT moves to a version control system with commit IDs)? -- From feller at mcs.anl.gov Tue Feb 12 09:33:14 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Tue, 12 Feb 2008 09:33:14 -0600 (CST) Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> Message-ID: <49223.130.202.97.10.1202830394.squirrel@www-unix.mcs.anl.gov> > On Mon, 11 Feb 2008, feller at mcs.anl.gov wrote: > >> My fault, not the ObjectSerializers one. >> You submitted in batch-mode? > > The final cleanup job that failed is in batch mode, yes. Its the only one > submitted that way. > >> The attached jar should fix that. > > With your new jar, I no longer get that error. I did once get the below > stack trace, though execution appeared to continue. It hasn't happened a > second time or third time on running the same tests. > > touch started > Unable to destroy remote service for task urn:0-1-1202817892228 > java.lang.NullPointerException > at > org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12) > at > org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:32) > at org.globus.exec.client.GramJob.destroy(GramJob.java:1303) > at > org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.cleanup(JobSubmissionTaskHandler.java:431) > at > org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.stateChanged(JobSubmissionTaskHandler.java:397) > at org.globus.exec.client.GramJob.setState(GramJob.java:321) > at org.globus.exec.client.GramJob.deliver(GramJob.java:1677) > at > org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:126) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at > org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384) > at > org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281) > at > org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319) > at > org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) > at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) > at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) > at > org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450) > at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285) > at > org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) > at > org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) > at > org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147) > at > org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) > touch completed > > -- This is odd. Can you have an eye on that in further tests? May it happen that you use GramJob.setEndpoint(EndpointReferenceType) before destruction at some point and pass null as argument? That's the only situation where i can see that this can happen right now without an exception being thrown before. Martin From mikekubal at yahoo.com Tue Feb 12 12:08:51 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 10:08:51 -0800 (PST) Subject: [Swift-devel] latest attempt with GRAM4 In-Reply-To: Message-ID: <734334.98494.qm@web52305.mail.re2.yahoo.com> Hello All, I am running with the cog and swift from svn as of Monday afternoon, 2/11. The swift script ran successfully when using pre-ws, but failed with ws-gram. I am also running with kickstart on, but will now test with kickstart off to see if this is the problem. This is the error I get back. (I rsynced the log files to Ben's dir at UC, job gtxa3945 is the one that failed). Failed to transfer kickstart records from run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/UC-64Exception in getFile task:transfer @ vdl-int.k, line: 322 sys:try @ vdl-int.k, line: 322 vdl:transferkickstartrec @ vdl-int.k, line: 409 sys:set @ vdl-int.k, line: 409 sys:sequential @ vdl-int.k, line: 409 sys:try @ vdl-int.k, line: 408 sys:else @ vdl-int.k, line: 407 sys:if @ vdl-int.k, line: 405 sys:set @ vdl-int.k, line: 404 sys:catch @ vdl-int.k, line: 396 sys:try @ vdl-int.k, line: 354 task:allocatehost @ vdl-int.k, line: 334 vdl:execute2 @ execute-default.k, line: 23 sys:restartonerror @ execute-default.k, line: 21 sys:sequential @ execute-default.k, line: 19 sys:try @ execute-default.k, line: 18 sys:if @ execute-default.k, line: 17 sys:then @ execute-default.k, line: 16 sys:if @ execute-default.k, line: 15 vdl:execute @ run_MD_pipeline_loop_for_impdh.kml, line: 67 prepare_ligand @ run_MD_pipeline_loop_for_impdh.kml, line: 585 sys:sequential @ run_MD_pipeline_loop_for_impdh.kml, line: 584 sys:parallel @ run_MD_pipeline_loop_for_impdh.kml, line: 583 sys:parallelfor @ run_MD_pipeline_loop_for_impdh.kml, line: 450 sys:sequential @ run_MD_pipeline_loop_for_impdh.kml, line: 449 vdl:mainp @ run_MD_pipeline_loop_for_impdh.kml, line: 448 mainp @ vdl.k, line: 150 vdl:mains @ run_MD_pipeline_loop_for_impdh.kml, line: 447 vdl:mains @ run_MD_pipeline_loop_for_impdh.kml, line: 447 rlog:restartlog @ run_MD_pipeline_loop_for_impdh.kml, line: 446 kernel:project @ run_MD_pipeline_loop_for_impdh.kml, line: 2 run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945 Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Exception in getFile Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed. : globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file /home/kubal/Swift_Runs/run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/amberize_ligand-lqshnboi-kickstart.xml 500-globus_xio: System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500- 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed. : globus_l_gfs_file_open failed. 500-globus_xio: Unable to open file /home/kubal/Swift_Runs/run_MD_pipeline_loop_for_impdh-20080212-1152-gtxa3945/kickstart/l/amberize_ligand-lqshnboi-kickstart.xml 500-globus_xio: System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500- 500 End.] --- Ben Clifford wrote: > > On Mon, 11 Feb 2008, Mihael Hategan wrote: > > > Also, r1876 updates the gram4 client to a patched > version of 4.0.6 which > > seems to eat less memory than 4.0.6 and earlier. > > For source code reproducibility when some sucker > wants to go look at the > source code, can you label the gram jars with a > timestamp (until such time > as GT moves to a version control system with commit > IDs)? > > -- > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From benc at hawaga.org.uk Tue Feb 12 12:33:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Feb 2008 18:33:34 +0000 (GMT) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <734334.98494.qm@web52305.mail.re2.yahoo.com> References: <734334.98494.qm@web52305.mail.re2.yahoo.com> Message-ID: yeah, run that same without kickstart. the error reported is that kickstart didn't work right - but there's perhaps some underlying error. -- From mikekubal at yahoo.com Tue Feb 12 13:36:20 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 11:36:20 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: Message-ID: <283100.87314.qm@web52307.mail.re2.yahoo.com> Yes, I believe you are right. The kickstart message may be only a warning. After digging a little deeper it appears the job is failing due to a project/account id problem. I get the following error: Caused by: The executable could not be started., qsub: Invalid Account MSG=invalid account I am specifying the same TG-account in my site-file for the gram4 run that fails, as in the site-file for the pre-ws job that suceeds. This is the same project, TG-MCA01S018, that is set in my .tg_default_project file in ~kubal/ on the UC teragrid. --- Ben Clifford wrote: > yeah, run that same without kickstart. the error > reported is that > kickstart didn't work right - but there's perhaps > some underlying error. > -- > > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From hategan at mcs.anl.gov Tue Feb 12 13:45:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 13:45:02 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <283100.87314.qm@web52307.mail.re2.yahoo.com> References: <283100.87314.qm@web52307.mail.re2.yahoo.com> Message-ID: <1202845502.13985.1.camel@blabla.mcs.anl.gov> While this doesn't solve the underlying problem, it may help you get this to work: log into tg-login1.uc..., set this project as default, then remove the project spec from the sites file and try again. Mihael On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: > Yes, I believe you are right. The kickstart message > may be only a warning. After digging a little deeper > it appears the job is failing due to a project/account > id problem. I get the following error: > > Caused by: > The executable could not be started., qsub: > Invalid Account MSG=invalid account > > I am specifying the same TG-account in my site-file > for the gram4 run that fails, as in the site-file for > the pre-ws job that suceeds. This is the same project, > TG-MCA01S018, that is set in my .tg_default_project > file in ~kubal/ on the UC teragrid. > > > > > > > > > > --- Ben Clifford wrote: > > > yeah, run that same without kickstart. the error > > reported is that > > kickstart didn't work right - but there's perhaps > > some underlying error. > > -- > > > > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > From mikekubal at yahoo.com Tue Feb 12 14:09:17 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 12:09:17 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202845502.13985.1.camel@blabla.mcs.anl.gov> Message-ID: <874540.48019.qm@web52309.mail.re2.yahoo.com> I'll give it a try. When using GRAM4, is qsub the method used to ultimately put the job in the queue? MikeK --- Mihael Hategan wrote: > While this doesn't solve the underlying problem, it > may help you get > this to work: log into tg-login1.uc..., set this > project as default, > then remove the project spec from the sites file and > try again. > > Mihael > > On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: > > Yes, I believe you are right. The kickstart > message > > may be only a warning. After digging a little > deeper > > it appears the job is failing due to a > project/account > > id problem. I get the following error: > > > > Caused by: > > The executable could not be started., > qsub: > > Invalid Account MSG=invalid account > > > > I am specifying the same TG-account in my > site-file > > for the gram4 run that fails, as in the site-file > for > > the pre-ws job that suceeds. This is the same > project, > > TG-MCA01S018, that is set in my > .tg_default_project > > file in ~kubal/ on the UC teragrid. > > > > > > > > > > > > > > > > > > > > --- Ben Clifford wrote: > > > > > yeah, run that same without kickstart. the error > > > reported is that > > > kickstart didn't work right - but there's > perhaps > > > some underlying error. > > > -- > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Never miss a thing. Make Yahoo your home page. > > http://www.yahoo.com/r/hs > > > > ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From hategan at mcs.anl.gov Tue Feb 12 14:15:06 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 14:15:06 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <874540.48019.qm@web52309.mail.re2.yahoo.com> References: <874540.48019.qm@web52309.mail.re2.yahoo.com> Message-ID: <1202847306.14542.1.camel@blabla.mcs.anl.gov> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote: > I'll give it a try. > > When using GRAM4, is qsub the method used to > ultimately put the job in the queue? Looks like it. I also believe it's the case with pre-ws gram. Stu may be able to clarify. > > MikeK > --- Mihael Hategan wrote: > > > While this doesn't solve the underlying problem, it > > may help you get > > this to work: log into tg-login1.uc..., set this > > project as default, > > then remove the project spec from the sites file and > > try again. > > > > Mihael > > > > On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: > > > Yes, I believe you are right. The kickstart > > message > > > may be only a warning. After digging a little > > deeper > > > it appears the job is failing due to a > > project/account > > > id problem. I get the following error: > > > > > > Caused by: > > > The executable could not be started., > > qsub: > > > Invalid Account MSG=invalid account > > > > > > I am specifying the same TG-account in my > > site-file > > > for the gram4 run that fails, as in the site-file > > for > > > the pre-ws job that suceeds. This is the same > > project, > > > TG-MCA01S018, that is set in my > > .tg_default_project > > > file in ~kubal/ on the UC teragrid. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- Ben Clifford wrote: > > > > > > > yeah, run that same without kickstart. the error > > > > reported is that > > > > kickstart didn't work right - but there's > > perhaps > > > > some underlying error. > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Never miss a thing. Make Yahoo your home page. > > > http://www.yahoo.com/r/hs > > > > > > > > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > From smartin at mcs.anl.gov Tue Feb 12 14:20:39 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 12 Feb 2008 14:20:39 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202847306.14542.1.camel@blabla.mcs.anl.gov> References: <874540.48019.qm@web52309.mail.re2.yahoo.com> <1202847306.14542.1.camel@blabla.mcs.anl.gov> Message-ID: that's right, qsub is used for PBS (and some others too) bsub is LSF condor_q for condor ... -Stu On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote: > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote: >> I'll give it a try. >> >> When using GRAM4, is qsub the method used to >> ultimately put the job in the queue? > > Looks like it. I also believe it's the case with pre-ws gram. Stu > may be > able to clarify. > >> >> MikeK >> --- Mihael Hategan wrote: >> >>> While this doesn't solve the underlying problem, it >>> may help you get >>> this to work: log into tg-login1.uc..., set this >>> project as default, >>> then remove the project spec from the sites file and >>> try again. >>> >>> Mihael >>> >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: >>>> Yes, I believe you are right. The kickstart >>> message >>>> may be only a warning. After digging a little >>> deeper >>>> it appears the job is failing due to a >>> project/account >>>> id problem. I get the following error: >>>> >>>> Caused by: >>>> The executable could not be started., >>> qsub: >>>> Invalid Account MSG=invalid account >>>> >>>> I am specifying the same TG-account in my >>> site-file >>>> for the gram4 run that fails, as in the site-file >>> for >>>> the pre-ws job that suceeds. This is the same >>> project, >>>> TG-MCA01S018, that is set in my >>> .tg_default_project >>>> file in ~kubal/ on the UC teragrid. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> --- Ben Clifford wrote: >>>> >>>>> yeah, run that same without kickstart. the error >>>>> reported is that >>>>> kickstart didn't work right - but there's >>> perhaps >>>>> some underlying error. >>>>> -- >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >> ____________________________________________________________________________________ >>>> Never miss a thing. Make Yahoo your home page. >>>> http://www.yahoo.com/r/hs >>>> >>> >>> >> >> >> >> >> ____________________________________________________________________________________ >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> > From hategan at mcs.anl.gov Tue Feb 12 14:23:22 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 14:23:22 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: References: <874540.48019.qm@web52309.mail.re2.yahoo.com> <1202847306.14542.1.camel@blabla.mcs.anl.gov> Message-ID: <1202847802.15085.0.camel@blabla.mcs.anl.gov> Is this the same for pre-WS GRAM? On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin wrote: > that's right, qsub is used for PBS (and some others too) > bsub is LSF > condor_q for condor > ... > > -Stu > > On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote: > > > > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote: > >> I'll give it a try. > >> > >> When using GRAM4, is qsub the method used to > >> ultimately put the job in the queue? > > > > Looks like it. I also believe it's the case with pre-ws gram. Stu > > may be > > able to clarify. > > > >> > >> MikeK > >> --- Mihael Hategan wrote: > >> > >>> While this doesn't solve the underlying problem, it > >>> may help you get > >>> this to work: log into tg-login1.uc..., set this > >>> project as default, > >>> then remove the project spec from the sites file and > >>> try again. > >>> > >>> Mihael > >>> > >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: > >>>> Yes, I believe you are right. The kickstart > >>> message > >>>> may be only a warning. After digging a little > >>> deeper > >>>> it appears the job is failing due to a > >>> project/account > >>>> id problem. I get the following error: > >>>> > >>>> Caused by: > >>>> The executable could not be started., > >>> qsub: > >>>> Invalid Account MSG=invalid account > >>>> > >>>> I am specifying the same TG-account in my > >>> site-file > >>>> for the gram4 run that fails, as in the site-file > >>> for > >>>> the pre-ws job that suceeds. This is the same > >>> project, > >>>> TG-MCA01S018, that is set in my > >>> .tg_default_project > >>>> file in ~kubal/ on the UC teragrid. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> --- Ben Clifford wrote: > >>>> > >>>>> yeah, run that same without kickstart. the error > >>>>> reported is that > >>>>> kickstart didn't work right - but there's > >>> perhaps > >>>>> some underlying error. > >>>>> -- > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> > >>> > >> ____________________________________________________________________________________ > >>>> Never miss a thing. Make Yahoo your home page. > >>>> http://www.yahoo.com/r/hs > >>>> > >>> > >>> > >> > >> > >> > >> > >> ____________________________________________________________________________________ > >> Be a better friend, newshound, and > >> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >> > > > From smartin at mcs.anl.gov Tue Feb 12 14:26:44 2008 From: smartin at mcs.anl.gov (Stuart Martin) Date: Tue, 12 Feb 2008 14:26:44 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202847802.15085.0.camel@blabla.mcs.anl.gov> References: <874540.48019.qm@web52309.mail.re2.yahoo.com> <1202847306.14542.1.camel@blabla.mcs.anl.gov> <1202847802.15085.0.camel@blabla.mcs.anl.gov> Message-ID: <41A561F6-5D46-4B2C-96B5-E693290C41C6@mcs.anl.gov> Yes. Both versions use the *same* perl scripts to submit jobs. On Feb 12, 2008, at Feb 12, 2:23 PM, Mihael Hategan wrote: > Is this the same for pre-WS GRAM? > > On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin wrote: >> that's right, qsub is used for PBS (and some others too) >> bsub is LSF >> condor_q for condor >> ... >> >> -Stu >> >> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael Hategan wrote: >> >>> >>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal wrote: >>>> I'll give it a try. >>>> >>>> When using GRAM4, is qsub the method used to >>>> ultimately put the job in the queue? >>> >>> Looks like it. I also believe it's the case with pre-ws gram. Stu >>> may be >>> able to clarify. >>> >>>> >>>> MikeK >>>> --- Mihael Hategan wrote: >>>> >>>>> While this doesn't solve the underlying problem, it >>>>> may help you get >>>>> this to work: log into tg-login1.uc..., set this >>>>> project as default, >>>>> then remove the project spec from the sites file and >>>>> try again. >>>>> >>>>> Mihael >>>>> >>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal wrote: >>>>>> Yes, I believe you are right. The kickstart >>>>> message >>>>>> may be only a warning. After digging a little >>>>> deeper >>>>>> it appears the job is failing due to a >>>>> project/account >>>>>> id problem. I get the following error: >>>>>> >>>>>> Caused by: >>>>>> The executable could not be started., >>>>> qsub: >>>>>> Invalid Account MSG=invalid account >>>>>> >>>>>> I am specifying the same TG-account in my >>>>> site-file >>>>>> for the gram4 run that fails, as in the site-file >>>>> for >>>>>> the pre-ws job that suceeds. This is the same >>>>> project, >>>>>> TG-MCA01S018, that is set in my >>>>> .tg_default_project >>>>>> file in ~kubal/ on the UC teragrid. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> --- Ben Clifford wrote: >>>>>> >>>>>>> yeah, run that same without kickstart. the error >>>>>>> reported is that >>>>>>> kickstart didn't work right - but there's >>>>> perhaps >>>>>>> some underlying error. >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> ____________________________________________________________________________________ >>>>>> Never miss a thing. Make Yahoo your home page. >>>>>> http://www.yahoo.com/r/hs >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> ____________________________________________________________________________________ >>>> Be a better friend, newshound, and >>>> know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>> >>> >> > From mikekubal at yahoo.com Tue Feb 12 14:34:28 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 12:34:28 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202847802.15085.0.camel@blabla.mcs.anl.gov> Message-ID: <660399.26765.qm@web52308.mail.re2.yahoo.com> I tried running with the account id removed from the sites.file as in the following line: but received the same error. --- Mihael Hategan wrote: > Is this the same for pre-WS GRAM? > > On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin > wrote: > > that's right, qsub is used for PBS (and some > others too) > > bsub is LSF > > condor_q for condor > > ... > > > > -Stu > > > > On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > Hategan wrote: > > > > > > > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal > wrote: > > >> I'll give it a try. > > >> > > >> When using GRAM4, is qsub the method used to > > >> ultimately put the job in the queue? > > > > > > Looks like it. I also believe it's the case with > pre-ws gram. Stu > > > may be > > > able to clarify. > > > > > >> > > >> MikeK > > >> --- Mihael Hategan wrote: > > >> > > >>> While this doesn't solve the underlying > problem, it > > >>> may help you get > > >>> this to work: log into tg-login1.uc..., set > this > > >>> project as default, > > >>> then remove the project spec from the sites > file and > > >>> try again. > > >>> > > >>> Mihael > > >>> > > >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal > wrote: > > >>>> Yes, I believe you are right. The kickstart > > >>> message > > >>>> may be only a warning. After digging a little > > >>> deeper > > >>>> it appears the job is failing due to a > > >>> project/account > > >>>> id problem. I get the following error: > > >>>> > > >>>> Caused by: > > >>>> The executable could not be started., > > >>> qsub: > > >>>> Invalid Account MSG=invalid account > > >>>> > > >>>> I am specifying the same TG-account in my > > >>> site-file > > >>>> for the gram4 run that fails, as in the > site-file > > >>> for > > >>>> the pre-ws job that suceeds. This is the same > > >>> project, > > >>>> TG-MCA01S018, that is set in my > > >>> .tg_default_project > > >>>> file in ~kubal/ on the UC teragrid. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> --- Ben Clifford wrote: > > >>>> > > >>>>> yeah, run that same without kickstart. the > error > > >>>>> reported is that > > >>>>> kickstart didn't work right - but there's > > >>> perhaps > > >>>>> some underlying error. > > >>>>> -- > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>> > > >> > ____________________________________________________________________________________ > > >>>> Never miss a thing. Make Yahoo your home > page. > > >>>> http://www.yahoo.com/r/hs > > >>>> > > >>> > > >>> > > >> > > >> > > >> > > >> > > >> > ____________________________________________________________________________________ > > >> Be a better friend, newshound, and > > >> know-it-all with Yahoo! Mobile. Try it now. > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > >> > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From hategan at mcs.anl.gov Tue Feb 12 14:37:18 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 14:37:18 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <660399.26765.qm@web52308.mail.re2.yahoo.com> References: <660399.26765.qm@web52308.mail.re2.yahoo.com> Message-ID: <1202848638.15905.0.camel@blabla.mcs.anl.gov> You should probably remove the line completely. Did you chose a default project on the login node with tgprojects? On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal wrote: > I tried running with the account id removed from the > sites.file as in the following line: > > > > but received the same error. > > > > --- Mihael Hategan wrote: > > > Is this the same for pre-WS GRAM? > > > > On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin > > wrote: > > > that's right, qsub is used for PBS (and some > > others too) > > > bsub is LSF > > > condor_q for condor > > > ... > > > > > > -Stu > > > > > > On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > Hategan wrote: > > > > > > > > > > > On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal > > wrote: > > > >> I'll give it a try. > > > >> > > > >> When using GRAM4, is qsub the method used to > > > >> ultimately put the job in the queue? > > > > > > > > Looks like it. I also believe it's the case with > > pre-ws gram. Stu > > > > may be > > > > able to clarify. > > > > > > > >> > > > >> MikeK > > > >> --- Mihael Hategan wrote: > > > >> > > > >>> While this doesn't solve the underlying > > problem, it > > > >>> may help you get > > > >>> this to work: log into tg-login1.uc..., set > > this > > > >>> project as default, > > > >>> then remove the project spec from the sites > > file and > > > >>> try again. > > > >>> > > > >>> Mihael > > > >>> > > > >>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal > > wrote: > > > >>>> Yes, I believe you are right. The kickstart > > > >>> message > > > >>>> may be only a warning. After digging a little > > > >>> deeper > > > >>>> it appears the job is failing due to a > > > >>> project/account > > > >>>> id problem. I get the following error: > > > >>>> > > > >>>> Caused by: > > > >>>> The executable could not be started., > > > >>> qsub: > > > >>>> Invalid Account MSG=invalid account > > > >>>> > > > >>>> I am specifying the same TG-account in my > > > >>> site-file > > > >>>> for the gram4 run that fails, as in the > > site-file > > > >>> for > > > >>>> the pre-ws job that suceeds. This is the same > > > >>> project, > > > >>>> TG-MCA01S018, that is set in my > > > >>> .tg_default_project > > > >>>> file in ~kubal/ on the UC teragrid. > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> --- Ben Clifford wrote: > > > >>>> > > > >>>>> yeah, run that same without kickstart. the > > error > > > >>>>> reported is that > > > >>>>> kickstart didn't work right - but there's > > > >>> perhaps > > > >>>>> some underlying error. > > > >>>>> -- > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > ____________________________________________________________________________________ > > > >>>> Never miss a thing. Make Yahoo your home > > page. > > > >>>> http://www.yahoo.com/r/hs > > > >>>> > > > >>> > > > >>> > > > >> > > > >> > > > >> > > > >> > > > >> > > > ____________________________________________________________________________________ > > > >> Be a better friend, newshound, and > > > >> know-it-all with Yahoo! Mobile. Try it now. > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > > >> > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > From insley at mcs.anl.gov Tue Feb 12 14:45:29 2008 From: insley at mcs.anl.gov (joseph insley) Date: Tue, 12 Feb 2008 14:45:29 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202848638.15905.0.camel@blabla.mcs.anl.gov> References: <660399.26765.qm@web52308.mail.re2.yahoo.com> <1202848638.15905.0.camel@blabla.mcs.anl.gov> Message-ID: <54A8DBC2-386E-4446-B29C-64952AD7B782@mcs.anl.gov> Mike K, looks like you have the wrong value in your .tg_default_project file: insley at tg-viz-login1:~> more ~kubal/.tg_default_project TG-MCA01S018 you should be using: TG-MCB010025N insley at tg-viz-login1:~> tgusage -i -u kubal [snip] Account: TG-MCA01S018 Title: Computational Studies of Complex Processes in Biological Macromolecular Systems Resource: teragrid **** Local project name on dtf.anl.teragrid is TG-MCB010025N **** Allocation Period: 2007-08-03 to 2008-03-31 Name (Last First) or Account Total Remaining Usage ---------------------------- ---------- ------------ ---------- Kubal Michael 101880 SU 99358 SU 296 SU ---------------------------------------------------------------------- TG-MCA01S018 101880 SU 99358 SU 2522 SU On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote: > You should probably remove the line completely. > > Did you chose a default project on the login node with tgprojects? > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal wrote: >> I tried running with the account id removed from the >> sites.file as in the following line: >> >> >> >> but received the same error. >> >> >> >> --- Mihael Hategan wrote: >> >>> Is this the same for pre-WS GRAM? >>> >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin >>> wrote: >>>> that's right, qsub is used for PBS (and some >>> others too) >>>> bsub is LSF >>>> condor_q for condor >>>> ... >>>> >>>> -Stu >>>> >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael >>> Hategan wrote: >>>> >>>>> >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal >>> wrote: >>>>>> I'll give it a try. >>>>>> >>>>>> When using GRAM4, is qsub the method used to >>>>>> ultimately put the job in the queue? >>>>> >>>>> Looks like it. I also believe it's the case with >>> pre-ws gram. Stu >>>>> may be >>>>> able to clarify. >>>>> >>>>>> >>>>>> MikeK >>>>>> --- Mihael Hategan wrote: >>>>>> >>>>>>> While this doesn't solve the underlying >>> problem, it >>>>>>> may help you get >>>>>>> this to work: log into tg-login1.uc..., set >>> this >>>>>>> project as default, >>>>>>> then remove the project spec from the sites >>> file and >>>>>>> try again. >>>>>>> >>>>>>> Mihael >>>>>>> >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike Kubal >>> wrote: >>>>>>>> Yes, I believe you are right. The kickstart >>>>>>> message >>>>>>>> may be only a warning. After digging a little >>>>>>> deeper >>>>>>>> it appears the job is failing due to a >>>>>>> project/account >>>>>>>> id problem. I get the following error: >>>>>>>> >>>>>>>> Caused by: >>>>>>>> The executable could not be started., >>>>>>> qsub: >>>>>>>> Invalid Account MSG=invalid account >>>>>>>> >>>>>>>> I am specifying the same TG-account in my >>>>>>> site-file >>>>>>>> for the gram4 run that fails, as in the >>> site-file >>>>>>> for >>>>>>>> the pre-ws job that suceeds. This is the same >>>>>>> project, >>>>>>>> TG-MCA01S018, that is set in my >>>>>>> .tg_default_project >>>>>>>> file in ~kubal/ on the UC teragrid. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --- Ben Clifford wrote: >>>>>>>> >>>>>>>>> yeah, run that same without kickstart. the >>> error >>>>>>>>> reported is that >>>>>>>>> kickstart didn't work right - but there's >>>>>>> perhaps >>>>>>>>> some underlying error. >>>>>>>>> -- >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>> >> _____________________________________________________________________ >> _______________ >>>>>>>> Never miss a thing. Make Yahoo your home >>> page. >>>>>>>> http://www.yahoo.com/r/hs >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>> >> _____________________________________________________________________ >> _______________ >>>>>> Be a better friend, newshound, and >>>>>> know-it-all with Yahoo! Mobile. Try it now. >>> >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> >> >> >> >> _____________________________________________________________________ >> _______________ >> Looking for last minute shopping deals? >> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/ >> newsearch/category.php?category=shopping >> > =================================================== joseph a. insley insley at mcs.anl.gov mathematics & computer science division (630) 252-5649 argonne national laboratory (630) 252-5986 (fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikekubal at yahoo.com Tue Feb 12 16:20:27 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 14:20:27 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <54A8DBC2-386E-4446-B29C-64952AD7B782@mcs.anl.gov> Message-ID: <341523.12842.qm@web52302.mail.re2.yahoo.com> Thanks Joe. This solved the account id problem. --- joseph insley wrote: > Mike K, > > looks like you have the wrong value in your > .tg_default_project file: > > insley at tg-viz-login1:~> more > ~kubal/.tg_default_project > TG-MCA01S018 > > you should be using: TG-MCB010025N > > insley at tg-viz-login1:~> tgusage -i -u kubal > > [snip] > > Account: TG-MCA01S018 > Title: Computational Studies of Complex Processes in > Biological > Macromolecular Systems > Resource: teragrid > > **** > Local project name on dtf.anl.teragrid is > TG-MCB010025N > **** > > Allocation Period: 2007-08-03 to 2008-03-31 > > Name (Last First) or Account Total > Remaining Usage > ---------------------------- ---------- > ------------ ---------- > Kubal Michael 101880 SU > 99358 SU 296 SU > ---------------------------------------------------------------------- > TG-MCA01S018 101880 SU > 99358 SU 2522 SU > > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote: > > > You should probably remove the line completely. > > > > Did you chose a default project on the login node > with tgprojects? > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > wrote: > >> I tried running with the account id removed from > the > >> sites.file as in the following line: > >> > >> > >> > >> but received the same error. > >> > >> > >> > >> --- Mihael Hategan wrote: > >> > >>> Is this the same for pre-WS GRAM? > >>> > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin > >>> wrote: > >>>> that's right, qsub is used for PBS (and some > >>> others too) > >>>> bsub is LSF > >>>> condor_q for condor > >>>> ... > >>>> > >>>> -Stu > >>>> > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > >>> Hategan wrote: > >>>> > >>>>> > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal > >>> wrote: > >>>>>> I'll give it a try. > >>>>>> > >>>>>> When using GRAM4, is qsub the method used to > >>>>>> ultimately put the job in the queue? > >>>>> > >>>>> Looks like it. I also believe it's the case > with > >>> pre-ws gram. Stu > >>>>> may be > >>>>> able to clarify. > >>>>> > >>>>>> > >>>>>> MikeK > >>>>>> --- Mihael Hategan > wrote: > >>>>>> > >>>>>>> While this doesn't solve the underlying > >>> problem, it > >>>>>>> may help you get > >>>>>>> this to work: log into tg-login1.uc..., set > >>> this > >>>>>>> project as default, > >>>>>>> then remove the project spec from the sites > >>> file and > >>>>>>> try again. > >>>>>>> > >>>>>>> Mihael > >>>>>>> > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > Kubal > >>> wrote: > >>>>>>>> Yes, I believe you are right. The kickstart > >>>>>>> message > >>>>>>>> may be only a warning. After digging a > little > >>>>>>> deeper > >>>>>>>> it appears the job is failing due to a > >>>>>>> project/account > >>>>>>>> id problem. I get the following error: > >>>>>>>> > >>>>>>>> Caused by: > >>>>>>>> The executable could not be > started., > >>>>>>> qsub: > >>>>>>>> Invalid Account MSG=invalid account > >>>>>>>> > >>>>>>>> I am specifying the same TG-account in my > >>>>>>> site-file > >>>>>>>> for the gram4 run that fails, as in the > >>> site-file > >>>>>>> for > >>>>>>>> the pre-ws job that suceeds. This is the > same > >>>>>>> project, > >>>>>>>> TG-MCA01S018, that is set in my > >>>>>>> .tg_default_project > >>>>>>>> file in ~kubal/ on the UC teragrid. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> --- Ben Clifford > wrote: > >>>>>>>> > >>>>>>>>> yeah, run that same without kickstart. the > >>> error > >>>>>>>>> reported is that > >>>>>>>>> kickstart didn't work right - but there's > >>>>>>> perhaps > >>>>>>>>> some underlying error. > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>> > >> > _____________________________________________________________________ > > >> _______________ > >>>>>>>> Never miss a thing. Make Yahoo your home > >>> page. > >>>>>>>> http://www.yahoo.com/r/hs > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>> > >> > _____________________________________________________________________ > > >> _______________ > >>>>>> Be a better friend, newshound, and > >>>>>> know-it-all with Yahoo! Mobile. Try it now. > >>> > >> > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > >>>>>> > >>>>> > >>>> > >>> > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> > >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > === message truncated === ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From hategan at mcs.anl.gov Tue Feb 12 16:23:34 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 16:23:34 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <341523.12842.qm@web52302.mail.re2.yahoo.com> References: <341523.12842.qm@web52302.mail.re2.yahoo.com> Message-ID: <1202855014.23472.0.camel@blabla.mcs.anl.gov> Would it be worth trying to find out why it worked with pre-WS GRAM? Mihael On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: > Thanks Joe. This solved the account id problem. > > --- joseph insley wrote: > > > Mike K, > > > > looks like you have the wrong value in your > > .tg_default_project file: > > > > insley at tg-viz-login1:~> more > > ~kubal/.tg_default_project > > TG-MCA01S018 > > > > you should be using: TG-MCB010025N > > > > insley at tg-viz-login1:~> tgusage -i -u kubal > > > > [snip] > > > > Account: TG-MCA01S018 > > Title: Computational Studies of Complex Processes in > > Biological > > Macromolecular Systems > > Resource: teragrid > > > > **** > > Local project name on dtf.anl.teragrid is > > TG-MCB010025N > > **** > > > > Allocation Period: 2007-08-03 to 2008-03-31 > > > > Name (Last First) or Account Total > > Remaining Usage > > ---------------------------- ---------- > > ------------ ---------- > > Kubal Michael 101880 SU > > 99358 SU 296 SU > > > ---------------------------------------------------------------------- > > TG-MCA01S018 101880 SU > > 99358 SU 2522 SU > > > > > > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote: > > > > > You should probably remove the line completely. > > > > > > Did you chose a default project on the login node > > with tgprojects? > > > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > > wrote: > > >> I tried running with the account id removed from > > the > > >> sites.file as in the following line: > > >> > > >> > > >> > > >> but received the same error. > > >> > > >> > > >> > > >> --- Mihael Hategan wrote: > > >> > > >>> Is this the same for pre-WS GRAM? > > >>> > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin > > >>> wrote: > > >>>> that's right, qsub is used for PBS (and some > > >>> others too) > > >>>> bsub is LSF > > >>>> condor_q for condor > > >>>> ... > > >>>> > > >>>> -Stu > > >>>> > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > >>> Hategan wrote: > > >>>> > > >>>>> > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal > > >>> wrote: > > >>>>>> I'll give it a try. > > >>>>>> > > >>>>>> When using GRAM4, is qsub the method used to > > >>>>>> ultimately put the job in the queue? > > >>>>> > > >>>>> Looks like it. I also believe it's the case > > with > > >>> pre-ws gram. Stu > > >>>>> may be > > >>>>> able to clarify. > > >>>>> > > >>>>>> > > >>>>>> MikeK > > >>>>>> --- Mihael Hategan > > wrote: > > >>>>>> > > >>>>>>> While this doesn't solve the underlying > > >>> problem, it > > >>>>>>> may help you get > > >>>>>>> this to work: log into tg-login1.uc..., set > > >>> this > > >>>>>>> project as default, > > >>>>>>> then remove the project spec from the sites > > >>> file and > > >>>>>>> try again. > > >>>>>>> > > >>>>>>> Mihael > > >>>>>>> > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > > Kubal > > >>> wrote: > > >>>>>>>> Yes, I believe you are right. The kickstart > > >>>>>>> message > > >>>>>>>> may be only a warning. After digging a > > little > > >>>>>>> deeper > > >>>>>>>> it appears the job is failing due to a > > >>>>>>> project/account > > >>>>>>>> id problem. I get the following error: > > >>>>>>>> > > >>>>>>>> Caused by: > > >>>>>>>> The executable could not be > > started., > > >>>>>>> qsub: > > >>>>>>>> Invalid Account MSG=invalid account > > >>>>>>>> > > >>>>>>>> I am specifying the same TG-account in my > > >>>>>>> site-file > > >>>>>>>> for the gram4 run that fails, as in the > > >>> site-file > > >>>>>>> for > > >>>>>>>> the pre-ws job that suceeds. This is the > > same > > >>>>>>> project, > > >>>>>>>> TG-MCA01S018, that is set in my > > >>>>>>> .tg_default_project > > >>>>>>>> file in ~kubal/ on the UC teragrid. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> --- Ben Clifford > > wrote: > > >>>>>>>> > > >>>>>>>>> yeah, run that same without kickstart. the > > >>> error > > >>>>>>>>> reported is that > > >>>>>>>>> kickstart didn't work right - but there's > > >>>>>>> perhaps > > >>>>>>>>> some underlying error. > > >>>>>>>>> -- > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>> > > >> > > > _____________________________________________________________________ > > > > >> _______________ > > >>>>>>>> Never miss a thing. Make Yahoo your home > > >>> page. > > >>>>>>>> http://www.yahoo.com/r/hs > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>> > > >> > > > _____________________________________________________________________ > > > > >> _______________ > > >>>>>> Be a better friend, newshound, and > > >>>>>> know-it-all with Yahoo! Mobile. Try it now. > > >>> > > >> > > > http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > >>>>>> > > >>>>> > > >>>> > > >>> > > >>> _______________________________________________ > > >>> Swift-devel mailing list > > >>> Swift-devel at ci.uchicago.edu > > >>> > > >> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >>> > > > === message truncated === > > > > ____________________________________________________________________________________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > From mikekubal at yahoo.com Tue Feb 12 16:43:38 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 14:43:38 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov> Message-ID: <876972.28192.qm@web52307.mail.re2.yahoo.com> Just to be sure I tested with pre-WS and it worked also. --- Mihael Hategan wrote: > Would it be worth trying to find out why it worked > with pre-WS GRAM? > > Mihael > > On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: > > Thanks Joe. This solved the account id problem. > > > > --- joseph insley wrote: > > > > > Mike K, > > > > > > looks like you have the wrong value in your > > > .tg_default_project file: > > > > > > insley at tg-viz-login1:~> more > > > ~kubal/.tg_default_project > > > TG-MCA01S018 > > > > > > you should be using: TG-MCB010025N > > > > > > insley at tg-viz-login1:~> tgusage -i -u kubal > > > > > > [snip] > > > > > > Account: TG-MCA01S018 > > > Title: Computational Studies of Complex > Processes in > > > Biological > > > Macromolecular Systems > > > Resource: teragrid > > > > > > **** > > > Local project name on dtf.anl.teragrid is > > > TG-MCB010025N > > > **** > > > > > > Allocation Period: 2007-08-03 to 2008-03-31 > > > > > > Name (Last First) or Account Total > > > Remaining Usage > > > ---------------------------- ---------- > > > ------------ ---------- > > > Kubal Michael 101880 SU > > > 99358 SU 296 SU > > > > > > ---------------------------------------------------------------------- > > > TG-MCA01S018 101880 SU > > > 99358 SU 2522 SU > > > > > > > > > > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan > wrote: > > > > > > > You should probably remove the line > completely. > > > > > > > > Did you chose a default project on the login > node > > > with tgprojects? > > > > > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > > > wrote: > > > >> I tried running with the account id removed > from > > > the > > > >> sites.file as in the following line: > > > >> > > > >> > > > >> > > > >> but received the same error. > > > >> > > > >> > > > >> > > > >> --- Mihael Hategan > wrote: > > > >> > > > >>> Is this the same for pre-WS GRAM? > > > >>> > > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > Martin > > > >>> wrote: > > > >>>> that's right, qsub is used for PBS (and > some > > > >>> others too) > > > >>>> bsub is LSF > > > >>>> condor_q for condor > > > >>>> ... > > > >>>> > > > >>>> -Stu > > > >>>> > > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > > >>> Hategan wrote: > > > >>>> > > > >>>>> > > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > Kubal > > > >>> wrote: > > > >>>>>> I'll give it a try. > > > >>>>>> > > > >>>>>> When using GRAM4, is qsub the method used > to > > > >>>>>> ultimately put the job in the queue? > > > >>>>> > > > >>>>> Looks like it. I also believe it's the > case > > > with > > > >>> pre-ws gram. Stu > > > >>>>> may be > > > >>>>> able to clarify. > > > >>>>> > > > >>>>>> > > > >>>>>> MikeK > > > >>>>>> --- Mihael Hategan > > > wrote: > > > >>>>>> > > > >>>>>>> While this doesn't solve the underlying > > > >>> problem, it > > > >>>>>>> may help you get > > > >>>>>>> this to work: log into tg-login1.uc..., > set > > > >>> this > > > >>>>>>> project as default, > > > >>>>>>> then remove the project spec from the > sites > > > >>> file and > > > >>>>>>> try again. > > > >>>>>>> > > > >>>>>>> Mihael > > > >>>>>>> > > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > > > Kubal > > > >>> wrote: > > > >>>>>>>> Yes, I believe you are right. The > kickstart > > > >>>>>>> message > > > >>>>>>>> may be only a warning. After digging a > > > little > > > >>>>>>> deeper > > > >>>>>>>> it appears the job is failing due to a > > > >>>>>>> project/account > > > >>>>>>>> id problem. I get the following error: > > > >>>>>>>> > > > >>>>>>>> Caused by: > > > >>>>>>>> The executable could not be > > > started., > > > >>>>>>> qsub: > > > >>>>>>>> Invalid Account MSG=invalid account > > > >>>>>>>> > > > >>>>>>>> I am specifying the same TG-account in > my > > > >>>>>>> site-file > > > >>>>>>>> for the gram4 run that fails, as in the > > > >>> site-file > > > >>>>>>> for > > > >>>>>>>> the pre-ws job that suceeds. This is > the > > > same > > > >>>>>>> project, > > > >>>>>>>> TG-MCA01S018, that is set in my > > > >>>>>>> .tg_default_project > > > >>>>>>>> file in ~kubal/ on the UC teragrid. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> --- Ben Clifford > > > wrote: > > > >>>>>>>> > > > >>>>>>>>> yeah, run that same without kickstart. > the > > > >>> error > > > >>>>>>>>> reported is that > > > >>>>>>>>> kickstart didn't work right - but > there's > > > >>>>>>> perhaps > > > >>>>>>>>> some underlying error. > > > >>>>>>>>> -- > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>> > > > >> > > > > > > _____________________________________________________________________ > > > > > > >> _______________ > === message truncated === ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ From insley at mcs.anl.gov Tue Feb 12 16:48:05 2008 From: insley at mcs.anl.gov (joseph insley) Date: Tue, 12 Feb 2008 16:48:05 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov> References: <341523.12842.qm@web52302.mail.re2.yahoo.com> <1202855014.23472.0.camel@blabla.mcs.anl.gov> Message-ID: If a project id is specified explicitly in the job description, that takes precedence over the default project. Could it be that the correct one was previously specified in the job request? joe. On Feb 12, 2008, at 4:23 PM, Mihael Hategan wrote: > Would it be worth trying to find out why it worked with pre-WS GRAM? > > Mihael > > On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: >> Thanks Joe. This solved the account id problem. >> >> --- joseph insley wrote: >> >>> Mike K, >>> >>> looks like you have the wrong value in your >>> .tg_default_project file: >>> >>> insley at tg-viz-login1:~> more >>> ~kubal/.tg_default_project >>> TG-MCA01S018 >>> >>> you should be using: TG-MCB010025N >>> >>> insley at tg-viz-login1:~> tgusage -i -u kubal >>> >>> [snip] >>> >>> Account: TG-MCA01S018 >>> Title: Computational Studies of Complex Processes in >>> Biological >>> Macromolecular Systems >>> Resource: teragrid >>> >>> **** >>> Local project name on dtf.anl.teragrid is >>> TG-MCB010025N >>> **** >>> >>> Allocation Period: 2007-08-03 to 2008-03-31 >>> >>> Name (Last First) or Account Total >>> Remaining Usage >>> ---------------------------- ---------- >>> ------------ ---------- >>> Kubal Michael 101880 SU >>> 99358 SU 296 SU >>> >> --------------------------------------------------------------------- >> - >>> TG-MCA01S018 101880 SU >>> 99358 SU 2522 SU >>> >>> >>> >>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan wrote: >>> >>>> You should probably remove the line completely. >>>> >>>> Did you chose a default project on the login node >>> with tgprojects? >>>> >>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal >>> wrote: >>>>> I tried running with the account id removed from >>> the >>>>> sites.file as in the following line: >>>>> >>>>> >>>>> >>>>> but received the same error. >>>>> >>>>> >>>>> >>>>> --- Mihael Hategan wrote: >>>>> >>>>>> Is this the same for pre-WS GRAM? >>>>>> >>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart Martin >>>>>> wrote: >>>>>>> that's right, qsub is used for PBS (and some >>>>>> others too) >>>>>>> bsub is LSF >>>>>>> condor_q for condor >>>>>>> ... >>>>>>> >>>>>>> -Stu >>>>>>> >>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael >>>>>> Hategan wrote: >>>>>>> >>>>>>>> >>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike Kubal >>>>>> wrote: >>>>>>>>> I'll give it a try. >>>>>>>>> >>>>>>>>> When using GRAM4, is qsub the method used to >>>>>>>>> ultimately put the job in the queue? >>>>>>>> >>>>>>>> Looks like it. I also believe it's the case >>> with >>>>>> pre-ws gram. Stu >>>>>>>> may be >>>>>>>> able to clarify. >>>>>>>> >>>>>>>>> >>>>>>>>> MikeK >>>>>>>>> --- Mihael Hategan >>> wrote: >>>>>>>>> >>>>>>>>>> While this doesn't solve the underlying >>>>>> problem, it >>>>>>>>>> may help you get >>>>>>>>>> this to work: log into tg-login1.uc..., set >>>>>> this >>>>>>>>>> project as default, >>>>>>>>>> then remove the project spec from the sites >>>>>> file and >>>>>>>>>> try again. >>>>>>>>>> >>>>>>>>>> Mihael >>>>>>>>>> >>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike >>> Kubal >>>>>> wrote: >>>>>>>>>>> Yes, I believe you are right. The kickstart >>>>>>>>>> message >>>>>>>>>>> may be only a warning. After digging a >>> little >>>>>>>>>> deeper >>>>>>>>>>> it appears the job is failing due to a >>>>>>>>>> project/account >>>>>>>>>>> id problem. I get the following error: >>>>>>>>>>> >>>>>>>>>>> Caused by: >>>>>>>>>>> The executable could not be >>> started., >>>>>>>>>> qsub: >>>>>>>>>>> Invalid Account MSG=invalid account >>>>>>>>>>> >>>>>>>>>>> I am specifying the same TG-account in my >>>>>>>>>> site-file >>>>>>>>>>> for the gram4 run that fails, as in the >>>>>> site-file >>>>>>>>>> for >>>>>>>>>>> the pre-ws job that suceeds. This is the >>> same >>>>>>>>>> project, >>>>>>>>>>> TG-MCA01S018, that is set in my >>>>>>>>>> .tg_default_project >>>>>>>>>>> file in ~kubal/ on the UC teragrid. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --- Ben Clifford >>> wrote: >>>>>>>>>>> >>>>>>>>>>>> yeah, run that same without kickstart. the >>>>>> error >>>>>>>>>>>> reported is that >>>>>>>>>>>> kickstart didn't work right - but there's >>>>>>>>>> perhaps >>>>>>>>>>>> some underlying error. >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>> >> _____________________________________________________________________ >>> >>>>> _______________ >>>>>>>>>>> Never miss a thing. Make Yahoo your home >>>>>> page. >>>>>>>>>>> http://www.yahoo.com/r/hs >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>> >> _____________________________________________________________________ >>> >>>>> _______________ >>>>>>>>> Be a better friend, newshound, and >>>>>>>>> know-it-all with Yahoo! Mobile. Try it now. >>>>>> >>>>> >>> >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-devel mailing list >>>>>> Swift-devel at ci.uchicago.edu >>>>>> >>>>> >>> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>>>>> >>> >> === message truncated === >> >> >> >> >> _____________________________________________________________________ >> _______________ >> Looking for last minute shopping deals? >> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/ >> newsearch/category.php?category=shopping >> > From wilde at mcs.anl.gov Tue Feb 12 16:51:31 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 Feb 2008 16:51:31 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <876972.28192.qm@web52307.mail.re2.yahoo.com> References: <876972.28192.qm@web52307.mail.re2.yahoo.com> Message-ID: <47B222F3.3010709@mcs.anl.gov> Mike, did you do a recent test with pre-WS-GRAM with the .tg_default_project file set *incorrectly*? I think the puzzle was why this would cause WS-GRAM to fail but not pre-WS-GRAM, as it would seem they would both get the TG account to use in the same manner. - mikew On 2/12/08 4:43 PM, Mike Kubal wrote: > Just to be sure I tested with pre-WS and it worked > also. > > --- Mihael Hategan wrote: > >> Would it be worth trying to find out why it worked >> with pre-WS GRAM? >> >> Mihael >> >> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: >>> Thanks Joe. This solved the account id problem. >>> >>> --- joseph insley wrote: >>> >>>> Mike K, >>>> >>>> looks like you have the wrong value in your >>>> .tg_default_project file: >>>> >>>> insley at tg-viz-login1:~> more >>>> ~kubal/.tg_default_project >>>> TG-MCA01S018 >>>> >>>> you should be using: TG-MCB010025N >>>> >>>> insley at tg-viz-login1:~> tgusage -i -u kubal >>>> >>>> [snip] >>>> >>>> Account: TG-MCA01S018 >>>> Title: Computational Studies of Complex >> Processes in >>>> Biological >>>> Macromolecular Systems >>>> Resource: teragrid >>>> >>>> **** >>>> Local project name on dtf.anl.teragrid is >>>> TG-MCB010025N >>>> **** >>>> >>>> Allocation Period: 2007-08-03 to 2008-03-31 >>>> >>>> Name (Last First) or Account Total >>>> Remaining Usage >>>> ---------------------------- ---------- >>>> ------------ ---------- >>>> Kubal Michael 101880 SU >>>> 99358 SU 296 SU >>>> > ---------------------------------------------------------------------- >>>> TG-MCA01S018 101880 SU >>>> 99358 SU 2522 SU >>>> >>>> >>>> >>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan >> wrote: >>>>> You should probably remove the line >> completely. >>>>> Did you chose a default project on the login >> node >>>> with tgprojects? >>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal >>>> wrote: >>>>>> I tried running with the account id removed >> from >>>> the >>>>>> sites.file as in the following line: >>>>>> >>>>>> >>>>>> >>>>>> but received the same error. >>>>>> >>>>>> >>>>>> >>>>>> --- Mihael Hategan >> wrote: >>>>>>> Is this the same for pre-WS GRAM? >>>>>>> >>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart >> Martin >>>>>>> wrote: >>>>>>>> that's right, qsub is used for PBS (and >> some >>>>>>> others too) >>>>>>>> bsub is LSF >>>>>>>> condor_q for condor >>>>>>>> ... >>>>>>>> >>>>>>>> -Stu >>>>>>>> >>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael >>>>>>> Hategan wrote: >>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike >> Kubal >>>>>>> wrote: >>>>>>>>>> I'll give it a try. >>>>>>>>>> >>>>>>>>>> When using GRAM4, is qsub the method used >> to >>>>>>>>>> ultimately put the job in the queue? >>>>>>>>> Looks like it. I also believe it's the >> case >>>> with >>>>>>> pre-ws gram. Stu >>>>>>>>> may be >>>>>>>>> able to clarify. >>>>>>>>> >>>>>>>>>> MikeK >>>>>>>>>> --- Mihael Hategan >>>> wrote: >>>>>>>>>>> While this doesn't solve the underlying >>>>>>> problem, it >>>>>>>>>>> may help you get >>>>>>>>>>> this to work: log into tg-login1.uc..., >> set >>>>>>> this >>>>>>>>>>> project as default, >>>>>>>>>>> then remove the project spec from the >> sites >>>>>>> file and >>>>>>>>>>> try again. >>>>>>>>>>> >>>>>>>>>>> Mihael >>>>>>>>>>> >>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike >>>> Kubal >>>>>>> wrote: >>>>>>>>>>>> Yes, I believe you are right. The >> kickstart >>>>>>>>>>> message >>>>>>>>>>>> may be only a warning. After digging a >>>> little >>>>>>>>>>> deeper >>>>>>>>>>>> it appears the job is failing due to a >>>>>>>>>>> project/account >>>>>>>>>>>> id problem. I get the following error: >>>>>>>>>>>> >>>>>>>>>>>> Caused by: >>>>>>>>>>>> The executable could not be >>>> started., >>>>>>>>>>> qsub: >>>>>>>>>>>> Invalid Account MSG=invalid account >>>>>>>>>>>> >>>>>>>>>>>> I am specifying the same TG-account in >> my >>>>>>>>>>> site-file >>>>>>>>>>>> for the gram4 run that fails, as in the >>>>>>> site-file >>>>>>>>>>> for >>>>>>>>>>>> the pre-ws job that suceeds. This is >> the >>>> same >>>>>>>>>>> project, >>>>>>>>>>>> TG-MCA01S018, that is set in my >>>>>>>>>>> .tg_default_project >>>>>>>>>>>> file in ~kubal/ on the UC teragrid. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --- Ben Clifford >>>> wrote: >>>>>>>>>>>>> yeah, run that same without kickstart. >> the >>>>>>> error >>>>>>>>>>>>> reported is that >>>>>>>>>>>>> kickstart didn't work right - but >> there's >>>>>>>>>>> perhaps >>>>>>>>>>>>> some underlying error. >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> > _____________________________________________________________________ >>>>>> _______________ > === message truncated === > > > > ____________________________________________________________________________________ > Be a better friend, newshound, and > know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From mikekubal at yahoo.com Tue Feb 12 17:00:28 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 15:00:28 -0800 (PST) Subject: [Swift-devel] next hurdle In-Reply-To: <1202855014.23472.0.camel@blabla.mcs.anl.gov> Message-ID: <236032.16528.qm@web52307.mail.re2.yahoo.com> One of the applications (antechamber) being launched by swift on the uc-teragrid is failing with an exit code of 1 and a message of 'cannot execute binary' file. It sounds like it might be attempting to run on one of the 32-bit nodes, though in my tc-file, it specifies to run only on the 64-bit nodes. The only difference between a successful run and the error above are the lines below in from the sites-file: (with this line I get the error above) (with this line instead the job succeeds) I rsynced the log and kickstart files to /home/benc/swift-logs at UC. Cheers, Mike --- Mihael Hategan wrote: > Would it be worth trying to find out why it worked > with pre-WS GRAM? > > Mihael > > On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: > > Thanks Joe. This solved the account id problem. > > > > --- joseph insley wrote: > > > > > Mike K, > > > > > > looks like you have the wrong value in your > > > .tg_default_project file: > > > > > > insley at tg-viz-login1:~> more > > > ~kubal/.tg_default_project > > > TG-MCA01S018 > > > > > > you should be using: TG-MCB010025N > > > > > > insley at tg-viz-login1:~> tgusage -i -u kubal > > > > > > [snip] > > > > > > Account: TG-MCA01S018 > > > Title: Computational Studies of Complex > Processes in > > > Biological > > > Macromolecular Systems > > > Resource: teragrid > > > > > > **** > > > Local project name on dtf.anl.teragrid is > > > TG-MCB010025N > > > **** > > > > > > Allocation Period: 2007-08-03 to 2008-03-31 > > > > > > Name (Last First) or Account Total > > > Remaining Usage > > > ---------------------------- ---------- > > > ------------ ---------- > > > Kubal Michael 101880 SU > > > 99358 SU 296 SU > > > > > > ---------------------------------------------------------------------- > > > TG-MCA01S018 101880 SU > > > 99358 SU 2522 SU > > > > > > > > > > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan > wrote: > > > > > > > You should probably remove the line > completely. > > > > > > > > Did you chose a default project on the login > node > > > with tgprojects? > > > > > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > > > wrote: > > > >> I tried running with the account id removed > from > > > the > > > >> sites.file as in the following line: > > > >> > > > >> > > > >> > > > >> but received the same error. > > > >> > > > >> > > > >> > > > >> --- Mihael Hategan > wrote: > > > >> > > > >>> Is this the same for pre-WS GRAM? > > > >>> > > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > Martin > > > >>> wrote: > > > >>>> that's right, qsub is used for PBS (and > some > > > >>> others too) > > > >>>> bsub is LSF > > > >>>> condor_q for condor > > > >>>> ... > > > >>>> > > > >>>> -Stu > > > >>>> > > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > > >>> Hategan wrote: > > > >>>> > > > >>>>> > > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > Kubal > > > >>> wrote: > > > >>>>>> I'll give it a try. > > > >>>>>> > > > >>>>>> When using GRAM4, is qsub the method used > to > > > >>>>>> ultimately put the job in the queue? > > > >>>>> > > > >>>>> Looks like it. I also believe it's the > case > > > with > > > >>> pre-ws gram. Stu > > > >>>>> may be > > > >>>>> able to clarify. > > > >>>>> > > > >>>>>> > > > >>>>>> MikeK > > > >>>>>> --- Mihael Hategan > > > wrote: > > > >>>>>> > > > >>>>>>> While this doesn't solve the underlying > > > >>> problem, it > > > >>>>>>> may help you get > > > >>>>>>> this to work: log into tg-login1.uc..., > set > > > >>> this > > > >>>>>>> project as default, > > > >>>>>>> then remove the project spec from the > sites > > > >>> file and > > > >>>>>>> try again. > > > >>>>>>> > > > >>>>>>> Mihael > > > >>>>>>> > > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > > > Kubal > > > >>> wrote: > > > >>>>>>>> Yes, I believe you are right. The > kickstart > > > >>>>>>> message > > > >>>>>>>> may be only a warning. After digging a > > > little > > > >>>>>>> deeper > > > >>>>>>>> it appears the job is failing due to a > > > >>>>>>> project/account > > > >>>>>>>> id problem. I get the following error: > > > >>>>>>>> > > > >>>>>>>> Caused by: > > > >>>>>>>> The executable could not be > > > started., > > > >>>>>>> qsub: > > > >>>>>>>> Invalid Account MSG=invalid account > > > >>>>>>>> > > > >>>>>>>> I am specifying the same TG-account in > my > > > >>>>>>> site-file > > > >>>>>>>> for the gram4 run that fails, as in the > > > >>> site-file > > > >>>>>>> for > > > >>>>>>>> the pre-ws job that suceeds. This is > the > > > same > > > >>>>>>> project, > > > >>>>>>>> TG-MCA01S018, that is set in my > > > >>>>>>> .tg_default_project > > > >>>>>>>> file in ~kubal/ on the UC teragrid. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> --- Ben Clifford > > > wrote: > > > >>>>>>>> > > > >>>>>>>>> yeah, run that same without kickstart. > the > > > >>> error > > > >>>>>>>>> reported is that > > > >>>>>>>>> kickstart didn't work right - but > there's > > > >>>>>>> perhaps > > > >>>>>>>>> some underlying error. > > > >>>>>>>>> -- > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>> > > > >> > > > > > > _____________________________________________________________________ > > > > > > >> _______________ > === message truncated === ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From hategan at mcs.anl.gov Tue Feb 12 17:06:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 17:06:44 -0600 Subject: [Swift-devel] Re: next hurdle In-Reply-To: <236032.16528.qm@web52307.mail.re2.yahoo.com> References: <236032.16528.qm@web52307.mail.re2.yahoo.com> Message-ID: <1202857604.26548.1.camel@blabla.mcs.anl.gov> On Tue, 2008-02-12 at 15:00 -0800, Mike Kubal wrote: > One of the applications (antechamber) being launched > by swift on the uc-teragrid is failing with an exit > code of 1 and a message of 'cannot execute binary' > file. It sounds like it might be attempting to run on > one of the 32-bit nodes, though in my tc-file, it > specifies to run only on the 64-bit nodes. > > The only difference between a successful run and the > error above are the lines below in from the > sites-file: > > (with this line I get the error above) > url="tg-grid1.uc.teragrid.org" /> > > > (with this line instead the job succeeds) > url="tg-grid1.uc.teragrid.org" major="4" minor="0" > patch="0"/> That's running with fork on the head node. > > I rsynced the log and kickstart files to > /home/benc/swift-logs at UC. > > Cheers, > > Mike > > --- Mihael Hategan wrote: > > > Would it be worth trying to find out why it worked > > with pre-WS GRAM? > > > > Mihael > > > > On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal wrote: > > > Thanks Joe. This solved the account id problem. > > > > > > --- joseph insley wrote: > > > > > > > Mike K, > > > > > > > > looks like you have the wrong value in your > > > > .tg_default_project file: > > > > > > > > insley at tg-viz-login1:~> more > > > > ~kubal/.tg_default_project > > > > TG-MCA01S018 > > > > > > > > you should be using: TG-MCB010025N > > > > > > > > insley at tg-viz-login1:~> tgusage -i -u kubal > > > > > > > > [snip] > > > > > > > > Account: TG-MCA01S018 > > > > Title: Computational Studies of Complex > > Processes in > > > > Biological > > > > Macromolecular Systems > > > > Resource: teragrid > > > > > > > > **** > > > > Local project name on dtf.anl.teragrid is > > > > TG-MCB010025N > > > > **** > > > > > > > > Allocation Period: 2007-08-03 to 2008-03-31 > > > > > > > > Name (Last First) or Account Total > > > > Remaining Usage > > > > ---------------------------- ---------- > > > > ------------ ---------- > > > > Kubal Michael 101880 SU > > > > 99358 SU 296 SU > > > > > > > > > > ---------------------------------------------------------------------- > > > > TG-MCA01S018 101880 SU > > > > 99358 SU 2522 SU > > > > > > > > > > > > > > > > On Feb 12, 2008, at 2:37 PM, Mihael Hategan > > wrote: > > > > > > > > > You should probably remove the line > > completely. > > > > > > > > > > Did you chose a default project on the login > > node > > > > with tgprojects? > > > > > > > > > > On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > > > > wrote: > > > > >> I tried running with the account id removed > > from > > > > the > > > > >> sites.file as in the following line: > > > > >> > > > > >> > > > > >> > > > > >> but received the same error. > > > > >> > > > > >> > > > > >> > > > > >> --- Mihael Hategan > > wrote: > > > > >> > > > > >>> Is this the same for pre-WS GRAM? > > > > >>> > > > > >>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > > Martin > > > > >>> wrote: > > > > >>>> that's right, qsub is used for PBS (and > > some > > > > >>> others too) > > > > >>>> bsub is LSF > > > > >>>> condor_q for condor > > > > >>>> ... > > > > >>>> > > > > >>>> -Stu > > > > >>>> > > > > >>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > > > >>> Hategan wrote: > > > > >>>> > > > > >>>>> > > > > >>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > > Kubal > > > > >>> wrote: > > > > >>>>>> I'll give it a try. > > > > >>>>>> > > > > >>>>>> When using GRAM4, is qsub the method used > > to > > > > >>>>>> ultimately put the job in the queue? > > > > >>>>> > > > > >>>>> Looks like it. I also believe it's the > > case > > > > with > > > > >>> pre-ws gram. Stu > > > > >>>>> may be > > > > >>>>> able to clarify. > > > > >>>>> > > > > >>>>>> > > > > >>>>>> MikeK > > > > >>>>>> --- Mihael Hategan > > > > wrote: > > > > >>>>>> > > > > >>>>>>> While this doesn't solve the underlying > > > > >>> problem, it > > > > >>>>>>> may help you get > > > > >>>>>>> this to work: log into tg-login1.uc..., > > set > > > > >>> this > > > > >>>>>>> project as default, > > > > >>>>>>> then remove the project spec from the > > sites > > > > >>> file and > > > > >>>>>>> try again. > > > > >>>>>>> > > > > >>>>>>> Mihael > > > > >>>>>>> > > > > >>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > > > > Kubal > > > > >>> wrote: > > > > >>>>>>>> Yes, I believe you are right. The > > kickstart > > > > >>>>>>> message > > > > >>>>>>>> may be only a warning. After digging a > > > > little > > > > >>>>>>> deeper > > > > >>>>>>>> it appears the job is failing due to a > > > > >>>>>>> project/account > > > > >>>>>>>> id problem. I get the following error: > > > > >>>>>>>> > > > > >>>>>>>> Caused by: > > > > >>>>>>>> The executable could not be > > > > started., > > > > >>>>>>> qsub: > > > > >>>>>>>> Invalid Account MSG=invalid account > > > > >>>>>>>> > > > > >>>>>>>> I am specifying the same TG-account in > > my > > > > >>>>>>> site-file > > > > >>>>>>>> for the gram4 run that fails, as in the > > > > >>> site-file > > > > >>>>>>> for > > > > >>>>>>>> the pre-ws job that suceeds. This is > > the > > > > same > > > > >>>>>>> project, > > > > >>>>>>>> TG-MCA01S018, that is set in my > > > > >>>>>>> .tg_default_project > > > > >>>>>>>> file in ~kubal/ on the UC teragrid. > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> --- Ben Clifford > > > > wrote: > > > > >>>>>>>> > > > > >>>>>>>>> yeah, run that same without kickstart. > > the > > > > >>> error > > > > >>>>>>>>> reported is that > > > > >>>>>>>>> kickstart didn't work right - but > > there's > > > > >>>>>>> perhaps > > > > >>>>>>>>> some underlying error. > > > > >>>>>>>>> -- > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>> > > > > >> > > > > > > > > > > _____________________________________________________________________ > > > > > > > > >> _______________ > > > === message truncated === > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > From mikekubal at yahoo.com Tue Feb 12 17:14:21 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Tue, 12 Feb 2008 15:14:21 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <47B222F3.3010709@mcs.anl.gov> Message-ID: <283035.8845.qm@web52304.mail.re2.yahoo.com> With pre-WS-GRAM, it doesn't seem to matter which account/project id I use where. I can have the TG-MCB010025N specified in the sites-files and TG-MCA01S018 specified in ~kubal/.tg-default_project on the uc teragrid, or vice versa and it still works, or having them match in both places. With WS-GRAM, I have to use TG-MCB010025N, the local uc-teragrid project id, in both places. Using TG-MCA01S018, the teragrid wide charge number/account number, causes the qsub failure error. --- Michael Wilde wrote: > Mike, did you do a recent test with pre-WS-GRAM with > the > .tg_default_project file set *incorrectly*? > > I think the puzzle was why this would cause WS-GRAM > to fail but not > pre-WS-GRAM, as it would seem they would both get > the TG account to use > in the same manner. > > - mikew > > On 2/12/08 4:43 PM, Mike Kubal wrote: > > Just to be sure I tested with pre-WS and it worked > > also. > > > > --- Mihael Hategan wrote: > > > >> Would it be worth trying to find out why it > worked > >> with pre-WS GRAM? > >> > >> Mihael > >> > >> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal > wrote: > >>> Thanks Joe. This solved the account id problem. > >>> > >>> --- joseph insley wrote: > >>> > >>>> Mike K, > >>>> > >>>> looks like you have the wrong value in your > >>>> .tg_default_project file: > >>>> > >>>> insley at tg-viz-login1:~> more > >>>> ~kubal/.tg_default_project > >>>> TG-MCA01S018 > >>>> > >>>> you should be using: TG-MCB010025N > >>>> > >>>> insley at tg-viz-login1:~> tgusage -i -u kubal > >>>> > >>>> [snip] > >>>> > >>>> Account: TG-MCA01S018 > >>>> Title: Computational Studies of Complex > >> Processes in > >>>> Biological > >>>> Macromolecular Systems > >>>> Resource: teragrid > >>>> > >>>> **** > >>>> Local project name on dtf.anl.teragrid is > >>>> TG-MCB010025N > >>>> **** > >>>> > >>>> Allocation Period: 2007-08-03 to 2008-03-31 > >>>> > >>>> Name (Last First) or Account Total > >>>> Remaining Usage > >>>> ---------------------------- ---------- > >>>> ------------ ---------- > >>>> Kubal Michael 101880 SU > >>>> 99358 SU 296 SU > >>>> > > > ---------------------------------------------------------------------- > >>>> TG-MCA01S018 101880 SU > >>>> 99358 SU 2522 SU > >>>> > >>>> > >>>> > >>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan > >> wrote: > >>>>> You should probably remove the line > >> completely. > >>>>> Did you chose a default project on the login > >> node > >>>> with tgprojects? > >>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > >>>> wrote: > >>>>>> I tried running with the account id removed > >> from > >>>> the > >>>>>> sites.file as in the following line: > >>>>>> > >>>>>> > >>>>>> > >>>>>> but received the same error. > >>>>>> > >>>>>> > >>>>>> > >>>>>> --- Mihael Hategan > >> wrote: > >>>>>>> Is this the same for pre-WS GRAM? > >>>>>>> > >>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > >> Martin > >>>>>>> wrote: > >>>>>>>> that's right, qsub is used for PBS (and > >> some > >>>>>>> others too) > >>>>>>>> bsub is LSF > >>>>>>>> condor_q for condor > >>>>>>>> ... > >>>>>>>> > >>>>>>>> -Stu > >>>>>>>> > >>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > >>>>>>> Hategan wrote: > >>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > >> Kubal > >>>>>>> wrote: > >>>>>>>>>> I'll give it a try. > >>>>>>>>>> > >>>>>>>>>> When using GRAM4, is qsub the method used > >> to > >>>>>>>>>> ultimately put the job in the queue? > >>>>>>>>> Looks like it. I also believe it's the > >> case > >>>> with > >>>>>>> pre-ws gram. Stu > >>>>>>>>> may be > >>>>>>>>> able to clarify. > >>>>>>>>> > >>>>>>>>>> MikeK > >>>>>>>>>> --- Mihael Hategan > >>>> wrote: > >>>>>>>>>>> While this doesn't solve the underlying > >>>>>>> problem, it > >>>>>>>>>>> may help you get > >>>>>>>>>>> this to work: log into tg-login1.uc..., > >> set > >>>>>>> this > >>>>>>>>>>> project as default, > >>>>>>>>>>> then remove the project spec from the > >> sites > >>>>>>> file and > >>>>>>>>>>> try again. > >>>>>>>>>>> > >>>>>>>>>>> Mihael > >>>>>>>>>>> > >>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > >>>> Kubal > >>>>>>> wrote: > >>>>>>>>>>>> Yes, I believe you are right. The > >> kickstart > >>>>>>>>>>> message > >>>>>>>>>>>> may be only a warning. After digging a > >>>> little > >>>>>>>>>>> deeper > >>>>>>>>>>>> it appears the job is failing due to a > >>>>>>>>>>> project/account > >>>>>>>>>>>> id problem. I get the following error: > >>>>>>>>>>>> > >>>>>>>>>>>> Caused by: > >>>>>>>>>>>> The executable could not be > >>>> started., > >>>>>>>>>>> qsub: > >>>>>>>>>>>> Invalid Account MSG=invalid account > >>>>>>>>>>>> > >>>>>>>>>>>> I am specifying the same TG-account in > >> my > >>>>>>>>>>> site-file > >>>>>>>>>>>> for the gram4 run that fails, as in the > >>>>>>> site-file > >>>>>>>>>>> for > >>>>>>>>>>>> the pre-ws job that suceeds. This is > >> the > >>>> same > >>>>>>>>>>> project, > >>>>>>>>>>>> TG-MCA01S018, that is set in my > >>>>>>>>>>> .tg_default_project > >>>>>>>>>>>> file in ~kubal/ on the UC teragrid. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> --- Ben Clifford > >>>> wrote: > >>>>>>>>>>>>> yeah, run that same without kickstart. > >> the > >>>>>>> error > >>>>>>>>>>>>> reported is that > >>>>>>>>>>>>> kickstart didn't work right - but > >> there's > >>>>>>>>>>> perhaps > >>>>>>>>>>>>> some underlying error. > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > === message truncated === ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From hategan at mcs.anl.gov Tue Feb 12 17:17:12 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 17:17:12 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <283035.8845.qm@web52304.mail.re2.yahoo.com> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> Message-ID: <1202858233.27191.0.camel@blabla.mcs.anl.gov> Are you sure you were using PBS with pre-ws GRAM and not fork? On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote: > With pre-WS-GRAM, it doesn't seem to matter which > account/project id I use where. I can have the > TG-MCB010025N specified in the sites-files and > TG-MCA01S018 specified in ~kubal/.tg-default_project > on the uc teragrid, or vice versa and it still works, > or having them match in both places. > > With WS-GRAM, I have to use TG-MCB010025N, the local > uc-teragrid project id, in both places. Using > TG-MCA01S018, the teragrid wide charge number/account > number, causes the qsub failure error. > > > > > --- Michael Wilde wrote: > > > Mike, did you do a recent test with pre-WS-GRAM with > > the > > .tg_default_project file set *incorrectly*? > > > > I think the puzzle was why this would cause WS-GRAM > > to fail but not > > pre-WS-GRAM, as it would seem they would both get > > the TG account to use > > in the same manner. > > > > - mikew > > > > On 2/12/08 4:43 PM, Mike Kubal wrote: > > > Just to be sure I tested with pre-WS and it worked > > > also. > > > > > > --- Mihael Hategan wrote: > > > > > >> Would it be worth trying to find out why it > > worked > > >> with pre-WS GRAM? > > >> > > >> Mihael > > >> > > >> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal > > wrote: > > >>> Thanks Joe. This solved the account id problem. > > >>> > > >>> --- joseph insley wrote: > > >>> > > >>>> Mike K, > > >>>> > > >>>> looks like you have the wrong value in your > > >>>> .tg_default_project file: > > >>>> > > >>>> insley at tg-viz-login1:~> more > > >>>> ~kubal/.tg_default_project > > >>>> TG-MCA01S018 > > >>>> > > >>>> you should be using: TG-MCB010025N > > >>>> > > >>>> insley at tg-viz-login1:~> tgusage -i -u kubal > > >>>> > > >>>> [snip] > > >>>> > > >>>> Account: TG-MCA01S018 > > >>>> Title: Computational Studies of Complex > > >> Processes in > > >>>> Biological > > >>>> Macromolecular Systems > > >>>> Resource: teragrid > > >>>> > > >>>> **** > > >>>> Local project name on dtf.anl.teragrid is > > >>>> TG-MCB010025N > > >>>> **** > > >>>> > > >>>> Allocation Period: 2007-08-03 to 2008-03-31 > > >>>> > > >>>> Name (Last First) or Account Total > > >>>> Remaining Usage > > >>>> ---------------------------- ---------- > > >>>> ------------ ---------- > > >>>> Kubal Michael 101880 SU > > >>>> 99358 SU 296 SU > > >>>> > > > > > > ---------------------------------------------------------------------- > > >>>> TG-MCA01S018 101880 SU > > >>>> 99358 SU 2522 SU > > >>>> > > >>>> > > >>>> > > >>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan > > >> wrote: > > >>>>> You should probably remove the line > > >> completely. > > >>>>> Did you chose a default project on the login > > >> node > > >>>> with tgprojects? > > >>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > > >>>> wrote: > > >>>>>> I tried running with the account id removed > > >> from > > >>>> the > > >>>>>> sites.file as in the following line: > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> but received the same error. > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> --- Mihael Hategan > > >> wrote: > > >>>>>>> Is this the same for pre-WS GRAM? > > >>>>>>> > > >>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > > >> Martin > > >>>>>>> wrote: > > >>>>>>>> that's right, qsub is used for PBS (and > > >> some > > >>>>>>> others too) > > >>>>>>>> bsub is LSF > > >>>>>>>> condor_q for condor > > >>>>>>>> ... > > >>>>>>>> > > >>>>>>>> -Stu > > >>>>>>>> > > >>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > > >>>>>>> Hategan wrote: > > >>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > > >> Kubal > > >>>>>>> wrote: > > >>>>>>>>>> I'll give it a try. > > >>>>>>>>>> > > >>>>>>>>>> When using GRAM4, is qsub the method used > > >> to > > >>>>>>>>>> ultimately put the job in the queue? > > >>>>>>>>> Looks like it. I also believe it's the > > >> case > > >>>> with > > >>>>>>> pre-ws gram. Stu > > >>>>>>>>> may be > > >>>>>>>>> able to clarify. > > >>>>>>>>> > > >>>>>>>>>> MikeK > > >>>>>>>>>> --- Mihael Hategan > > >>>> wrote: > > >>>>>>>>>>> While this doesn't solve the underlying > > >>>>>>> problem, it > > >>>>>>>>>>> may help you get > > >>>>>>>>>>> this to work: log into tg-login1.uc..., > > >> set > > >>>>>>> this > > >>>>>>>>>>> project as default, > > >>>>>>>>>>> then remove the project spec from the > > >> sites > > >>>>>>> file and > > >>>>>>>>>>> try again. > > >>>>>>>>>>> > > >>>>>>>>>>> Mihael > > >>>>>>>>>>> > > >>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > > >>>> Kubal > > >>>>>>> wrote: > > >>>>>>>>>>>> Yes, I believe you are right. The > > >> kickstart > > >>>>>>>>>>> message > > >>>>>>>>>>>> may be only a warning. After digging a > > >>>> little > > >>>>>>>>>>> deeper > > >>>>>>>>>>>> it appears the job is failing due to a > > >>>>>>>>>>> project/account > > >>>>>>>>>>>> id problem. I get the following error: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Caused by: > > >>>>>>>>>>>> The executable could not be > > >>>> started., > > >>>>>>>>>>> qsub: > > >>>>>>>>>>>> Invalid Account MSG=invalid account > > >>>>>>>>>>>> > > >>>>>>>>>>>> I am specifying the same TG-account in > > >> my > > >>>>>>>>>>> site-file > > >>>>>>>>>>>> for the gram4 run that fails, as in the > > >>>>>>> site-file > > >>>>>>>>>>> for > > >>>>>>>>>>>> the pre-ws job that suceeds. This is > > >> the > > >>>> same > > >>>>>>>>>>> project, > > >>>>>>>>>>>> TG-MCA01S018, that is set in my > > >>>>>>>>>>> .tg_default_project > > >>>>>>>>>>>> file in ~kubal/ on the UC teragrid. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> --- Ben Clifford > > >>>> wrote: > > >>>>>>>>>>>>> yeah, run that same without kickstart. > > >> the > > >>>>>>> error > > >>>>>>>>>>>>> reported is that > > >>>>>>>>>>>>> kickstart didn't work right - but > > >> there's > > >>>>>>>>>>> perhaps > > >>>>>>>>>>>>> some underlying error. > > >>>>>>>>>>>>> -- > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > > === message truncated === > > > > ____________________________________________________________________________________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > From wilde at mcs.anl.gov Tue Feb 12 17:41:18 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 Feb 2008 17:41:18 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202858233.27191.0.camel@blabla.mcs.anl.gov> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> Message-ID: <47B22E9E.2060000@mcs.anl.gov> That would certainly explain this clobbered the head node. Im sorry that we all missed this last week. If true, we would have seen the applications running on the headnode. I wonder if anyone noticed? Mike, heres a sample entry Ive used in the past for UC-TG: /home/wilde/swiftdata/UC/work The missing part is the "/jobmanager-pbs" in the url= tag of the element. - mikew On 2/12/08 5:17 PM, Mihael Hategan wrote: > Are you sure you were using PBS with pre-ws GRAM and not fork? > > On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote: >> With pre-WS-GRAM, it doesn't seem to matter which >> account/project id I use where. I can have the >> TG-MCB010025N specified in the sites-files and >> TG-MCA01S018 specified in ~kubal/.tg-default_project >> on the uc teragrid, or vice versa and it still works, >> or having them match in both places. >> >> With WS-GRAM, I have to use TG-MCB010025N, the local >> uc-teragrid project id, in both places. Using >> TG-MCA01S018, the teragrid wide charge number/account >> number, causes the qsub failure error. >> >> >> >> >> --- Michael Wilde wrote: >> >>> Mike, did you do a recent test with pre-WS-GRAM with >>> the >>> .tg_default_project file set *incorrectly*? >>> >>> I think the puzzle was why this would cause WS-GRAM >>> to fail but not >>> pre-WS-GRAM, as it would seem they would both get >>> the TG account to use >>> in the same manner. >>> >>> - mikew >>> >>> On 2/12/08 4:43 PM, Mike Kubal wrote: >>>> Just to be sure I tested with pre-WS and it worked >>>> also. >>>> >>>> --- Mihael Hategan wrote: >>>> >>>>> Would it be worth trying to find out why it >>> worked >>>>> with pre-WS GRAM? >>>>> >>>>> Mihael >>>>> >>>>> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal >>> wrote: >>>>>> Thanks Joe. This solved the account id problem. >>>>>> >>>>>> --- joseph insley wrote: >>>>>> >>>>>>> Mike K, >>>>>>> >>>>>>> looks like you have the wrong value in your >>>>>>> .tg_default_project file: >>>>>>> >>>>>>> insley at tg-viz-login1:~> more >>>>>>> ~kubal/.tg_default_project >>>>>>> TG-MCA01S018 >>>>>>> >>>>>>> you should be using: TG-MCB010025N >>>>>>> >>>>>>> insley at tg-viz-login1:~> tgusage -i -u kubal >>>>>>> >>>>>>> [snip] >>>>>>> >>>>>>> Account: TG-MCA01S018 >>>>>>> Title: Computational Studies of Complex >>>>> Processes in >>>>>>> Biological >>>>>>> Macromolecular Systems >>>>>>> Resource: teragrid >>>>>>> >>>>>>> **** >>>>>>> Local project name on dtf.anl.teragrid is >>>>>>> TG-MCB010025N >>>>>>> **** >>>>>>> >>>>>>> Allocation Period: 2007-08-03 to 2008-03-31 >>>>>>> >>>>>>> Name (Last First) or Account Total >>>>>>> Remaining Usage >>>>>>> ---------------------------- ---------- >>>>>>> ------------ ---------- >>>>>>> Kubal Michael 101880 SU >>>>>>> 99358 SU 296 SU >>>>>>> >> ---------------------------------------------------------------------- >>>>>>> TG-MCA01S018 101880 SU >>>>>>> 99358 SU 2522 SU >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan >>>>> wrote: >>>>>>>> You should probably remove the line >>>>> completely. >>>>>>>> Did you chose a default project on the login >>>>> node >>>>>>> with tgprojects? >>>>>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal >>>>>>> wrote: >>>>>>>>> I tried running with the account id removed >>>>> from >>>>>>> the >>>>>>>>> sites.file as in the following line: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> but received the same error. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> --- Mihael Hategan >>>>> wrote: >>>>>>>>>> Is this the same for pre-WS GRAM? >>>>>>>>>> >>>>>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart >>>>> Martin >>>>>>>>>> wrote: >>>>>>>>>>> that's right, qsub is used for PBS (and >>>>> some >>>>>>>>>> others too) >>>>>>>>>>> bsub is LSF >>>>>>>>>>> condor_q for condor >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> -Stu >>>>>>>>>>> >>>>>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael >>>>>>>>>> Hategan wrote: >>>>>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike >>>>> Kubal >>>>>>>>>> wrote: >>>>>>>>>>>>> I'll give it a try. >>>>>>>>>>>>> >>>>>>>>>>>>> When using GRAM4, is qsub the method used >>>>> to >>>>>>>>>>>>> ultimately put the job in the queue? >>>>>>>>>>>> Looks like it. I also believe it's the >>>>> case >>>>>>> with >>>>>>>>>> pre-ws gram. Stu >>>>>>>>>>>> may be >>>>>>>>>>>> able to clarify. >>>>>>>>>>>> >>>>>>>>>>>>> MikeK >>>>>>>>>>>>> --- Mihael Hategan >>>>>>> wrote: >>>>>>>>>>>>>> While this doesn't solve the underlying >>>>>>>>>> problem, it >>>>>>>>>>>>>> may help you get >>>>>>>>>>>>>> this to work: log into tg-login1.uc..., >>>>> set >>>>>>>>>> this >>>>>>>>>>>>>> project as default, >>>>>>>>>>>>>> then remove the project spec from the >>>>> sites >>>>>>>>>> file and >>>>>>>>>>>>>> try again. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mihael >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike >>>>>>> Kubal >>>>>>>>>> wrote: >>>>>>>>>>>>>>> Yes, I believe you are right. The >>>>> kickstart >>>>>>>>>>>>>> message >>>>>>>>>>>>>>> may be only a warning. After digging a >>>>>>> little >>>>>>>>>>>>>> deeper >>>>>>>>>>>>>>> it appears the job is failing due to a >>>>>>>>>>>>>> project/account >>>>>>>>>>>>>>> id problem. I get the following error: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Caused by: >>>>>>>>>>>>>>> The executable could not be >>>>>>> started., >>>>>>>>>>>>>> qsub: >>>>>>>>>>>>>>> Invalid Account MSG=invalid account >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am specifying the same TG-account in >>>>> my >>>>>>>>>>>>>> site-file >>>>>>>>>>>>>>> for the gram4 run that fails, as in the >>>>>>>>>> site-file >>>>>>>>>>>>>> for >>>>>>>>>>>>>>> the pre-ws job that suceeds. This is >>>>> the >>>>>>> same >>>>>>>>>>>>>> project, >>>>>>>>>>>>>>> TG-MCA01S018, that is set in my >>>>>>>>>>>>>> .tg_default_project >>>>>>>>>>>>>>> file in ~kubal/ on the UC teragrid. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> --- Ben Clifford >>>>>>> wrote: >>>>>>>>>>>>>>>> yeah, run that same without kickstart. >>>>> the >>>>>>>>>> error >>>>>>>>>>>>>>>> reported is that >>>>>>>>>>>>>>>> kickstart didn't work right - but >>>>> there's >>>>>>>>>>>>>> perhaps >>>>>>>>>>>>>>>> some underlying error. >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >> === message truncated === >> >> >> >> ____________________________________________________________________________________ >> Looking for last minute shopping deals? >> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping >> > > From benc at hawaga.org.uk Tue Feb 12 17:51:05 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Feb 2008 23:51:05 +0000 (GMT) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <283100.87314.qm@web52307.mail.re2.yahoo.com> References: <283100.87314.qm@web52307.mail.re2.yahoo.com> Message-ID: On Tue, 12 Feb 2008, Mike Kubal wrote: > Yes, I believe you are right. The kickstart message > may be only a warning. After digging a little deeper > it appears the job is failing due to a project/account > id problem. I get the following error: > > Caused by: > The executable could not be started., qsub: > Invalid Account MSG=invalid account > I am specifying the same TG-account in my site-file > for the gram4 run that fails, as in the site-file for > the pre-ws job that suceeds. This is the same project, > TG-MCA01S018, that is set in my .tg_default_project > file in ~kubal/ on the UC teragrid. ok. there's something wrong there. run the command: tgprojects and paste its output. also paste your sites file again. -- From hategan at mcs.anl.gov Tue Feb 12 17:53:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 17:53:50 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <47B22E9E.2060000@mcs.anl.gov> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> <47B22E9E.2060000@mcs.anl.gov> Message-ID: <1202860430.28705.7.camel@blabla.mcs.anl.gov> On Tue, 2008-02-12 at 17:41 -0600, Michael Wilde wrote: > That would certainly explain this clobbered the head node. > Im sorry that we all missed this last week. This is what Joe killed at the time: > kubal 28202 19438 1 16:50 ? 00:00:00 /usr/bin/perl /soft/ > prews-gram-4.0.1-r3/libexec/globus-job-manager-script.pl -m pbs -f / > tmp/gram_SPsdme -c poll Looks like PBS. > > If true, we would have seen the applications running on the headnode. > I wonder if anyone noticed? > > Mike, heres a sample entry Ive used in the past for UC-TG: > > sysinfo="INTEL32::LINUX"> > storage="/home/wilde/swiftdata/UC/storage" major="2" minor="2" /> > url="tg-grid.uc.teragrid.org/jobmanager-pbs" major="2" minor="2"/> > /home/wilde/swiftdata/UC/work > > > > The missing part is the "/jobmanager-pbs" in the url= tag of the > element. I think we may want to discourage that since it's not portable. I'd say instead of , one should use Mihael > > - mikew > > > On 2/12/08 5:17 PM, Mihael Hategan wrote: > > Are you sure you were using PBS with pre-ws GRAM and not fork? > > > > On Tue, 2008-02-12 at 15:14 -0800, Mike Kubal wrote: > >> With pre-WS-GRAM, it doesn't seem to matter which > >> account/project id I use where. I can have the > >> TG-MCB010025N specified in the sites-files and > >> TG-MCA01S018 specified in ~kubal/.tg-default_project > >> on the uc teragrid, or vice versa and it still works, > >> or having them match in both places. > >> > >> With WS-GRAM, I have to use TG-MCB010025N, the local > >> uc-teragrid project id, in both places. Using > >> TG-MCA01S018, the teragrid wide charge number/account > >> number, causes the qsub failure error. > >> > >> > >> > >> > >> --- Michael Wilde wrote: > >> > >>> Mike, did you do a recent test with pre-WS-GRAM with > >>> the > >>> .tg_default_project file set *incorrectly*? > >>> > >>> I think the puzzle was why this would cause WS-GRAM > >>> to fail but not > >>> pre-WS-GRAM, as it would seem they would both get > >>> the TG account to use > >>> in the same manner. > >>> > >>> - mikew > >>> > >>> On 2/12/08 4:43 PM, Mike Kubal wrote: > >>>> Just to be sure I tested with pre-WS and it worked > >>>> also. > >>>> > >>>> --- Mihael Hategan wrote: > >>>> > >>>>> Would it be worth trying to find out why it > >>> worked > >>>>> with pre-WS GRAM? > >>>>> > >>>>> Mihael > >>>>> > >>>>> On Tue, 2008-02-12 at 14:20 -0800, Mike Kubal > >>> wrote: > >>>>>> Thanks Joe. This solved the account id problem. > >>>>>> > >>>>>> --- joseph insley wrote: > >>>>>> > >>>>>>> Mike K, > >>>>>>> > >>>>>>> looks like you have the wrong value in your > >>>>>>> .tg_default_project file: > >>>>>>> > >>>>>>> insley at tg-viz-login1:~> more > >>>>>>> ~kubal/.tg_default_project > >>>>>>> TG-MCA01S018 > >>>>>>> > >>>>>>> you should be using: TG-MCB010025N > >>>>>>> > >>>>>>> insley at tg-viz-login1:~> tgusage -i -u kubal > >>>>>>> > >>>>>>> [snip] > >>>>>>> > >>>>>>> Account: TG-MCA01S018 > >>>>>>> Title: Computational Studies of Complex > >>>>> Processes in > >>>>>>> Biological > >>>>>>> Macromolecular Systems > >>>>>>> Resource: teragrid > >>>>>>> > >>>>>>> **** > >>>>>>> Local project name on dtf.anl.teragrid is > >>>>>>> TG-MCB010025N > >>>>>>> **** > >>>>>>> > >>>>>>> Allocation Period: 2007-08-03 to 2008-03-31 > >>>>>>> > >>>>>>> Name (Last First) or Account Total > >>>>>>> Remaining Usage > >>>>>>> ---------------------------- ---------- > >>>>>>> ------------ ---------- > >>>>>>> Kubal Michael 101880 SU > >>>>>>> 99358 SU 296 SU > >>>>>>> > >> ---------------------------------------------------------------------- > >>>>>>> TG-MCA01S018 101880 SU > >>>>>>> 99358 SU 2522 SU > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Feb 12, 2008, at 2:37 PM, Mihael Hategan > >>>>> wrote: > >>>>>>>> You should probably remove the line > >>>>> completely. > >>>>>>>> Did you chose a default project on the login > >>>>> node > >>>>>>> with tgprojects? > >>>>>>>> On Tue, 2008-02-12 at 12:34 -0800, Mike Kubal > >>>>>>> wrote: > >>>>>>>>> I tried running with the account id removed > >>>>> from > >>>>>>> the > >>>>>>>>> sites.file as in the following line: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> but received the same error. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> --- Mihael Hategan > >>>>> wrote: > >>>>>>>>>> Is this the same for pre-WS GRAM? > >>>>>>>>>> > >>>>>>>>>> On Tue, 2008-02-12 at 14:20 -0600, Stuart > >>>>> Martin > >>>>>>>>>> wrote: > >>>>>>>>>>> that's right, qsub is used for PBS (and > >>>>> some > >>>>>>>>>> others too) > >>>>>>>>>>> bsub is LSF > >>>>>>>>>>> condor_q for condor > >>>>>>>>>>> ... > >>>>>>>>>>> > >>>>>>>>>>> -Stu > >>>>>>>>>>> > >>>>>>>>>>> On Feb 12, 2008, at Feb 12, 2:15 PM, Mihael > >>>>>>>>>> Hategan wrote: > >>>>>>>>>>>> On Tue, 2008-02-12 at 12:09 -0800, Mike > >>>>> Kubal > >>>>>>>>>> wrote: > >>>>>>>>>>>>> I'll give it a try. > >>>>>>>>>>>>> > >>>>>>>>>>>>> When using GRAM4, is qsub the method used > >>>>> to > >>>>>>>>>>>>> ultimately put the job in the queue? > >>>>>>>>>>>> Looks like it. I also believe it's the > >>>>> case > >>>>>>> with > >>>>>>>>>> pre-ws gram. Stu > >>>>>>>>>>>> may be > >>>>>>>>>>>> able to clarify. > >>>>>>>>>>>> > >>>>>>>>>>>>> MikeK > >>>>>>>>>>>>> --- Mihael Hategan > >>>>>>> wrote: > >>>>>>>>>>>>>> While this doesn't solve the underlying > >>>>>>>>>> problem, it > >>>>>>>>>>>>>> may help you get > >>>>>>>>>>>>>> this to work: log into tg-login1.uc..., > >>>>> set > >>>>>>>>>> this > >>>>>>>>>>>>>> project as default, > >>>>>>>>>>>>>> then remove the project spec from the > >>>>> sites > >>>>>>>>>> file and > >>>>>>>>>>>>>> try again. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Mihael > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, 2008-02-12 at 11:36 -0800, Mike > >>>>>>> Kubal > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>> Yes, I believe you are right. The > >>>>> kickstart > >>>>>>>>>>>>>> message > >>>>>>>>>>>>>>> may be only a warning. After digging a > >>>>>>> little > >>>>>>>>>>>>>> deeper > >>>>>>>>>>>>>>> it appears the job is failing due to a > >>>>>>>>>>>>>> project/account > >>>>>>>>>>>>>>> id problem. I get the following error: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Caused by: > >>>>>>>>>>>>>>> The executable could not be > >>>>>>> started., > >>>>>>>>>>>>>> qsub: > >>>>>>>>>>>>>>> Invalid Account MSG=invalid account > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I am specifying the same TG-account in > >>>>> my > >>>>>>>>>>>>>> site-file > >>>>>>>>>>>>>>> for the gram4 run that fails, as in the > >>>>>>>>>> site-file > >>>>>>>>>>>>>> for > >>>>>>>>>>>>>>> the pre-ws job that suceeds. This is > >>>>> the > >>>>>>> same > >>>>>>>>>>>>>> project, > >>>>>>>>>>>>>>> TG-MCA01S018, that is set in my > >>>>>>>>>>>>>> .tg_default_project > >>>>>>>>>>>>>>> file in ~kubal/ on the UC teragrid. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> --- Ben Clifford > >>>>>>> wrote: > >>>>>>>>>>>>>>>> yeah, run that same without kickstart. > >>>>> the > >>>>>>>>>> error > >>>>>>>>>>>>>>>> reported is that > >>>>>>>>>>>>>>>> kickstart didn't work right - but > >>>>> there's > >>>>>>>>>>>>>> perhaps > >>>>>>>>>>>>>>>> some underlying error. > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >> === message truncated === > >> > >> > >> > >> ____________________________________________________________________________________ > >> Looking for last minute shopping deals? > >> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > >> > > > > > From benc at hawaga.org.uk Tue Feb 12 17:57:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 12 Feb 2008 23:57:34 +0000 (GMT) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202860430.28705.7.camel@blabla.mcs.anl.gov> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> <47B22E9E.2060000@mcs.anl.gov> <1202860430.28705.7.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 12 Feb 2008, Mihael Hategan wrote: > I think we may want to discourage that since it's not portable. I'd say > instead of , one should use jobManager="pbs" url="tg-grid.uc.teragrid.org"/> which is more portable...? -- From hategan at mcs.anl.gov Tue Feb 12 18:04:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 18:04:36 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> <47B22E9E.2060000@mcs.anl.gov> <1202860430.28705.7.camel@blabla.mcs.anl.gov> Message-ID: <1202861076.29685.4.camel@blabla.mcs.anl.gov> On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote: > > On Tue, 12 Feb 2008, Mihael Hategan wrote: > > > I think we may want to discourage that since it's not portable. I'd say > > instead of , one should use > jobManager="pbs" url="tg-grid.uc.teragrid.org"/> > > which is more portable...? Hmm? From wilde at mcs.anl.gov Tue Feb 12 18:26:41 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 12 Feb 2008 18:26:41 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202861076.29685.4.camel@blabla.mcs.anl.gov> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> <47B22E9E.2060000@mcs.anl.gov> <1202860430.28705.7.camel@blabla.mcs.anl.gov> <1202861076.29685.4.camel@blabla.mcs.anl.gov> Message-ID: <47B23941.7050201@mcs.anl.gov> I think that makes sense - you mean that jobManager="pbs" works for both WS-GRAM and pre-WS-GRAM, right? On 2/12/08 6:04 PM, Mihael Hategan wrote: > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote: >> On Tue, 12 Feb 2008, Mihael Hategan wrote: >> >>> I think we may want to discourage that since it's not portable. I'd say >>> instead of , one should use >> jobManager="pbs" url="tg-grid.uc.teragrid.org"/> >> which is more portable...? > > Hmm? > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From hategan at mcs.anl.gov Tue Feb 12 18:31:55 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 12 Feb 2008 18:31:55 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <47B23941.7050201@mcs.anl.gov> References: <283035.8845.qm@web52304.mail.re2.yahoo.com> <1202858233.27191.0.camel@blabla.mcs.anl.gov> <47B22E9E.2060000@mcs.anl.gov> <1202860430.28705.7.camel@blabla.mcs.anl.gov> <1202861076.29685.4.camel@blabla.mcs.anl.gov> <47B23941.7050201@mcs.anl.gov> Message-ID: <1202862715.31941.5.camel@blabla.mcs.anl.gov> On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde wrote: > I think that makes sense - you mean that jobManager="pbs" works for both > WS-GRAM and pre-WS-GRAM, right? Yes. Not only that, with and WS-GRAM there is no (known to me) way to specify a job manager. Somewhat ironic. > > On 2/12/08 6:04 PM, Mihael Hategan wrote: > > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford wrote: > >> On Tue, 12 Feb 2008, Mihael Hategan wrote: > >> > >>> I think we may want to discourage that since it's not portable. I'd say > >>> instead of , one should use >>> jobManager="pbs" url="tg-grid.uc.teragrid.org"/> > >> which is more portable...? > > > > Hmm? I'm asking Ben "Hmm?" because I thought he was aware of the above fact and so unsure what exactly he wanted to know. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From benc at hawaga.org.uk Wed Feb 13 06:53:24 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 13 Feb 2008 12:53:24 +0000 (GMT) Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> Message-ID: On Tue, 12 Feb 2008, Ben Clifford wrote: > > The attached jar should fix that. > > With your new jar, I no longer get that error. I did once get the below > stack trace, though execution appeared to continue. It hasn't happened a > second time or third time on running the same tests. This change should probably find its way into the swift distribution via commits to the various dependencies that I don't commit to (GRAM4 and cog). -- From feller at mcs.anl.gov Wed Feb 13 08:49:03 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Wed, 13 Feb 2008 08:49:03 -0600 (CST) Subject: [Swift-devel] cog r1871 In-Reply-To: References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> Message-ID: <22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov> > > On Tue, 12 Feb 2008, Ben Clifford wrote: > >> > The attached jar should fix that. >> >> With your new jar, I no longer get that error. I did once get the below >> stack trace, though execution appeared to continue. It hasn't happened a >> second time or third time on running the same tests. > > This change should probably find its way into the swift distribution via > commits to the various dependencies that I don't commit to (GRAM4 and > cog). > The change has not been committed yet to any branch in ws-gram. As far as i know cog has its own gram jars. Is that right? jars from what GT version are in the latest cog version? From hategan at mcs.anl.gov Wed Feb 13 10:05:02 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 13 Feb 2008 10:05:02 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: <22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> <22184.208.54.7.179.1202914143.squirrel@www-unix.mcs.anl.gov> Message-ID: <1202918702.16251.1.camel@blabla.mcs.anl.gov> On Wed, 2008-02-13 at 08:49 -0600, feller at mcs.anl.gov wrote: > > > > On Tue, 12 Feb 2008, Ben Clifford wrote: > > > >> > The attached jar should fix that. > >> > >> With your new jar, I no longer get that error. I did once get the below > >> stack trace, though execution appeared to continue. It hasn't happened a > >> second time or third time on running the same tests. > > > > This change should probably find its way into the swift distribution via > > commits to the various dependencies that I don't commit to (GRAM4 and > > cog). > > > > The change has not been committed yet to any branch in ws-gram. > As far as i know cog has its own gram jars. Is that right? > jars from what GT version are in the latest cog version? Right now it's the first thing you sent. I think before that it was 4.0.2. Mihaek > > From mikekubal at yahoo.com Wed Feb 13 14:03:44 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Wed, 13 Feb 2008 12:03:44 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202862715.31941.5.camel@blabla.mcs.anl.gov> Message-ID: <338963.32679.qm@web52306.mail.re2.yahoo.com> It worked swimmingly with Mihael's suggestion to change gt4 to gt2 in the following line in my sites file: The only warning I get is a failure to transfer kickstart records if I include the gridlaunch argument as in the line below: Cheers, Mike --- Mihael Hategan wrote: > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde > wrote: > > I think that makes sense - you mean that > jobManager="pbs" works for both > > WS-GRAM and pre-WS-GRAM, right? > > Yes. Not only that, with and WS-GRAM > there is no (known to > me) way to specify a job manager. Somewhat ironic. > > > > > > On 2/12/08 6:04 PM, Mihael Hategan wrote: > > > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford > wrote: > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote: > > >> > > >>> I think we may want to discourage that since > it's not portable. I'd say > > >>> instead of , one should use > > >>> jobManager="pbs" > url="tg-grid.uc.teragrid.org"/> > > >> which is more portable...? > > > > > > Hmm? > > I'm asking Ben "Hmm?" because I thought he was aware > of the above fact > and so unsure what exactly he wanted to know. > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping From hategan at mcs.anl.gov Wed Feb 13 14:15:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 13 Feb 2008 14:15:36 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <338963.32679.qm@web52306.mail.re2.yahoo.com> References: <338963.32679.qm@web52306.mail.re2.yahoo.com> Message-ID: <1202933737.21302.4.camel@blabla.mcs.anl.gov> On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote: > It worked swimmingly with Mihael's suggestion to > change gt4 to gt2 Ouch. That was a bit of a mistake there. I was pointing out that should be used instead of . GT2 was accidental. You should probably change that to GT4 unless you're using a checkout more current than yesterday which has some throttling patches to try to prevent killing the head node. > in the following line in my sites > file: > > url="tg-grid1.uc.teragrid.org" /> > > The only warning I get is a failure to transfer > kickstart records if I include the gridlaunch argument > as in the line below: > gridlaunch="/home/wilde/vds/mystart"> > > Cheers, > > Mike > > > > > --- Mihael Hategan wrote: > > > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde > > wrote: > > > I think that makes sense - you mean that > > jobManager="pbs" works for both > > > WS-GRAM and pre-WS-GRAM, right? > > > > Yes. Not only that, with and WS-GRAM > > there is no (known to > > me) way to specify a job manager. Somewhat ironic. > > > > > > > > > > On 2/12/08 6:04 PM, Mihael Hategan wrote: > > > > On Tue, 2008-02-12 at 23:57 +0000, Ben Clifford > > wrote: > > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote: > > > >> > > > >>> I think we may want to discourage that since > > it's not portable. I'd say > > > >>> instead of , one should use > > > > >>> jobManager="pbs" > > url="tg-grid.uc.teragrid.org"/> > > > >> which is more portable...? > > > > > > > > Hmm? > > > > I'm asking Ben "Hmm?" because I thought he was aware > > of the above fact > > and so unsure what exactly he wanted to know. > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Looking for last minute shopping deals? > Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping > From mikekubal at yahoo.com Wed Feb 13 14:55:34 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Wed, 13 Feb 2008 12:55:34 -0800 (PST) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <1202933737.21302.4.camel@blabla.mcs.anl.gov> Message-ID: <570155.71653.qm@web52302.mail.re2.yahoo.com> What does the gt2/gt4 signify? Using gt4 in the line below causes app on the uc-teragrid to fail with message "cannot execute binary file": Cheers, Mike --- Mihael Hategan wrote: > On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote: > > It worked swimmingly with Mihael's suggestion to > > change gt4 to gt2 > > Ouch. That was a bit of a mistake there. I was > pointing out that > should be used instead of . > GT2 was accidental. > You should probably change that to GT4 unless you're > using a checkout > more current than yesterday which has some > throttling patches to try to > prevent killing the head node. > > > in the following line in my sites > > file: > > > > > url="tg-grid1.uc.teragrid.org" /> > > > > The only warning I get is a failure to transfer > > kickstart records if I include the gridlaunch > argument > > as in the line below: > > > gridlaunch="/home/wilde/vds/mystart"> > > > > Cheers, > > > > Mike > > > > > > > > > > --- Mihael Hategan wrote: > > > > > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde > > > wrote: > > > > I think that makes sense - you mean that > > > jobManager="pbs" works for both > > > > WS-GRAM and pre-WS-GRAM, right? > > > > > > Yes. Not only that, with and > WS-GRAM > > > there is no (known to > > > me) way to specify a job manager. Somewhat > ironic. > > > > > > > > > > > > > > On 2/12/08 6:04 PM, Mihael Hategan wrote: > > > > > On Tue, 2008-02-12 at 23:57 +0000, Ben > Clifford > > > wrote: > > > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote: > > > > >> > > > > >>> I think we may want to discourage that > since > > > it's not portable. I'd say > > > > >>> instead of , one should use > > > > > > >>> jobManager="pbs" > > > url="tg-grid.uc.teragrid.org"/> > > > > >> which is more portable...? > > > > > > > > > > Hmm? > > > > > > I'm asking Ben "Hmm?" because I thought he was > aware > > > of the above fact > > > and so unsure what exactly he wanted to know. > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Looking for last minute shopping deals? > > Find them fast with Yahoo! Search. > http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs From hategan at mcs.anl.gov Wed Feb 13 15:03:27 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 13 Feb 2008 15:03:27 -0600 Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <570155.71653.qm@web52302.mail.re2.yahoo.com> References: <570155.71653.qm@web52302.mail.re2.yahoo.com> Message-ID: <1202936607.22610.0.camel@blabla.mcs.anl.gov> On Wed, 2008-02-13 at 12:55 -0800, Mike Kubal wrote: > What does the gt2/gt4 signify? gt2 - pre-ws gram gt4 - ws gram > > Using gt4 in the line below causes app on the > uc-teragrid to fail with message "cannot execute > binary file": Nevermind then. Though we should probably debug that. > > url="tg-grid1.uc.teragrid.org" /> > > Cheers, > > Mike > > --- Mihael Hategan wrote: > > > On Wed, 2008-02-13 at 12:03 -0800, Mike Kubal wrote: > > > It worked swimmingly with Mihael's suggestion to > > > change gt4 to gt2 > > > > Ouch. That was a bit of a mistake there. I was > > pointing out that > > should be used instead of . > > GT2 was accidental. > > You should probably change that to GT4 unless you're > > using a checkout > > more current than yesterday which has some > > throttling patches to try to > > prevent killing the head node. > > > > > in the following line in my sites > > > file: > > > > > > > > url="tg-grid1.uc.teragrid.org" /> > > > > > > The only warning I get is a failure to transfer > > > kickstart records if I include the gridlaunch > > argument > > > as in the line below: > > > > > gridlaunch="/home/wilde/vds/mystart"> > > > > > > Cheers, > > > > > > Mike > > > > > > > > > > > > > > > --- Mihael Hategan wrote: > > > > > > > On Tue, 2008-02-12 at 18:26 -0600, Michael Wilde > > > > wrote: > > > > > I think that makes sense - you mean that > > > > jobManager="pbs" works for both > > > > > WS-GRAM and pre-WS-GRAM, right? > > > > > > > > Yes. Not only that, with and > > WS-GRAM > > > > there is no (known to > > > > me) way to specify a job manager. Somewhat > > ironic. > > > > > > > > > > > > > > > > > > On 2/12/08 6:04 PM, Mihael Hategan wrote: > > > > > > On Tue, 2008-02-12 at 23:57 +0000, Ben > > Clifford > > > > wrote: > > > > > >> On Tue, 12 Feb 2008, Mihael Hategan wrote: > > > > > >> > > > > > >>> I think we may want to discourage that > > since > > > > it's not portable. I'd say > > > > > >>> instead of , one should use > > > > > > > > >>> jobManager="pbs" > > > > url="tg-grid.uc.teragrid.org"/> > > > > > >> which is more portable...? > > > > > > > > > > > > Hmm? > > > > > > > > I'm asking Ben "Hmm?" because I thought he was > > aware > > > > of the above fact > > > > and so unsure what exactly he wanted to know. > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > > > > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Looking for last minute shopping deals? > > > Find them fast with Yahoo! Search. > > > http://tools.search.yahoo.com/newsearch/category.php?category=shopping > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > ____________________________________________________________________________________ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs > From benc at hawaga.org.uk Thu Feb 14 03:25:35 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 14 Feb 2008 09:25:35 +0000 (GMT) Subject: [Swift-devel] Re: latest attempt with GRAM4 In-Reply-To: <570155.71653.qm@web52302.mail.re2.yahoo.com> References: <570155.71653.qm@web52302.mail.re2.yahoo.com> Message-ID: On Wed, 13 Feb 2008, Mike Kubal wrote: > What does the gt2/gt4 signify? There are two totally different job submission systems, both called GRAM. GRAM2 is more deployed but much older. GRAM4 is newer, less used, but has the promise of being (much) more scalable. > Using gt4 in the line below causes app on the > uc-teragrid to fail with message "cannot execute > binary file": > > url="tg-grid1.uc.teragrid.org" /> For debugging these problems, see if you can run the example workflow, examples/vdsk/first.swift - that should help isolate execution problems in general with something application specific. -- From benc at hawaga.org.uk Fri Feb 15 16:41:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 15 Feb 2008 22:41:53 +0000 (GMT) Subject: [Swift-devel] placement of large amounts of client side kickstart records Message-ID: At present, kickstart records go to $PWD. That's lame - 10000 jobs give 10000 files that are i) in $PWD and ii) all in the same directory. I'd like to do something about that. i) is most important - perhaps put them first in a subdirectory named by workflow run ID, eg fmri-20080215-1828-2d433ro1.d/ ii) matters a bit less; however, kickstart records could be staged back into a hierarchy split up by job ID, in the same way that they are split up on the execute side. -- From wilde at mcs.anl.gov Fri Feb 15 16:53:48 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 15 Feb 2008 16:53:48 -0600 Subject: [Swift-devel] placement of large amounts of client side kickstart records In-Reply-To: References: Message-ID: <47B617FC.7090108@mcs.anl.gov> This sounds very good to me. I had to hack around this in the work I did on angle in November. On 2/15/08 4:41 PM, Ben Clifford wrote: > At present, kickstart records go to $PWD. > > That's lame - 10000 jobs give 10000 files that are i) in $PWD and ii) all > in the same directory. > > I'd like to do something about that. > > i) is most important - perhaps put them first in a subdirectory named by > workflow run ID, eg fmri-20080215-1828-2d433ro1.d/ > > ii) matters a bit less; however, kickstart records could be staged back > into a hierarchy split up by job ID, in the same way that they are split > up on the execute side. > From wilde at mcs.anl.gov Tue Feb 19 15:52:48 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 19 Feb 2008 15:52:48 -0600 Subject: [Swift-devel] Re: Swift running errors In-Reply-To: <20080219150017.AWQ22172@m4500-03.uchicago.edu> References: <20080219150017.AWQ22172@m4500-03.uchicago.edu> Message-ID: <47BB4FB0.3030202@mcs.anl.gov> Xi, Regarding the kickstart problem - this is just a warning, possibly due to an incorrect spec in your sites.xml file on where kickstart is installed. We can look into this. Regarding "too many open files" - its possible that swift is trying to run too much in parallel and thus opening too many files at once. Mihael or Ben, could this be due to lack of or incorrect setting of the throttling parameters? I cant tell if this is hitting a per-host or per-process limit, but I suspect its the latter. Xi, until you hear from others, look at the throttling parameters and set them to a modest value to start with. I need to go back to my notes for this - and we should document this more clearly in the user guide. - mike On 2/19/08 3:00 PM, lixi at uchicago.edu wrote: > Hi, > > I have two problems. > > 1. Today, when I try to run swift workflow on muliple OSG > sites, I always encounter the following errors which cause > the running failed: > [lixi at login remote]$ swift - > tc.file /home/lixi/swift/test/tc.data - > sites.file /home/lixi/swift/test/OSGEDU_Sites.xml > workflowtest.swift > Swift v0.3-dev r1674 (modified locally) > > RunID: 20080219-1447-1hztqje9 > node started > Failed to transfer kickstart records from workflowtest- > 20080219-1447-1hztqje9/kickstart/8/CIT_CMS_T2Exception in > getFile > task:transfer @ vdl-int.k, line: 322 > sys:try @ vdl-int.k, line: 322 > vdl:transferkickstartrec @ vdl-int.k, line: 409 > sys:set @ vdl-int.k, line: 409 > sys:sequential @ vdl-int.k, line: 409 > sys:try @ vdl-int.k, line: 408 > sys:else @ vdl-int.k, line: 407 > sys:if @ vdl-int.k, line: 405 > sys:set @ vdl-int.k, line: 404 > sys:catch @ vdl-int.k, line: 396 > sys:try @ vdl-int.k, line: 354 > task:allocatehost @ vdl-int.k, line: 334 > vdl:execute2 @ execute-default.k, line: 23 > sys:restartonerror @ execute-default.k, line: 21 > sys:sequential @ execute-default.k, line: 19 > sys:try @ execute-default.k, line: 18 > sys:if @ execute-default.k, line: 17 > sys:then @ execute-default.k, line: 16 > sys:if @ execute-default.k, line: 15 > vdl:execute @ workflowtest.kml, line: 31 > worknode @ workflowtest.kml, line: 79 > sys:sequential @ workflowtest.kml, line: 78 > sys:parallel @ workflowtest.kml, line: 77 > vdl:mainp @ workflowtest.kml, line: 76 > mainp @ vdl.k, line: 150 > vdl:mains @ workflowtest.kml, line: 75 > vdl:mains @ workflowtest.kml, line: 75 > rlog:restartlog @ workflowtest.kml, line: 74 > kernel:project @ workflowtest.kml, line: 2 > workflowtest-20080219-1447-1hztqje9 > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > Exception in getFile > Caused by: org.globus.ftp.exception.ServerException: Server > refused performing the request. Custom message: (error code > 1) [Nested exception message: Custom message: Unexpected > reply: 500-Command failed. : > globus_gridftp_server_file.c:globus_l_gfs_file_send:2190: > 500-globus_l_gfs_file_open failed. > 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: > 500-globus_xio_register_open failed. > 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: > 500-Unable to open file /raid2/osg-data/lixi/workflowtest- > 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi- > kickstart.xml > 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: > 500-System error in open: No such file or directory > 500-globus_xio: A system call failed: No such file or > directory > 500 End.] [Nested exception is > org.globus.ftp.exception.UnexpectedReplyCodeException: > Custom message: Unexpected reply: 500-Command failed. : > globus_gridftp_server_file.c:globus_l_gfs_file_send:2190: > 500-globus_l_gfs_file_open failed. > 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: > 500-globus_xio_register_open failed. > 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: > 500-Unable to open file /raid2/osg-data/lixi/workflowtest- > 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi- > kickstart.xml > 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: > 500-System error in open: No such file or directory > 500-globus_xio: A system call failed: No such file or > directory > 500 End.] > > 2. When runing a workflow which involves 1000nodes, I > encounter the following errors very frequently, but not all > the time: > ... > node completed > node completed > node completed > node completed > node completed > node failed > Execution failed: > Exception in node: > Arguments: [_concurrent/intermediatefile-b5b5dc39-df70-4137- > 8149-c20f5d1af839-, out.0132.txt] > Host: localhost > Directory: workflowtest-20080219-1443-2qx4ctkc/jobs/6/node- > 64kddnoi > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > java.io.IOException: Too many open files > > Could you tell me why and teach me how to resolve such > problems? > > Thanks, > > Xi > > From hategan at mcs.anl.gov Thu Feb 21 12:53:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 21 Feb 2008 12:53:25 -0600 Subject: [Swift-devel] cog r1871 In-Reply-To: <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> Message-ID: <1203620005.2850.9.camel@blabla.mcs.anl.gov> An email message in a different thread brings the question: is this compiled for 1.4 or 1.5? Mihael On Mon, 2008-02-11 at 23:28 -0600, feller at mcs.anl.gov wrote: > My fault, not the ObjectSerializers one. > You submitted in batch-mode? > The attached jar should fix that. > Hope the java version is fine. > Martin > > > Martin? > > > > On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote: > >> I'm seeing repeatable cleanup errors like the below. The workflows run > >> to > >> completion, though. > >> > >> RunID: 20080211-2248-rsqe1da0 > >> cat started > >> cat completed > >> The following warnings have occurred: > >> 1. Cleanup on tguc failed > >> Caused by: > >> Cannot submit job: null > >> Caused by: > >> java.lang.NullPointerException > >> at > >> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211) > >> at > >> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970) > >> at org.globus.exec.client.GramJob.submit(GramJob.java:447) > >> at > >> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189) > >> at > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54) > >> at > >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) > >> at > >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > >> > >> > > > > From feller at mcs.anl.gov Thu Feb 21 13:32:29 2008 From: feller at mcs.anl.gov (feller at mcs.anl.gov) Date: Thu, 21 Feb 2008 13:32:29 -0600 (CST) Subject: [Swift-devel] cog r1871 In-Reply-To: <1203620005.2850.9.camel@blabla.mcs.anl.gov> References: <1202745681.15887.10.camel@blabla.mcs.anl.gov> <1202749192.18234.0.camel@blabla.mcs.anl.gov> <1202758443.28686.0.camel@blabla.mcs.anl.gov> <1202769433.31985.1.camel@blabla.mcs.anl.gov> <1202773602.779.0.camel@blabla.mcs.anl.gov> <49572.207.229.171.174.1202794085.squirrel@www-unix.mcs.anl.gov> <1203620005.2850.9.camel@blabla.mcs.anl.gov> Message-ID: <21069.208.54.7.178.1203622349.squirrel@www-unix.mcs.anl.gov> 99.63% sure that it was built with java 1.4 if it was 1.5 a client running under 1.4 should see errors. Martin > An email message in a different thread brings the question: is this > compiled for 1.4 or 1.5? > > Mihael > > On Mon, 2008-02-11 at 23:28 -0600, feller at mcs.anl.gov wrote: >> My fault, not the ObjectSerializers one. >> You submitted in batch-mode? >> The attached jar should fix that. >> Hope the java version is fine. >> Martin >> >> > Martin? >> > >> > On Mon, 2008-02-11 at 22:50 +0000, Ben Clifford wrote: >> >> I'm seeing repeatable cleanup errors like the below. The workflows >> run >> >> to >> >> completion, though. >> >> >> >> RunID: 20080211-2248-rsqe1da0 >> >> cat started >> >> cat completed >> >> The following warnings have occurred: >> >> 1. Cleanup on tguc failed >> >> Caused by: >> >> Cannot submit job: null >> >> Caused by: >> >> java.lang.NullPointerException >> >> at >> >> org.globus.wsrf.encoding.ObjectSerializer.clone(ObjectSerializer.java:211) >> >> at >> >> org.globus.exec.client.GramJob.createJobEndpoint(GramJob.java:970) >> >> at org.globus.exec.client.GramJob.submit(GramJob.java:447) >> >> at >> >> org.globus.cog.abstraction.impl.execution.gt4_0_0.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:189) >> >> at >> >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:54) >> >> at >> >> org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) >> >> at >> >> edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >> >> >> >> >> > >> > > > From benc at hawaga.org.uk Sun Feb 24 17:12:16 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 24 Feb 2008 23:12:16 +0000 (GMT) Subject: [Swift-devel] some racelike condition on file stagein Message-ID: I see errors like the below sometimes on my laptop and on the NMI test systems, happening in file stagein. It happens sporadically and seemingly not very often if at all on i386 linux, so feels like some kind of race condition. Every time I've seen it, its been creating a file under a non-trivial directory structure (eg under _concurrent) so maybe there's something funny going on there. Its come to my attention because I put in a test that turns off execution retries and runs all the local behaviour tests, on the basis that local execution should be very unlikely to need retries. org.globus.cog.abstraction.impl.file.FileResourceException: Failed to create directory: _concurrent/aligned-a0f7b757-142a-4c66-8288-bd06ec2d591c--array//elt-4.-field http://nmi-s005.cs.wisc.edu:80/nmi/run/benc/2008/02/benc_nmi-s005.cs.wisc.edu_1203870306_20341/userdir/nmi:x86_fc_3/remote_task.err I'll put some more details in the bugzilla. -- From lixi at uchicago.edu Tue Feb 19 15:00:17 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Tue, 19 Feb 2008 15:00:17 -0600 (CST) Subject: [Swift-devel] Swift running errors Message-ID: <20080219150017.AWQ22172@m4500-03.uchicago.edu> Hi, I have two problems. 1. Today, when I try to run swift workflow on muliple OSG sites, I always encounter the following errors which cause the running failed: [lixi at login remote]$ swift - tc.file /home/lixi/swift/test/tc.data - sites.file /home/lixi/swift/test/OSGEDU_Sites.xml workflowtest.swift Swift v0.3-dev r1674 (modified locally) RunID: 20080219-1447-1hztqje9 node started Failed to transfer kickstart records from workflowtest- 20080219-1447-1hztqje9/kickstart/8/CIT_CMS_T2Exception in getFile task:transfer @ vdl-int.k, line: 322 sys:try @ vdl-int.k, line: 322 vdl:transferkickstartrec @ vdl-int.k, line: 409 sys:set @ vdl-int.k, line: 409 sys:sequential @ vdl-int.k, line: 409 sys:try @ vdl-int.k, line: 408 sys:else @ vdl-int.k, line: 407 sys:if @ vdl-int.k, line: 405 sys:set @ vdl-int.k, line: 404 sys:catch @ vdl-int.k, line: 396 sys:try @ vdl-int.k, line: 354 task:allocatehost @ vdl-int.k, line: 334 vdl:execute2 @ execute-default.k, line: 23 sys:restartonerror @ execute-default.k, line: 21 sys:sequential @ execute-default.k, line: 19 sys:try @ execute-default.k, line: 18 sys:if @ execute-default.k, line: 17 sys:then @ execute-default.k, line: 16 sys:if @ execute-default.k, line: 15 vdl:execute @ workflowtest.kml, line: 31 worknode @ workflowtest.kml, line: 79 sys:sequential @ workflowtest.kml, line: 78 sys:parallel @ workflowtest.kml, line: 77 vdl:mainp @ workflowtest.kml, line: 76 mainp @ vdl.k, line: 150 vdl:mains @ workflowtest.kml, line: 75 vdl:mains @ workflowtest.kml, line: 75 rlog:restartlog @ workflowtest.kml, line: 74 kernel:project @ workflowtest.kml, line: 2 workflowtest-20080219-1447-1hztqje9 Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Exception in getFile Caused by: org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed. : globus_gridftp_server_file.c:globus_l_gfs_file_send:2190: 500-globus_l_gfs_file_open failed. 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: 500-globus_xio_register_open failed. 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: 500-Unable to open file /raid2/osg-data/lixi/workflowtest- 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi- kickstart.xml 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: 500-System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500 End.] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 500-Command failed. : globus_gridftp_server_file.c:globus_l_gfs_file_send:2190: 500-globus_l_gfs_file_open failed. 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694: 500-globus_xio_register_open failed. 500-globus_xio_file_driver.c:globus_l_xio_file_open:438: 500-Unable to open file /raid2/osg-data/lixi/workflowtest- 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi- kickstart.xml 500-globus_xio_file_driver.c:globus_l_xio_file_open:381: 500-System error in open: No such file or directory 500-globus_xio: A system call failed: No such file or directory 500 End.] 2. When runing a workflow which involves 1000nodes, I encounter the following errors very frequently, but not all the time: ... node completed node completed node completed node completed node completed node failed Execution failed: Exception in node: Arguments: [_concurrent/intermediatefile-b5b5dc39-df70-4137- 8149-c20f5d1af839-, out.0132.txt] Host: localhost Directory: workflowtest-20080219-1443-2qx4ctkc/jobs/6/node- 64kddnoi stderr.txt: stdout.txt: ---- Caused by: java.io.IOException: Too many open files Could you tell me why and teach me how to resolve such problems? Thanks, Xi From zhoujianghua1017 at 163.com Tue Feb 26 07:48:26 2008 From: zhoujianghua1017 at 163.com (jezhee) Date: Tue, 26 Feb 2008 21:48:26 +0800 Subject: [Swift-devel] Some questions about Swift Message-ID: <200802262145466621306@163.com> swift-devel? Hi. This is Zhou Jianghua from China and come into some problems when using Swift. Waiting for your guide and thanks a lot. I have installed the Swift environment in my computer and run some simple examples in local machine. All things were normal except that the exection was very slow. A simple program just displaying text on the screen took 5 to 10 seconds. Could you tell me why? Besides, I followed the instructions in the documentation, Swift lab at University of Chicago Computation Institute,part I: Grid workflow(url:http://www.ci.uchicago.edu/osgedu/schools/swiftlab/). BUt, I didn't find the folder sw in my machine, and the file sites-chicago.xml neither. So, I can't let my program run at a remote host. How to solve this? ?Regards. 2008-02-26 ////////////////////////////////////////// // Zhou Jianghua zhoujianghua1017 at 163.com // EI Dep, Huazhong Uni of Sci & Tech // Internet Technology and Engineering Center // http://www.itec.org.cn // // Tel?(86)27-87792139 // Fax?(86)27-87540745 // Zipcode?430074 // Address?Luoyu Road 1037, Wuhan, Hubei, China ///////////////////////////////////////// From benc at hawaga.org.uk Tue Feb 26 10:42:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 26 Feb 2008 16:42:04 +0000 (GMT) Subject: [Swift-devel] Some questions about Swift In-Reply-To: <200802262145466621306@163.com> References: <200802262145466621306@163.com> Message-ID: On Tue, 26 Feb 2008, jezhee wrote: > I have installed the Swift environment in my computer and run some > simple examples in local machine. All things were normal except that > the exection was very slow. A simple program just displaying text on the > screen took 5 to 10 seconds. Could you tell me why? There is a lot of startup involved with running a swift program - that is probably most of the time you see. This time consists of loading the JVM, loading various libraries and compiling your program. However, if you run two programs, you should find that it takes about the same amount of time, not twice as long. > Besides, I followed the instructions in the documentation, Swift lab > at University of Chicago Computation Institute,part I: Grid > workflow(url:http://www.ci.uchicago.edu/osgedu/schools/swiftlab/). BUt, > I didn't find the folder sw in my machine, and the file > sites-chicago.xml neither. So, I can't let my program run at a remote > host. How to solve this? Those instructions won't work if you are working on your own machine. Have you ever used Globus to run a job on the grid before? If so, then I can show you how to use Swift to submit jobs from there using your existing setup. If you have not, then you should get set up to submit jobs to some execution system first - for example, apply for an account on the CI gridlab at http://www.ci.uchicago.edu/osgedu/schools/gridlab/ -- From benc at hawaga.org.uk Wed Feb 27 02:32:43 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Feb 2008 08:32:43 +0000 (GMT) Subject: [Swift-devel] Re: [Newslab] Re: Getting all RSS data into database In-Reply-To: <4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com> References: <47C2E38D.6060700@mcs.anl.gov> <4290b6c60802250920p108c77f6j17879871c532490@mail.gmail.com> <47C2FA1B.40509@mcs.anl.gov> <4290b6c60802251006w7694ectbe773d6211ba8cbb@mail.gmail.com> <4290b6c60802260831q79b1acbeoaf8a454c09e6a9a0@mail.gmail.com> <4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com> Message-ID: Note: I added swift-devel to this piece of the thread because it is relevant there; and perhaps now not so relevant to the newslab list. On Tue, 26 Feb 2008, Quan Tran Pham wrote: > > What I think you are trying to do is merge a bunch of files into a single > big file? (which is not in itself a merge sort) > I have merge2 that merge two sorted files (contain sorted key + value) into > one big sorted file. A different way of thinking about this, which is perhaps more of interest to the swift development group rather than newslab directly: Define a binary operator like >+ meaning somthing like ordered-concatenate, which will combine two files in the appropriate ordered fashion. file >+ file --> file This operator is commutative. Then have foldC able to fold knowing that the supplied operator is commutative (so it can split up in a binary fashion, or however other way it cares to). Now say: file[] inputs file output output = foldC (>+) inputs Perhaps foldC should be provided by Swift, with >+ provided as a procedure. -- From benc at hawaga.org.uk Wed Feb 27 03:20:11 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Feb 2008 09:20:11 +0000 (GMT) Subject: [Swift-devel] compile time error handling Message-ID: Over the past day or so, I've committed a bunch of compile time error handling changes. r1691 is the last one of those for now. Some more compile error messages will now have source line numbers in them. There is more compile-time static analysis of the program, which should result in errors occurring at compile time rather than part-way through workflow execution. Programs which previously ran OK should still compile OK. As always, indicate here if not. -- From benc at hawaga.org.uk Wed Feb 27 03:45:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Feb 2008 09:45:25 +0000 (GMT) Subject: [Swift-devel] get an @arg as an int. Message-ID: I want to make my load tests take the number of procedures to run as a commandline @arg. So I want to iterate over [1:@arg(foo)] or something like that. But @arg(...) has type string. I have a straightforward implementation of @toint(string) that fixes this. I am however slightly concerned about a lack of coherency in what can be cast to what / what can be read from a file / the forms those things take (eg. @extractint, readdata, @toint) -- From benc at hawaga.org.uk Wed Feb 27 08:07:19 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 27 Feb 2008 14:07:19 +0000 (GMT) Subject: [Swift-devel] Re: some racelike condition on file stagein In-Reply-To: References: Message-ID: File.mkdirs() is not thread-safe, according to typing "threadsafe java mkdirs" into google. I applied the below patch to my cog checkout and the error goes away for me. However, I'm not a cog developer, so someone else needs to fix this in the CoG SVN. Index: cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java =================================================================== --- cog.orig/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java 2007-08-27 09:30:23.000000000 +0100 +++ cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java 2008-02-27 13:51:42.000000000 +0000 @@ -146,15 +146,19 @@ } } +static Object mkdirlock = new Object(); + public void createDirectories(String directory) throws FileResourceException { if (directory == null || directory.equals("")) { return; } File f = resolve(directory); + synchronized(mkdirlock) { if (!f.mkdirs() && !f.exists()) { throw new FileResourceException("Failed to create directory: " + directory); } + } } public void deleteDirectory(String dir, boolean force) throws FileResourceException { From hategan at mcs.anl.gov Wed Feb 27 09:33:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 27 Feb 2008 09:33:36 -0600 Subject: [Swift-devel] Re: some racelike condition on file stagein In-Reply-To: References: Message-ID: <1204126416.17698.7.camel@blabla.mcs.anl.gov> Right. The same problem likely applies to gridftp. On Wed, 2008-02-27 at 14:07 +0000, Ben Clifford wrote: > File.mkdirs() is not thread-safe, according to typing "threadsafe java > mkdirs" into google. > > I applied the below patch to my cog checkout and the error goes away for > me. However, I'm not a cog developer, so someone else needs to fix this in > the CoG SVN. > > Index: cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java > =================================================================== > --- cog.orig/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java 2007-08-27 09:30:23.000000000 +0100 > +++ cog/modules/provider-local/src/org/globus/cog/abstraction/impl/file/local/FileResourceImpl.java 2008-02-27 13:51:42.000000000 +0000 > @@ -146,15 +146,19 @@ > } > } > > +static Object mkdirlock = new Object(); > + > public void createDirectories(String directory) > throws FileResourceException { > if (directory == null || directory.equals("")) { > return; > } > File f = resolve(directory); > + synchronized(mkdirlock) { > if (!f.mkdirs() && !f.exists()) { > throw new FileResourceException("Failed to create directory: " + directory); > } > + } > } > > public void deleteDirectory(String dir, boolean force) throws FileResourceException { > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From quanpt at gmail.com Wed Feb 27 10:00:12 2008 From: quanpt at gmail.com (Quan Tran Pham) Date: Wed, 27 Feb 2008 10:00:12 -0600 Subject: [Swift-devel] Re: [Newslab] Re: Getting all RSS data into database In-Reply-To: References: <47C2E38D.6060700@mcs.anl.gov> <4290b6c60802250920p108c77f6j17879871c532490@mail.gmail.com> <47C2FA1B.40509@mcs.anl.gov> <4290b6c60802251006w7694ectbe773d6211ba8cbb@mail.gmail.com> <4290b6c60802260831q79b1acbeoaf8a454c09e6a9a0@mail.gmail.com> <4290b6c60802260953q1098c639s7eb0db3531052868@mail.gmail.com> Message-ID: <4290b6c60802270800j6577a5e2ue684164a2fe4bac4@mail.gmail.com> I would support the idea. That foldC-like function has been used in some other languages: + reduce in python (they have order from left to right, by the way) + reduce phase in MapReduce programming model ( http://labs.google.com/papers/mapreduce.html) Quan On Wed, Feb 27, 2008 at 2:32 AM, Ben Clifford wrote: > > Note: I added swift-devel to this piece of the thread because it is > relevant there; and perhaps now not so relevant to the newslab list. > > On Tue, 26 Feb 2008, Quan Tran Pham wrote: > > > > What I think you are trying to do is merge a bunch of files into a > single > > big file? (which is not in itself a merge sort) > > > I have merge2 that merge two sorted files (contain sorted key + value) > into > > one big sorted file. > > A different way of thinking about this, which is perhaps more of interest > to the swift development group rather than newslab directly: > > Define a binary operator like >+ meaning somthing like > ordered-concatenate, which will combine two files in the appropriate > ordered fashion. > > file >+ file --> file > > This operator is commutative. > > Then have foldC able to fold knowing that the supplied operator is > commutative (so it can split up in a binary fashion, or however other way > it cares to). > > Now say: > > file[] inputs > file output > output = foldC (>+) inputs > > Perhaps foldC should be provided by Swift, with >+ provided as a > procedure. > > -- > -- Quan Tran Pham PhD Student Department of Computer Science University of Chicago 1100 E 58th Street, Chicago, IL 60637 Office: Ryerson 178 Phone: (773)702-4227 Fax: (773)702-8487 quanpt at cs.uchicago.edu --- -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Thu Feb 28 16:22:34 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 28 Feb 2008 22:22:34 +0000 (GMT) Subject: [Swift-devel] runtime console stats Message-ID: In the style of the RFT client, I implemented a runtime progress ticker that every few seconds outputs a line of how many jobs are in each internal state. See below for an example output. I think exposing the various internal states on the console is a useful thing to do. The states in the below example are a bit lame - should probably have something like: Waiting for a site to be allocated; staging in; submitted for execution; staging out; all finished.. $ swift 130-fmri.swift Swift v0.3-dev r1689 (modified locally) RunID: 20080228-1619-xkb5elaf Progress: touch started touch started touch started touch started Progress: EXECUTE:3 STAGEOUT:1 START:4 touch completed touch completed touch completed touch completed touch started Progress: EXECUTE2DONE:1 END:4 START:3 touch completed touch started touch started touch started touch completed touch completed touch started touch started touch completed touch started Progress: EXECUTE:2 EXECUTE2:1 END:8 touch completed touch completed touch completed Final status: END:11 From foster at mcs.anl.gov Thu Feb 28 18:06:58 2008 From: foster at mcs.anl.gov (Ian Foster) Date: Thu, 28 Feb 2008 18:06:58 -0600 Subject: [Swift-devel] runtime console stats In-Reply-To: References: Message-ID: <47C74CA2.4090606@mcs.anl.gov> cool! Ben Clifford wrote: > In the style of the RFT client, I implemented a runtime progress ticker > that every few seconds outputs a line of how many jobs are in each > internal state. See below for an example output. > > I think exposing the various internal states on the console is a useful > thing to do. > > The states in the below example are a bit lame - should probably have > something like: Waiting for a site to be allocated; staging in; submitted > for execution; staging out; all finished.. > > > $ swift 130-fmri.swift > Swift v0.3-dev r1689 (modified locally) > > RunID: 20080228-1619-xkb5elaf > Progress: > touch started > touch started > touch started > touch started > Progress: EXECUTE:3 STAGEOUT:1 START:4 > touch completed > touch completed > touch completed > touch completed > touch started > Progress: EXECUTE2DONE:1 END:4 START:3 > touch completed > touch started > touch started > touch started > touch completed > touch completed > touch started > touch started > touch completed > touch started > Progress: EXECUTE:2 EXECUTE2:1 END:8 > touch completed > touch completed > touch completed > Final status: END:11 > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Fri Feb 29 06:51:53 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 29 Feb 2008 12:51:53 +0000 (GMT) Subject: [Swift-devel] execute side md5sum Message-ID: Its pretty straightforward to modify the wrapper to take a hash (eg md5sum) of input files before and output files after execution (I made a prototype yesterday afternoon) and log those hashes. This gives a convenient summary of the content of the inputs and outputs that is automated and hard to break through lack of attention; and so is probably useful for questions like "was this run with the same version or a different version of a particular input file". (this is not some universal versioning solution that will solve the unsolvable; instead it provides an answer to 'was this file the same or different than some other file?' over which can be laid other exciting versioning systems) I think having something like this is probably useful optional functionality (enabled in the same was as kickstart, perhaps). -- From itf at mcs.anl.gov Fri Feb 29 07:03:23 2008 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Fri, 29 Feb 2008 13:03:23 +0000 Subject: [Swift-devel] execute side md5sum In-Reply-To: References: Message-ID: <1468697544-1204290242-cardhu_decombobulator_blackberry.rim.net-447726758-@bxe122.bisx.prod.on.blackberry> Definitely. How about the exeuctable as well? Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Ben Clifford Date: Fri, 29 Feb 2008 12:51:53 To:swift-devel at ci.uchicago.edu Subject: [Swift-devel] execute side md5sum Its pretty straightforward to modify the wrapper to take a hash (eg md5sum) of input files before and output files after execution (I made a prototype yesterday afternoon) and log those hashes. This gives a convenient summary of the content of the inputs and outputs that is automated and hard to break through lack of attention; and so is probably useful for questions like "was this run with the same version or a different version of a particular input file". (this is not some universal versioning solution that will solve the unsolvable; instead it provides an answer to 'was this file the same or different than some other file?' over which can be laid other exciting versioning systems) I think having something like this is probably useful optional functionality (enabled in the same was as kickstart, perhaps). -- _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel