From hategan at mcs.anl.gov Wed Apr 1 10:22:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 01 Apr 2009 10:22:55 -0500 Subject: [Swift-user] possible coasters problem In-Reply-To: <49D29BF2.6060101@uchicago.edu> References: <49D29BF2.6060101@uchicago.edu> Message-ID: <1238599375.9751.0.camel@localhost> On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote: > Hi Guys, > Do you think this is a problem with coasters or just the way i'm using it... It's a problem with coasters. What version of cog/swift is this? > > Thanks, > Glen > > Exception in runoops: > > Arguments: [input/fasta/T1ubq.fasta, > > teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, > > input/native/T1ubq.pdb, > > teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, > > teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, > > 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, > > MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55] > > Host: teraport > > Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > java.lang.IllegalArgumentException: No worker with id=1956306968 > > at > > org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71) > > Caused by: java.lang.IllegalArgumentException: No worker with > > id=1956306968 > > at > > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78) > > ... 1 more > > > > Cleaning up... > > Shutting down service at https://128.135.125.118:55513 > > Got channel MetaChannel: 22129174 -> GSSSChannel-null(1) > > - Done > > Command exited with non-zero status 2 > > real 1628.27 > > user 169.87 > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Wed Apr 1 10:28:48 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 01 Apr 2009 10:28:48 -0500 Subject: [Swift-user] possible coasters problem In-Reply-To: <1238599375.9751.0.camel@localhost> References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost> Message-ID: <49D38830.8070903@mcs.anl.gov> I think it was run on cog 2349, swift 2787. com$ svn info cog Path: cog URL: https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog Repository Root: https://cogkit.svn.sourceforge.net/svnroot/cogkit Repository UUID: 5b74d2a0-fa0e-0410-85ed-ffba77ec0bde Revision: 2349 Node Kind: directory Schedule: normal Last Changed Author: hategan Last Changed Rev: 2349 Last Changed Date: 2009-03-29 14:58:40 -0500 (Sun, 29 Mar 2009) com$ svn info cog/modules/swift Path: cog/modules/swift URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk Repository Root: https://svn.ci.uchicago.edu/svn/vdl2 Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8 Revision: 2788 Node Kind: directory Schedule: normal Last Changed Author: hategan Last Changed Rev: 2787 Last Changed Date: 2009-03-30 19:31:33 -0500 (Mon, 30 Mar 2009) com$ On 4/1/09 10:22 AM, Mihael Hategan wrote: > On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote: >> Hi Guys, >> Do you think this is a problem with coasters or just the way i'm using it... > > It's a problem with coasters. > > What version of cog/swift is this? > >> Thanks, >> Glen >>> Exception in runoops: >>> Arguments: [input/fasta/T1ubq.fasta, >>> teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, >>> input/native/T1ubq.pdb, >>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, >>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, >>> 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, >>> MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55] >>> Host: teraport >>> Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> >>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>> java.lang.IllegalArgumentException: No worker with id=1956306968 >>> at >>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85) >>> at >>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71) >>> Caused by: java.lang.IllegalArgumentException: No worker with >>> id=1956306968 >>> at >>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483) >>> at >>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78) >>> ... 1 more >>> >>> Cleaning up... >>> Shutting down service at https://128.135.125.118:55513 >>> Got channel MetaChannel: 22129174 -> GSSSChannel-null(1) >>> - Done >>> Command exited with non-zero status 2 >>> real 1628.27 >>> user 169.87 >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Wed Apr 1 10:36:06 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 1 Apr 2009 15:36:06 +0000 (GMT) Subject: [Swift-user] possible coasters problem In-Reply-To: <49D38830.8070903@mcs.anl.gov> References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost> <49D38830.8070903@mcs.anl.gov> Message-ID: On Wed, 1 Apr 2009, Michael Wilde wrote: > I think it was run on cog 2349, swift 2787. You can tell more accurately from the run log with a command like this (that is, it will give what the executing Swift thought it was, rather than what was in your repo). grep swift-r pc3-20090331-1506-bk02i344.log -- From wilde at mcs.anl.gov Wed Apr 1 10:39:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 01 Apr 2009 10:39:10 -0500 Subject: [Swift-user] possible coasters problem In-Reply-To: References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost> <49D38830.8070903@mcs.anl.gov> Message-ID: <49D38A9E.3030006@mcs.anl.gov> certainly. Glen, can you provide that? I didnt have the time/info to hunt that down, sorry. On 4/1/09 10:36 AM, Ben Clifford wrote: > On Wed, 1 Apr 2009, Michael Wilde wrote: > >> I think it was run on cog 2349, swift 2787. > > You can tell more accurately from the run log with a command like this > (that is, it will give what the executing Swift thought it was, rather > than what was in your repo). > > grep swift-r pc3-20090331-1506-bk02i344.log > From hockyg at uchicago.edu Wed Apr 1 13:56:55 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Wed, 01 Apr 2009 13:56:55 -0500 Subject: [Swift-user] possible coasters problem In-Reply-To: <49D38830.8070903@mcs.anl.gov> References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost> <49D38830.8070903@mcs.anl.gov> Message-ID: <49D3B8F7.2060008@uchicago.edu> it's "Swift svn swift-r2788 cog-r2349" Michael Wilde wrote: > I think it was run on cog 2349, swift 2787. > > com$ svn info cog > Path: cog > URL: > https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog > Repository Root: https://cogkit.svn.sourceforge.net/svnroot/cogkit > Repository UUID: 5b74d2a0-fa0e-0410-85ed-ffba77ec0bde > Revision: 2349 > Node Kind: directory > Schedule: normal > Last Changed Author: hategan > Last Changed Rev: 2349 > Last Changed Date: 2009-03-29 14:58:40 -0500 (Sun, 29 Mar 2009) > > com$ svn info cog/modules/swift > Path: cog/modules/swift > URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk > Repository Root: https://svn.ci.uchicago.edu/svn/vdl2 > Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8 > Revision: 2788 > Node Kind: directory > Schedule: normal > Last Changed Author: hategan > Last Changed Rev: 2787 > Last Changed Date: 2009-03-30 19:31:33 -0500 (Mon, 30 Mar 2009) > > com$ > > > On 4/1/09 10:22 AM, Mihael Hategan wrote: >> On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote: >>> Hi Guys, >>> Do you think this is a problem with coasters or just the way i'm >>> using it... >> >> It's a problem with coasters. >> >> What version of cog/swift is this? >> >>> Thanks, >>> Glen >>>> Exception in runoops: >>>> Arguments: [input/fasta/T1ubq.fasta, >>>> teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, >>>> input/native/T1ubq.pdb, >>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, >>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, >>>> 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, >>>> MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55] >>>> Host: teraport >>>> Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> >>>> ---- >>>> >>>> Caused by: >>>> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >>>> java.lang.IllegalArgumentException: No worker with id=1956306968 >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85) >>>> >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71) >>>> >>>> Caused by: java.lang.IllegalArgumentException: No worker with >>>> id=1956306968 >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483) >>>> >>>> at >>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78) >>>> >>>> ... 1 more >>>> >>>> Cleaning up... >>>> Shutting down service at https://128.135.125.118:55513 >>>> Got channel MetaChannel: 22129174 -> GSSSChannel-null(1) >>>> - Done >>>> Command exited with non-zero status 2 >>>> real 1628.27 >>>> user 169.87 >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hockyg at uchicago.edu Thu Apr 2 15:06:19 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 02 Apr 2009 15:06:19 -0500 Subject: [Swift-user] coasters problem on teraport Message-ID: <49D51ABB.80807@uchicago.edu> I get the following error trying to run on teraport w/ coasters > Progress: Submitting:4 Failed:1 Finished successfully:10 > Execution failed: > Exception in runoops: > Arguments: [input/fasta/T1dcj.fasta, > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj/T1dcj.ST10.TU10.0000.secseq, > input/native/T1dcj.pdb, > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.pdt, > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.rmsd, > 0, DEFAULT_INIT_TEMP_=_10, TEMP_UPDATE_INTERVAL_=_10, > MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_30] > Host: teraport > Directory: oops-20090402-1307-6ud4sy60/jobs/9/runoops-9h3jft8j > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Cannot submit job > Caused by: > The job manager failed to open stderr > Cleaning up... > Done > Command exited with non-zero status 2 > real 189.93 > user 13.20 > sys 1.94 With sites files containing: > > fast > key="coasterWorkerMaxwalltime">01:00:00 > url="coaster-gt2://tp-grid1.ci.uchicago.edu" /> > jobmanager="gt2:gt2:pbs" /> > /home/hockyg/swiftwork > > > fast > key="coasterWorkerMaxwalltime">01:00:00 > > jobmanager="gt2:gt2:pbs" /> > /home/hockyg/swiftwork > From hockyg at uchicago.edu Thu Apr 2 15:08:54 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 02 Apr 2009 15:08:54 -0500 Subject: [Swift-user] re: coasters problem on terapo Message-ID: <49D51B56.7050500@uchicago.edu> More details, sorry: Swift svn swift-r2809 cog-r2350 From hategan at mcs.anl.gov Thu Apr 2 15:15:10 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 02 Apr 2009 15:15:10 -0500 Subject: [Swift-user] coasters problem on teraport In-Reply-To: <49D51ABB.80807@uchicago.edu> References: <49D51ABB.80807@uchicago.edu> Message-ID: <1238703310.13579.1.camel@localhost> Swift cannot properly guess your client machine's address. Do export GLOBUS_HOSTNAME=your.address.or.ip before invoking swift. On Thu, 2009-04-02 at 15:06 -0500, Glen Hocky wrote: > I get the following error trying to run on teraport w/ coasters > > Progress: Submitting:4 Failed:1 Finished successfully:10 > > Execution failed: > > Exception in runoops: > > Arguments: [input/fasta/T1dcj.fasta, > > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj/T1dcj.ST10.TU10.0000.secseq, > > input/native/T1dcj.pdb, > > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.pdt, > > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.rmsd, > > 0, DEFAULT_INIT_TEMP_=_10, TEMP_UPDATE_INTERVAL_=_10, > > MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_30] > > Host: teraport > > Directory: oops-20090402-1307-6ud4sy60/jobs/9/runoops-9h3jft8j > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Could not submit job > > Caused by: > > Could not start coaster service > > Caused by: > > Cannot submit job > > Caused by: > > The job manager failed to open stderr > > Cleaning up... > > Done > > Command exited with non-zero status 2 > > real 189.93 > > user 13.20 > > sys 1.94 > With sites files containing: > > > > fast > > > key="coasterWorkerMaxwalltime">01:00:00 > > > url="coaster-gt2://tp-grid1.ci.uchicago.edu" /> > > > jobmanager="gt2:gt2:pbs" /> > > /home/hockyg/swiftwork > > > > > > fast > > > key="coasterWorkerMaxwalltime">01:00:00 > > > > > jobmanager="gt2:gt2:pbs" /> > > /home/hockyg/swiftwork > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Apr 2 17:19:28 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 02 Apr 2009 17:19:28 -0500 Subject: [Swift-user] Re: installing swift In-Reply-To: References: Message-ID: <49D539F0.4090206@mcs.anl.gov> Hi Marco, I am not sure about cog-setup, but I would skip that and proceed to the examples, after putting bin/ in your path and making sure that you have CA certificates so you can make a proxy (I assume you get all this from your OSG environment). (I assume that was removed on purpose and the quickstart was not updated to reflect it. We need to fix that). Note that Swift's bin/ provides some tools that are also in OSG bin dirs. And you should direct all your questions to this list, swift-user, because that's where the developers listen for users who need help. To test if its working, do the first example in the Swift tutorial: http://www.ci.uchicago.edu/swift/guides/tutorial.php Then move on to the Swift tutorial for the OSG Grid School, I think, would be the best approach. - Mike On 4/2/09 5:04 PM, Marco Mambelli wrote: > Hi Mike, > I'm setting up another host that could be used for the workshop (in case > the VM has problem, to have a VDT supported platform). > > I need some help installing swift. > I tried to follow the instructions in > http://www.ci.uchicago.edu/swift/guides/quickstartguide.php > > The files cog-setup and example.swift are not in the tarfiles (version > 8) that I downloaded. > > I don't know exactly how to configure it and/or test if it is working. > > Thank you, > Marco From benc at hawaga.org.uk Thu Apr 2 17:28:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 2 Apr 2009 22:28:13 +0000 (GMT) Subject: [Swift-user] Re: installing swift In-Reply-To: <49D539F0.4090206@mcs.anl.gov> References: <49D539F0.4090206@mcs.anl.gov> Message-ID: Yeah, that's quite out of date, it seems. If you're installing on a machine with an OSG stack, get the version without extra stuff (swift-0.8-stripped.tar.gz) from the download page. Untar it, and put its bin/ directory on your system path. The stripped version does not contain commands like grid-proxy-init, to avoid conflict with real versions deployed elsewhere (i.e. in an OSG install). To test, go into the examples/swift/ directory, type: swift first.swift and check that a file hello.txt appears. -- From wilde at mcs.anl.gov Thu Apr 2 21:01:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 02 Apr 2009 21:01:22 -0500 Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by CPU Message-ID: <49D56DF2.7060005@mcs.anl.gov> Some sites, like TeraPort, (I think) place independent jobs on all CPUS. When using coasters, is it true that the user should not specify coastersPerNode? Or at least not set it to > 1? We should clarify this in the users guide. From wilde at mcs.anl.gov Thu Apr 2 21:20:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 02 Apr 2009 21:20:05 -0500 Subject: [Swift-user] Please clarify throttle parameters for coasters and GT2 issues Message-ID: <49D57255.9090605@mcs.anl.gov> I (and colleagues Im working with) have a few related questions: At some point guidelines were posted regarding "safe" throttle values for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify if that number is still the best practice, and how to the 4-5 throttle parameters to conform? Then, do those same values apply to coasters? Finally, with the recent successes in high-volume coaster runs on Ranger - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how were those achieved without tripping into the GRAM overhead limits, given that Ranger as far as I know has only GT2 GRAM and even submitting locallly to SGE must go through GRAM since we have no direct SGE provider? Are the "safe" limits for Ranger simply higer,or is there something else involved that makes this practical? In other words, please share and post how to get lots of jobs through Ranger. From hategan at mcs.anl.gov Thu Apr 2 22:52:20 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 02 Apr 2009 22:52:20 -0500 Subject: [Swift-user] Please clarify throttle parameters for coasters and GT2 issues In-Reply-To: <49D57255.9090605@mcs.anl.gov> References: <49D57255.9090605@mcs.anl.gov> Message-ID: <1238730740.21734.2.camel@localhost> On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote: > I (and colleagues Im working with) have a few related questions: > > At some point guidelines were posted regarding "safe" throttle values > for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify > if that number is still the best practice, Yes. GT2 hasn't changed much since. > and how to the 4-5 throttle > parameters to conform? The defaults are pretty much geared towards the gt2/gridftp combo. > > Then, do those same values apply to coasters? No. Throttles in the range of 32-256 (or maybe even more) are not unreasonable with coasters. > > Finally, with the recent successes in high-volume coaster runs on Ranger > - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how > were those achieved without tripping into the GRAM overhead limits, > given that Ranger as far as I know has only GT2 GRAM and even submitting > locallly to SGE must go through GRAM since we have no direct SGE > provider? Are the "safe" limits for Ranger simply higer,or is there > something else involved that makes this practical? In other words, > please share and post how to get lots of jobs through Ranger. coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram jobs. From hategan at mcs.anl.gov Thu Apr 2 23:05:13 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 02 Apr 2009 23:05:13 -0500 Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by CPU In-Reply-To: <49D56DF2.7060005@mcs.anl.gov> References: <49D56DF2.7060005@mcs.anl.gov> Message-ID: <1238731513.22128.0.camel@localhost> On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote: > Some sites, like TeraPort, (I think) place independent jobs on all CPUS. > > When using coasters, is it true that the user should not specify > coastersPerNode? Or at least not set it to > 1? Yes. I believe "coastersPerNode" is misleading. It should probably be "coastersPerWorkerJob", but that may sound cryptic. > > We should clarify this in the users guide. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hockyg at uchicago.edu Fri Apr 3 01:05:16 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 03 Apr 2009 01:05:16 -0500 Subject: [Swift-user] what does this kind of error mean? Message-ID: <49D5A71C.8030302@uchicago.edu> log attached Thanks, Glen p.s. can i get it to run more than 37-38 jobs concurrently on one site? -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: teraport.out.5000 URL: From benc at hawaga.org.uk Fri Apr 3 02:36:22 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 07:36:22 +0000 (GMT) Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by CPU In-Reply-To: <49D56DF2.7060005@mcs.anl.gov> References: <49D56DF2.7060005@mcs.anl.gov> Message-ID: On Thu, 2 Apr 2009, Michael Wilde wrote: > Some sites, like TeraPort, (I think) place independent jobs on all CPUS. > > When using coasters, is it true that the user should not specify > coastersPerNode? Or at least not set it to > 1? Pretty much, yes. -- From benc at hawaga.org.uk Fri Apr 3 02:41:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 07:41:44 +0000 (GMT) Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by CPU In-Reply-To: References: <49D56DF2.7060005@mcs.anl.gov> Message-ID: On Fri, 3 Apr 2009, Ben Clifford wrote: > > Some sites, like TeraPort, (I think) place independent jobs on all CPUS. > > > > When using coasters, is it true that the user should not specify > > coastersPerNode? Or at least not set it to > 1? > > Pretty much, yes. * Although it provides way to get node overcomitting, which I think in some applications is good (i.e. we have two cores, try to run 4 jobs on them). * If you're running on a node that allocates jobs per CPU by default, its probably going to present less load to the worker submission system (eg GRAM2 or PBS) if you can make it submit one per physical machine and have coastersPerNode allocate the CPUs instead of PBS. This is what you'd do in qsub with the ppn option. Off the top of my head, I don't know how (or if its possible at all) with coasters+gram2+pbs. -- From wilde at mcs.anl.gov Fri Apr 3 07:55:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 03 Apr 2009 07:55:16 -0500 Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by CPU In-Reply-To: References: <49D56DF2.7060005@mcs.anl.gov> Message-ID: <49D60734.6070102@mcs.anl.gov> On 4/3/09 2:41 AM, Ben Clifford wrote: > On Fri, 3 Apr 2009, Ben Clifford wrote: > >>> Some sites, like TeraPort, (I think) place independent jobs on all CPUS. >>> >>> When using coasters, is it true that the user should not specify >>> coastersPerNode? Or at least not set it to > 1? >> Pretty much, yes. > > * Although it provides way to get node overcomitting, which I think in > some applications is good (i.e. we have two cores, try to run 4 jobs on > them). Sounds reasonable; would be good to try for IO-bound jobs. > * If you're running on a node that allocates jobs per CPU by default, its > probably going to present less load to the worker submission system (eg > GRAM2 or PBS) if you can make it submit one per physical machine and have > coastersPerNode allocate the CPUs instead of PBS. This is what you'd do in > qsub with the ppn option. Off the top of my head, I don't know how (or if > its possible at all) with coasters+gram2+pbs. I see the possibility (this could make the "allocate all the cores I want in one job" feature work for us today with no code change). But I dont understand the mechanism, even w/o GRAM. Youre implying this would work with coasters today, in provider=local:pbs mode, right? But how would you specify it? Lets say you want 32 cores, on teraport. if you say coastersPerNode=32, you would get 32 coasters per core (overcommiting, as before). Did you mean "If you're running on a node that allocates jobs per HOST by default"? Eg, systems like Abe and Ranger, the systems with substantial cores per host (8,16)? So say we run on Abe, which I think has PBS, and say coastersPerNode=32, you think we would get 4 hosts running 32 coasters, 8 per host, 1 per core? That would be cool to try, and then to try over GRAM. But this direction depends somewhat on how Mihael will specify and design the coaster provisioning feature. From hategan at mcs.anl.gov Fri Apr 3 10:16:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Apr 2009 10:16:09 -0500 Subject: [Swift-user] what does this kind of error mean? In-Reply-To: <49D5A71C.8030302@uchicago.edu> References: <49D5A71C.8030302@uchicago.edu> Message-ID: <1238771769.23515.2.camel@localhost> The GridFTP error, I don't know. What are your throttling parameters in swift.properties? From hockyg at uchicago.edu Fri Apr 3 10:17:27 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 03 Apr 2009 10:17:27 -0500 Subject: [Swift-user] what does this kind of error mean? In-Reply-To: <1238771769.23515.2.camel@localhost> References: <49D5A71C.8030302@uchicago.edu> <1238771769.23515.2.camel@localhost> Message-ID: <49D62887.9060807@uchicago.edu> throttle.submit=10 throttle.host.submit=10 throttle.score.job.factor=100.0 throttle.transfers=10 throttle.file.operations=10 Mihael Hategan wrote: > The GridFTP error, I don't know. What are your throttling parameters in > swift.properties? > > > From hategan at mcs.anl.gov Fri Apr 3 10:21:29 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 03 Apr 2009 10:21:29 -0500 Subject: [Swift-user] what does this kind of error mean? In-Reply-To: <49D62887.9060807@uchicago.edu> References: <49D5A71C.8030302@uchicago.edu> <1238771769.23515.2.camel@localhost> <49D62887.9060807@uchicago.edu> Message-ID: <1238772089.23804.0.camel@localhost> Those are a bit too high. Do things work with the defaults? On Fri, 2009-04-03 at 10:17 -0500, Glen Hocky wrote: > throttle.submit=10 > throttle.host.submit=10 > throttle.score.job.factor=100.0 > throttle.transfers=10 > throttle.file.operations=10 > > > Mihael Hategan wrote: > > The GridFTP error, I don't know. What are your throttling parameters in > > swift.properties? > > > > > > > From aespinosa at cs.uchicago.edu Fri Apr 3 11:58:16 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 3 Apr 2009 09:58:16 -0700 Subject: [Swift-user] Please clarify throttle parameters for coasters and GT2 issues In-Reply-To: <1238730740.21734.2.camel@localhost> References: <49D57255.9090605@mcs.anl.gov> <1238730740.21734.2.camel@localhost> Message-ID: <50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com> In pushing Ranger's current scheduling policies of 50 SGE jobs, we can push a max of 800 cpus. I have tried this before using the gt2 interface. -Allan On Thu, Apr 2, 2009 at 8:52 PM, Mihael Hategan wrote: > On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote: >> I (and colleagues Im working with) have a few related questions: >> >> At some point guidelines were posted regarding "safe" throttle values >> for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify >> if that number is still the best practice, > > Yes. GT2 hasn't changed much since. > >> ?and how to the 4-5 throttle >> parameters to conform? > > The defaults are pretty much geared towards the gt2/gridftp combo. > >> >> Then, do those same values apply to coasters? > > No. Throttles in the range of 32-256 (or maybe even more) are not > unreasonable with coasters. > >> >> Finally, with the recent successes in high-volume coaster runs on Ranger >> - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how >> were those achieved without tripping into the GRAM overhead limits, >> given that Ranger as far as I know has only GT2 GRAM and even submitting >> locallly to SGE must go through GRAM since we have no direct SGE >> provider? ?Are the "safe" limits for Ranger simply higer,or is there >> something else involved that makes this practical? ?In other words, >> please share and post how to get lots of jobs through Ranger. > > coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram > jobs. > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Fri Apr 3 13:14:29 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 3 Apr 2009 18:14:29 +0000 (GMT) Subject: [Swift-user] Re: installing swift In-Reply-To: References: <49D539F0.4090206@mcs.anl.gov> Message-ID: Every user can share one swift installation, but your pwd needs to be writable by the user - so run in ~ rather than in the Swift installation directory. On Fri, 3 Apr 2009, Marco Mambelli wrote: > Hi Ben, > the user owning the installation ran succesfully: > Swift 0.8 (stripped) swift-r2448 cog-r2261 > > RunID: 20090403-1305-k22hrieg > Progress: > Final status: Finished successfully:1 > > Another user had permission problems: > swift.log exists, it was created by the first user but it is writable also by > this one (I tryed to append something to it). > Which permission is looking for? does it need to be the owner? Group write is > not sufficient? > Should every user have its own swift installation? > > Below the java trace. > Thanks, > Marco > > > [train02 at grid07 swift]$ swift first.swift > log4j:ERROR setFile(null,true) call failed. > java.io.FileNotFoundException: swift.log (Permission denied) > at java.io.FileOutputStream.openAppend(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:177) > at java.io.FileOutputStream.(FileOutputStream.java:102) > at org.apache.log4j.FileAppender.setFile(FileAppender.java:272) > at > org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151) > at > org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123) > at > org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87) > at > org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645) > at > org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603) > at > org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460) > at org.apache.log4j.LogManager.(LogManager.java:113) > at org.apache.log4j.Logger.getLogger(Logger.java:94) > at org.globus.cog.karajan.Loader.(Loader.java:43) > Could not start execution. > first.xml (Permission denied) > > > On Thu, 2 Apr 2009, Ben Clifford wrote: > > > > > Yeah, that's quite out of date, it seems. > > > > If you're installing on a machine with an OSG stack, get the version > > without extra stuff (swift-0.8-stripped.tar.gz) from the download page. > > Untar it, and put its bin/ directory on your system path. > > > > The stripped version does not contain commands like grid-proxy-init, to > > avoid conflict with real versions deployed elsewhere (i.e. in an OSG > > install). > > > > To test, go into the examples/swift/ directory, type: > > > > swift first.swift > > > > and check that a file hello.txt appears. > > > > > > From marco at hep.uchicago.edu Fri Apr 3 13:07:49 2009 From: marco at hep.uchicago.edu (Marco Mambelli) Date: Fri, 3 Apr 2009 13:07:49 -0500 (CDT) Subject: [Swift-user] Re: installing swift In-Reply-To: References: <49D539F0.4090206@mcs.anl.gov> Message-ID: Hi Ben, the user owning the installation ran succesfully: Swift 0.8 (stripped) swift-r2448 cog-r2261 RunID: 20090403-1305-k22hrieg Progress: Final status: Finished successfully:1 Another user had permission problems: swift.log exists, it was created by the first user but it is writable also by this one (I tryed to append something to it). Which permission is looking for? does it need to be the owner? Group write is not sufficient? Should every user have its own swift installation? Below the java trace. Thanks, Marco [train02 at grid07 swift]$ swift first.swift log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: swift.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:177) at java.io.FileOutputStream.(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:272) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460) at org.apache.log4j.LogManager.(LogManager.java:113) at org.apache.log4j.Logger.getLogger(Logger.java:94) at org.globus.cog.karajan.Loader.(Loader.java:43) Could not start execution. first.xml (Permission denied) On Thu, 2 Apr 2009, Ben Clifford wrote: > > Yeah, that's quite out of date, it seems. > > If you're installing on a machine with an OSG stack, get the version > without extra stuff (swift-0.8-stripped.tar.gz) from the download page. > Untar it, and put its bin/ directory on your system path. > > The stripped version does not contain commands like grid-proxy-init, to > avoid conflict with real versions deployed elsewhere (i.e. in an OSG > install). > > To test, go into the examples/swift/ directory, type: > > swift first.swift > > and check that a file hello.txt appears. > > From marco at hep.uchicago.edu Fri Apr 3 13:27:44 2009 From: marco at hep.uchicago.edu (Marco Mambelli) Date: Fri, 3 Apr 2009 13:27:44 -0500 (CDT) Subject: [Swift-user] Re: installing swift In-Reply-To: References: <49D539F0.4090206@mcs.anl.gov> Message-ID: My bad, I changed permission to the tree (files and subdir) but not to the directory itself and I was confused by a swift.log somewhere else in the path. swift is installed and works fine Thanks, Marco On Fri, 3 Apr 2009, Ben Clifford wrote: > > Every user can share one swift installation, but your pwd needs to be > writable by the user - so run in ~ rather than in the Swift installation > directory. > > On Fri, 3 Apr 2009, Marco Mambelli wrote: > >> Hi Ben, >> the user owning the installation ran succesfully: >> Swift 0.8 (stripped) swift-r2448 cog-r2261 >> >> RunID: 20090403-1305-k22hrieg >> Progress: >> Final status: Finished successfully:1 >> >> Another user had permission problems: >> swift.log exists, it was created by the first user but it is writable also by >> this one (I tryed to append something to it). >> Which permission is looking for? does it need to be the owner? Group write is >> not sufficient? >> Should every user have its own swift installation? >> >> Below the java trace. >> Thanks, >> Marco >> >> >> [train02 at grid07 swift]$ swift first.swift >> log4j:ERROR setFile(null,true) call failed. >> java.io.FileNotFoundException: swift.log (Permission denied) >> at java.io.FileOutputStream.openAppend(Native Method) >> at java.io.FileOutputStream.(FileOutputStream.java:177) >> at java.io.FileOutputStream.(FileOutputStream.java:102) >> at org.apache.log4j.FileAppender.setFile(FileAppender.java:272) >> at >> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151) >> at >> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247) >> at >> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123) >> at >> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87) >> at >> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645) >> at >> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603) >> at >> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500) >> at >> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406) >> at >> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432) >> at >> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460) >> at org.apache.log4j.LogManager.(LogManager.java:113) >> at org.apache.log4j.Logger.getLogger(Logger.java:94) >> at org.globus.cog.karajan.Loader.(Loader.java:43) >> Could not start execution. >> first.xml (Permission denied) >> >> >> On Thu, 2 Apr 2009, Ben Clifford wrote: >> >>> >>> Yeah, that's quite out of date, it seems. >>> >>> If you're installing on a machine with an OSG stack, get the version >>> without extra stuff (swift-0.8-stripped.tar.gz) from the download page. >>> Untar it, and put its bin/ directory on your system path. >>> >>> The stripped version does not contain commands like grid-proxy-init, to >>> avoid conflict with real versions deployed elsewhere (i.e. in an OSG >>> install). >>> >>> To test, go into the examples/swift/ directory, type: >>> >>> swift first.swift >>> >>> and check that a file hello.txt appears. >>> >>> >> >> > From wilde at mcs.anl.gov Fri Apr 3 15:57:22 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 03 Apr 2009 15:57:22 -0500 Subject: [Swift-user] Please clarify throttle parameters for coasters and GT2 issues In-Reply-To: <50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com> References: <49D57255.9090605@mcs.anl.gov> <1238730740.21734.2.camel@localhost> <50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com> Message-ID: <49D67832.1070202@mcs.anl.gov> Allan, 50 SGE jobs is Ranger's queue limit, but its above the 40-job "safe" limit suggested by the Swift developers. Above 40 we risk causing sever overhead on their gatekeeper, which I think doubles as one of the login hosts. So I would urge you that when using GT2 to submit (with and without coasters) that you stay under 40, until coasters can create more workers with less jobs. - Mike On 4/3/09 11:58 AM, Allan Espinosa wrote: > In pushing Ranger's current scheduling policies of 50 SGE jobs, we can > push a max of 800 cpus. I have tried this before using the gt2 > interface. > > -Allan > > On Thu, Apr 2, 2009 at 8:52 PM, Mihael Hategan wrote: >> On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote: >>> I (and colleagues Im working with) have a few related questions: >>> >>> At some point guidelines were posted regarding "safe" throttle values >>> for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify >>> if that number is still the best practice, >> Yes. GT2 hasn't changed much since. >> >>> and how to the 4-5 throttle >>> parameters to conform? >> The defaults are pretty much geared towards the gt2/gridftp combo. >> >>> Then, do those same values apply to coasters? >> No. Throttles in the range of 32-256 (or maybe even more) are not >> unreasonable with coasters. >> >>> Finally, with the recent successes in high-volume coaster runs on Ranger >>> - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how >>> were those achieved without tripping into the GRAM overhead limits, >>> given that Ranger as far as I know has only GT2 GRAM and even submitting >>> locallly to SGE must go through GRAM since we have no direct SGE >>> provider? Are the "safe" limits for Ranger simply higer,or is there >>> something else involved that makes this practical? In other words, >>> please share and post how to get lots of jobs through Ranger. >> coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram >> jobs. >> > > > > From hockyg at uchicago.edu Mon Apr 6 21:42:08 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Mon, 06 Apr 2009 21:42:08 -0500 Subject: [Swift-user] consultation about error messages, coaster usage Message-ID: <49DABD80.8010508@uchicago.edu> Hi Guys, I just ran (and killed) too big runs w/ swift, one on ranger, one on abe. I stopped them because in each case there were many "Failed but can retry" jobs, several "Failed to transfer wrapper log" errors and at the point where i stopped them, many more cpu's allocated than "Active" jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an hour left (so 224 cpus) but only 76 "Active" jobs. Could someone take a look at the logs and tell me if things are working properly? It's a little hard to tell from a user end... On a ci home machine, All run related files for abe are in > /home/hockyg/oops/swift/output/abeoutdir.5/ and for ranger > /home/hockyg/oops/swift/output/rangeroutdir.5/ In those directories, there will be a file $site.out.5 which has the stdout and xout.XXXXX which has a log of all the commands run including the swift invocation the tc.data file used is $site.data and the sites.xml file is $site.xml Thanks, Glen From hategan at mcs.anl.gov Mon Apr 6 22:08:41 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 22:08:41 -0500 Subject: [Swift-user] consultation about error messages, coaster usage In-Reply-To: <49DABD80.8010508@uchicago.edu> References: <49DABD80.8010508@uchicago.edu> Message-ID: <1239073721.14311.2.camel@localhost> You seem to be using a particularly bad version of swift. I suggest trying the latest version. Mihael On Mon, 2009-04-06 at 21:42 -0500, Glen Hocky wrote: > Hi Guys, > I just ran (and killed) too big runs w/ swift, one on ranger, one on > abe. I stopped them because in each case there were many "Failed but can > retry" jobs, several "Failed to transfer wrapper log" errors and at the > point where i stopped them, many more cpu's allocated than "Active" > jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an > hour left (so 224 cpus) but only 76 "Active" jobs. > > Could someone take a look at the logs and tell me if things are working > properly? It's a little hard to tell from a user end... > On a ci home machine, > All run related files for abe are in > > /home/hockyg/oops/swift/output/abeoutdir.5/ > and for ranger > > > /home/hockyg/oops/swift/output/rangeroutdir.5/ > In those directories, there will be a file $site.out.5 which has the stdout > and xout.XXXXX which has a log of all the commands run including the > swift invocation > the tc.data file used is $site.data and the sites.xml file is $site.xml > > Thanks, > Glen > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Mon Apr 6 23:15:36 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 06 Apr 2009 23:15:36 -0500 Subject: [Swift-user] consultation about error messages, coaster usage In-Reply-To: <1239073721.14311.2.camel@localhost> References: <49DABD80.8010508@uchicago.edu> <1239073721.14311.2.camel@localhost> Message-ID: <49DAD368.5000006@mcs.anl.gov> OK, will do. I think the fix you applied at 5PM enables us to go back to the latest rev. This morning we updated, then reverted back to Tuesday 3/31. On 4/6/09 10:08 PM, Mihael Hategan wrote: > You seem to be using a particularly bad version of swift. I suggest > trying the latest version. > > Mihael > > On Mon, 2009-04-06 at 21:42 -0500, Glen Hocky wrote: >> Hi Guys, >> I just ran (and killed) too big runs w/ swift, one on ranger, one on >> abe. I stopped them because in each case there were many "Failed but can >> retry" jobs, several "Failed to transfer wrapper log" errors and at the >> point where i stopped them, many more cpu's allocated than "Active" >> jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an >> hour left (so 224 cpus) but only 76 "Active" jobs. >> >> Could someone take a look at the logs and tell me if things are working >> properly? It's a little hard to tell from a user end... >> On a ci home machine, >> All run related files for abe are in >>> /home/hockyg/oops/swift/output/abeoutdir.5/ >> and for ranger >> >>> /home/hockyg/oops/swift/output/rangeroutdir.5/ >> In those directories, there will be a file $site.out.5 which has the stdout >> and xout.XXXXX which has a log of all the commands run including the >> swift invocation >> the tc.data file used is $site.data and the sites.xml file is $site.xml >> >> Thanks, >> Glen >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From hategan at mcs.anl.gov Mon Apr 6 23:40:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 06 Apr 2009 23:40:42 -0500 Subject: [Swift-user] consultation about error messages, coaster usage In-Reply-To: <49DAD368.5000006@mcs.anl.gov> References: <49DABD80.8010508@uchicago.edu> <1239073721.14311.2.camel@localhost> <49DAD368.5000006@mcs.anl.gov> Message-ID: <1239079242.15719.0.camel@localhost> On Mon, 2009-04-06 at 23:15 -0500, Michael Wilde wrote: > OK, will do. I think the fix you applied at 5PM enables us to go back to > the latest rev. This morning we updated, then reverted back to Tuesday 3/31. Yes. Sorry about that one. It happens though that Tuesday 3/31 was also pretty unstable. From hockyg at uchicago.edu Thu Apr 9 13:40:34 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 09 Apr 2009 13:40:34 -0500 Subject: [Swift-user] Is there a way to have an optional command line argument to a swift script? Message-ID: <49DE4122.8020409@uchicago.edu> I have a new command line argument for my script and I want to check if it's there or not. doing @arg("foo") just gives Missing command line argument: foo Thanks, Glen From benc at hawaga.org.uk Thu Apr 9 13:45:18 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 9 Apr 2009 18:45:18 +0000 (GMT) Subject: [Swift-user] Is there a way to have an optional command line argument to a swift script? In-Reply-To: <49DE4122.8020409@uchicago.edu> References: <49DE4122.8020409@uchicago.edu> Message-ID: On Thu, 9 Apr 2009, Glen Hocky wrote: > I have a new command line argument for my script and I want to check if it's > there or not. You can have a default value for an argument. The user guide describes how. If you choose a sufficiently obscure default value, you can pretty much detect if you got the default value or not and change behaviour based on it. Or if you only want to know if its there or not in order to assign a default value you get that automatically. -- From hategan at mcs.anl.gov Thu Apr 9 13:49:22 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 09 Apr 2009 13:49:22 -0500 Subject: [Swift-user] Is there a way to have an optional command line argument to a swift script? In-Reply-To: <49DE4122.8020409@uchicago.edu> References: <49DE4122.8020409@uchicago.edu> Message-ID: <1239302962.5659.0.camel@localhost> On Thu, 2009-04-09 at 13:40 -0500, Glen Hocky wrote: > I have a new command line argument for my script and I want to check if > it's there or not. > doing @arg("foo") just gives > Missing command line argument: foo What's the command line you use to start the script? > > > Thanks, > Glen > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Wed Apr 15 12:41:30 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 15 Apr 2009 12:41:30 -0500 Subject: [Swift-user] Swift restarts with iterate? Message-ID: <49E61C4A.90902@mcs.anl.gov> Is the restart feature designed to correctly handle restarts of scripts with active, possibly nested, iterate statements? The use case of interest here is to run a single copy of swift continuously, or for extended periods, doing a task graph of work, sleeping, and repeating indefinitely. The thought was that you could restart swift indefinitely after any failures, or periodically if for the time being it cant run indefinitely due to memory or other resource consumption issues. The use case involves applying it to process log data on a continuing basis. I suspect the pattern may also be useful in digesting, eg, news data. Comments or advice on feasibility would be useful before experimenting. From benc at hawaga.org.uk Wed Apr 15 14:02:15 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:02:15 +0000 (GMT) Subject: [Swift-user] Swift restarts with iterate? In-Reply-To: <49E61C4A.90902@mcs.anl.gov> References: <49E61C4A.90902@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > Is the restart feature designed to correctly handle restarts of scripts with > active, possibly nested, iterate statements? There was no intention that such would not work. > The use case of interest here is to run a single copy of swift continuously, > or for extended periods, doing a task graph of work, sleeping, and repeating > indefinitely. I've considered that as a way of handling something like streaming datasets. Doing that should work in as much as it should accomodate new data appearing. However I'm unsure of the memory usage scalability compared to a run where you had all the data in place at the start of a single run - Swift will still make karajan threads to attempt (and then optimise away) already done executions, and will still have an in-memory representation of each data object already processed. >From a SwiftScript language perspective, the above fits in just fine, I think. >From a practical perspective as it is now, you will need something that depends on the array being closed and fails (for example, call /bin/false with the array a an input). -- From hategan at mcs.anl.gov Wed Apr 15 14:12:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 15 Apr 2009 14:12:58 -0500 Subject: [Swift-user] Swift restarts with iterate? In-Reply-To: References: <49E61C4A.90902@mcs.anl.gov> Message-ID: <1239822778.23411.26.camel@localhost> On Wed, 2009-04-15 at 19:02 +0000, Ben Clifford wrote: > On Wed, 15 Apr 2009, Michael Wilde wrote: > > > Is the restart feature designed to correctly handle restarts of scripts with > > active, possibly nested, iterate statements? > > There was no intention that such would not work. > > > The use case of interest here is to run a single copy of swift continuously, > > or for extended periods, doing a task graph of work, sleeping, and repeating > > indefinitely. > > I've considered that as a way of handling something like streaming > datasets. > > Doing that should work in as much as it should accomodate new data > appearing. > > However I'm unsure of the memory usage scalability compared to a run where > you had all the data in place at the start of a single run - Swift will > still make karajan threads to attempt (and then optimise away) already > done executions, and will still have an in-memory representation of each > data object already processed. While thinking of the scalability issues a while ago when I did the foreach limiting, I concluded that a solution to that may exist. Currently we use a certain scheme to detect when a piece of data stops being written to, such that it can be considered "closed". Similarly, it may be possible to determine when a piece of data will not be referenced any more, and consequently remove in-memory references to it and its associated data structures. From benc at hawaga.org.uk Wed Apr 15 14:34:43 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 15 Apr 2009 19:34:43 +0000 (GMT) Subject: [Swift-user] Swift restarts with iterate? In-Reply-To: <49E61C4A.90902@mcs.anl.gov> References: <49E61C4A.90902@mcs.anl.gov> Message-ID: On Wed, 15 Apr 2009, Michael Wilde wrote: > Is the restart feature designed to correctly handle restarts of scripts with > active, possibly nested, iterate statements? I think, though, that foreach is more the construct for the use case of iterating over a growing collection of files. Do a foreach over a mapepd collection of files. Run swift again with a bigger mapped collection of files, and if you are using the restart stuff described earlier in this thread, it will run only on the new entries. -- From yuechen at bsd.uchicago.edu Sat Apr 18 14:03:24 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Sat, 18 Apr 2009 14:03:24 -0500 Subject: [Swift-user] job waiting Message-ID: Hi, I'm using Swift and PTMap to analyze E coli genome data. Right now, I'm only mapped on SDSC DTF and NCSA mercury, so I'm trying to use only these two computer clusters. Total number of jobs should be around 4127. After it started, the application runs normally. However, after 3769 jobs returned successfully, it could not receive any more data and the system kept waiting. On these computers, if I use qstat, I cannot find any active job. In my email, I received 45 emails like the following: ///// PBS Job Id: 1932326.tg-master.ncsa.teragrid.org Job Name: null job deleted Job deleted at request of root at tg-master.ncsa.teragrid.org MOAB_INFO: job exceeded wallclock limit ////// I'm wondering if I did something wrong and how I can avoid this situation. The log of the search should be /home/yuechen/PTMap2/PTMap2-unmod-20090418-1254-d8loarc1.log. /************* The swift script I used is /home/yuechen/PTMap2/PTMap2-unmod.swift /************* tc.data is /home/yuechen/PTMap2/tc.data /************* sites.xml is /home/yuechen/PTMap2/sites.xml and the following are the two sites I used. /gpfs-wan/scratch/yuechen fast /gpfs_scratch1/yuechen/swiftwork Thank you very much! Best regards, Chen, Yue This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sat Apr 18 17:01:16 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 18 Apr 2009 22:01:16 +0000 (GMT) Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > MOAB_INFO: job exceeded wallclock limit This message means that some of the josb that you tried to run took longer than is allowed by default. I plotted your logs using swift-plot-log, and from the graph 'karajan active JOB_SUBMISSION cumulative duration' in the karajan tab (http://www.ci.uchicago.edu/~benc/tmp/report-PTMap2-unmod-20090418-1254-d8loarc1/karajan.html) it looks like while most of your jobs take somewhere between seconds to a few minutes, a number of your jobs take longer (up to 3000 seconds in that graph) Check that: i) an hour is a sane time for your programs to be taking, and ii) that the queues that you are submitting to (the default on ncsa, for example) allow this length of time. -- From benc at hawaga.org.uk Sat Apr 18 17:05:55 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 18 Apr 2009 22:05:55 +0000 (GMT) Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: actually, I see you're using coasters on NCSA, so the actual numbers for walltimes being submitted into NCSA's queueing system will be a little strange. But my first question, that some jobs taking around an hour still stands. Also I notice a large number of jobs being submitted at the start of your run - have you adjusted the default throttles on your swift installation to some larger value? -- From yuechen at bsd.uchicago.edu Sat Apr 18 18:20:43 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Sat, 18 Apr 2009 18:20:43 -0500 Subject: [Swift-user] job waiting References: Message-ID: Hi Ben, Thanks for answering my question. This phenomena occur after half an hour of execution. If all the jobs finish execution at original speed, it would probably take not more than 40 min. How the system figure out that some jobs will take more than 1 hour? Should I request more time when I execute "grid-proxy-init"? I did not change the default throttles. How much is more appropriate? The total number of jobs in my application typically run between 4000 and 30000 and typically each job can be finished within a couple of minutes. Thanks! Chen, Yue ________________________________ From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Sat 4/18/2009 5:05 PM To: Yue, Chen - BMD Cc: swift user Subject: Re: [Swift-user] job waiting actually, I see you're using coasters on NCSA, so the actual numbers for walltimes being submitted into NCSA's queueing system will be a little strange. But my first question, that some jobs taking around an hour still stands. Also I notice a large number of jobs being submitted at the start of your run - have you adjusted the default throttles on your swift installation to some larger value? -- This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Sun Apr 19 02:07:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 19 Apr 2009 07:07:13 +0000 (GMT) Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > Thanks for answering my question. This phenomena occur after half an > hour of execution. If all the jobs finish execution at original speed, > it would probably take not more than 40 min. How the system figure out > that some jobs will take more than 1 hour? Should I request more time > when I execute "grid-proxy-init"? Not with grid-proxy-init. You can specify a parameter called maxwalltime in your sites file or your tc.data file that will tell Swift an upper bound on how long your job will run. In Swift 0.8, coasters assume something like 10 minutes if you do not specify a walltime, so you will run into trouble. For example, change the null at the end of your tc.data lines to globus::maxwalltime=50 to mean 50 minutes maxwalltime. There has been work done on coasters since Swift 0.8, and so Mihael may have some other recommendations. > I did not change the default throttles. How much is more appropriate? > The total number of jobs in my application typically run between 4000 > and 30000 and typically each job can be finished within a couple of > minutes. Where is your Swift installation? I would liek to look at it. -- From hategan at mcs.anl.gov Sun Apr 19 11:20:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 19 Apr 2009 11:20:14 -0500 Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: <1240158014.25901.1.camel@localhost> On Sun, 2009-04-19 at 07:07 +0000, Ben Clifford wrote: > On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > > > Thanks for answering my question. This phenomena occur after half an > > hour of execution. If all the jobs finish execution at original speed, > > it would probably take not more than 40 min. How the system figure out > > that some jobs will take more than 1 hour? Should I request more time > > when I execute "grid-proxy-init"? > > Not with grid-proxy-init. You can specify a parameter called maxwalltime > in your sites file or your tc.data file that will tell Swift an upper > bound on how long your job will run. In Swift 0.8, coasters assume > something like 10 minutes if you do not specify a walltime, so you will > run into trouble. > > For example, change the null at the end of your tc.data lines to > globus::maxwalltime=50 to mean 50 minutes maxwalltime. > > There has been work done on coasters since Swift 0.8, and so Mihael may > have some other recommendations. Yes. Coasters are experimental. As such, there are problems. However, you may get better results with the current development version. > > > I did not change the default throttles. How much is more appropriate? > > The total number of jobs in my application typically run between 4000 > > and 30000 and typically each job can be finished within a couple of > > minutes. > > Where is your Swift installation? I would liek to look at it. > From yuechen at bsd.uchicago.edu Tue Apr 21 10:53:02 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Tue, 21 Apr 2009 10:53:02 -0500 Subject: [Swift-user] job waiting References: <1240158014.25901.1.camel@localhost> Message-ID: Hi Ben and Mihael, Thanks for answering my questions. I will try to set the maxwalltime in tc.data in my application and let you know how it works. My Swift installation is at /home/yuechen/swift-0.8 on Communicado. Please let me know if you see any problem in my setup. Thank you very much! Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Sun 4/19/2009 11:20 AM To: Ben Clifford Cc: Yue, Chen - BMD; swift user Subject: RE: [Swift-user] job waiting On Sun, 2009-04-19 at 07:07 +0000, Ben Clifford wrote: > On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > > > Thanks for answering my question. This phenomena occur after half an > > hour of execution. If all the jobs finish execution at original speed, > > it would probably take not more than 40 min. How the system figure out > > that some jobs will take more than 1 hour? Should I request more time > > when I execute "grid-proxy-init"? > > Not with grid-proxy-init. You can specify a parameter called maxwalltime > in your sites file or your tc.data file that will tell Swift an upper > bound on how long your job will run. In Swift 0.8, coasters assume > something like 10 minutes if you do not specify a walltime, so you will > run into trouble. > > For example, change the null at the end of your tc.data lines to > globus::maxwalltime=50 to mean 50 minutes maxwalltime. > > There has been work done on coasters since Swift 0.8, and so Mihael may > have some other recommendations. Yes. Coasters are experimental. As such, there are problems. However, you may get better results with the current development version. > > > I did not change the default throttles. How much is more appropriate? > > The total number of jobs in my application typically run between 4000 > > and 30000 and typically each job can be finished within a couple of > > minutes. > > Where is your Swift installation? I would liek to look at it. > This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Wed Apr 22 11:30:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 22 Apr 2009 16:30:17 +0000 (GMT) Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: You might find it useful to try out coasters from swift 0.9rc2, which is a much more recent testing version of Swift compared to 0.8 (which I think you are using) You can get that here: www.ci.uchicago.edu/~benc/swift-0.9rc2.tar.gz Your existing SwiftScript and site files should work with that. -- From yuechen at bsd.uchicago.edu Wed Apr 22 11:25:42 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 22 Apr 2009 11:25:42 -0500 Subject: [Swift-user] job waiting References: Message-ID: Hi Ben, Yesterday, I tested my application a few times on NCSA mercury only with coaster and with the specification of globus::maxwalltime=50 in tc.data. Similar to previous try, in several runs, the application keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns respectively. Does this relate to my setting? The log for the last run is at: /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log I started to receive email with the following content after about 10 min of execution, ///////// PBS Job Id: 1947957.tg-master.ncsa.teragrid.org Job Name: null job deleted Job deleted at request of root at tg-master.ncsa.teragrid.org MOAB_INFO: job exceeded wallclock limit ///////// However, Swift did not indicate any job failure, so should I worry about the success of those jobs? I also tried NCSA mercury only without coaster, but the submitted jobs do not seem to return successfully. I notice that if I use coaster, typicaly max number jobs I have on NCSA is about 130, but if I do not use coaster, I can have more than 300 jobs queued on NCSA computer. Is this related with the throttle setting? I also tried SDSC dtf server without coaster, but the jobs submitted do not get started on SDSC dtf server. Instead, I got many error messages like the following. Should I contact teragrid for these errors? Progress: Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished successfully:230 Failed but can retry:45 Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs The following is my sites.xml content for NCSA mercury with and without coaster and SDSC DTF: /gpfs_scratch1/yuechen/swiftwork /gpfs_scratch1/yuechen/swiftwork /gpfs-wan/scratch/yuechen fast The swift script I used is at: /home/yuechen/PTMap2/PTMap2-unmod.swift The tc.data I used is: /home/yuechen/PTMap2/tc.data I will start to try other servers to see if I can run all jobs successfully. Thank you very much for help! Chen, Yue ________________________________ From: Ben Clifford [mailto:benc at hawaga.org.uk] Sent: Sun 4/19/2009 2:07 AM To: Yue, Chen - BMD Cc: swift user Subject: RE: [Swift-user] job waiting On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > Thanks for answering my question. This phenomena occur after half an > hour of execution. If all the jobs finish execution at original speed, > it would probably take not more than 40 min. How the system figure out > that some jobs will take more than 1 hour? Should I request more time > when I execute "grid-proxy-init"? Not with grid-proxy-init. You can specify a parameter called maxwalltime in your sites file or your tc.data file that will tell Swift an upper bound on how long your job will run. In Swift 0.8, coasters assume something like 10 minutes if you do not specify a walltime, so you will run into trouble. For example, change the null at the end of your tc.data lines to globus::maxwalltime=50 to mean 50 minutes maxwalltime. There has been work done on coasters since Swift 0.8, and so Mihael may have some other recommendations. > I did not change the default throttles. How much is more appropriate? > The total number of jobs in my application typically run between 4000 > and 30000 and typically each job can be finished within a couple of > minutes. Where is your Swift installation? I would liek to look at it. -- -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 22 11:44:14 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 22 Apr 2009 11:44:14 -0500 Subject: [Swift-user] job waiting In-Reply-To: References: Message-ID: <1240418654.7409.0.camel@localhost> This behavior was observed previously with the version you have. I strongly recommend upgrading to the version Ben mentions. On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote: > Hi Ben, > > Yesterday, I tested my application a few times on NCSA mercury only > with coaster and with the specification of globus::maxwalltime=50 in > tc.data. Similar to previous try, in several runs, the application > keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns > respectively. Does this relate to my setting? The log for the last run > is at: > > /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log > > I started to receive email with the following content after about 10 > min of execution, > > ///////// > PBS Job Id: 1947957.tg-master.ncsa.teragrid.org > Job Name: null > job deleted > Job deleted at request of root at tg-master.ncsa.teragrid.org > MOAB_INFO: job exceeded wallclock limit > ///////// > > However, Swift did not indicate any job failure, so should I worry > about the success of those jobs? > > I also tried NCSA mercury only without coaster, but the submitted jobs > do not seem to return successfully. I notice that if I use coaster, > typicaly max number jobs I have on NCSA is about 130, but if I do not > use coaster, I can have more than 300 jobs queued on NCSA computer. Is > this related with the throttle setting? > > I also tried SDSC dtf server without coaster, but the jobs submitted > do not get started on SDSC dtf server. Instead, I got many error > messages like the following. Should I contact teragrid for these > errors? > > Progress: Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished > successfully:230 Failed but can retry:45 > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > > The following is my sites.xml content for NCSA mercury with and > without coaster and SDSC DTF: > > > > jobManager="gt2:PBS"/> > /gpfs_scratch1/yuechen/swiftwork > > > > url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" /> > /gpfs_scratch1/yuechen/swiftwork > > > > url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" /> > /gpfs-wan/scratch/yuechen > fast > > > The swift script I used is at: > > /home/yuechen/PTMap2/PTMap2-unmod.swift > > The tc.data I used is: > > /home/yuechen/PTMap2/tc.data > > I will start to try other servers to see if I can run all jobs > successfully. > > Thank you very much for help! > > Chen, Yue > > > > > > > > > > > > > ______________________________________________________________________ > From: Ben Clifford [mailto:benc at hawaga.org.uk] > Sent: Sun 4/19/2009 2:07 AM > To: Yue, Chen - BMD > Cc: swift user > Subject: RE: [Swift-user] job waiting > > > > On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > > > Thanks for answering my question. This phenomena occur after half an > > hour of execution. If all the jobs finish execution at original > speed, > > it would probably take not more than 40 min. How the system figure > out > > that some jobs will take more than 1 hour? Should I request more > time > > when I execute "grid-proxy-init"? > > Not with grid-proxy-init. You can specify a parameter called > maxwalltime > in your sites file or your tc.data file that will tell Swift an upper > bound on how long your job will run. In Swift 0.8, coasters assume > something like 10 minutes if you do not specify a walltime, so you > will > run into trouble. > > For example, change the null at the end of your tc.data lines to > globus::maxwalltime=50 to mean 50 minutes maxwalltime. > > There has been work done on coasters since Swift 0.8, and so Mihael > may > have some other recommendations. > > > I did not change the default throttles. How much is more > appropriate? > > The total number of jobs in my application typically run between > 4000 > > and 30000 and typically each job can be finished within a couple of > > minutes. > > Where is your Swift installation? I would liek to look at it. > > -- > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From yuechen at bsd.uchicago.edu Wed Apr 22 21:29:51 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 22 Apr 2009 21:29:51 -0500 Subject: [Swift-user] job waiting References: <1240418654.7409.0.camel@localhost> Message-ID: Hi Mihael and Ben, Thanks for your information. The new version of coasters works very well on NCSA mercury and I don't receive those email any more. But I run into some problem with SDSC server. I will send separate email tomorrow after I get response from SDSC people. Best, Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wed 4/22/2009 11:44 AM To: Yue, Chen - BMD Cc: Ben Clifford; swift user Subject: RE: [Swift-user] job waiting This behavior was observed previously with the version you have. I strongly recommend upgrading to the version Ben mentions. On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote: > Hi Ben, > > Yesterday, I tested my application a few times on NCSA mercury only > with coaster and with the specification of globus::maxwalltime=50 in > tc.data. Similar to previous try, in several runs, the application > keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns > respectively. Does this relate to my setting? The log for the last run > is at: > > /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log > > I started to receive email with the following content after about 10 > min of execution, > > ///////// > PBS Job Id: 1947957.tg-master.ncsa.teragrid.org > Job Name: null > job deleted > Job deleted at request of root at tg-master.ncsa.teragrid.org > MOAB_INFO: job exceeded wallclock limit > ///////// > > However, Swift did not indicate any job failure, so should I worry > about the success of those jobs? > > I also tried NCSA mercury only without coaster, but the submitted jobs > do not seem to return successfully. I notice that if I use coaster, > typicaly max number jobs I have on NCSA is about 130, but if I do not > use coaster, I can have more than 300 jobs queued on NCSA computer. Is > this related with the throttle setting? > > I also tried SDSC dtf server without coaster, but the jobs submitted > do not get started on SDSC dtf server. Instead, I got many error > messages like the following. Should I contact teragrid for these > errors? > > Progress: Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished > successfully:230 Failed but can retry:45 > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > Failed to transfer wrapper log from > PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs > > The following is my sites.xml content for NCSA mercury with and > without coaster and SDSC DTF: > > > > jobManager="gt2:PBS"/> > /gpfs_scratch1/yuechen/swiftwork > > > > url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" /> > /gpfs_scratch1/yuechen/swiftwork > > > > url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" /> > /gpfs-wan/scratch/yuechen > fast > > > The swift script I used is at: > > /home/yuechen/PTMap2/PTMap2-unmod.swift > > The tc.data I used is: > > /home/yuechen/PTMap2/tc.data > > I will start to try other servers to see if I can run all jobs > successfully. > > Thank you very much for help! > > Chen, Yue > > > > > > > > > > > > > ______________________________________________________________________ > From: Ben Clifford [mailto:benc at hawaga.org.uk] > Sent: Sun 4/19/2009 2:07 AM > To: Yue, Chen - BMD > Cc: swift user > Subject: RE: [Swift-user] job waiting > > > > On Sat, 18 Apr 2009, Yue, Chen - BMD wrote: > > > Thanks for answering my question. This phenomena occur after half an > > hour of execution. If all the jobs finish execution at original > speed, > > it would probably take not more than 40 min. How the system figure > out > > that some jobs will take more than 1 hour? Should I request more > time > > when I execute "grid-proxy-init"? > > Not with grid-proxy-init. You can specify a parameter called > maxwalltime > in your sites file or your tc.data file that will tell Swift an upper > bound on how long your job will run. In Swift 0.8, coasters assume > something like 10 minutes if you do not specify a walltime, so you > will > run into trouble. > > For example, change the null at the end of your tc.data lines to > globus::maxwalltime=50 to mean 50 minutes maxwalltime. > > There has been work done on coasters since Swift 0.8, and so Mihael > may > have some other recommendations. > > > I did not change the default throttles. How much is more > appropriate? > > The total number of jobs in my application typically run between > 4000 > > and 30000 and typically each job can be finished within a couple of > > minutes. > > Where is your Swift installation? I would liek to look at it. > > -- > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Thu Apr 23 14:21:36 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 14:21:36 -0500 Subject: [Swift-user] max number of jobs? Message-ID: <49F0BFC0.1040504@uchicago.edu> Hi everyone. I was wondering if there is a cap on number of coasters or jobs in the queue on some machines. I've had a lot of success running on Ranger but I've never had more than 256 active jobs (i.e. 16x16) even with very high initial score and throttle settings. Glen From hockyg at uchicago.edu Thu Apr 23 14:43:00 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 14:43:00 -0500 Subject: [Swift-user] max number of jobs? In-Reply-To: <49F0C3FD.7040907@cs.uchicago.edu> References: <49F0BFC0.1040504@uchicago.edu> <49F0C3FD.7040907@cs.uchicago.edu> Message-ID: <49F0C4C4.6050403@uchicago.edu> Ah, I do intend to try running under ranger but the one main reason I haven't is I'm trying to run from a single location (a ci machine) because it's easier to keep managed that way. I'm running in the normal queue, but all of my jobs are 16 node, the reason for that I think is that there is no way to get larger block allocations as was being discussed a few weeks ago. I would be better off w/ larger because I'm sure the wait time is the same for 16 or 32 or 64... Ioan Raicu wrote: > I can't help with the Swift or coaster settings, but don't forget that > Falkon is also installed on Ranger, and you can use it the same way > that you use it on intrepid. I have yet to do extremely large runs on > ranger to see how well things scale, but you might want to give Falkon > a try as well. > > Also, I recall something about the development queue being limited to > 16 or 32 nodes. The normal queue, which allows larger allocations, > usually also has higher wait times. Coaster might be configured to use > the faster development queue, which has a limited number of nodes you > can use. You might want to look into changing the queue Swift/Coaster > submits the jobs to. Perhaps Mihael or others can offer details on how > to change the queue Swift will submit to. > > Ioan > > Glen Hocky wrote: >> Hi everyone. >> I was wondering if there is a cap on number of coasters or jobs in >> the queue on some machines. I've had a lot of success running on >> Ranger but I've never had more than 256 active jobs (i.e. 16x16) even >> with very high initial score and throttle settings. >> >> Glen >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> > From iraicu at cs.uchicago.edu Thu Apr 23 14:39:41 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 23 Apr 2009 14:39:41 -0500 Subject: [Swift-user] max number of jobs? In-Reply-To: <49F0BFC0.1040504@uchicago.edu> References: <49F0BFC0.1040504@uchicago.edu> Message-ID: <49F0C3FD.7040907@cs.uchicago.edu> I can't help with the Swift or coaster settings, but don't forget that Falkon is also installed on Ranger, and you can use it the same way that you use it on intrepid. I have yet to do extremely large runs on ranger to see how well things scale, but you might want to give Falkon a try as well. Also, I recall something about the development queue being limited to 16 or 32 nodes. The normal queue, which allows larger allocations, usually also has higher wait times. Coaster might be configured to use the faster development queue, which has a limited number of nodes you can use. You might want to look into changing the queue Swift/Coaster submits the jobs to. Perhaps Mihael or others can offer details on how to change the queue Swift will submit to. Ioan Glen Hocky wrote: > Hi everyone. > I was wondering if there is a cap on number of coasters or jobs in the > queue on some machines. I've had a lot of success running on Ranger > but I've never had more than 256 active jobs (i.e. 16x16) even with > very high initial score and throttle settings. > > Glen > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From hockyg at uchicago.edu Thu Apr 23 14:45:55 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 14:45:55 -0500 Subject: [Swift-user] max number of jobs? In-Reply-To: <49F0C415.7040809@mcs.anl.gov> References: <49F0BFC0.1040504@uchicago.edu> <49F0C415.7040809@mcs.anl.gov> Message-ID: <49F0C573.6040002@uchicago.edu> ranger.data > #sitename transformation path INSTALLED > platform profiles > localhost echo /bin/echo INSTALLED > INTEL32::LINUX null > localhost cat /bin/cat INSTALLED > INTEL32::LINUX null > localhost ls /bin/ls INSTALLED > INTEL32::LINUX null > localhost grep /bin/grep INSTALLED > INTEL32::LINUX null > localhost sort /bin/sort INSTALLED > INTEL32::LINUX null > localhost paste /bin/paste INSTALLED > INTEL32::LINUX null > localhost sed /bin/sed INSTALLED > INTEL32::LINUX null > localhost cp /bin/cp INSTALLED > INTEL32::LINUX null > localhost sumarizeStudy > /home/hockyg/oops/swift/genPlotFiles.py INSTALLED INTEL32::LINUX > null > > > ranger runoops > /share/home/01021/hockyg/oops/trunk/bin/runoops.sh > INSTALLED INTEL32::LINUX null > ranger runrama > /share/home/01021/hockyg/oops/trunk/bin/runrama.sh > INSTALLED INTEL32::LINUX null > ranger runramaSpeed > /share/home/01021/hockyg/oops/trunk/bin/runramaSpeed.sh > INSTALLED INTEL32::LINUX null > ranger analyze_round_dir > /share/home/01021/hockyg/oops/trunk/bin/analyze_round_dir.sh > INSTALLED INTEL32::LINUX null ranger.xml > > > > > /home/hockyg/swiftwork > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > > key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir > TG-MCB080099N > 16 > key="coasterWorkerMaxwalltime">05:00:00 > 60 > 50 > 10 > /share/home/01021/hockyg/swiftwork > > > [hockyg at communicado rangeroutdir.1002]$ less > /home/hockyg/.swift/swift.properties > sitedir.keep=true > lazy.errors=false > #execution.retries=0 > status.mode=provider Michael Wilde wrote: > Glen, can you post your sites.xml, tc.data and swift.properties files? > > On 4/23/09 2:21 PM, Glen Hocky wrote: >> Hi everyone. >> I was wondering if there is a cap on number of coasters or jobs in >> the queue on some machines. I've had a lot of success running on >> Ranger but I've never had more than 256 active jobs (i.e. 16x16) even >> with very high initial score and throttle settings. >> >> Glen >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Thu Apr 23 16:53:24 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 23 Apr 2009 21:53:24 +0000 (GMT) Subject: [Swift-user] max number of jobs? In-Reply-To: <49F0BFC0.1040504@uchicago.edu> References: <49F0BFC0.1040504@uchicago.edu> Message-ID: On Thu, 23 Apr 2009, Glen Hocky wrote: > I was wondering if there is a cap on number of coasters or jobs in the queue > on some machines. I've had a lot of success running on Ranger but I've never > had more than 256 active jobs (i.e. 16x16) even with very high initial score > and throttle settings. Do you see other jobs from Swift sitting in the queue on Ranger in queued/waiting state (rather than running)? Or do you only see exactly 256 jobs in the queue? (this being from wahtever Ranger's equivalent of qstat is) -- From hockyg at uchicago.edu Thu Apr 23 16:55:48 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 23 Apr 2009 16:55:48 -0500 Subject: [Swift-user] max number of jobs? In-Reply-To: References: <49F0BFC0.1040504@uchicago.edu> Message-ID: <49F0E3E4.6010909@uchicago.edu> With these settings, exactly 16 jobs immediately go into the queue (which on ranger goes to 256 coasters) and that number never changes Ben Clifford wrote: > On Thu, 23 Apr 2009, Glen Hocky wrote: > > >> I was wondering if there is a cap on number of coasters or jobs in the queue >> on some machines. I've had a lot of success running on Ranger but I've never >> had more than 256 active jobs (i.e. 16x16) even with very high initial score >> and throttle settings. >> > > Do you see other jobs from Swift sitting in the queue on Ranger in > queued/waiting state (rather than running)? Or do you only see exactly 256 > jobs in the queue? (this being from wahtever Ranger's equivalent of qstat > is) > > From aespinosa at cs.uchicago.edu Fri Apr 24 13:05:44 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 24 Apr 2009 13:05:44 -0500 Subject: [Swift-user] swift-plot-log with svg graphics Message-ID: <20090424180543.GA4121@origin> I wanted to zoom into the plots as much as I want so i changed the png term configured in gnuplot invocations to svg. Much prettier than png plots in my opinion :) My patch for swift-plot-log is in http://www.ci.uchicago.edu/~aespinosa/swiftplot_svg-r2874.patch Sample plots: http://www.ci.uchicago.edu/~aespinosa/swift/report-blast-20090410-2357-j4nnkrg1/ Known issues: firefox does not properly render svg graphics produced by gnuplot4.0patch0 (like the one installed in communicado). Although this is fixed in gnuplot4.0patch2. below's a small sed script to fix that: sed -i 's/xmlns/xmlns="http:\/\/www.w3.org\/2000\/svg" xmlns/g' *.svg -Allan From hategan at mcs.anl.gov Fri Apr 24 13:15:07 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 24 Apr 2009 13:15:07 -0500 Subject: [Swift-user] swift-plot-log with svg graphics In-Reply-To: <20090424180543.GA4121@origin> References: <20090424180543.GA4121@origin> Message-ID: <1240596907.9287.8.camel@localhost> Prettier indeed. The rendering on my browser is slow though. So this should probably be an option. Or maybe the pages should show a png and have a link to the high-resolution svg. On Fri, 2009-04-24 at 13:05 -0500, Allan Espinosa wrote: > I wanted to zoom into the plots as much as I want so i changed the png term configured in gnuplot invocations to svg. Much prettier than png plots in my opinion :) > > My patch for swift-plot-log is in http://www.ci.uchicago.edu/~aespinosa/swiftplot_svg-r2874.patch > > Sample plots: http://www.ci.uchicago.edu/~aespinosa/swift/report-blast-20090410-2357-j4nnkrg1/ > > > Known issues: firefox does not properly render svg graphics produced by gnuplot4.0patch0 (like the one installed in communicado). Although this is fixed in gnuplot4.0patch2. below's a small sed script to fix that: > > sed -i 's/xmlns/xmlns="http:\/\/www.w3.org\/2000\/svg" xmlns/g' *.svg > > -Allan > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Mon Apr 27 07:41:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 27 Apr 2009 12:41:39 +0000 (GMT) Subject: [Swift-user] swift-plot-log with svg graphics In-Reply-To: <20090424180543.GA4121@origin> References: <20090424180543.GA4121@origin> Message-ID: I think general SVG support isn't good enough for SVG to be the main default format. But if its making better images, it would be good to incorporate it somehow, like mihael suggests. At the very least, make a bugzilla enhancement request; even better, write the code and contribute it... -- From benc at hawaga.org.uk Mon Apr 27 08:24:05 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 27 Apr 2009 13:24:05 +0000 (GMT) Subject: [Swift-user] Swift 0.9 released. Message-ID: Swift 0.9 is released. Download it at http://www.ci.uchicago.edu/swift/downloads/ The release notes, with more information on new features and bugfixes, are available at: http://www.ci.uchicago.edu/swift/downloads/release-notes-0.9.txt -- From yuechen at bsd.uchicago.edu Wed Apr 29 12:38:09 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 29 Apr 2009 12:38:09 -0500 Subject: [Swift-user] errors in file transfer Message-ID: Hi, I was trying to test PTMap application using NCSA Abe. However, I got many error messages like the following and no process started on the Abe. Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe The log of the search is on communicado: /home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log In the sites.xml, the entry for NCSA Abe is : /cfs/scratch/users/yuechen/swiftwork fast In the tc.data, the entry is: NCSA_Abe PTMap2 /u/ac/yuechen/PTMap2/PTMap2 INSTALLED INTEL32::LINUX globus::maxwalltime=50 I'm wondering if I have any setup problem or I should contact NCSA administrator. Thank you very much! Regards, Chen, Yue This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 29 12:51:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 12:51:17 -0500 Subject: [Swift-user] errors in file transfer In-Reply-To: References: Message-ID: <1241027477.14561.0.camel@localhost> This is the error: Cannot submit job: Could not submit job (qsub reported an exit code o f 170). no error output Try jobmanager="gt2:gt2:PBS" instead of "gt2:PBS". Mihael On Wed, 2009-04-29 at 12:38 -0500, Yue, Chen - BMD wrote: > Hi, > > I was trying to test PTMap application using NCSA Abe. However, I > got many error messages like the following and no process started on > the Abe. > > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe > > The log of the search is on communicado: > > /home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log > > In the sites.xml, the entry for NCSA Abe is : > > > > jobManager="gt2:PBS"/> > > /cfs/scratch/users/yuechen/swiftwork > fast > > > In the tc.data, the entry is: > > NCSA_Abe PTMap2 /u/ac/yuechen/PTMap2/PTMap2 > INSTALLED INTEL32::LINUX globus::maxwalltime=50 > > I'm wondering if I have any setup problem or I should contact NCSA > administrator. > > Thank you very much! > > Regards, > > Chen, Yue > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From yuechen at bsd.uchicago.edu Wed Apr 29 14:04:11 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 29 Apr 2009 14:04:11 -0500 Subject: [Swift-user] errors in file transfer References: <1241027477.14561.0.camel@localhost> Message-ID: Hi Mihael, I tried the setting but it still gives me the following error: Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe The log for the search is at: /home/yuechen/PTMap2/PTMap2-unmod-20090429-1402-b5t8cqua.log Thanks! Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wed 4/29/2009 12:51 PM To: Yue, Chen - BMD Cc: swift user Subject: Re: [Swift-user] errors in file transfer This is the error: Cannot submit job: Could not submit job (qsub reported an exit code o f 170). no error output Try jobmanager="gt2:gt2:PBS" instead of "gt2:PBS". Mihael On Wed, 2009-04-29 at 12:38 -0500, Yue, Chen - BMD wrote: > Hi, > > I was trying to test PTMap application using NCSA Abe. However, I > got many error messages like the following and no process started on > the Abe. > > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe > > The log of the search is on communicado: > > /home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log > > In the sites.xml, the entry for NCSA Abe is : > > > > jobManager="gt2:PBS"/> > > /cfs/scratch/users/yuechen/swiftwork > fast > > > In the tc.data, the entry is: > > NCSA_Abe PTMap2 /u/ac/yuechen/PTMap2/PTMap2 > INSTALLED INTEL32::LINUX globus::maxwalltime=50 > > I'm wondering if I have any setup problem or I should contact NCSA > administrator. > > Thank you very much! > > Regards, > > Chen, Yue > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged > and confidential. If the reader of this email message is not the > intended recipient, you are hereby notified that any dissemination, > distribution, or copying of this communication is prohibited. If you > have received this email in error, please notify the sender and > destroy/delete all copies of the transmittal. Thank you. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 29 14:19:26 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 14:19:26 -0500 Subject: [Swift-user] errors in file transfer In-Reply-To: References: <1241027477.14561.0.camel@localhost> Message-ID: <1241032766.16150.2.camel@localhost> On Wed, 2009-04-29 at 14:04 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > I tried the setting but it still gives me the following error: > > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe Those are more like warnings, not errors. The real error should be displayed towards the end of the run. Anyway, it says: "org.globus.gram.GramException: The provided RSL 'queue' parameter is invalid" From yuechen at bsd.uchicago.edu Wed Apr 29 16:06:28 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 29 Apr 2009 16:06:28 -0500 Subject: [Swift-user] errors in file transfer References: <1241027477.14561.0.camel@localhost> <1241032766.16150.2.camel@localhost> Message-ID: Hi Mihael, I deleted the following line in my sites.xml file for NCSA_Abe and the wrapper transfer warnings are gone. fast I can also find jobs queuing on Abe. However, after quite a while, no job returned. I guess it is because I didn't set a priority and all the jobs are waiting. Is there other way to set priority? I will try again later. I then tested the IU BigRed with my application. Swift showed me the following error and I don't know if this is because of my setting: Progress: Selecting site:1019 Initializing site shared directory:4 Execution failed: Could not initialize shared directory on IU_BigRed Caused by: org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server Caused by: Server refused performing the request. Custom message: Server refused GSSAPI authentication. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530-globus_xio: Server side credential failure 530-globus_gsi_gssapi: Error with GSI credential 530-globus_gsi_gssapi: Error with gss credential handle 530-globus_credential: Error with credential: The host credential: /etc/grid-security/hostcert.pem 530- with subject: /C=US/O=National Center for Supercomputing Applications/CN=gridftp4.bigred.teragrid.iu.edu 530- has expired 4459 minutes ago. 530- 530 End.] The log for this search is at : /home/yuechen/PTMap2/PTMap2-unmod-20090429-1553-vz669563.log In the sites.xml, the entry for the BigRed is : /N/u/tg-yuechen/BigRed/swiftwork fast Thank you for help! Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wed 4/29/2009 2:19 PM To: Yue, Chen - BMD Cc: swift user Subject: RE: [Swift-user] errors in file transfer On Wed, 2009-04-29 at 14:04 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > I tried the setting but it still gives me the following error: > > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe > Failed to transfer wrapper log from > PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe Those are more like warnings, not errors. The real error should be displayed towards the end of the run. Anyway, it says: "org.globus.gram.GramException: The provided RSL 'queue' parameter is invalid" This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 29 16:23:42 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 16:23:42 -0500 Subject: [Swift-user] errors in file transfer In-Reply-To: References: <1241027477.14561.0.camel@localhost> <1241032766.16150.2.camel@localhost> Message-ID: <1241040222.18377.7.camel@localhost> On Wed, 2009-04-29 at 16:06 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > I deleted the following line in my sites.xml file for NCSA_Abe and the > wrapper transfer warnings are gone. > > fast > > I can also find jobs queuing on Abe. However, after quite a while, no > job returned. I guess it is because I didn't set a priority and all > the jobs are waiting. When you do qstat, are your jobs in a queued state? > Is there other way to set priority? You should be able to specify the queue. The only problem is that you are specifying a queue that doesn't exist on Abe. This is what I've found online: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Jobs.html#Queues You can also log in, and do a qstat -q, which will show the following: [hategan at honest2 ~]$ qstat -q server: abem5.ncsa.uiuc.edu Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- normal -- -- 48:00:00 600 82 928 -- E R iacat2 -- -- 241:00:0 -- 0 20 -- E R indprio -- -- 48:00:00 600 0 0 -- E R long -- -- 168:00:0 600 13 15 -- E R iacat -- -- 241:00:0 -- 0 0 -- E R industrial -- -- 336:00:0 600 14 32 -- E R lincoln -- -- 241:00:0 -- 2 0 -- E R wide -- -- 48:00:00 1196 6 344 -- E R mlinglin -- -- 168:00:0 256 2 0 -- E R debug -- -- 00:30:00 16 0 4 -- E R fernsler -- -- 168:00:0 32 0 0 -- E R specreq -- -- 241:00:0 600 2 0 -- E R ----- ----- 121 1343 > I will try again later. > > I then tested the IU BigRed with my application. Swift showed me the > following error and I don't know if this is because of my setting: > > Progress: Selecting site:1019 Initializing site shared directory:4 > Execution failed: > Could not initialize shared directory on IU_BigRed > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > Error communicating with the GridFTP server > Caused by: > Server refused performing the request. Custom message: Server > refused GSSAPI authentication. (error code 1) [Nested exception > message: Custom message: Unexpected reply: 530-globus_xio: Server > side credential failure > 530-globus_gsi_gssapi: Error with GSI credential > 530-globus_gsi_gssapi: Error with gss credential handle > 530-globus_credential: Error with credential: The host > credential: /etc/grid-security/hostcert.pem > 530- with subject: /C=US/O=National Center for Supercomputing > Applications/CN=gridftp4.bigred.teragrid.iu.edu > 530- has expired 4459 minutes ago. > 530- > 530 End.] Bigred, it would seem, has an expired host certificate. This is a problem with the site. I would suggest seding an email to help at teragrid.org with the above message (from "Server refused performing the request" to "530 End.]"). From yuechen at bsd.uchicago.edu Wed Apr 29 17:30:07 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Wed, 29 Apr 2009 17:30:07 -0500 Subject: [Swift-user] errors in file transfer References: <1241027477.14561.0.camel@localhost><1241032766.16150.2.camel@localhost> <1241040222.18377.7.camel@localhost> Message-ID: Hi Mihael, When I do qstat, it shows the following line for all my jobs in the queue: 937872.abem5 null yuechen 0 Q(null) normal It looks like no job is running. I did the qstat -q. Should I use the following line instead in sites.xml for shorter Walltime? debug I will send email to help at teragrid.org about the Bigred certificate problem. Thanks! Chen, Yue ________________________________ From: Mihael Hategan [mailto:hategan at mcs.anl.gov] Sent: Wed 4/29/2009 4:23 PM To: Yue, Chen - BMD Cc: swift user Subject: RE: [Swift-user] errors in file transfer On Wed, 2009-04-29 at 16:06 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > I deleted the following line in my sites.xml file for NCSA_Abe and the > wrapper transfer warnings are gone. > > fast > > I can also find jobs queuing on Abe. However, after quite a while, no > job returned. I guess it is because I didn't set a priority and all > the jobs are waiting. When you do qstat, are your jobs in a queued state? > Is there other way to set priority? You should be able to specify the queue. The only problem is that you are specifying a queue that doesn't exist on Abe. This is what I've found online: http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Jobs.html#Queues You can also log in, and do a qstat -q, which will show the following: [hategan at honest2 ~]$ qstat -q server: abem5.ncsa.uiuc.edu Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- normal -- -- 48:00:00 600 82 928 -- E R iacat2 -- -- 241:00:0 -- 0 20 -- E R indprio -- -- 48:00:00 600 0 0 -- E R long -- -- 168:00:0 600 13 15 -- E R iacat -- -- 241:00:0 -- 0 0 -- E R industrial -- -- 336:00:0 600 14 32 -- E R lincoln -- -- 241:00:0 -- 2 0 -- E R wide -- -- 48:00:00 1196 6 344 -- E R mlinglin -- -- 168:00:0 256 2 0 -- E R debug -- -- 00:30:00 16 0 4 -- E R fernsler -- -- 168:00:0 32 0 0 -- E R specreq -- -- 241:00:0 600 2 0 -- E R ----- ----- 121 1343 > I will try again later. > > I then tested the IU BigRed with my application. Swift showed me the > following error and I don't know if this is because of my setting: > > Progress: Selecting site:1019 Initializing site shared directory:4 > Execution failed: > Could not initialize shared directory on IU_BigRed > Caused by: > org.globus.cog.abstraction.impl.file.FileResourceException: > Error communicating with the GridFTP server > Caused by: > Server refused performing the request. Custom message: Server > refused GSSAPI authentication. (error code 1) [Nested exception > message: Custom message: Unexpected reply: 530-globus_xio: Server > side credential failure > 530-globus_gsi_gssapi: Error with GSI credential > 530-globus_gsi_gssapi: Error with gss credential handle > 530-globus_credential: Error with credential: The host > credential: /etc/grid-security/hostcert.pem > 530- with subject: /C=US/O=National Center for Supercomputing > Applications/CN=gridftp4.bigred.teragrid.iu.edu > 530- has expired 4459 minutes ago. > 530- > 530 End.] Bigred, it would seem, has an expired host certificate. This is a problem with the site. I would suggest seding an email to help at teragrid.org with the above message (from "Server refused performing the request" to "530 End.]"). This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Apr 29 17:58:12 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 29 Apr 2009 17:58:12 -0500 Subject: [Swift-user] errors in file transfer In-Reply-To: References: <1241027477.14561.0.camel@localhost> <1241032766.16150.2.camel@localhost> <1241040222.18377.7.camel@localhost> Message-ID: <1241045892.21707.1.camel@localhost> On Wed, 2009-04-29 at 17:30 -0500, Yue, Chen - BMD wrote: > Hi Mihael, > > When I do qstat, it shows the following line for all my jobs in the > queue: > > 937872.abem5 null yuechen 0 > Q(null) normal > > It looks like no job is running. Yep. That's what it looks like. > > I did the qstat -q. Should I use the following line instead in > sites.xml for shorter Walltime? > > debug I think so. Though make sure to set coasterWorkerMaxwalltime to 30 minutes if you do. > > I will send email to help at teragrid.org about the Bigred certificate > problem. > > Thanks! You're welcome. > From yuechen at bsd.uchicago.edu Thu Apr 30 12:08:57 2009 From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD) Date: Thu, 30 Apr 2009 12:08:57 -0500 Subject: [Swift-user] Execution error Message-ID: Hi, I came back to re-run my application on NCSA Mercury which was tested successfully last week after I just set up coasters with swift 0.9, but I got many messages like the following: Progress: Stage in:219 Submitting:803 Submitted:1 Progress: Stage in:129 Submitting:703 Submitted:190 Failed but can retry:1 Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can retry:4 Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY Progress: Stage in:1 Submitted:1013 Active:1 Failed but can retry:8 Progress: Submitted:1011 Active:1 Failed but can retry:11 The log file for the successful run last week is ; /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log The log file for the failed run is : /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log I don't think I did anything different, so I don't know why this time they failed. The sites.xml for Mercury is: /gpfs_scratch1/yuechen/swiftwork debug Thank you for help! Chen, Yue This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Apr 30 12:20:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 12:20:03 -0500 Subject: [Swift-user] Execution error In-Reply-To: References: Message-ID: <49F9DDC3.5090907@mcs.anl.gov> Yue, I'm looking at your logs. I see that swift is encountering qsub errors (code "-68"). Did you either get emails from qsub on Mercury, or have logs in your home dir there or your .globus dir? (I cant access your home directory). Can you look for log files in both $HOME and below .globus (maybe deeper) for that time period, and put them somewhere (like back on CI in a tarball) where we can access them? - Mike On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > Hi, > > I came back to re-run my application on NCSA Mercury which was tested > successfully last week after I just set up coasters with swift 0.9, but > I got many messages like the following: > > Progress: Stage in:219 Submitting:803 Submitted:1 > Progress: Stage in:129 Submitting:703 Submitted:190 Failed but can > retry:1 > Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can retry:4 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > Progress: Stage in:1 Submitted:1013 Active:1 Failed but can retry:8 > Progress: Submitted:1011 Active:1 Failed but can retry:11 > The log file for the successful run last week is ; > /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > The log file for the failed run is : > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > I don't think I did anything different, so I don't know why this time > they failed. The sites.xml for Mercury is: > > > > jobManager="gt2:PBS"/> > /gpfs_scratch1/yuechen/swiftwork > debug > > > Thank you for help! > > Chen, Yue > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Apr 30 12:23:27 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 12:23:27 -0500 Subject: [Swift-user] Execution error In-Reply-To: References: Message-ID: <49F9DE8F.1070404@mcs.anl.gov> Also, what account are you running under? We may need to change you to a new account - as the OSG Training account expires today. If that happend at Noon, it *might* be the problem. - Mike On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: > Hi, > > I came back to re-run my application on NCSA Mercury which was tested > successfully last week after I just set up coasters with swift 0.9, but > I got many messages like the following: > > Progress: Stage in:219 Submitting:803 Submitted:1 > Progress: Stage in:129 Submitting:703 Submitted:190 Failed but can > retry:1 > Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can retry:4 > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY > Failed to transfer wrapper log from > PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY > Progress: Stage in:1 Submitted:1013 Active:1 Failed but can retry:8 > Progress: Submitted:1011 Active:1 Failed but can retry:11 > The log file for the successful run last week is ; > /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log > > The log file for the failed run is : > /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log > > I don't think I did anything different, so I don't know why this time > they failed. The sites.xml for Mercury is: > > > > jobManager="gt2:PBS"/> > /gpfs_scratch1/yuechen/swiftwork > debug > > > Thank you for help! > > Chen, Yue > > > > > > > > > > > > This email is intended only for the use of the individual or entity to > which it is addressed and may contain information that is privileged and > confidential. If the reader of this email message is not the intended > recipient, you are hereby notified that any dissemination, distribution, > or copying of this communication is prohibited. If you have received > this email in error, please notify the sender and destroy/delete all > copies of the transmittal. Thank you. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Apr 30 12:40:40 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 12:40:40 -0500 Subject: [Swift-user] Execution error In-Reply-To: <49F9DE8F.1070404@mcs.anl.gov> References: <49F9DE8F.1070404@mcs.anl.gov> Message-ID: <49F9E298.8030801@mcs.anl.gov> I just checked - TG-CDA070002T has indeed expired. The best for now is to move to use (only) Ranger, under this account: TG-CCR080022N I will locate and send you a sites.xml entry in a moment. You need to go to a web page to activate your Ranger login. Best to contact me in IM and we can work this out. - Mike On 4/30/09 12:23 PM, Michael Wilde wrote: > Also, what account are you running under? We may need to change you to a > new account - as the OSG Training account expires today. > If that happend at Noon, it *might* be the problem. > > - Mike > > > On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >> Hi, >> >> I came back to re-run my application on NCSA Mercury which was tested >> successfully last week after I just set up coasters with swift 0.9, >> but I got many messages like the following: >> >> Progress: Stage in:219 Submitting:803 Submitted:1 >> Progress: Stage in:129 Submitting:703 Submitted:190 Failed but can >> retry:1 >> Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can >> retry:4 >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >> Failed to transfer wrapper log from >> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can retry:8 >> Progress: Submitted:1011 Active:1 Failed but can retry:11 >> The log file for the successful run last week is ; >> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >> >> The log file for the failed run is : >> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >> >> I don't think I did anything different, so I don't know why this time >> they failed. The sites.xml for Mercury is: >> >> >> >> > jobManager="gt2:PBS"/> >> /gpfs_scratch1/yuechen/swiftwork >> debug >> >> >> Thank you for help! >> >> Chen, Yue >> >> >> >> >> >> >> >> >> >> >> >> This email is intended only for the use of the individual or entity to >> which it is addressed and may contain information that is privileged >> and confidential. If the reader of this email message is not the >> intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication is prohibited. If you >> have received this email in error, please notify the sender and >> destroy/delete all copies of the transmittal. Thank you. >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From wilde at mcs.anl.gov Thu Apr 30 13:07:55 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 30 Apr 2009 13:07:55 -0500 Subject: [Swift-user] Execution error In-Reply-To: <49F9E298.8030801@mcs.anl.gov> References: <49F9DE8F.1070404@mcs.anl.gov> <49F9E298.8030801@mcs.anl.gov> Message-ID: <49F9E8FB.9020500@mcs.anl.gov> Yue, use this XML pool element to access ranger: /tmp/yuechen/jobdir TG-CCR080022N 16 development 00:40:00 31 50 10 /work/00306/tg455797/swiftwork You will need to also do these steps: Go to this web page to enable your Ranger account: https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx Then login to Ranger via the TeraGrid portal and put your ssh keys in place (assuming you use ssh keys, which you should) While on Ranger, do this: echo $WORK mkdir $work/swiftwork and put the full path of your $WORK/swiftwork directory in the element above. (My login is tg455etc, yours is yuechen) Then scp your code to Ranger and compile it. Then create a tc.data entry for your ptmap app Next, set your time values in the sites.xml entry above to suitable values for Ranger. You'll need to measure times, but I think you will find Ranger about twice as fast as Mercury for CPU-bound jobs. The values above were set for one app job per coaster. I think you can probably do more. If you estimate a run time of 5 minutes, use: 00:30:00 5 Other people on the list - please sanity check what I suggest here. - Mike On 4/30/09 12:40 PM, Michael Wilde wrote: > I just checked - TG-CDA070002T has indeed expired. > > The best for now is to move to use (only) Ranger, under this account: > TG-CCR080022N > > I will locate and send you a sites.xml entry in a moment. > > You need to go to a web page to activate your Ranger login. > > Best to contact me in IM and we can work this out. > > - Mike > > > > On 4/30/09 12:23 PM, Michael Wilde wrote: >> Also, what account are you running under? We may need to change you to >> a new account - as the OSG Training account expires today. >> If that happend at Noon, it *might* be the problem. >> >> - Mike >> >> >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote: >>> Hi, >>> >>> I came back to re-run my application on NCSA Mercury which was tested >>> successfully last week after I just set up coasters with swift 0.9, >>> but I got many messages like the following: >>> >>> Progress: Stage in:219 Submitting:803 Submitted:1 >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed but can >>> retry:1 >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed but can >>> retry:4 >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY >>> Failed to transfer wrapper log from >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can retry:8 >>> Progress: Submitted:1011 Active:1 Failed but can retry:11 >>> The log file for the successful run last week is ; >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log >>> >>> The log file for the failed run is : >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log >>> >>> I don't think I did anything different, so I don't know why this time >>> they failed. The sites.xml for Mercury is: >>> >>> >>> >>> >> jobManager="gt2:PBS"/> >>> /gpfs_scratch1/yuechen/swiftwork >>> debug >>> >>> >>> Thank you for help! >>> >>> Chen, Yue >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> This email is intended only for the use of the individual or entity >>> to which it is addressed and may contain information that is >>> privileged and confidential. If the reader of this email message is >>> not the intended recipient, you are hereby notified that any >>> dissemination, distribution, or copying of this communication is >>> prohibited. If you have received this email in error, please notify >>> the sender and destroy/delete all copies of the transmittal. Thank you. >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user