From zhaozhang at uchicago.edu Tue Jun 2 15:42:44 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 02 Jun 2009 15:42:44 -0500 Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster Message-ID: <4A258EC4.3010307@uchicago.edu> Hi, Mihael The language behavior test failed on SDSC cluster. All gram and coaster log files could be found at CI netowrk: /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/ I am attaching the stdout and the sites.xml definition of SDSC cluster site. Could you help find out where things go wrong? Thanks. best zhao [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml TG-DBS080005N /users/zzhang/work 4 2 5 1 2 false [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml Removing files from previous runs Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009 Swift svn swift-r2949 cog-r2406 RunID: 20090602-1527-q2au81s2 Progress: Progress: Stage in:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitting:1 Progress: Submitted:1 Progress: Active:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc Progress: Stage in:1 Progress: Active:1 Failed to transfer wrapper log from 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc Progress: Failed:1 Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: tgsdsc Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj stderr.txt: stdout.txt: ---- Caused by: Block task failed: org.globus.gram.GramException: The job failed when the job manager attempted to run it at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) at org.globus.gram.GramJob.setStatus(GramJob.java:184) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) at java.lang.Thread.run(Thread.java:534) Cleaning up... Shutting down service at https://198.202.112.33:45214 Got channel MetaChannel: 25326891 -> GSSSChannel-null(1) - Done SWIFT RETURN CODE NON-ZERO - test 061-cattwo From hategan at mcs.anl.gov Tue Jun 2 17:19:02 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 02 Jun 2009 17:19:02 -0500 Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster In-Reply-To: <4A258EC4.3010307@uchicago.edu> References: <4A258EC4.3010307@uchicago.edu> Message-ID: <1243981142.31356.0.camel@localhost> Gram is failing, so you should try a plain gram job to confirm that it is indeed the issue. On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote: > Hi, Mihael > > The language behavior test failed on SDSC cluster. All gram and coaster > log files could be found at CI netowrk: > /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/ > > I am attaching the stdout and the sites.xml definition of SDSC cluster > site. Could you help find out where things > go wrong? Thanks. > > best > zhao > > [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml > > > > jobManager="gt2:gt2:pbs"/> > TG-DBS080005N > /users/zzhang/work > 4 > 2 > 5 > 1 > 2 > false > > > > > [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml > testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml > Removing files from previous runs > Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009 > Swift svn swift-r2949 cog-r2406 > > RunID: 20090602-1527-q2au81s2 > Progress: > Progress: Stage in:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitting:1 > Progress: Submitted:1 > Progress: Active:1 > Progress: Failed but can retry:1 > Failed to transfer wrapper log from > 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc > Progress: Stage in:1 > Progress: Active:1 > Failed to transfer wrapper log from > 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [061-cattwo.1.in, 061-cattwo.2.in] > Host: tgsdsc > Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Block task failed: > org.globus.gram.GramException: The job failed when the job manager > attempted to run it > at > org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) > at org.globus.gram.GramJob.setStatus(GramJob.java:184) > at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) > at java.lang.Thread.run(Thread.java:534) > > Cleaning up... > Shutting down service at https://198.202.112.33:45214 > Got channel MetaChannel: 25326891 -> GSSSChannel-null(1) > - Done > SWIFT RETURN CODE NON-ZERO - test 061-cattwo > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Tue Jun 2 18:13:59 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Tue, 02 Jun 2009 18:13:59 -0500 Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster In-Reply-To: <1243981142.31356.0.camel@localhost> References: <4A258EC4.3010307@uchicago.edu> <1243981142.31356.0.camel@localhost> Message-ID: <4A25B237.2080000@uchicago.edu> Do you mean a globus-job-run? or a test with a jobmanger other than coaster? I did a globu-job-run, it was fine. [zzhang at communicado coaster_new]$ globus-job-run tg-login1.sdsc.teragrid.org /us r/bin/id uid=501593(zzhang) gid=5387(anl101) groups=5387(anl101) zhao Mihael Hategan wrote: > Gram is failing, so you should try a plain gram job to confirm that it > is indeed the issue. > > On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> The language behavior test failed on SDSC cluster. All gram and coaster >> log files could be found at CI netowrk: >> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/ >> >> I am attaching the stdout and the sites.xml definition of SDSC cluster >> site. Could you help find out where things >> go wrong? Thanks. >> >> best >> zhao >> >> [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml >> >> >> >> > jobManager="gt2:gt2:pbs"/> >> TG-DBS080005N >> /users/zzhang/work >> 4 >> 2 >> 5 >> 1 >> 2 >> false >> >> >> >> >> [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml >> testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml >> Removing files from previous runs >> Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009 >> Swift svn swift-r2949 cog-r2406 >> >> RunID: 20090602-1527-q2au81s2 >> Progress: >> Progress: Stage in:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitting:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Progress: Failed but can retry:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc >> Progress: Active:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc >> Progress: Stage in:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc >> Progress: Failed:1 >> Execution failed: >> Exception in cat: >> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >> Host: tgsdsc >> Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> Block task failed: >> org.globus.gram.GramException: The job failed when the job manager >> attempted to run it >> at >> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) >> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >> at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >> at java.lang.Thread.run(Thread.java:534) >> >> Cleaning up... >> Shutting down service at https://198.202.112.33:45214 >> Got channel MetaChannel: 25326891 -> GSSSChannel-null(1) >> - Done >> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > > From wilde at mcs.anl.gov Tue Jun 2 19:18:56 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 02 Jun 2009 19:18:56 -0500 Subject: [Swift-devel] coaster test failed on SDSC IA-64 Cluster In-Reply-To: <4A25B237.2080000@uchicago.edu> References: <4A258EC4.3010307@uchicago.edu> <1243981142.31356.0.camel@localhost> <4A25B237.2080000@uchicago.edu> Message-ID: <4A25C170.5050606@mcs.anl.gov> Try doing a gobus-job-run to the PBS job manager, specifying your project. Also try a simple swift job to the PBS job manager, with the same project. - Mike On 6/2/09 6:13 PM, Zhao Zhang wrote: > Do you mean a globus-job-run? or a test with a jobmanger other than > coaster? > I did a globu-job-run, it was fine. > [zzhang at communicado coaster_new]$ globus-job-run > tg-login1.sdsc.teragrid.org /us > r/bin/id > uid=501593(zzhang) gid=5387(anl101) groups=5387(anl101) > > zhao > > Mihael Hategan wrote: >> Gram is failing, so you should try a plain gram job to confirm that it >> is indeed the issue. >> >> On Tue, 2009-06-02 at 15:42 -0500, Zhao Zhang wrote: >> >>> Hi, Mihael >>> >>> The language behavior test failed on SDSC cluster. All gram and >>> coaster log files could be found at CI netowrk: >>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/sdsc/ >>> >>> I am attaching the stdout and the sites.xml definition of SDSC >>> cluster site. Could you help find out where things >>> go wrong? Thanks. >>> >>> best >>> zhao >>> >>> [zzhang at communicado sites]$ cat coaster_new/tgsdsc-pbs-gram2.xml >>> >>> >>> >>> >> jobManager="gt2:gt2:pbs"/> >>> TG-DBS080005N >>> /users/zzhang/work >>> 4 >>> 2 >>> 5 >>> 1 >>> 2 >>> >> key="remoteMonitorEnabled">false >>> >>> >>> >>> >>> [zzhang at communicado sites]$ ./run-site coaster_new/tgsdsc-pbs-gram2.xml >>> testing site configuration: coaster_new/tgsdsc-pbs-gram2.xml >>> Removing files from previous runs >>> Running test 061-cattwo at Tue Jun 2 15:27:48 CDT 2009 >>> Swift svn swift-r2949 cog-r2406 >>> >>> RunID: 20090602-1527-q2au81s2 >>> Progress: >>> Progress: Stage in:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitting:1 >>> Progress: Submitted:1 >>> Progress: Active:1 >>> Progress: Failed but can retry:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090602-1527-q2au81s2/info/c on tgsdsc >>> Progress: Active:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090602-1527-q2au81s2/info/e on tgsdsc >>> Progress: Stage in:1 >>> Progress: Active:1 >>> Failed to transfer wrapper log from >>> 061-cattwo-20090602-1527-q2au81s2/info/g on tgsdsc >>> Progress: Failed:1 >>> Execution failed: >>> Exception in cat: >>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in] >>> Host: tgsdsc >>> Directory: 061-cattwo-20090602-1527-q2au81s2/jobs/g/cat-gvrvuobj >>> stderr.txt: >>> >>> stdout.txt: >>> >>> ---- >>> >>> Caused by: >>> Block task failed: >>> org.globus.gram.GramException: The job failed when the job manager >>> attempted to run it >>> at >>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:531) >>> >>> at org.globus.gram.GramJob.setStatus(GramJob.java:184) >>> at >>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176) >>> at java.lang.Thread.run(Thread.java:534) >>> >>> Cleaning up... >>> Shutting down service at https://198.202.112.33:45214 >>> Got channel MetaChannel: 25326891 -> GSSSChannel-null(1) >>> - Done >>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Wed Jun 3 10:10:52 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 3 Jun 2009 15:10:52 +0000 (GMT) Subject: [Swift-devel] small-scale swift pseudopublications Message-ID: Theres a bunch of stuff floating round that isn't published material but is still interesting - mostly making me think of this is stuff that is to be presented at the DSL Workshop this year but I suspect there is other material floating around. It might be interesting to keep some list of this online on the swift website linking to whatever random material is available. -- From wilde at mcs.anl.gov Wed Jun 3 16:24:08 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 03 Jun 2009 16:24:08 -0500 Subject: [Swift-devel] [Fwd: [teraport-notify] tp-login2 job submission failures] Message-ID: <4A26E9F8.7060609@mcs.anl.gov> just fyi -------- Original Message -------- Subject: [teraport-notify] tp-login2 job submission failures Date: Wed, 3 Jun 2009 14:56:34 -0500 From: Greg Cross To: teraport-notify at ci.uchicago.edu Users have reported failures when attempting job submissions to Teraport's scheduler using the "qsub" command. This problem has been isolated to the node tp-login2.ci.uchicago.edu. The cause for these failures stem from recent misconfigurations in DNS service, which is out of the CI's scope of control. You will receive notification when the responsible authority has corrected the misconfigurations. In the meantime, all users should use tp-login1.ci.uchicago.edu exclusively for job submission purposes. _______________________________________________ teraport-notify mailing list teraport-notify at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/teraport-notify From zhaozhang at uchicago.edu Thu Jun 4 14:08:09 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 04 Jun 2009 14:08:09 -0500 Subject: [Swift-devel] Test Cases for Swift Test Message-ID: <4A281B99.7080901@uchicago.edu> Hi, Ben Here is a list of the tests I put in the regular swift test with coaster on TeraGrid for now. Sooner, we are going to test swift with condor-G on OSG, too. Is there any other tests you want to put it the regular test for swift, more sites, more applications, or more swift features? best zhao : 1 Sanity Test (Swift language behavior test) 061-cattwo 130-fmri 103-quote.swift 1032-singlequote.swift 1031-quote.swift 1033-singlequote.swift 141-space-in-filename 142-space-and-quotes 2 Data Movement Test foreach data in {1KB, 1MB, "10MB, 100MB, 1GB"} foreach i in {1, 10, 100, 1000} copy data to site redirect data to output_file copy output_file back to submit host done done 3. Application Test SCIP - 1000 jobs on ranger, I am building scip on uc-teragrid and Abe now, hopefully, could add those sites today. DOCK6 - 1000 jobs on uc teragrid. From aespinosa at cs.uchicago.edu Mon Jun 8 16:24:37 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 8 Jun 2009 16:24:37 -0500 Subject: [Swift-devel] block coasters not registering on proper queue Message-ID: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> Is there a default maxwalltime being submitted to the LRM if nothing is specified? I made in this configuration to use the "fast" ueue in sites.xml but i keep getting placed inside the "exteneded" queue. sites.xml fast /home/aespinosa/work 50 10 20 gram log snippet: ... ... Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed (may be harmless): Operation not permitted Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may be harmless): Operation not permitted Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from job description Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time limit from job description Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "http://128.135.125.118:56015" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "http://128.135.125.118:56015" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" ... ... $grep fast gram*.log: gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) Swift version: Swift svn swift-r2949 cog-r2406 -Allan From hategan at mcs.anl.gov Mon Jun 8 17:49:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Jun 2009 17:49:25 -0500 Subject: [Swift-devel] block coasters not registering on proper queue In-Reply-To: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> Message-ID: <1244501365.5919.3.camel@localhost> On Mon, 2009-06-08 at 16:24 -0500, Allan Espinosa wrote: > Is there a default maxwalltime being submitted to the LRM if nothing > is specified? The block maxwalltime varies depending on the job maxwalltimes and the overallocation parameters. So in a sense, yes. > I made in this configuration to use the "fast" ueue in > sites.xml but i keep getting placed inside the "exteneded" queue. My gut feeling tells me that the LRM would not change the queue if the walltime didn't fit, but would instead complain that the maxwalltime is larger than what the queue accepts. So it looks more like the queue parameter doesn't get passed to the LRM properly. > > sites.xml > > > > jobmanager="gt2:gt2:pbs" /> > fast > /home/aespinosa/work > 50 > 10 > 20 > > > > gram log snippet: > ... > ... > Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed > (may be harmless): Operation not permitted > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may > be harmless): Operation not permitted > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from > job description > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time > limit from job description > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60 > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument > "http://128.135.125.118:56015" > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to > "http://128.135.125.118:56015" > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" > ... > ... > > $grep fast gram*.log: > gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > > > Swift version: Swift svn swift-r2949 cog-r2406 > > -Allan > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Tue Jun 9 03:37:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 9 Jun 2009 08:37:13 +0000 (GMT) Subject: [Swift-devel] block coasters not registering on proper queue In-Reply-To: <1244501365.5919.3.camel@localhost> References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> <1244501365.5919.3.camel@localhost> Message-ID: > My gut feeling tells me that the LRM would not change the queue if the > walltime didn't fit, but would instead complain that the maxwalltime is > larger than what the queue accepts. Thats also my understanding of how teraport behaves. -- From hockyg at uchicago.edu Tue Jun 9 14:33:29 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 09 Jun 2009 14:33:29 -0500 Subject: [Swift-devel] Error conditional on name of sites.file Message-ID: <4A2EB909.9020106@uchicago.edu> Hi everyone, When i use this file: > > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> > > > > /home/hockyg/swiftwork > > > I get this error > swift sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input > -nsims=1 -kill=1 > Execution failed: > Could not load file sites.local: > org.globus.cog.karajan.translator.TranslationException: > org.globus.cog.karajan.parser.ParsingException: Line 8: > > Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but > got '/' When I rename it to something else > swift sites.local.xml -prot=T1af7 > -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1 > Swift svn swift-r2953 cog-r2406 > > RunID: 20090609-1430-851tnjye > Progress: > Progress: Active:1 > Progress: Active:1 > Progress: Checking status:1 > Final status: Finished successfully:1 From wilde at mcs.anl.gov Tue Jun 9 14:56:34 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 09 Jun 2009 14:56:34 -0500 Subject: [Swift-devel] "multiple writers-in-iterate" problem Message-ID: <4A2EBE72.1020105@mcs.anl.gov> Ben, can you fix this in the next release? -------- Original Message -------- Subject: Re: [Swift-devel] continued questions on iterate Date: Sun, 01 Mar 2009 23:39:42 -0600 From: Michael Wilde To: swift-devel References: <49AB6653.6010306 at mcs.anl.gov> Im able to work around this by moving the s[0] assignments inside the iterate block, in an if(i==0) {} else {} construct. Still, it seems the restriction is not intended. - Mike On 3/1/09 10:53 PM, Michael Wilde wrote: > This program: > > string s[]; > s[0]="hi "; > iterate i { > s[i+1] = @strcat(s[i],"hi "); > trace(s[i]); > } until(i==5); > > Gives: > > com$ swift it4.swift > Could not start execution. > variable s has multiple writers. > > -- > Its similar to the tutorial example: > > counterfile a[] ; > > a[0] = echo("793578934574893"); > > iterate v { > a[v+1] = countstep(a[v]); > print("extract int value ", at extractint(a[v+1])); > } until (@extractint(a[v+1]) <= 1); > > -- > > ...which I reported earlier as having problems (I think in addition to > the one above?) > > This is using the latest swift, rev 2631, and latest cog. > > I thought I had issues like this licked, but then updated the code to > get closer to what the user needs. > > In this example, I dont see any violation of single-assignment, but > apparently swift does. > > The full example that the test case above is for is at: > www.ci.uchicago.edu/~wilde/oops8.swift, which encounters the same > multiple-writer problem. > > I start with an initial "secondary structure" string of all A's, same > length as the protein sequence. After each folding round, a new > structure is derived for analysis and used as the starting point for the > next round. This has the same data access pattern as array s[] above: > > foreach p, pn in protein { > OOPSOut result[][] ; > SecSeq secseq[] prefix=@strcat("seqseq/",p,"/"),suffix=".secseq">; > OOPSIn oopsin ; > secseq[0] = sedfasta(oopsin.fasta, ["-e","s/./A/g"]); > boolean converged[]; > iterate i { > SecSeq s; > result[i] = doRound(p,oopsin,secseq[i],i); > (converged[i],s) = analyzeResult(result[i], p, i, secseq[i]); > secseq[i+1] = s; > } until (converged[i] || (i==3)); > } > > In this case, I get the same message for array secseq (varable has > multiple writers). > > I > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Tue Jun 9 15:00:09 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 09 Jun 2009 15:00:09 -0500 Subject: [Swift-devel] Error conditional on name of sites.file In-Reply-To: <4A2EB909.9020106@uchicago.edu> References: <4A2EB909.9020106@uchicago.edu> Message-ID: <4A2EBF49.1070706@mcs.anl.gov> Looks like the sites file needs to end in .xml? On 6/9/09 2:33 PM, Glen Hocky wrote: > Hi everyone, > When i use this file: > >> >> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> >> >> >> >> /home/hockyg/swiftwork >> >> >> > > I get this error >> swift > sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input >> -nsims=1 -kill=1 >> Execution failed: >> Could not load file sites.local: >> org.globus.cog.karajan.translator.TranslationException: >> org.globus.cog.karajan.parser.ParsingException: Line 8: >> >> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but >> got '/' > When I rename it to something else > >> swift > sites.local.xml -prot=T1af7 >> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1 >> Swift svn swift-r2953 cog-r2406 >> >> RunID: 20090609-1430-851tnjye >> Progress: >> Progress: Active:1 >> Progress: Active:1 >> Progress: Checking status:1 >> Final status: Finished successfully:1 > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Tue Jun 9 15:10:13 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 9 Jun 2009 15:10:13 -0500 Subject: [Swift-devel] active jobs vs available processors on submitted coaster queues Message-ID: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com> I was expecting to have 2 active jobs at a time from the swift log but instead got only one at a time: Swift svn swift-r2949 cog-r2406 RunID: out.run_000 Progress: Progress: Selecting site:4 Initializing site shared directory:1 Stage in:1 Progress: Stage in:6 Progress: Stage in:6 Progress: Stage in:6 Progress: Stage in:6 Progress: Stage in:6 Progress: Stage in:6 Progress: Stage in:5 Submitting:1 Progress: Submitting:5 Submitted:1 Progress: Submitted:6 Progress: Submitted:5 Active:1 Progress: Submitted:5 Active:1 Progress: Submitted:5 Active:1 Progress: Submitted:5 Active:1 Progress: Submitted:5 Active:1 Progress: Submitted:5 Checking status:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Active:1 Finished successfully:1 Progress: Submitted:4 Checking status:1 Finished successfully:1 Progress: Submitted:3 Active:1 Finished successfully:2 Progress: Submitted:3 Active:1 Finished successfully:2 Progress: Submitted:3 Active:1 Finished successfully:2 Progress: Submitted:3 Checking status:1 Finished successfully:2 Progress: Submitted:2 Active:1 Finished successfully:3 ... ... uc-teragrid queue status: $showq -u $USER [aespinosa at tg-login1 ~]$ showq -u $USER active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 2015982 aespinos Running 2 00:55:41 Tue Jun 9 15:02:18 1 active job 2 of 116 processors in use by local jobs (1.72%) 42 of 58 nodes active (72.41%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1 sites.xml: /home/aespinosa/blast-runs 5 1.26 ia64-compute 4 16 -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Tue Jun 9 15:31:06 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 9 Jun 2009 15:31:06 -0500 Subject: [Swift-devel] block coasters not registering on proper queue In-Reply-To: <1244501365.5919.3.camel@localhost> References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> <1244501365.5919.3.camel@localhost> Message-ID: <50b07b4b0906091331y598949cfi353155ae646b7826@mail.gmail.com> ok. so i guess i should file this on bugzilla... 2009/6/8 Mihael Hategan : > On Mon, 2009-06-08 at 16:24 -0500, Allan Espinosa wrote: >> Is there a default maxwalltime being submitted to the LRM if nothing >> is specified? > > The block maxwalltime varies depending on the job maxwalltimes and the > overallocation parameters. So in a sense, yes. > >> ? I made in this configuration to use the "fast" ueue in >> sites.xml but i keep getting placed inside the "exteneded" queue. > > My gut feeling tells me that the LRM would not change the queue if the > walltime didn't fit, but would instead complain that the maxwalltime is > larger than what the queue accepts. So it looks more like the queue > parameter doesn't get passed to the LRM properly. > >> >> sites.xml >> >> ? >> ? ? >> ? ? > jobmanager="gt2:gt2:pbs" /> >> ? ? fast >> ? ? /home/aespinosa/work >> ? ? 50 >> ? ? 10 >> ? ? 20 >> ? >> >> >> gram log snippet: >> ... >> ... >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: >> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Entering pbs submit >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed >> (may be harmless): Operation not permitted >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may >> be harmless): Operation not permitted >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from >> job description >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using queue default >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max wall time >> limit from job description >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using maxwalltime of 60 >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Building job script >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: >> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument >> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to >> "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument >> "http://128.135.125.118:56015" >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to >> "http://128.135.125.118:56015" >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" >> Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" >> ... >> ... >> >> $grep fast gram*.log: >> gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = >> GLOBUS_FAILURE (try Perl scripts) >> >> >> Swift version: Swift svn swift-r2949 cog-r2406 >> >> -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hockyg at uchicago.edu Tue Jun 9 15:34:43 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Tue, 09 Jun 2009 15:34:43 -0500 Subject: [Swift-devel] Error conditional on name of sites.file In-Reply-To: <4A2EBF49.1070706@mcs.anl.gov> References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov> Message-ID: <4A2EC763.9030303@uchicago.edu> To clarify, I don't mind if the filename must end in .xml or w/e, I just wish the error message would have told me that rather than figuring it out by trial and error Michael Wilde wrote: > Looks like the sites file needs to end in .xml? > > On 6/9/09 2:33 PM, Glen Hocky wrote: >> Hi everyone, >> When i use this file: >> >>> >>> >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> >>> >>> >>> >>> /home/hockyg/swiftwork >>> >>> >>> >> >> I get this error >>> swift >> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input >>> -nsims=1 -kill=1 >>> Execution failed: >>> Could not load file sites.local: >>> org.globus.cog.karajan.translator.TranslationException: >>> org.globus.cog.karajan.parser.ParsingException: Line 8: >>> >>> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER >>> but got '/' >> When I rename it to something else >> >>> swift >> sites.local.xml -prot=T1af7 >>> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1 >>> Swift svn swift-r2953 cog-r2406 >>> >>> RunID: 20090609-1430-851tnjye >>> Progress: >>> Progress: Active:1 >>> Progress: Active:1 >>> Progress: Checking status:1 >>> Final status: Finished successfully:1 >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From bugzilla-daemon at mcs.anl.gov Tue Jun 9 15:34:25 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 9 Jun 2009 15:34:25 -0500 (CDT) Subject: [Swift-devel] [Bug 211] New: block coasters not registering on proper queue Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=211 Summary: block coasters not registering on proper queue Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Specific site issues AssignedTo: hategan at mcs.anl.gov ReportedBy: aespinosa at cs.uchicago.edu Is there a default maxwalltime being submitted to the LRM if nothing is specified? I made in this configuration to use the "fast" ueue in sites.xml but i keep getting placed inside the "exteneded" queue. sites.xml fast /home/aespinosa/work 50 10 20 gram log snippet: ... ... Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed (may be harmless): Operation not permitted Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may be harmless): Operation not permitted Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from job description Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time limit from job description Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "http://128.135.125.118:56015" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "http://128.135.125.118:56015" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" ... ... $grep fast gram*.log: gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) Swift version: Swift svn swift-r2949 cog-r2406 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From wilde at mcs.anl.gov Tue Jun 9 16:09:48 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 09 Jun 2009 16:09:48 -0500 Subject: [Swift-devel] Error conditional on name of sites.file In-Reply-To: <4A2EC763.9030303@uchicago.edu> References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov> <4A2EC763.9030303@uchicago.edu> Message-ID: <4A2ECF9C.9030902@mcs.anl.gov> Indeed. We should file it as a bug. On 6/9/09 3:34 PM, Glen Hocky wrote: > To clarify, I don't mind if the filename must end in .xml or w/e, I just > wish the error message would have told me that rather than figuring it > out by trial and error > > Michael Wilde wrote: >> Looks like the sites file needs to end in .xml? >> >> On 6/9/09 2:33 PM, Glen Hocky wrote: >>> Hi everyone, >>> When i use this file: >>> >>>> >>>> >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> >>>> >>>> >>>> >>>> /home/hockyg/swiftwork >>>> >>>> >>>> >>> >>> I get this error >>>> swift >>> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input >>>> -nsims=1 -kill=1 >>>> Execution failed: >>>> Could not load file sites.local: >>>> org.globus.cog.karajan.translator.TranslationException: >>>> org.globus.cog.karajan.parser.ParsingException: Line 8: >>>> >>>> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER >>>> but got '/' >>> When I rename it to something else >>> >>>> swift >>> sites.local.xml -prot=T1af7 >>>> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1 >>>> Swift svn swift-r2953 cog-r2406 >>>> >>>> RunID: 20090609-1430-851tnjye >>>> Progress: >>>> Progress: Active:1 >>>> Progress: Active:1 >>>> Progress: Checking status:1 >>>> Final status: Finished successfully:1 >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 10 02:26:51 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Jun 2009 02:26:51 -0500 Subject: [Swift-devel] Error conditional on name of sites.file In-Reply-To: <4A2EBF49.1070706@mcs.anl.gov> References: <4A2EB909.9020106@uchicago.edu> <4A2EBF49.1070706@mcs.anl.gov> Message-ID: <1244618811.16077.0.camel@localhost> On Tue, 2009-06-09 at 15:00 -0500, Michael Wilde wrote: > Looks like the sites file needs to end in .xml? Yes. > > On 6/9/09 2:33 PM, Glen Hocky wrote: > > Hi everyone, > > When i use this file: > > > >> > >> >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.5"> > >> > >> > >> > >> /home/hockyg/swiftwork > >> > >> > >> > > > > I get this error > >> swift >> sites.local -prot=T1af7 -indir=/home/hockyg/swoops/test_swift/input > >> -nsims=1 -kill=1 > >> Execution failed: > >> Could not load file sites.local: > >> org.globus.cog.karajan.translator.TranslationException: > >> org.globus.cog.karajan.parser.ParsingException: Line 8: > >> > >> Expected '[' or '(' or DIGITS() or '+' or '-' or '"' or IDENTIFIER but > >> got '/' > > When I rename it to something else > > > >> swift >> sites.local.xml -prot=T1af7 > >> -indir=/home/hockyg/swoops/test_swift/input -nsims=1 -kill=1 > >> Swift svn swift-r2953 cog-r2406 > >> > >> RunID: 20090609-1430-851tnjye > >> Progress: > >> Progress: Active:1 > >> Progress: Active:1 > >> Progress: Checking status:1 > >> Final status: Finished successfully:1 > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Wed Jun 10 02:29:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Jun 2009 02:29:16 -0500 Subject: [Swift-devel] active jobs vs available processors on submitted coaster queues In-Reply-To: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com> References: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com> Message-ID: <1244618956.16077.2.camel@localhost> I need to look at the coaster log. On Tue, 2009-06-09 at 15:10 -0500, Allan Espinosa wrote: > I was expecting to have 2 active jobs at a time from the swift log but > instead got only one at a time: > Swift svn swift-r2949 cog-r2406 > > RunID: out.run_000 > Progress: > Progress: Selecting site:4 Initializing site shared directory:1 Stage in:1 > Progress: Stage in:6 > Progress: Stage in:6 > > > > Progress: Stage in:6 > Progress: Stage in:6 > Progress: Stage in:6 > Progress: Stage in:6 > Progress: Stage in:5 Submitting:1 > Progress: Submitting:5 Submitted:1 > Progress: Submitted:6 > Progress: Submitted:5 Active:1 > Progress: Submitted:5 Active:1 > Progress: Submitted:5 Active:1 > Progress: Submitted:5 Active:1 > Progress: Submitted:5 Active:1 > Progress: Submitted:5 Checking status:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Active:1 Finished successfully:1 > Progress: Submitted:4 Checking status:1 Finished successfully:1 > Progress: Submitted:3 Active:1 Finished successfully:2 > Progress: Submitted:3 Active:1 Finished successfully:2 > Progress: Submitted:3 Active:1 Finished successfully:2 > Progress: Submitted:3 Checking status:1 Finished successfully:2 > Progress: Submitted:2 Active:1 Finished successfully:3 > ... > ... > > > uc-teragrid queue status: $showq -u $USER > [aespinosa at tg-login1 ~]$ showq -u $USER > > active jobs------------------------ > JOBID USERNAME STATE PROCS REMAINING STARTTIME > > 2015982 aespinos Running 2 00:55:41 Tue Jun 9 15:02:18 > > 1 active job 2 of 116 processors in use by local jobs (1.72%) > 42 of 58 nodes active (72.41%) > > eligible jobs---------------------- > JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME > > > 0 eligible jobs > > blocked jobs----------------------- > JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME > > > 0 blocked jobs > > Total job: 1 > > > sites.xml: > > > url="tg-grid.uc.teragrid.org" jobmanager="gt2:gt2:pbs" /> > > /home/aespinosa/blast-runs > > 5 > 1.26 > > key="host_types">ia64-compute > 4 > 16 > > > > From benc at hawaga.org.uk Wed Jun 10 06:10:02 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Jun 2009 11:10:02 +0000 (GMT) Subject: [Swift-devel] someone to own swift NMI build and test Message-ID: The Swift NMI daily and per-commit builds run in my user account in the NMI build and test system. I'm going to turn those off before 17th of July 2009. Who wants to own them now? -- From wilde at mcs.anl.gov Wed Jun 10 06:40:53 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Jun 2009 06:40:53 -0500 Subject: [Swift-devel] someone to own swift NMI build and test In-Reply-To: References: Message-ID: <4A2F9BC5.4010405@mcs.anl.gov> Whats involved? Does someone need to establish a metronome login? Is there an automated way to push tests from swift svn to metronome? Do errors get emailed or does one have to check the logs via the web? On 6/10/09 6:10 AM, Ben Clifford wrote: > The Swift NMI daily and per-commit builds run in my user account in the > NMI build and test system. > > I'm going to turn those off before 17th of July 2009. > > Who wants to own them now? > From benc at hawaga.org.uk Wed Jun 10 07:04:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Jun 2009 12:04:08 +0000 (GMT) Subject: [Swift-devel] someone to own swift NMI build and test In-Reply-To: <4A2F9BC5.4010405@mcs.anl.gov> References: <4A2F9BC5.4010405@mcs.anl.gov> Message-ID: On Wed, 10 Jun 2009, Michael Wilde wrote: > Whats involved? Does someone need to establish a metronome login? yes - contact nmi-support at ci.uchicago.edu > Is there an automated way to push tests from swift svn to metronome? The tests are checked out every time they run. > Do errors get emailed or does one have to check the logs via the web? they get emailed to me when they finish. -- From benc at hawaga.org.uk Wed Jun 10 07:11:39 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Jun 2009 12:11:39 +0000 (GMT) Subject: [Swift-devel] someone to own swift NMI build and test In-Reply-To: References: <4A2F9BC5.4010405@mcs.anl.gov> Message-ID: > yes - contact nmi-support at ci.uchicago.edu lies! nmi-support at cs.wisc.edu -- From wilde at mcs.anl.gov Wed Jun 10 09:46:24 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Jun 2009 09:46:24 -0500 Subject: [Swift-devel] Application engagements Message-ID: <4A2FC740.9030901@mcs.anl.gov> Here's an update on 7 application engagements we have going at the moment. This doesnt include the ongoing CNARI work Sarah is supporting. The exact things to do next for each varies based on where each user is and how well they are making progress: 1. scip - Chris Henry Zhao turned over prototype script to them and did initial tests; Chris needs to provide run definitions; we need to give him a Swift starter release and a tailored README to get him started. He (pr we) need to create a TG startup account. 2. oops - Glen and Aashish They are making progress on their own 3. dock - Andrew Binkowski Andrew is running DOCK on Falkon on his own; can try to convert to Swift; for now, bigger Falkon runs are needed for INCITE app 4. oops - Mike Kubal Mike wants to run OOPS for other studies; needs startup help; waiting on data and on a new oops version 5. ptmap - Yue Chen Yue is focusing elsewhere; will get back to running at some point 6. see - Joshua Elliot and Todd Munson, ampl runs This just became "ready to swift"; need to do initial scripts and runs 7. PIR BLAST - Baris Suszek Allan doing this as a demo for them; Based on this, Zhao, Allan, I think the next step is to write the Swift script for SEE, #6; help Chris, #1; prepare to help MikeK, #4. I will send more details. - Mike From wilde at mcs.anl.gov Wed Jun 10 10:21:42 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Jun 2009 10:21:42 -0500 Subject: [Swift-devel] Application engagements In-Reply-To: <4A2FC740.9030901@mcs.anl.gov> References: <4A2FC740.9030901@mcs.anl.gov> Message-ID: <4A2FCF86.1070601@mcs.anl.gov> additions: 8. Matlab workflows for David Biron's "worm lab" (C.elegans) 9. Matlab workflows for Andrew Jamieison / Giger lab Alex Moore, student, is making progress on 8 with some help from me. 9. is on hold - waiting for time and interest on their part. - Mike On 6/10/09 9:46 AM, Michael Wilde wrote: > Here's an update on 7 application engagements we have going at the > moment. This doesnt include the ongoing CNARI work Sarah is supporting. > > The exact things to do next for each varies based on where each user is > and how well they are making progress: > > 1. scip - Chris Henry > Zhao turned over prototype script to them and did initial tests; > Chris needs to provide run definitions; > we need to give him a Swift starter release and a tailored README > to get him started. He (pr we) need to create a TG startup account. > 2. oops - Glen and Aashish > They are making progress on their own > 3. dock - Andrew Binkowski > Andrew is running DOCK on Falkon on his own; > can try to convert to Swift; for now, bigger Falkon runs > are needed for INCITE app > 4. oops - Mike Kubal > Mike wants to run OOPS for other studies; needs startup help; > waiting on data and on a new oops version > 5. ptmap - Yue Chen > Yue is focusing elsewhere; will get back to running at some point > 6. see - Joshua Elliot and Todd Munson, ampl runs > This just became "ready to swift"; > need to do initial scripts and runs > 7. PIR BLAST - Baris Suszek > Allan doing this as a demo for them; > > Based on this, Zhao, Allan, I think the next step is to write the Swift > script for SEE, #6; help Chris, #1; prepare to help MikeK, #4. I will > send more details. > > - Mike > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Wed Jun 10 13:45:45 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 10 Jun 2009 13:45:45 -0500 Subject: [Swift-devel] coaster job "completed" but swift reports check-status-failed Message-ID: <50b07b4b0906101145q6831e076oc5050efef8a82f5e@mail.gmail.com> attached are the corresponding swift logs, coaster logs and gram logs. sites.xml: /home/aespinosa/blast-runs 1 1.26 ia64-compute 4 2 swift session stdout: RunID: out.run_000 Progress: Progress: uninitialized:1 Progress: Initializing:1000 Selecting site:1 Progress: Selecting site:1000 Initializing site shared directory:1 Progress: Selecting site:999 Initializing site shared directory:1 Stage in:1 Progress: Selecting site:996 Stage in:5 Progress: Selecting site:996 Stage in:5 Progress: Selecting site:995 Stage in:6 Progress: Selecting site:994 Stage in:7 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:992 Stage in:9 Progress: Selecting site:992 Stage in:9 Progress: Selecting site:992 Stage in:8 Submitting:1 Progress: Selecting site:991 Stage in:1 Submitting:9 Progress: Selecting site:991 Stage in:1 Submitting:8 Submitted:1 Progress: Selecting site:991 Submitted:9 Active:1 Progress: Selecting site:991 Submitted:8 Active:2 Progress: Selecting site:991 Active:7 Checking status:2 Failed but can retry:1 Progress: Selecting site:991 Active:2 Checking status:3 Failed but can retry:5 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:2 Failed but can retry:8 Progress: Selecting site:991 Active:1 Checking status:1 Failed but can retry:8 Progress: Selecting site:989 Active:2 Checking status:1 Finished successfully:1 Failed but can retry:8 Progress: Selecting site:988 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:10 ... ... -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: tarball.tar.gz Type: application/x-gzip Size: 172173 bytes Desc: not available URL: From aespinosa at cs.uchicago.edu Wed Jun 10 16:42:35 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 10 Jun 2009 16:42:35 -0500 Subject: [Swift-devel] active jobs vs available processors on submitted coaster queues In-Reply-To: <50b07b4b0906101213g34368050re6e6b7b2b0992d9a@mail.gmail.com> References: <50b07b4b0906091310j502935b2jc144e126aeb4edf8@mail.gmail.com> <1244618956.16077.2.camel@localhost> <50b07b4b0906101213g34368050re6e6b7b2b0992d9a@mail.gmail.com> Message-ID: <50b07b4b0906101442t14695d3ei808adaa740b7ac1d@mail.gmail.com> Here's run on 1k jobs: only 2 jobs were active . the 18 procs here in the LRM i think is the 2nd block request: [aespinosa at tg-login1 ~]$ showq -u $USER active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 2016757 aespinos Running 18 00:15:09 Wed Jun 10 16:29:31 1 active job 18 of 114 processors in use by local jobs (15.79%) 50 of 57 nodes active (87.72%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME swift session: Swift svn swift-r2949 cog-r2406 RunID: out.run_000 Progress: Progress: uninitialized:1 Progress: Initializing:1000 Selecting site:1 Progress: Selecting site:1000 Initializing site shared directory:1 Progress: Selecting site:999 Initializing site shared directory:1 Stage in:1 Progress: Selecting site:996 Stage in:5 Progress: Selecting site:996 Stage in:5 Progress: Selecting site:995 Stage in:6 Progress: Selecting site:994 Stage in:7 Progress: Selecting site:994 Stage in:7 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:993 Stage in:8 Progress: Selecting site:992 Stage in:9 Progress: Selecting site:992 Stage in:9 Progress: Selecting site:992 Stage in:8 Submitting:1 Progress: Selecting site:991 Stage in:1 Submitting:8 Submitted:1 Progress: Selecting site:991 Submitted:9 Active:1 Progress: Selecting site:991 Submitted:9 Active:1 Progress: Selecting site:991 Submitted:8 Active:2 Progress: Selecting site:991 Submitted:1 Active:2 Checking status:6 Failed but can retry:1 Progress: Selecting site:991 Active:1 Checking status:4 Failed but can retry:5 Progress: Selecting site:990 Stage in:1 Active:1 Failed but can retry:9 Progress: Selecting site:990 Active:1 Checking status:1 Failed but can retry:9 Progress: Selecting site:989 Submitting:1 Active:1 Failed but can retry:10 Progress: Selecting site:989 Active:1 Checking status:1 Failed but can retry:10 Progress: Selecting site:988 Submitting:1 Active:1 Failed but can retry:11 Progress: Selecting site:988 Active:1 Checking status:1 Failed but can retry:11 Progress: Selecting site:987 Submitting:1 Active:1 Failed but can retry:12 Progress: Selecting site:987 Active:1 Checking status:1 Failed but can retry:12 Progress: Selecting site:986 Stage in:1 Active:1 Failed but can retry:13 Progress: Selecting site:986 Active:1 Checking status:1 Failed but can retry:13 Progress: Selecting site:985 Stage in:1 Active:1 Failed but can retry:14 Progress: Selecting site:985 Active:1 Checking status:1 Failed but can retry:14 Progress: Selecting site:984 Stage in:1 Active:1 Failed but can retry:15 Progress: Selecting site:984 Active:1 Checking status:1 Failed but can retry:15 Progress: Selecting site:983 Stage in:1 Active:1 Failed but can retry:16 Progress: Selecting site:983 Active:2 Failed but can retry:16 Progress: Selecting site:983 Active:2 Failed but can retry:16 Progress: Selecting site:983 Active:1 Checking status:1 Failed but can retry:16 Progress: Selecting site:982 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:16 Progress: Selecting site:982 Active:1 Checking status:1 Finished successfully:1 Failed but can retry:16 Progress: Selecting site:981 Submitting:1 Active:1 Finished successfully:1 Failed but can retry:17 Progress: Selecting site:981 Active:1 Finished successfully:1 Failed but can retry:18 Progress: Selecting site:980 Submitting:1 Active:1 Finished successfully:1 Failed but can retry:18 Progress: Selecting site:980 Active:1 Checking status:1 Finished successfully:1 Failed but can retry:18 Progress: Selecting site:979 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:19 Progress: Selecting site:979 Active:1 Checking status:1 Finished successfully:1 Failed but can retry:19 Progress: Selecting site:979 Active:1 Finished successfully:1 Failed but can retry:20 Progress: Selecting site:978 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:20 Progress: Selecting site:978 Active:1 Checking status:1 Finished successfully:1 Failed but can retry:20 Progress: Selecting site:977 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:21 Progress: Selecting site:977 Active:1 Checking status:1 Finished successfully:1 Failed but can retry:21 Progress: Selecting site:976 Stage in:1 Active:1 Finished successfully:1 Failed but can retry:22 Progress: Selecting site:976 Submitted:1 Active:1 Finished successfully:1 Failed but can retry:22 Progress: Selecting site:976 Submitted:1 Active:1 Finished successfully:1 Failed but can retry:22 Progress: Selecting site:976 Submitted:1 Active:1 Finished successfully:1 Failed but can retry:22 Progress: Selecting site:976 Submitted:1 Active:1 Finished successfully:1 Failed but can retry:22 Progress: Selecting site:976 Submitted:1 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Stage in:1 Submitted:1 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 qProgress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 Progress: Selecting site:975 Submitted:2 Finished successfully:1 Failed but can retry:23 2009/6/10 Allan Espinosa : > hi mihael, > > I reran the job and attached the log files (coaster log, swift-log, gram logs). > > swift session: > rogress: ?Submitted:1 ?Active:1 ?Finished successfully:4 > Progress: ?Submitted:1 ?Active:1 ?Finished successfully:4 > Progress: ?Submitted:1 ?Active:1 ?Finished successfully:4 > Progress: ?Submitted:1 ?Checking status:1 ?Finished successfully:4 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Active:1 ?Finished successfully:5 > Progress: ?Checking status:1 ?Finished successfully:5 > Progress: ?Stage out:1 ?Finished successfully:5 > Progress: ?Submitted:1 ?Finished successfully:6 > Progress: ?Submitted:1 ?Finished successfully:6 > Progress: ?Submitted:1 ?Finished successfully:6 > Progress: ?Submitted:1 ?Finished successfully:6 > Progress: ?Submitted:1 ?Finished successfully:6 > ... > > sites.xml (i may have changed it during this run): > > ? ? ? ? > ? ? ? ? ? ? ? ? url="tg-grid.uc.teragrid.org" ?jobmanager="gt2:gt2:pbs" /> > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ?/home/aespinosa/blast-runs > > ? ? ? ? ? ? ? ?1 > ? ? ? ? ? ? ? ?1.26 > > ? ? ? ? ? ? ? ? key="host_types">ia64-compute > ? ? ? ? ? ? ? ?4 > ? ? ? ? ? ? ? ?2 > ? ? ? ? > > > it looks like the last job was submitted but have not yet registered > with the gram service in the ucanl remote site. ?at this point the > coaster for the previous 5 jobs already ended. > -Allan > > 2009/6/10 Mihael Hategan : >> I need to look at the coaster log. >> >> On Tue, 2009-06-09 at 15:10 -0500, Allan Espinosa wrote: >>> I was expecting to have 2 active jobs at a time from the swift log but >>> instead got only one at a time: >>> Swift svn swift-r2949 cog-r2406 >>> >>> RunID: out.run_000 >>> Progress: >>> Progress: ?Selecting site:4 ?Initializing site shared directory:1 ?Stage in:1 >>> Progress: ?Stage in:6 >>> Progress: ?Stage in:6 >>> >>> >>> >>> Progress: ?Stage in:6 >>> Progress: ?Stage in:6 >>> Progress: ?Stage in:6 >>> Progress: ?Stage in:6 >>> Progress: ?Stage in:5 ?Submitting:1 >>> Progress: ?Submitting:5 ?Submitted:1 >>> Progress: ?Submitted:6 >>> Progress: ?Submitted:5 ?Active:1 >>> Progress: ?Submitted:5 ?Active:1 >>> Progress: ?Submitted:5 ?Active:1 >>> Progress: ?Submitted:5 ?Active:1 >>> Progress: ?Submitted:5 ?Active:1 >>> Progress: ?Submitted:5 ?Checking status:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Active:1 ?Finished successfully:1 >>> Progress: ?Submitted:4 ?Checking status:1 ?Finished successfully:1 >>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2 >>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2 >>> Progress: ?Submitted:3 ?Active:1 ?Finished successfully:2 >>> Progress: ?Submitted:3 ?Checking status:1 ?Finished successfully:2 >>> Progress: ?Submitted:2 ?Active:1 ?Finished successfully:3 >>> ... >>> ... >>> >>> >>> uc-teragrid queue status: $showq -u $USER >>> [aespinosa at tg-login1 ~]$ showq -u $USER >>> >>> active jobs------------------------ >>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? REMAINING ? ? ? ? ? ?STARTTIME >>> >>> 2015982 ? ? ? ? ? ?aespinos ? ?Running ? ? 2 ? ?00:55:41 ?Tue Jun ?9 15:02:18 >>> >>> 1 active job ? ? ? ? ? ? ?2 of 116 processors in use by local jobs (1.72%) >>> ? ? ? ? ? ? ? ? ? ? ? ? ? 42 of 58 nodes active ? ? ?(72.41%) >>> >>> eligible jobs---------------------- >>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? ? WCLIMIT ? ? ? ? ? ?QUEUETIME >>> >>> >>> 0 eligible jobs >>> >>> blocked jobs----------------------- >>> JOBID ? ? ? ? ? ? ?USERNAME ? ? ?STATE PROCS ? ? WCLIMIT ? ? ? ? ? ?QUEUETIME >>> >>> >>> 0 blocked jobs >>> >>> Total job: ?1 >>> >>> >>> sites.xml: >>> >>> ? ? ? ? >>> ? ? ? ? ? ? ? ? >> url="tg-grid.uc.teragrid.org" ?jobmanager="gt2:gt2:pbs" /> >>> ? ? ? ? ? ? ? ? >>> ? ? ? ? ? ? ? ? /home/aespinosa/blast-runs >>> >>> ? ? ? ? ? ? ? ? 5 >>> ? ? ? ? ? ? ? ? 1.26 >>> >>> ? ? ? ? ? ? ? ? >> key="host_types">ia64-compute >>> ? ? ? ? ? ? ? ? 4 >>> ? ? ? ? ? ? ? ? 16 >>> ? ? ? ? >>> > -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: tarball.tar.gz Type: application/x-gzip Size: 182455 bytes Desc: not available URL: From zhaozhang at uchicago.edu Thu Jun 11 09:24:57 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 11 Jun 2009 09:24:57 -0500 Subject: [Swift-devel] coaster error on ranger In-Reply-To: <4A30EDBD.50108@mcs.anl.gov> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> Message-ID: <4A3113B9.9080303@uchicago.edu> Hi, Mike and Mihael Here is the error, I think this is related to the job wall time of coaster settings. Mihael, could you give me some suggestions on how to set the parameters for coasters on ranger? For now I am running 100 jobs, each job could take 2~3 hours. Thanks. best zhao Execution failed: Exception in run_ampl: Arguments: [run70, template, armington.mod, armington_process.cmd, armington_ou\ tput.cmd, subproblems/producer_tree.mod, ces.so] Host: tgtacc Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj stderr.txt: stdout.txt: ---- Caused by: Shutting down worker Cleaning up... Shutting down service at https://129.114.50.163:58556 And here is my sites.xml bash-3.00$ cat tgranger-sge-gram2.xml TG-CCR080022N /work/00946/zzhang/work /tmp/zzhang/jobdir 16 development 100 10 20 5 1 5 From hategan at mcs.anl.gov Thu Jun 11 10:22:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Jun 2009 10:22:23 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <4A3113B9.9080303@uchicago.edu> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> Message-ID: <1244733743.18728.1.camel@localhost> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: > Hi, Mike and Mihael > > Here is the error, I think this is related to the job wall time of > coaster settings. > > Mihael, could you give me some suggestions on how to set the parameters > for coasters on ranger? I need to know what the problem is first. And for that I need to take a look at the coaster log (and possibly gram logs). So if you could copy that to some shared space in the CI, that would be good. > For now I am running 100 jobs, each job could take 2~3 hours. Thanks. > > best > zhao > > Execution failed: > Exception in run_ampl: > Arguments: [run70, template, armington.mod, armington_process.cmd, > armington_ou\ > tput.cmd, subproblems/producer_tree.mod, ces.so] > Host: tgtacc > Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj > stderr.txt: > > stdout.txt: > ---- > > Caused by: > Shutting down worker > Cleaning up... > Shutting down service at https://129.114.50.163:58556 > > And here is my sites.xml > bash-3.00$ cat tgranger-sge-gram2.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > > TG-CCR080022N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > 100 > 10 > 20 > 5 > 1 > 5 > > > > > > From wilde at mcs.anl.gov Thu Jun 11 10:29:35 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Jun 2009 10:29:35 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <1244733743.18728.1.camel@localhost> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> Message-ID: <4A3122DF.2090301@mcs.anl.gov> There is some likelihood that ampl itself is exitting with a non-zero exit code (12 I suspect) due ot a subscript error at the near-correct termination of the model (ie it runs usefully to the end, then dies when it runs off the end of an array). We know the fix for this. But I wonder, in the case below, Zhao: is this happening when ampl gets one of these errors, or is it running one job OK on a coaster, and then running into a timeout on the next job? What was the mapping of the number of jobs in this script (100 I think) to the number of coasters started? Did the error occur when it tried to start a second long job on a coaster after a prior (long) job had already completed? - Mike On 6/11/09 10:22 AM, Mihael Hategan wrote: > On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: >> Hi, Mike and Mihael >> >> Here is the error, I think this is related to the job wall time of >> coaster settings. >> >> Mihael, could you give me some suggestions on how to set the parameters >> for coasters on ranger? > > I need to know what the problem is first. And for that I need to take a > look at the coaster log (and possibly gram logs). So if you could copy > that to some shared space in the CI, that would be good. > >> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. >> >> best >> zhao >> >> Execution failed: >> Exception in run_ampl: >> Arguments: [run70, template, armington.mod, armington_process.cmd, >> armington_ou\ >> tput.cmd, subproblems/producer_tree.mod, ces.so] >> Host: tgtacc >> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj >> stderr.txt: >> >> stdout.txt: >> ---- >> >> Caused by: >> Shutting down worker >> Cleaning up... >> Shutting down service at https://129.114.50.163:58556 >> >> And here is my sites.xml >> bash-3.00$ cat tgranger-sge-gram2.xml >> >> >> >> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >> >> TG-CCR080022N >> /work/00946/zzhang/work >> > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >> 16 >> development >> 100 >> 10 >> 20 >> 5 >> 1 >> 5 >> >> >> >> >> >> > From zhaozhang at uchicago.edu Thu Jun 11 10:37:02 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 11 Jun 2009 10:37:02 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <1244733743.18728.1.camel@localhost> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> Message-ID: <4A31249E.4060806@uchicago.edu> Hi, Mihael The coaster log is at /home/zzhang/see/logs/coasters.log. The latest record should be the run that failed last night. best zhao Mihael Hategan wrote: > On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: > >> Hi, Mike and Mihael >> >> Here is the error, I think this is related to the job wall time of >> coaster settings. >> >> Mihael, could you give me some suggestions on how to set the parameters >> for coasters on ranger? >> > > I need to know what the problem is first. And for that I need to take a > look at the coaster log (and possibly gram logs). So if you could copy > that to some shared space in the CI, that would be good. > > >> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. >> >> best >> zhao >> >> Execution failed: >> Exception in run_ampl: >> Arguments: [run70, template, armington.mod, armington_process.cmd, >> armington_ou\ >> tput.cmd, subproblems/producer_tree.mod, ces.so] >> Host: tgtacc >> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj >> stderr.txt: >> >> stdout.txt: >> ---- >> >> Caused by: >> Shutting down worker >> Cleaning up... >> Shutting down service at https://129.114.50.163:58556 >> >> And here is my sites.xml >> bash-3.00$ cat tgranger-sge-gram2.xml >> >> >> >> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >> >> TG-CCR080022N >> /work/00946/zzhang/work >> > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >> 16 >> development >> 100 >> 10 >> 20 >> 5 >> 1 >> 5 >> >> >> >> >> >> >> > > > From zhaozhang at uchicago.edu Thu Jun 11 11:06:48 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 11 Jun 2009 11:06:48 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <4A3122DF.2090301@mcs.anl.gov> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A3122DF.2090301@mcs.anl.gov> Message-ID: <4A312B98.2090702@uchicago.edu> Hi, Mike I am attaching the whole log at the end. From the log, we could tell that no job is successful at the point when the the work flow exits. And the workflow has been running for only 13 minutes. I also copy the swift-work dir back to CI net work, it is at /home/zzhang/see/logs/ampl-20090611-0122-hzktisu5. Although no job in the workflow returned successful, I did find 22 result files in /home/zzhang/see/logs/ampl-20090611-0122-hzktisu5/shared/result You could take a look at run14 as an example. I echo the exit code of the ampl script at the end of run_ampl: ** EXIT - solution found. Major Iterations. . . . 4 Minor Iterations. . . . 36 Restarts. . . . . . . . 0 Crash Iterations. . . . 0 Gradient Steps. . . . . 0 Function Evaluations. . 5 Gradient Evaluations. . 5 Basis Time. . . . . . . 25.713607 Total Time. . . . . . . 27.701732 Residual. . . . . . . . 2.998933e-07 Postsolved residual: 2.9989e-07 Path 4.7.01: Solution found. 4 iterations (0 for crash); 36 pivots. 5 function, 5 gradient evaluations. exitcode 2 See here? the exit code is 2, which mean, the ampl script has error itself. I know you said Todd has a fix for this, but I didn't find it. The code I was running is the latest from svn. Any idea about this? best wishes zhao Swift svn swift-r2953 cog-r2406 RunID: 20090611-0122-hzktisu5 Progress: Progress: uninitialized:1 Progress: Selecting site:98 Initializing site shared directory:1 Stage in:1 Progress: Stage in:99 Submitting:1 Progress: Submitting:99 Submitted:1 Progress: Submitted:100 Progress: Submitted:100 Progress: Submitted:100 Progress: Submitted:100 Progress: Submitted:99 Active:1 Progress: Submitted:82 Active:18 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:23 Progress: Submitted:77 Active:22 Failed but can retry:1 Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/h on tgtacc Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/s on tgtacc Progress: Submitted:75 Active:21 Failed:1 Failed but can retry:3 Execution failed: Exception in run_ampl: Arguments: [run70, template, armington.mod, armington_process.cmd, armington_output.cmd, subproblems/producer_tree.mod, ces.so] Host: tgtacc Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj stderr.txt: stdout.txt: ---- Caused by: Shutting down worker Cleaning up... Shutting down service at https://129.114.50.163:58556 Got channel MetaChannel: 6217586 -> GSSSChannel-null(1) Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/o on tgtacc - Done Michael Wilde wrote: > There is some likelihood that ampl itself is exitting with a non-zero > exit code (12 I suspect) due ot a subscript error at the near-correct > termination of the model (ie it runs usefully to the end, then dies > when it runs off the end of an array). We know the fix for this. > > But I wonder, in the case below, Zhao: is this happening when ampl > gets one of these errors, or is it running one job OK on a coaster, > and then running into a timeout on the next job? > > What was the mapping of the number of jobs in this script (100 I > think) to the number of coasters started? Did the error occur when it > tried to start a second long job on a coaster after a prior (long) job > had already completed? > > - Mike > > > On 6/11/09 10:22 AM, Mihael Hategan wrote: >> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: >>> Hi, Mike and Mihael >>> >>> Here is the error, I think this is related to the job wall time of >>> coaster settings. >>> >>> Mihael, could you give me some suggestions on how to set the >>> parameters for coasters on ranger? >> >> I need to know what the problem is first. And for that I need to take a >> look at the coaster log (and possibly gram logs). So if you could copy >> that to some shared space in the CI, that would be good. >> >>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. >>> >>> best >>> zhao >>> >>> Execution failed: >>> Exception in run_ampl: >>> Arguments: [run70, template, armington.mod, armington_process.cmd, >>> armington_ou\ >>> tput.cmd, subproblems/producer_tree.mod, ces.so] >>> Host: tgtacc >>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj >>> stderr.txt: >>> >>> stdout.txt: >>> ---- >>> >>> Caused by: >>> Shutting down worker >>> Cleaning up... >>> Shutting down service at https://129.114.50.163:58556 >>> >>> And here is my sites.xml >>> bash-3.00$ cat tgranger-sge-gram2.xml >>> >>> >>> >>> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>> >>> TG-CCR080022N >>> /work/00946/zzhang/work >>> >> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >>> 16 >>> development >>> 100 >>> 10 >>> 20 >>> 5 >>> 1 >>> 5 >>> >>> >>> >>> >>> >>> >> > From hategan at mcs.anl.gov Thu Jun 11 12:32:34 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Jun 2009 12:32:34 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <4A31249E.4060806@uchicago.edu> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A31249E.4060806@uchicago.edu> Message-ID: <1244741554.23235.0.camel@localhost> Your jobs seem to not have a walltime specified. Can you post your tc.data? On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote: > Hi, Mihael > > The coaster log is at /home/zzhang/see/logs/coasters.log. The latest > record should be the run that failed last night. > > best > zhao > > Mihael Hategan wrote: > > On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: > > > >> Hi, Mike and Mihael > >> > >> Here is the error, I think this is related to the job wall time of > >> coaster settings. > >> > >> Mihael, could you give me some suggestions on how to set the parameters > >> for coasters on ranger? > >> > > > > I need to know what the problem is first. And for that I need to take a > > look at the coaster log (and possibly gram logs). So if you could copy > > that to some shared space in the CI, that would be good. > > > > > >> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. > >> > >> best > >> zhao > >> > >> Execution failed: > >> Exception in run_ampl: > >> Arguments: [run70, template, armington.mod, armington_process.cmd, > >> armington_ou\ > >> tput.cmd, subproblems/producer_tree.mod, ces.so] > >> Host: tgtacc > >> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj > >> stderr.txt: > >> > >> stdout.txt: > >> ---- > >> > >> Caused by: > >> Shutting down worker > >> Cleaning up... > >> Shutting down service at https://129.114.50.163:58556 > >> > >> And here is my sites.xml > >> bash-3.00$ cat tgranger-sge-gram2.xml > >> > >> > >> > >> >> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > >> > >> TG-CCR080022N > >> /work/00946/zzhang/work > >> >> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > >> 16 > >> development > >> 100 > >> 10 > >> 20 > >> 5 > >> 1 > >> 5 > >> > >> > >> > >> > >> > >> > >> > > > > > > From zhaozhang at uchicago.edu Thu Jun 11 13:04:42 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 11 Jun 2009 13:04:42 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <1244741554.23235.0.camel@localhost> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost> Message-ID: <4A31473A.1090006@uchicago.edu> No, I don't specify any wall time. The last entry is for the run_ampl script. zhao login3% cat tc.data #This is the transformation catalog. # #It comes pre-configured with a number of simple transformations with #paths that are likely to work on a linux box. However, on some systems, #the paths to these executables will be different (for example, sometimes #some of these programs are found in /usr/bin rather than in /bin) # #NOTE WELL: fields in this file must be separated by tabs, not spaces; and #there must be no trailing whitespace at the end of each line. # # sitename transformation path INSTALLED platform profiles bgps echo /bin/echo INSTALLED INTEL32::LINUX null bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null localhost sleep /bin/sleep INSTALLED INTEL32::LINUX null localhost echo /bin/echo INSTALLED INTEL32::LINUX null localhost ls /bin/ls INSTALLED INTEL32::LINUX null localhost wc /bin/wc INSTALLED INTEL32::LINUX null localhost grep /bin/grep INSTALLED INTEL32::LINUX null localhost sort /bin/sort INSTALLED INTEL32::LINUX null localhost paste /bin/paste INSTALLED INTEL32::LINUX null localhost date /bin/date INSTALLED INTEL32::LINUX null localhost db /home/wilde/angle/data/db INSTALLED INTEL32::LINUX null localhost set1 /home/wilde/angle/data/set1 INSTALLED INTEL32::LINUX null localhost set3 /home/wilde/angle/data/set3 INSTALLED INTEL32::LINUX null localhost run_ampl /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED INTEL32::LINUX null tgtacc run_ampl /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED INTEL32::LINUX null Mihael Hategan wrote: > Your jobs seem to not have a walltime specified. Can you post your > tc.data? > > On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote: > >> Hi, Mihael >> >> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest >> record should be the run that failed last night. >> >> best >> zhao >> >> Mihael Hategan wrote: >> >>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mike and Mihael >>>> >>>> Here is the error, I think this is related to the job wall time of >>>> coaster settings. >>>> >>>> Mihael, could you give me some suggestions on how to set the parameters >>>> for coasters on ranger? >>>> >>>> >>> I need to know what the problem is first. And for that I need to take a >>> look at the coaster log (and possibly gram logs). So if you could copy >>> that to some shared space in the CI, that would be good. >>> >>> >>> >>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. >>>> >>>> best >>>> zhao >>>> >>>> Execution failed: >>>> Exception in run_ampl: >>>> Arguments: [run70, template, armington.mod, armington_process.cmd, >>>> armington_ou\ >>>> tput.cmd, subproblems/producer_tree.mod, ces.so] >>>> Host: tgtacc >>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj >>>> stderr.txt: >>>> >>>> stdout.txt: >>>> ---- >>>> >>>> Caused by: >>>> Shutting down worker >>>> Cleaning up... >>>> Shutting down service at https://129.114.50.163:58556 >>>> >>>> And here is my sites.xml >>>> bash-3.00$ cat tgranger-sge-gram2.xml >>>> >>>> >>>> >>>> >>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>>> >>>> TG-CCR080022N >>>> /work/00946/zzhang/work >>>> >>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >>>> 16 >>>> development >>>> 100 >>>> 10 >>>> 20 >>>> 5 >>>> 1 >>>> 5 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> > > > From hategan at mcs.anl.gov Thu Jun 11 13:09:20 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Jun 2009 13:09:20 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <4A31473A.1090006@uchicago.edu> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost> <4A31473A.1090006@uchicago.edu> Message-ID: <1244743761.24254.1.camel@localhost> On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote: > No, I don't specify any wall time. Well, you need to specify one. > The last entry is for the run_ampl script. > > zhao > > login3% cat tc.data > #This is the transformation catalog. > # > #It comes pre-configured with a number of simple transformations with > #paths that are likely to work on a linux box. However, on some systems, > #the paths to these executables will be different (for example, sometimes > #some of these programs are found in /usr/bin rather than in /bin) > # > #NOTE WELL: fields in this file must be separated by tabs, not spaces; and > #there must be no trailing whitespace at the end of each line. > # > # sitename transformation path INSTALLED platform profiles > bgps echo /bin/echo INSTALLED INTEL32::LINUX null > bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null > localhost sleep /bin/sleep INSTALLED > INTEL32::LINUX null > localhost echo /bin/echo INSTALLED > INTEL32::LINUX null > localhost ls /bin/ls INSTALLED > INTEL32::LINUX null > localhost wc /bin/wc INSTALLED > INTEL32::LINUX null > localhost grep /bin/grep INSTALLED > INTEL32::LINUX null > localhost sort /bin/sort INSTALLED > INTEL32::LINUX null > localhost paste /bin/paste INSTALLED > INTEL32::LINUX null > localhost date /bin/date INSTALLED > INTEL32::LINUX null > localhost db /home/wilde/angle/data/db > INSTALLED INTEL32::LINUX null > localhost set1 /home/wilde/angle/data/set1 > INSTALLED INTEL32::LINUX null > localhost set3 /home/wilde/angle/data/set3 > INSTALLED INTEL32::LINUX null > localhost run_ampl > /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED > INTEL32::LINUX null > tgtacc run_ampl > /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED > INTEL32::LINUX null > > > Mihael Hategan wrote: > > Your jobs seem to not have a walltime specified. Can you post your > > tc.data? > > > > On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote: > > > >> Hi, Mihael > >> > >> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest > >> record should be the run that failed last night. > >> > >> best > >> zhao > >> > >> Mihael Hategan wrote: > >> > >>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: > >>> > >>> > >>>> Hi, Mike and Mihael > >>>> > >>>> Here is the error, I think this is related to the job wall time of > >>>> coaster settings. > >>>> > >>>> Mihael, could you give me some suggestions on how to set the parameters > >>>> for coasters on ranger? > >>>> > >>>> > >>> I need to know what the problem is first. And for that I need to take a > >>> look at the coaster log (and possibly gram logs). So if you could copy > >>> that to some shared space in the CI, that would be good. > >>> > >>> > >>> > >>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. > >>>> > >>>> best > >>>> zhao > >>>> > >>>> Execution failed: > >>>> Exception in run_ampl: > >>>> Arguments: [run70, template, armington.mod, armington_process.cmd, > >>>> armington_ou\ > >>>> tput.cmd, subproblems/producer_tree.mod, ces.so] > >>>> Host: tgtacc > >>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj > >>>> stderr.txt: > >>>> > >>>> stdout.txt: > >>>> ---- > >>>> > >>>> Caused by: > >>>> Shutting down worker > >>>> Cleaning up... > >>>> Shutting down service at https://129.114.50.163:58556 > >>>> > >>>> And here is my sites.xml > >>>> bash-3.00$ cat tgranger-sge-gram2.xml > >>>> > >>>> > >>>> > >>>> >>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > >>>> > >>>> TG-CCR080022N > >>>> /work/00946/zzhang/work > >>>> >>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > >>>> 16 > >>>> development > >>>> 100 > >>>> 10 > >>>> 20 > >>>> 5 > >>>> 1 > >>>> 5 > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > > > > > > From zhaozhang at uchicago.edu Thu Jun 11 13:13:34 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 11 Jun 2009 13:13:34 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <1244743761.24254.1.camel@localhost> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost> <4A31473A.1090006@uchicago.edu> <1244743761.24254.1.camel@localhost> Message-ID: <4A31494E.8060209@uchicago.edu> Hi, Mihael Actually, I have no idea how long would these jobs run. Some of them just took ~10 minutes, and some of them went far more than this. What if I set the wall to 120 minutes, what will happen when the wall time is up but the job doesn't finish? 120 zhao Mihael Hategan wrote: > On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote: > >> No, I don't specify any wall time. >> > > Well, you need to specify one. > > >> The last entry is for the run_ampl script. >> >> zhao >> >> login3% cat tc.data >> #This is the transformation catalog. >> # >> #It comes pre-configured with a number of simple transformations with >> #paths that are likely to work on a linux box. However, on some systems, >> #the paths to these executables will be different (for example, sometimes >> #some of these programs are found in /usr/bin rather than in /bin) >> # >> #NOTE WELL: fields in this file must be separated by tabs, not spaces; and >> #there must be no trailing whitespace at the end of each line. >> # >> # sitename transformation path INSTALLED platform profiles >> bgps echo /bin/echo INSTALLED INTEL32::LINUX null >> bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null >> localhost sleep /bin/sleep INSTALLED >> INTEL32::LINUX null >> localhost echo /bin/echo INSTALLED >> INTEL32::LINUX null >> localhost ls /bin/ls INSTALLED >> INTEL32::LINUX null >> localhost wc /bin/wc INSTALLED >> INTEL32::LINUX null >> localhost grep /bin/grep INSTALLED >> INTEL32::LINUX null >> localhost sort /bin/sort INSTALLED >> INTEL32::LINUX null >> localhost paste /bin/paste INSTALLED >> INTEL32::LINUX null >> localhost date /bin/date INSTALLED >> INTEL32::LINUX null >> localhost db /home/wilde/angle/data/db >> INSTALLED INTEL32::LINUX null >> localhost set1 /home/wilde/angle/data/set1 >> INSTALLED INTEL32::LINUX null >> localhost set3 /home/wilde/angle/data/set3 >> INSTALLED INTEL32::LINUX null >> localhost run_ampl >> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED >> INTEL32::LINUX null >> tgtacc run_ampl >> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED >> INTEL32::LINUX null >> >> >> Mihael Hategan wrote: >> >>> Your jobs seem to not have a walltime specified. Can you post your >>> tc.data? >>> >>> On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote: >>> >>> >>>> Hi, Mihael >>>> >>>> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest >>>> record should be the run that failed last night. >>>> >>>> best >>>> zhao >>>> >>>> Mihael Hategan wrote: >>>> >>>> >>>>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote: >>>>> >>>>> >>>>> >>>>>> Hi, Mike and Mihael >>>>>> >>>>>> Here is the error, I think this is related to the job wall time of >>>>>> coaster settings. >>>>>> >>>>>> Mihael, could you give me some suggestions on how to set the parameters >>>>>> for coasters on ranger? >>>>>> >>>>>> >>>>>> >>>>> I need to know what the problem is first. And for that I need to take a >>>>> look at the coaster log (and possibly gram logs). So if you could copy >>>>> that to some shared space in the CI, that would be good. >>>>> >>>>> >>>>> >>>>> >>>>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks. >>>>>> >>>>>> best >>>>>> zhao >>>>>> >>>>>> Execution failed: >>>>>> Exception in run_ampl: >>>>>> Arguments: [run70, template, armington.mod, armington_process.cmd, >>>>>> armington_ou\ >>>>>> tput.cmd, subproblems/producer_tree.mod, ces.so] >>>>>> Host: tgtacc >>>>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj >>>>>> stderr.txt: >>>>>> >>>>>> stdout.txt: >>>>>> ---- >>>>>> >>>>>> Caused by: >>>>>> Shutting down worker >>>>>> Cleaning up... >>>>>> Shutting down service at https://129.114.50.163:58556 >>>>>> >>>>>> And here is my sites.xml >>>>>> bash-3.00$ cat tgranger-sge-gram2.xml >>>>>> >>>>>> >>>>>> >>>>>> >>>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>>>>> >>>>>> TG-CCR080022N >>>>>> /work/00946/zzhang/work >>>>>> >>>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir >>>>>> 16 >>>>>> development >>>>>> 100 >>>>>> 10 >>>>>> 20 >>>>>> 5 >>>>>> 1 >>>>>> 5 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>> >>> > > > From hategan at mcs.anl.gov Thu Jun 11 13:24:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Jun 2009 13:24:58 -0500 Subject: [Swift-devel] Re: coaster error on ranger In-Reply-To: <4A31494E.8060209@uchicago.edu> References: <4A300069.6030103@anl.gov> <4A3005CF.2090107@mcs.anl.gov> <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> <1244733743.18728.1.camel@localhost> <4A31249E.4060806@uchicago.edu> <1244741554.23235.0.camel@localhost> <4A31473A.1090006@uchicago.edu> <1244743761.24254.1.camel@localhost> <4A31494E.8060209@uchicago.edu> Message-ID: <1244744698.24446.7.camel@localhost> On Thu, 2009-06-11 at 13:13 -0500, Zhao Zhang wrote: > Hi, Mihael > > Actually, I have no idea how long would these jobs run. Some of them > just took ~10 minutes, and some of them went far more than this. > What if I set the wall to 120 minutes, what will happen when the wall > time is up but the job doesn't finish? You'll probably send another message to the mailing list saying that things don't work properly. And I'll ask you again to gather all logs. Then I'll tell you the same thing, which is that you should set a proper maxwalltime for the job. > 120 Read the swift documentation, in particular http://www.ci.uchicago.edu/swift/guides/userguide.php#profile.globus (how to specify the maxwalltime and what it means, and the meaning of "maxtime" which has nothing to do with your job's maximum walltime). From bugzilla-daemon at mcs.anl.gov Thu Jun 11 15:02:57 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 11 Jun 2009 15:02:57 -0500 (CDT) Subject: [Swift-devel] [Bug 212] New: support for multiple arguments Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=212 Summary: support for multiple arguments Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: enhancement Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: aespinosa at cs.uchicago.edu Created an attachment (id=287) --> (http://bugzilla.mcs.anl.gov/swift/attachment.cgi?id=287) job with 10k arguments attached job is an invocation with lots of arguments (10k). we can generalize these kinds of jobs as "summarizers". this occurs most likely because of the number of arguments limits in the shell when a job is invoked by _swiftwrap. A mapreduce-like approach to reduce the data chunks-at-a-time would be a possible solution. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From bugzilla-daemon at mcs.anl.gov Thu Jun 11 15:10:50 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 11 Jun 2009 15:10:50 -0500 (CDT) Subject: [Swift-devel] [Bug 212] support for lots of arguments In-Reply-To: References: Message-ID: <20090611201050.A973B2B886@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=212 Allan Espinosa changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|support for multiple |support for lots of |arguments |arguments -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From benc at hawaga.org.uk Fri Jun 12 13:21:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 12 Jun 2009 18:21:08 +0000 (GMT) Subject: [Swift-devel] swift in softenv at CI Message-ID: For people using CI maintained login machines, Swift is now available in softenv. To get Swift 0.9 in the same way that you get other CI software, add the line: @swift to the start of your ~/.soft file. The full commentary from CI support is: > There's a @swift macro which is recommended to use before the @default > macro and that will pull in the Sun Java for that OS and swift. And > there's also a +swift if someone wants to try out a different Java. -- From aespinosa at cs.uchicago.edu Fri Jun 12 15:33:24 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 12 Jun 2009 15:33:24 -0500 Subject: [Swift-devel] Re: block coasters not registering on proper queue In-Reply-To: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> Message-ID: <50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com> I did a rebuild (ant redist) today and it looks like everything is working fine. It looks like some files during my previous build were not updated properly. I was using rsync to copy files from swift-svn to another directory. i guess that was a bad idea. ./runtest.sh Swift svn swift-r2949 cog-r2406 RunID: coasterrun Progress: uninitialized:1 Progress: Submitted:1 Progress: Active:1 Final status: Finished successfully:1 Cleaning up... Shutting down service at https://128.135.125.117:43627 Got channel MetaChannel: 1910518671 -> GSSSChannel-null(1) - Done qstat: 1095930.tp-mgt null aespinosa 0 R fast -Allan 2009/6/8 Allan Espinosa : > Is there a default maxwalltime being submitted to the LRM if nothing > is specified? ?I made in this configuration to use the "fast" ueue in > sites.xml but i keep getting placed inside the "exteneded" queue. > > sites.xml > > ? > ? ? > ? ? jobmanager="gt2:gt2:pbs" /> > ? ?fast > ? ?/home/aespinosa/work > ? ?50 > ? ?10 > ? ?20 > ? > > > gram log snippet: > ... > ... > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Entering pbs submit > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed > (may be harmless): Operation not permitted > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may > be harmless): Operation not permitted > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from > job description > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using queue default > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Determining job max wall time > limit from job description > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: ? ?using maxwalltime of 60 > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Building job script > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument > "http://128.135.125.118:56015" > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to > "http://128.135.125.118:56015" > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" > Mon Jun ?8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" > ... > ... > > $grep fast gram*.log: > gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = > GLOBUS_FAILURE (try Perl scripts) > > > Swift version: Swift svn swift-r2949 cog-r2406 > > -Allan > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From bugzilla-daemon at mcs.anl.gov Fri Jun 12 15:35:06 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 12 Jun 2009 15:35:06 -0500 (CDT) Subject: [Swift-devel] [Bug 211] block coasters not registering on proper queue In-Reply-To: References: Message-ID: <20090612203506.C209C2CB39@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=211 Allan Espinosa changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID --- Comment #1 from Allan Espinosa 2009-06-12 15:35:06 --- I made a mistake in my build scripts. apparently rsync is not the way to copy of builds in swift-svn to another directory. -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From hategan at mcs.anl.gov Fri Jun 12 19:20:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 12 Jun 2009 19:20:17 -0500 Subject: [Swift-devel] Re: block coasters not registering on proper queue In-Reply-To: <50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com> References: <50b07b4b0906081424jeab39a3kdb7b22aec84c1328@mail.gmail.com> <50b07b4b0906121333s62ea38f6hbfc8835c534af4b6@mail.gmail.com> Message-ID: <1244852417.10588.1.camel@localhost> There's a chance an intermittent coaster bug still exists on this issue. So it would be useful to test the same configuration some more. On Fri, 2009-06-12 at 15:33 -0500, Allan Espinosa wrote: > I did a rebuild (ant redist) today and it looks like everything is > working fine. It looks like some files during my previous build were > not updated properly. I was using rsync to copy files from swift-svn > to another directory. i guess that was a bad idea. > > ./runtest.sh > Swift svn swift-r2949 cog-r2406 > > RunID: coasterrun > Progress: uninitialized:1 > Progress: Submitted:1 > Progress: Active:1 > Final status: Finished successfully:1 > Cleaning up... > Shutting down service at https://128.135.125.117:43627 > Got channel MetaChannel: 1910518671 -> GSSSChannel-null(1) > - Done > > qstat: > 1095930.tp-mgt null aespinosa 0 R fast > > -Allan > > > 2009/6/8 Allan Espinosa : > > Is there a default maxwalltime being submitted to the LRM if nothing > > is specified? I made in this configuration to use the "fast" ueue in > > sites.xml but i keep getting placed inside the "exteneded" queue. > > > > sites.xml > > > > > > > > > jobmanager="gt2:gt2:pbs" /> > > fast > > /home/aespinosa/work > > 50 > > 10 > > 20 > > > > > > > > gram log snippet: > > ... > > ... > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: New Perl JobManager created. > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Entering pbs submit > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /usr/bin/perl failed > > (may be harmless): Operation not permitted > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /usr/bin/perl > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: NFS sync for /dev/null failed (may > > be harmless): Operation not permitted > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Sent NFS sync for /dev/null > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max time cpu from > > job description > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using queue default > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Determining job max wall time > > limit from job description > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: using maxwalltime of 60 > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Building job script > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Using jm supplied job dir: > > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/18222.1244495960 > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument > > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to > > "/home/aespinosa/.globus/coasters/cscript6820117662705473060.pl" > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument > > "http://128.135.125.118:56015" > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to > > "http://128.135.125.118:56015" > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transforming argument "0608-190415-000000" > > Mon Jun 8 16:19:21 2009 JM_SCRIPT: Transformed to "0608-190415-000000" > > ... > > ... > > > > $grep fast gram*.log: > > gram_job_mgr_18021.log:6/8 16:19:15 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18021.log:6/8 16:19:25 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18021.log:6/8 16:19:36 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:19:22 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:19:32 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:19:43 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:19:53 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:03 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:14 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:24 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:34 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:45 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:20:55 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:05 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:16 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:26 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:36 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:47 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:21:57 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:22:08 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:22:18 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:22:28 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:22:39 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:22:49 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:23:00 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:23:10 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > gram_job_mgr_18222.log:6/8 16:23:20 JMI: poll_fast: returning -1 = > > GLOBUS_FAILURE (try Perl scripts) > > > > > > Swift version: Swift svn swift-r2949 cog-r2406 > > > > -Allan > > > > > From benc at hawaga.org.uk Mon Jun 15 05:37:42 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 15 Jun 2009 10:37:42 +0000 (GMT) Subject: [Swift-devel] pc3 swift slides Message-ID: Although I think they're of minimal interest to most, here are the slides I presented in the swift slot at PC3 last week at Universiteit van Amsterdam. http://www.ci.uchicago.edu/~benc/pc3-swift-slides.pdf Some substantially more meaty technical report on Swift vs PC3 should appear later. -- From zhaozhang at uchicago.edu Mon Jun 15 16:48:01 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 15 Jun 2009 16:48:01 -0500 Subject: [Swift-devel] Two problems regarding coaster Message-ID: <4A36C191.9050701@uchicago.edu> Hi, Mihael I encountered two problems on coasters. 1. The log file coasters.log is increasing too fast. For a two hour run, the log file could be 5GB. And swift would fail if there is no space for coaster to produce logs. 2.On ranger there are two file systems that I am using now. One is $HOME, the other is $WORK, each with quota 6GB and 350 GB. By default coasters.log is produced at $HOME/.globus/coaster, I set up a symbol link there, it actually points to a place in $WORK. This fixes the problem I saw on Sunday, but there comes a new one. I am not sure why swift failed, could you help find out either it is because the job reaches the maxwalltime, or something else. My sites.xml is at the end of the email. The worker logs, gram logs, swift logs and standard output are on CI network, /home/zzhang/ranger-logs/2009-06-15 The coasters.log is too big, if you need it, I would try to tailor it. Let me know. best zhao From aespinosa at cs.uchicago.edu Mon Jun 15 18:37:54 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 15 Jun 2009 18:37:54 -0500 Subject: [Swift-devel] coaster error on ranger In-Reply-To: <4A3113B9.9080303@uchicago.edu> References: <4A301544.4030806@uchicago.edu> <4A3074B6.3070003@uchicago.edu> <4A3083E9.2040603@mcs.anl.gov> <4A30A404.7050004@uchicago.edu> <4A30EDBD.50108@mcs.anl.gov> <4A3113B9.9080303@uchicago.edu> Message-ID: <50b07b4b0906151637t7dd7653eu29a2ab0d17f76153@mail.gmail.com> isn't coastersPerNode already deprecated as a configuration parameter? 2009/6/11 Zhao Zhang > Hi, Mike and Mihael > > Here is the error, I think this is related to the job wall time of coaster > settings. > > Mihael, could you give me some suggestions on how to set the parameters for > coasters on ranger? > For now I am running 100 jobs, each job could take 2~3 hours. Thanks. > > best > zhao > > Execution failed: > Exception in run_ampl: > Arguments: [run70, template, armington.mod, armington_process.cmd, > armington_ou\ > tput.cmd, subproblems/producer_tree.mod, ces.so] > Host: tgtacc > Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj > stderr.txt: > > stdout.txt: > ---- > > Caused by: > Shutting down worker > Cleaning up... > Shutting down service at https://129.114.50.163:58556 > > And here is my sites.xml > bash-3.00$ cat tgranger-sge-gram2.xml > > > > jobManager="gt2:gt2:SGE"/> > > TG-CCR080022N > /work/00946/zzhang/work > key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir > 16 > development > 100 > 10 > 20 > 5 > 1 > 5 > > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From benc at hawaga.org.uk Tue Jun 16 03:53:53 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 16 Jun 2009 08:53:53 +0000 (GMT) Subject: [Swift-devel] passing very long lists of files to applications Message-ID: Some applications have a problem where a component program is to process a very large number of files. Swift can deal with files in two basic ways at the moment: i) explicitly pass the filenames on the commandline ii) stage the files into the job input directory without explicitly naming the files on the commandline, with the component program inspecting the working directory on the execution side to decide which files to process. i) has the disadvantage that the commandline limits the number of filenames that can be passed ii) has the disadvantage that the component program must be able to distinguish which of their working directory files are the relevant input files. A further option which could be implemented is to provide the ability to write out a list of filenames into a file, and have that file staged as input. This needs the component program to be able to take a list of files from a file rather than from the command line (for example, the -T option of tar). This could be implemented, I think, by providing a writeData procedure which is the inverse of readData, and writing something like this: file l = writeData(@filenames(f)) p(l,f); app p(file l, file f[]) { myproc "-T" @l } comments? -- From benc at hawaga.org.uk Tue Jun 16 05:10:59 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 16 Jun 2009 10:10:59 +0000 (GMT) Subject: [Swift-devel] passing very long lists of files to applications In-Reply-To: References: Message-ID: a related idea to this when you are using some component program that is summarising data and is associative (and maybe other properties) in its work is that the associativity could be indicated to Swift, and Swift could then make use of that to generate an arbitrary number of app calls. For example, the numerical operations max or sum fit this, but mean does not. max (100,8,1,1,33,8,7,423,46,2,222) = max( max(100,8,1,1), max(33,8), max(7,423,46), max(2,222) ) so its possible to evaluate the max without any individual invocation having more than 4 parameters. This fits in quite nicely with ideas of having Swift stuff be expressed more functionally, and have Swift able to make its own decisions about exactly how things are run. I don't think this is going to be something that goes in the language soon, but if anyone happens to pursue the functional direction further, this is a case that should be kept in mind. -- From aespinosa at cs.uchicago.edu Tue Jun 16 17:30:03 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 16 Jun 2009 17:30:03 -0500 Subject: [Swift-devel] more active processes than requested cores Message-ID: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> By the throttling parameters below, i do expect to have a thousand jobs active at a time. But shouldn't the coaster request larger blocks to accommodate the 277 active jobs? sge snapshot: ACTIVE JOBS-------------------------- JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME ================================================================================ 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38 swift session snipper Progress: Selecting site:38 Submitted:707 Active:278 Finished successfully:1861 Progress: Selecting site:38 Submitted:707 Active:277 Checking status:1 Finished successfully:1861 sites.xml TG-CCR080022N /work/01035/tg802895/blast-runs 16 development 4 00:30:00 2 2 10 i'll send the swift and coaster logs once the run finishes. -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Tue Jun 16 17:33:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 16 Jun 2009 22:33:54 +0000 (GMT) Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> Message-ID: Can you compare with the post-processed logs (especially info/worker logs, not execution layer stats), not the runtime counter - the runtime counter is necessarily reliant on the realtime delivery of status changes; the post-processed wrapper logs are not. So maybe this is too many jobs running at once; maybe this is delayed statistics updates (as has been discussed here) You need to turn on the wrapper log always transfer option in the config file to get all the wrapper logs back if you don't already have that. On Tue, 16 Jun 2009, Allan Espinosa wrote: > By the throttling parameters below, i do expect to have a thousand > jobs active at a time. But shouldn't the coaster request larger > blocks to accommodate the 277 active jobs? > > sge snapshot: > ACTIVE JOBS-------------------------- > JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME > ================================================================================ > 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41 > 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 > 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 > 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38 > > > swift session snipper > Progress: Selecting site:38 Submitted:707 Active:278 Finished > successfully:1861 > Progress: Selecting site:38 Submitted:707 Active:277 Checking > status:1 Finished successfully:1861 > > > sites.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > TG-CCR080022N > /work/01035/tg802895/blast-runs > 16 > development > 4 > 00:30:00 > 2 > 2 > 10 > > > > i'll send the swift and coaster logs once the run finishes. > > -Allan > > > From hategan at mcs.anl.gov Wed Jun 17 07:12:55 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 17 Jun 2009 07:12:55 -0500 Subject: [Swift-devel] Two problems regarding coaster In-Reply-To: <4A36C191.9050701@uchicago.edu> References: <4A36C191.9050701@uchicago.edu> Message-ID: <1245240775.8776.3.camel@localhost> On Mon, 2009-06-15 at 16:48 -0500, Zhao Zhang wrote: > Hi, Mihael > > I encountered two problems on coasters. > 1. The log file coasters.log is increasing too fast. For a two hour run, > the log file could be 5GB. > And swift would fail if there is no space for coaster to produce logs. That's the temporary price we pay while the thing isn't tested much in order to be able to find bugs quickly. From hategan at mcs.anl.gov Wed Jun 17 07:17:37 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 17 Jun 2009 07:17:37 -0500 Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> Message-ID: <1245241057.8776.6.camel@localhost> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote: > By the throttling parameters below, i do expect to have a thousand > jobs active at a time. But shouldn't the coaster request larger > blocks to accommodate the 277 active jobs? Not if they fit in existing blocks (either vertically or horizontally). This is something that should be thought of some more, but for short jobs it seems ok. > > sge snapshot: > ACTIVE JOBS-------------------------- > JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME > ================================================================================ > 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41 > 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 > 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 > 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38 > > > swift session snipper > Progress: Selecting site:38 Submitted:707 Active:278 Finished > successfully:1861 > Progress: Selecting site:38 Submitted:707 Active:277 Checking > status:1 Finished successfully:1861 > > > sites.xml > > > > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > TG-CCR080022N > /work/01035/tg802895/blast-runs > 16 > development > 4 > 00:30:00 > 2 > 2 > 10 > > > > i'll send the swift and coaster logs once the run finishes. > > -Allan > > From zhaozhang at uchicago.edu Wed Jun 17 10:11:22 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Wed, 17 Jun 2009 10:11:22 -0500 Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <1245241057.8776.6.camel@localhost> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> <1245241057.8776.6.camel@localhost> Message-ID: <4A39079A.3050001@uchicago.edu> Hi, All Here is something in my test case: Swift says: Progress: Selecting site:80 Submitted:828 Active:115 Finished in previous run:487 Finished successfully:295 Progress: Selecting site:80 Submitted:828 Active:115 Finished in previous run:487 Finished successfully:295 Progress: Selecting site:80 Submitted:828 Active:115 Finished in previous run:487 Finished successfully:295 Progress: Selecting site:80 Submitted:828 Active:115 Finished in previous run:487 Finished successfully:295 Progress: Selecting site:80 Submitted:828 Active:115 Finished in previous run:487 Finished successfully:295 And showq -u says login3% showq -u ACTIVE JOBS-------------------------- JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME ================================================================================ 0 active jobs : 0 of 3828 hosts ( 0.00 %) Why there are no active SGE jobs, but swift says there are 115 active jobs? zhao Mihael Hategan wrote: > On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote: > >> By the throttling parameters below, i do expect to have a thousand >> jobs active at a time. But shouldn't the coaster request larger >> blocks to accommodate the 277 active jobs? >> > > Not if they fit in existing blocks (either vertically or horizontally). > This is something that should be thought of some more, but for short > jobs it seems ok. > > >> sge snapshot: >> ACTIVE JOBS-------------------------- >> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME >> ================================================================================ >> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 15:59:41 >> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 >> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 17:07:41 >> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 17:09:38 >> >> >> swift session snipper >> Progress: Selecting site:38 Submitted:707 Active:278 Finished >> successfully:1861 >> Progress: Selecting site:38 Submitted:707 Active:277 Checking >> status:1 Finished successfully:1861 >> >> >> sites.xml >> >> >> >> > url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >> TG-CCR080022N >> /work/01035/tg802895/blast-runs >> 16 >> development >> 4 >> 00:30:00 >> 2 >> 2 >> 10 >> >> >> >> i'll send the swift and coaster logs once the run finishes. >> >> -Allan >> >> >> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From aespinosa at cs.uchicago.edu Wed Jun 17 14:08:50 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 17 Jun 2009 14:08:50 -0500 Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <4A39079A.3050001@uchicago.edu> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> <1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu> Message-ID: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com> I also get this after a while. Attached are the logs when the workflow finished. Actually it did not finish because the coaster got an out of memory error. This does not happen if coasters were not used. 2009/6/17 Zhao Zhang : > Hi, All > > Here is something in my test case: > > Swift says: > Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in > previous run:487 ?Finished successfully:295 > Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in > previous run:487 ?Finished successfully:295 > Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in > previous run:487 ?Finished successfully:295 > Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in > previous run:487 ?Finished successfully:295 > Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in > previous run:487 ?Finished successfully:295 > > And showq -u says > login3% showq -u > ACTIVE JOBS-------------------------- > JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME > ================================================================================ > > ? ?0 active jobs : ? ?0 of 3828 hosts ( ?0.00 %) > > Why there are no active SGE jobs, but swift says there are 115 active jobs? > > zhao > > Mihael Hategan wrote: >> >> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote: >> >>> >>> By the throttling parameters below, i do expect to have a thousand >>> jobs active at a time. ?But shouldn't the coaster request larger >>> blocks to accommodate the 277 active jobs? >>> >> >> Not if they fit in existing blocks (either vertically or horizontally). >> This is something that should be thought of some more, but for short >> jobs it seems ok. >> >> >>> >>> sge snapshot: >>> ACTIVE JOBS-------------------------- >>> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME >>> >>> ================================================================================ >>> 779616 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 00:36:01 ?Tue Jun 16 >>> 15:59:41 >>> 779723 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16 >>> 17:07:41 >>> 779724 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16 >>> 17:07:41 >>> 779727 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:45:58 ?Tue Jun 16 >>> 17:09:38 >>> >>> >>> swift session snipper >>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:278 ?Finished >>> successfully:1861 >>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:277 ?Checking >>> status:1 ?Finished successfully:1861 >>> >>> >>> sites.xml >>> >>> ? >>> ? ? >>> ? ?>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>> ? ?TG-CCR080022N >>> ? ?/work/01035/tg802895/blast-runs >>> ? ?16 >>> ? ?development >>> ? ?4 >>> ? ?00:30:00 >>> ? ?2 >>> ? ?2 >>> ? ?10 >>> ? >>> >>> >>> i'll send the swift and coaster logs once the run finishes. >>> >>> -Allan >>> -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: bug05.tar.gz Type: application/x-gzip Size: 5444754 bytes Desc: not available URL: From HodgessE at uhd.edu Wed Jun 17 14:16:07 2009 From: HodgessE at uhd.edu (Hodgess, Erin) Date: Wed, 17 Jun 2009 14:16:07 -0500 Subject: [Swift-devel] updated files for tutorial Message-ID: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus> Dear Swift Development: Please find the locations of the appropriate updated files for the tutorial (on home machine). /home/erin/cog/modules/swift/docs The files are tutorial.php and tutorial.html respectively. Please let me know if I need to do further changes. Thanks, Erin Erin M. Hodgess, PhD Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: hodgesse at uhd.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Wed Jun 17 14:22:55 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 17 Jun 2009 14:22:55 -0500 Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> <1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu> <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com> Message-ID: <50b07b4b0906171222k172bf7a1y9e9d44841c7c8b3d@mail.gmail.com> oops. forgot all the wrapper logs. this next attachment should have it. 2009/6/17 Allan Espinosa : > I also get this after a while. > > Attached are the logs when the workflow finished. ?Actually it did not > finish because the coaster got an out of memory error. ?This does not > happen if coasters were not used. > > 2009/6/17 Zhao Zhang : >> Hi, All >> >> Here is something in my test case: >> >> Swift says: >> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in >> previous run:487 ?Finished successfully:295 >> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in >> previous run:487 ?Finished successfully:295 >> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in >> previous run:487 ?Finished successfully:295 >> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in >> previous run:487 ?Finished successfully:295 >> Progress: ?Selecting site:80 ?Submitted:828 ?Active:115 ?Finished in >> previous run:487 ?Finished successfully:295 >> >> And showq -u says >> login3% showq -u >> ACTIVE JOBS-------------------------- >> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME >> ================================================================================ >> >> ? ?0 active jobs : ? ?0 of 3828 hosts ( ?0.00 %) >> >> Why there are no active SGE jobs, but swift says there are 115 active jobs? >> >> zhao >> >> Mihael Hategan wrote: >>> >>> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote: >>> >>>> >>>> By the throttling parameters below, i do expect to have a thousand >>>> jobs active at a time. ?But shouldn't the coaster request larger >>>> blocks to accommodate the 277 active jobs? >>>> >>> >>> Not if they fit in existing blocks (either vertically or horizontally). >>> This is something that should be thought of some more, but for short >>> jobs it seems ok. >>> >>> >>>> >>>> sge snapshot: >>>> ACTIVE JOBS-------------------------- >>>> JOBID ? ? JOBNAME ? ?USERNAME ? ? ?STATE ? CORE ?REMAINING ?STARTTIME >>>> >>>> ================================================================================ >>>> 779616 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 00:36:01 ?Tue Jun 16 >>>> 15:59:41 >>>> 779723 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16 >>>> 17:07:41 >>>> 779724 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:44:01 ?Tue Jun 16 >>>> 17:07:41 >>>> 779727 ? ?data ? ? ? tg802895 ? ? ?Running 16 ? ? 01:45:58 ?Tue Jun 16 >>>> 17:09:38 >>>> >>>> >>>> swift session snipper >>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:278 ?Finished >>>> successfully:1861 >>>> Progress: ?Selecting site:38 ?Submitted:707 ?Active:277 ?Checking >>>> status:1 ?Finished successfully:1861 >>>> >>>> >>>> sites.xml >>>> >>>> ? >>>> ? ? >>>> ? ?>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> >>>> ? ?TG-CCR080022N >>>> ? ?/work/01035/tg802895/blast-runs >>>> ? ?16 >>>> ? ?development >>>> ? ?4 >>>> ? ?00:30:00 >>>> ? ?2 >>>> ? ?2 >>>> ? ?10 >>>> ? >>>> >>>> >>>> i'll send the swift and coaster logs once the run finishes. -------------- next part -------------- A non-text attachment was scrubbed... Name: buf04.tar.gz Type: application/x-gzip Size: 5654046 bytes Desc: not available URL: From benc at hawaga.org.uk Thu Jun 18 03:35:44 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 18 Jun 2009 08:35:44 +0000 (GMT) Subject: [Swift-devel] updated files for tutorial In-Reply-To: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus> References: <70A5AC06FDB5E54482D19E1C04CDFCF307C37048@BALI.uhd.campus> Message-ID: You need to submit the changes to the .xml files, not the generated ones. Do this: in your docs directory, type: svn diff > whatever.diff and then make that whatever.diff available here. That gives specific information about the changes you made to the XML file, rather than the end PHPs and HTMLs - the PHP and HTML files on the website are generated from the latest SVN version every night. To contribute to Swift, you need to have gone through the dev.globus contributor licensing paperwork - basically you and your employer need to fill out a licence and give to Gigi at Argonne (who sits in C101). On Wed, 17 Jun 2009, Hodgess, Erin wrote: > Dear Swift Development: > > Please find the locations of the appropriate updated files for the tutorial (on home machine). > /home/erin/cog/modules/swift/docs > > The files are tutorial.php and tutorial.html respectively. > > > > Please let me know if I need to do further changes. > > Thanks, > Erin > > > Erin M. Hodgess, PhD > Associate Professor > Department of Computer and Mathematical Sciences > University of Houston - Downtown > mailto: hodgesse at uhd.edu > > > From benc at hawaga.org.uk Thu Jun 18 03:59:02 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 18 Jun 2009 08:59:02 +0000 (GMT) Subject: [Swift-devel] Re: [metrics-dev] feasability of collecting swift usage stats through the globus usage stats mechanism (fwd) Message-ID: Its been mooted a few times over the last year or so, so I enquired with metrics-dev about using the globus usage stats stuff for very basic swift usage info. Here's the response below in case anyone is interested in following up. ---------- Forwarded message ---------- Date: Wed, 17 Jun 2009 09:31:19 -0600 From: Lee Liming To: Ben Clifford Cc: metrics-dev at globus.org Subject: Re: [metrics-dev] feasability of collecting swift usage stats through the globus usage stats mechanism Absolutely yes. See http://dev.globus.org/wiki/Incubator/Metrics#FAQ for the information you ask about here. I think all of the topics are covered in the FAQ and linked docs. In summary, the code and mechanism needed to do this is totally open and available. Using the CDIGS listener service itself requires coordination with the person who operates it (currently Joe Bester) but it is do-able. Running your own listener requires no coordination and is a good option. You may want to consider operating your own listener service. The "global" CDIGS listener service is experiencing growing pains at the moment, and is not currently (this week, maybe next) available for you to experiment with because it's being serviced. It's also pretty heavily loaded, so your performance (e.g., report generation) will not be stellar. It's quite easy to bring up a listener service, and if you have control of your code deployment (the code being reported on), you can easily configure where it sends reports. You could even have it report to multiple listeners, such as a Swift-specific listener *and* the CDIGS listener. The largest challenge in running your own listener would be sustaining its operation over time, and you will have to think a bit about what your requirements are in that area. (How badly you want to have *every* usage report.) If it's not vital that you have each and every usage report (but get a good sampling, for example, and keep track of when you were vs. weren't listening), then this should be a pretty lightweight thing to do. CDIGS has tried to be meticulous about high availability and not losing any data, and our record over several years is quite good, but it can be a high-stress enterprise and requires significant attention for short (mostly unpredictable) times. I am not 100% sure we know the return we're getting for such effort. --- Lee On Jun 17, 2009, at 2:39 AM, Ben Clifford wrote: > > I would like to investigate the feasability of collecting basic usage > stats for Swift through the globus usage stats mechanism. Specifically: > > 1) is the usage stats mechanism even open to other dev.globus projects > (sociopolitically, not technically) > > 2) what actual code needs to be added to the client to send usage packets > (is there a packaged library?) > > 3) what needs to happen at the receiving end to get useful reports. > > The sort of information I suspect being logged would be (for each run > where Swift ends naturally): > > i) svn revision number > ii) number of tasks executed > > -- From hategan at mcs.anl.gov Thu Jun 18 07:14:47 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Jun 2009 07:14:47 -0500 Subject: [Swift-devel] more active processes than requested cores In-Reply-To: <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com> References: <50b07b4b0906161530g9c37926i610c83901dc9f914@mail.gmail.com> <1245241057.8776.6.camel@localhost> <4A39079A.3050001@uchicago.edu> <50b07b4b0906171208h24d38d9ey64c8e6cd9708b5ce@mail.gmail.com> Message-ID: <1245327287.25261.3.camel@localhost> Ok. This is getting messy, and I need to be able to reproduce it. I suggest testing with one of the existing workflows, such as 066-many.swift, and if that does not trigger the problem, a custom version of it with /bin/sleep instead. If that fails too, I'll need access to your blast installation. I also need to know if this is an intermittent issue or not, so testing more than once would be desirable. On Wed, 2009-06-17 at 14:08 -0500, Allan Espinosa wrote: > I also get this after a while. > > Attached are the logs when the workflow finished. Actually it did not > finish because the coaster got an out of memory error. This does not > happen if coasters were not used. > > 2009/6/17 Zhao Zhang : > > Hi, All > > > > Here is something in my test case: > > > > Swift says: > > Progress: Selecting site:80 Submitted:828 Active:115 Finished in > > previous run:487 Finished successfully:295 > > Progress: Selecting site:80 Submitted:828 Active:115 Finished in > > previous run:487 Finished successfully:295 > > Progress: Selecting site:80 Submitted:828 Active:115 Finished in > > previous run:487 Finished successfully:295 > > Progress: Selecting site:80 Submitted:828 Active:115 Finished in > > previous run:487 Finished successfully:295 > > Progress: Selecting site:80 Submitted:828 Active:115 Finished in > > previous run:487 Finished successfully:295 > > > > And showq -u says > > login3% showq -u > > ACTIVE JOBS-------------------------- > > JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME > > ================================================================================ > > > > 0 active jobs : 0 of 3828 hosts ( 0.00 %) > > > > Why there are no active SGE jobs, but swift says there are 115 active jobs? > > > > zhao > > > > Mihael Hategan wrote: > >> > >> On Tue, 2009-06-16 at 17:30 -0500, Allan Espinosa wrote: > >> > >>> > >>> By the throttling parameters below, i do expect to have a thousand > >>> jobs active at a time. But shouldn't the coaster request larger > >>> blocks to accommodate the 277 active jobs? > >>> > >> > >> Not if they fit in existing blocks (either vertically or horizontally). > >> This is something that should be thought of some more, but for short > >> jobs it seems ok. > >> > >> > >>> > >>> sge snapshot: > >>> ACTIVE JOBS-------------------------- > >>> JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME > >>> > >>> ================================================================================ > >>> 779616 data tg802895 Running 16 00:36:01 Tue Jun 16 > >>> 15:59:41 > >>> 779723 data tg802895 Running 16 01:44:01 Tue Jun 16 > >>> 17:07:41 > >>> 779724 data tg802895 Running 16 01:44:01 Tue Jun 16 > >>> 17:07:41 > >>> 779727 data tg802895 Running 16 01:45:58 Tue Jun 16 > >>> 17:09:38 > >>> > >>> > >>> swift session snipper > >>> Progress: Selecting site:38 Submitted:707 Active:278 Finished > >>> successfully:1861 > >>> Progress: Selecting site:38 Submitted:707 Active:277 Checking > >>> status:1 Finished successfully:1861 > >>> > >>> > >>> sites.xml > >>> > >>> > >>> > >>> >>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/> > >>> TG-CCR080022N > >>> /work/01035/tg802895/blast-runs > >>> 16 > >>> development > >>> 4 > >>> 00:30:00 > >>> 2 > >>> 2 > >>> 10 > >>> > >>> > >>> > >>> i'll send the swift and coaster logs once the run finishes. > >>> > >>> -Allan > >>> > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Thu Jun 18 07:19:17 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Jun 2009 07:19:17 -0500 Subject: [Swift-devel] Re: [metrics-dev] feasability of collecting swift usage stats through the globus usage stats mechanism (fwd) In-Reply-To: References: Message-ID: <1245327557.25261.8.camel@localhost> On Thu, 2009-06-18 at 08:59 +0000, Ben Clifford wrote: > You may want to consider operating your own listener service. The "global" > CDIGS listener service is experiencing growing pains at the moment, The "global CDIGS listener service" was experiencing growing pains from the start. Table-top software is a different business from scalable software. From aespinosa at cs.uchicago.edu Thu Jun 18 13:04:18 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Jun 2009 13:04:18 -0500 Subject: [Swift-devel] scheduler scoring with file transfer Message-ID: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> i observed in swift logs that there are scheduler score updates after FILE_OPERATIONs. As we can see below in the nostagein workflow, there are less submitted jobs than the one with stageins. Does this mean i have to match my file transfer throttles with job submission throttles? same score and throttling parameters in sites.xml file: /home/aespinosa/workflows/activelog/workdir 2.02 1.98 fast 4 2 64 00:06:00 3600 [aespinosa at tp-login1 blast]$ ./demoblast.sh (blast.swift with coasters) Swift svn swift-r2949 cog-r2406 RunID: out.run_000 Progress: Progress: Progress: Progress: Progress: Progress: uninitialized:1 Progress: Initializing:1022 Selecting site:1 Progress: Selecting site:1022 Initializing site shared directory:1 Progress: Selecting site:1011 Stage in:12 Progress: Selecting site:1010 Stage in:13 Progress: Selecting site:1005 Stage in:18 Progress: Selecting site:998 Stage in:25 Progress: Selecting site:989 Stage in:34 Progress: Selecting site:988 Stage in:35 Progress: Selecting site:984 Stage in:39 Progress: Selecting site:983 Stage in:40 Progress: Selecting site:974 Stage in:49 Progress: Selecting site:973 Submitting:49 Submitted:1 Progress: Selecting site:973 Submitted:50 Progress: Selecting site:973 Submitted:50 ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 174 of 200 Processors Active (87.00%) 94 of 100 Nodes Active (94.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 1101631 aespinosa Idle 23 00:54:00 Thu Jun 18 12:47:56 066-many.swift (no stageins) [aespinosa at tp-login1 activelog]$ ./runtest.sh Swift svn swift-r2949 cog-r2406 RunID: activelog Progress: Progress: uninitialized:1 Progress: Initializing:1022 Selecting site:1 Progress: Selecting site:1022 Initializing site shared directory:1 Progress: Selecting site:1013 Submitting:9 Submitted:1 Progress: Selecting site:1013 Submitted:10 ... -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Thu Jun 18 13:12:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Jun 2009 13:12:32 -0500 Subject: [Swift-devel] scheduler scoring with file transfer In-Reply-To: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> Message-ID: <1245348752.1601.2.camel@localhost> On Thu, 2009-06-18 at 13:04 -0500, Allan Espinosa wrote: > i observed in swift logs that there are scheduler score updates after > FILE_OPERATIONs. As we can see below in the nostagein workflow, there > are less submitted jobs than the one with stageins. Yes. There is more load on the site when there are files to transfer then when there are no files to transfer. > > Does this mean i have to match my file transfer throttles with job > submission throttles? I don't know what that means. From benc at hawaga.org.uk Thu Jun 18 13:19:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 18 Jun 2009 18:19:26 +0000 (GMT) Subject: [Swift-devel] scheduler scoring with file transfer In-Reply-To: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> Message-ID: On Thu, 18 Jun 2009, Allan Espinosa wrote: > Does this mean i have to match my file transfer throttles with job > submission throttles? no. While score capacity on a site is used up dealing files, that same capacity won't be used to submit jobs - the adaptive rate limiting attempts to restrict the load put on a site, not the number of jobs submitted to a site. File transfer and operation load is still load; although it is qualitiatively different from job submission load, the scheduler doesn't make that distinction -- From aespinosa at cs.uchicago.edu Thu Jun 18 13:22:35 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Jun 2009 13:22:35 -0500 Subject: [Swift-devel] scheduler scoring with file transfer In-Reply-To: <1245348752.1601.2.camel@localhost> References: <50b07b4b0906181104g1834a4c0n60c863e1b9e4aad@mail.gmail.com> <1245348752.1601.2.camel@localhost> Message-ID: <50b07b4b0906181122s19eaa8c2j5adf2b5e27b93bcf@mail.gmail.com> 2009/6/18 Mihael Hategan : > On Thu, 2009-06-18 at 13:04 -0500, Allan Espinosa wrote: >> i observed in swift logs that there are scheduler score updates after >> FILE_OPERATIONs. ?As we can see below in the nostagein workflow, there >> are less submitted jobs than the one with stageins. > > Yes. There is more load on the site when there are files to transfer > then when there are no files to transfer. > >> I want to have the same number of jobs at the point of job sumbission to replicate some bugs. Guess i'll just add file transfers in my 066-many.swift workflow. >> Does this mean i have to match my file transfer throttles with job >> submission throttles? > > I don't know what that means. > > > From wilde at mcs.anl.gov Thu Jun 18 14:22:10 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Jun 2009 14:22:10 -0500 Subject: [Swift-devel] Cant run condor-g on TeraPort Message-ID: <4A3A93E2.2080805@mcs.anl.gov> As far as I can tell, the condor client code is broken on TeraPort. Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg in my .soft. I source $VDT_LOCATION/setup.sh Zhao, Glen, can you cross-check and see if you are now seeing the same thing? My suspicion is that the condor client config broke in the last month, through OSG changes, CI Support work, etc etc. - Mike I get this from condor_q: tp$ condor_q Error: Extra Info: You probably saw this error because the condor_schedd is not running on the machine you are trying to query. If the condor_schedd is not running, the Condor system will not be able to find an address and port to connect to and satisfy this request. Please make sure the Condor daemons are running and try again. Extra Info: If the condor_schedd is running on the machine you are trying to query and you still see the error, the most likely cause is that you have setup a personal Condor, you have not defined SCHEDD_NAME in your condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE setting. You must define either or both of those settings in your config file, or you must use the -name option to condor_q. Please see the Condor manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. tp$ and this from swift: tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift Swift svn swift-r2890 cog-r2392 RunID: 20090618-1404-mo0thjj4 Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on firefly Progress: Failed:1 Execution failed: Exception in cat: Arguments: [data.txt] Host: firefly Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj stderr.txt: stdout.txt: ---- Caused by: Cannot submit job: Could not submit job (condor_submit reported an exit code of 1). no error output tp-grid1$ ls -- Using this sites file: grid gt2 ff-grid.unl.edu/jobmanager-pbs /panfs/panasas/CMS/data/oops/wilde/swiftwork From hategan at mcs.anl.gov Thu Jun 18 14:25:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Jun 2009 14:25:16 -0500 Subject: [Swift-devel] Cant run condor-g on TeraPort In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov> References: <4A3A93E2.2080805@mcs.anl.gov> Message-ID: <1245353116.2875.0.camel@localhost> Send mail to Ti to restart the daemon (or fix whatever configuration problems prevent it from starting). On Thu, 2009-06-18 at 14:22 -0500, Michael Wilde wrote: > As far as I can tell, the condor client code is broken on TeraPort. > > Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg > in my .soft. I source $VDT_LOCATION/setup.sh > > Zhao, Glen, can you cross-check and see if you are now seeing the same > thing? > > My suspicion is that the condor client config broke in the last month, > through OSG changes, CI Support work, etc etc. > > - Mike > > > I get this from condor_q: > > tp$ condor_q > Error: > > Extra Info: You probably saw this error because the condor_schedd is not > running on the machine you are trying to query. If the condor_schedd is not > running, the Condor system will not be able to find an address and port to > connect to and satisfy this request. Please make sure the Condor daemons > are > running and try again. > > Extra Info: If the condor_schedd is running on the machine you are > trying to > query and you still see the error, the most likely cause is that you have > setup a personal Condor, you have not defined SCHEDD_NAME in your > condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE > setting. You must define either or both of those settings in your config > file, or you must use the -name option to condor_q. Please see the Condor > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > tp$ > > and this from swift: > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift > Swift svn swift-r2890 cog-r2392 > > RunID: 20090618-1404-mo0thjj4 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on > firefly > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: firefly > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported an exit > code of 1). no error output > tp-grid1$ ls > > -- > > Using this sites file: > > > > > > grid > gt2 > ff-grid.unl.edu/jobmanager-pbs > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From zhaozhang at uchicago.edu Thu Jun 18 14:29:07 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 18 Jun 2009 14:29:07 -0500 Subject: [Swift-devel] condor-g test on ff-grid site Message-ID: <4A3A9583.8010005@uchicago.edu> Dear All I am trying to run a workflow on ff-grid site with condor-g feature. My submit host is tp-osg.ci.uchicago.edu. I have a question about the remote site requirements. Does remote site require a condor jobmanger in order for us to run swift with condor-g on there? cuz ff-grid only has pbs job manager. Here is my sites.xml [zzhang at tp-grid1 sites]$ cat condor-g_new/ff-grid.xml /mnt/panasas/CMS/grid_users/osg/ grid gt2 ff-grid.unl.edu/jobmanager-pbs The reason I am asking this is because my test failed on ff-grid site. All related logs are at CI network /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/logs/ff-grid/ Execution failed: Exception in cat: Arguments: [061-cattwo.1.in, 061-cattwo.2.in] Host: ff-grid Directory: 061-cattwo-20090618-1407-gfg03g57/jobs/v/cat-v66x3gcj stderr.txt: stdout.txt: ---- Caused by: No status file was found. Check the shared filesystem on ff-grid SWIFT RETURN CODE NON-ZERO - test 061-cattwo On the remote site, the shared dir was created, but the jobs dir wasn't. [zzhang at tp-grid1 ~]$ globus-job-run ff-grid.unl.edu /bin/ls 061-cattwo-20090618- 1407-gfg03g57/ info kickstart shared status Any idea on the job failure? Also, to make sure it is not the test workflow's problem, I tested exactly the same suite on the GLOW site. best zhao From zhaozhang at uchicago.edu Thu Jun 18 14:32:19 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 18 Jun 2009 14:32:19 -0500 Subject: [Swift-devel] Cant run condor-g on TeraPort In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov> References: <4A3A93E2.2080805@mcs.anl.gov> Message-ID: <4A3A9643.6050908@uchicago.edu> Hi, Mike Michael Wilde wrote: > As far as I can tell, the condor client code is broken on TeraPort. > > Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg > in my .soft. I source $VDT_LOCATION/setup.sh > > Zhao, Glen, can you cross-check and see if you are now seeing the same > thing? > > My suspicion is that the condor client config broke in the last month, > through OSG changes, CI Support work, etc etc. > > - Mike > > > I get this from condor_q: condor_q is working for me [zzhang at tp-grid1 sites]$ condor_q -- Submitter: tp-grid1.ci.uchicago.edu : <128.135.125.118:43109> : tp-grid1.ci.uchicago.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 101.0 zzhang 4/28 16:15 0+00:02:06 X 0 1.0 bash /nfs/home/osg 137.0 zzhang 4/29 12:25 0+00:00:00 X 0 1.0 bash /nfs/osg-data 138.0 zzhang 4/29 13:02 0+00:00:00 X 0 1.0 bash /scratch/ufhp 139.0 zzhang 4/29 16:15 0+00:00:00 X 0 1.0 bash /opt/osg/data 140.0 zzhang 5/5 14:12 0+00:00:43 X 0 1.0 bash /nfs/osg-data 157.0 zzhang 5/5 14:49 0+00:00:00 X 0 1.0 bash /atlas/data08 158.0 zzhang 5/5 14:59 0+00:00:00 X 0 1.0 bash /raid2/osg-da 159.0 zzhang 5/5 15:03 0+00:00:00 X 0 1.0 bash /raid2/osg-da The source file in my .bashrc is "source /opt/osg/setup.sh" not "/opt/osg-ce-1.0.0-r2/setup.sh". [zzhang at tp-grid1 sites]$ echo $VDT_LOCATION /opt/osg-ce-1.0.0-r2 zhao > > tp$ condor_q > Error: > > Extra Info: You probably saw this error because the condor_schedd is not > running on the machine you are trying to query. If the condor_schedd > is not > running, the Condor system will not be able to find an address and > port to > connect to and satisfy this request. Please make sure the Condor > daemons are > running and try again. > > Extra Info: If the condor_schedd is running on the machine you are > trying to > query and you still see the error, the most likely cause is that you have > setup a personal Condor, you have not defined SCHEDD_NAME in your > condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE > setting. You must define either or both of those settings in your config > file, or you must use the -name option to condor_q. Please see the Condor > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > tp$ > > and this from swift: > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift > Swift svn swift-r2890 cog-r2392 > > RunID: 20090618-1404-mo0thjj4 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h > on firefly > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: firefly > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported an > exit code of 1). no error output > tp-grid1$ ls > > -- > > Using this sites file: > > > > > > grid > gt2 > ff-grid.unl.edu/jobmanager-pbs > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Thu Jun 18 14:31:26 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) Subject: [Swift-devel] Cant run condor-g on TeraPort In-Reply-To: <4A3A93E2.2080805@mcs.anl.gov> References: <4A3A93E2.2080805@mcs.anl.gov> Message-ID: condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather than use softenv. it doesn't work for me if I use @osg in softenv, with the error you report. On Thu, 18 Jun 2009, Michael Wilde wrote: > As far as I can tell, the condor client code is broken on TeraPort. > > Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg in my > .soft. I source $VDT_LOCATION/setup.sh > > Zhao, Glen, can you cross-check and see if you are now seeing the same thing? > > My suspicion is that the condor client config broke in the last month, through > OSG changes, CI Support work, etc etc. > > - Mike > > > I get this from condor_q: > > tp$ condor_q > Error: > > Extra Info: You probably saw this error because the condor_schedd is not > running on the machine you are trying to query. If the condor_schedd is not > running, the Condor system will not be able to find an address and port to > connect to and satisfy this request. Please make sure the Condor daemons are > running and try again. > > Extra Info: If the condor_schedd is running on the machine you are trying to > query and you still see the error, the most likely cause is that you have > setup a personal Condor, you have not defined SCHEDD_NAME in your > condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE > setting. You must define either or both of those settings in your config > file, or you must use the -name option to condor_q. Please see the Condor > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > tp$ > > and this from swift: > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift > Swift svn swift-r2890 cog-r2392 > > RunID: 20090618-1404-mo0thjj4 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on > firefly > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: firefly > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported an > exit code of 1). no error output > tp-grid1$ ls > > -- > > Using this sites file: > > > > > > grid > gt2 > ff-grid.unl.edu/jobmanager-pbs > /panfs/panasas/CMS/data/oops/wilde/swiftwork > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Thu Jun 18 14:38:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 18 Jun 2009 19:38:32 +0000 (GMT) Subject: [Swift-devel] condor-g test on ff-grid site In-Reply-To: <4A3A9583.8010005@uchicago.edu> References: <4A3A9583.8010005@uchicago.edu> Message-ID: On Thu, 18 Jun 2009, Zhao Zhang wrote: > I have a question about the remote site requirements. Does remote site require > a condor jobmanger in order > for us to run swift with condor-g on there? no. condor-g is a submit-side only requirement. Does the site work using swift+plain gram2 instead of swift+condor-g? -- From hockyg at uchicago.edu Thu Jun 18 15:15:03 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Thu, 18 Jun 2009 15:15:03 -0500 Subject: [Swift-devel] condor-g test on ff-grid site In-Reply-To: References: <4A3A9583.8010005@uchicago.edu> Message-ID: Hey Zhao, I couldn't get it to work from teraport but from the engage login host, engage-submit, i can with this default grid gt2 ff-grid.unl.edu/jobmanager-pbs /panfs/panasas/CMS/data/oops/swiftwork On Thu, Jun 18, 2009 at 2:38 PM, Ben Clifford wrote: > > On Thu, 18 Jun 2009, Zhao Zhang wrote: > > > I have a question about the remote site requirements. Does remote site > require > > a condor jobmanger in order > > for us to run swift with condor-g on there? > > no. condor-g is a submit-side only requirement. > > Does the site work using swift+plain gram2 instead of swift+condor-g? > > -- > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Jun 18 16:44:03 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Jun 2009 16:44:03 -0500 Subject: [Swift-devel] How to set .soft and env to run condor on TeraPort? Message-ID: <4A3AB523.2060205@mcs.anl.gov> Hi, Swift users need to run the condor-g client in order to send jobs to OSG sites from a Swift script. Can you tell us how to set .soft and env so that condor_submit to "grid" universe works? We've had all sorts of problems in getting this to work well: - the version of condor client code on communicado is too new to run with Swift. - On teraport, it seems difficult to get the right settings of .soft entries and setup.sh scripts to work corrcetly together - I still dont know if what worked for Zhao on tp-osg a month ago still works. It seems not to, and I cant tell if its because of a change in .soft or env settings, or some other software issue - We would like to run from Teraport compute nodes with qsub -I, and hope that whatever we determine to be the right settings for login nodes work on interactive compute nodes as well. - It would be good *not* to run on tp-osg. Suchandra, Ti, or Greg, can you help us sort out how to set things correctly? Tanks, Mike -------- Original Message -------- Subject: Re: [Swift-devel] Cant run condor-g on TeraPort Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) From: Ben Clifford To: Michael Wilde CC: swift-devel References: <4A3A93E2.2080805 at mcs.anl.gov> condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather than use softenv. it doesn't work for me if I use @osg in softenv, with the error you report. On Thu, 18 Jun 2009, Michael Wilde wrote: > As far as I can tell, the condor client code is broken on TeraPort. > > Ive tried this on tp-login and tp-osg; I am using +osg-client and @osg in my > .soft. I source $VDT_LOCATION/setup.sh > > Zhao, Glen, can you cross-check and see if you are now seeing the same thing? > > My suspicion is that the condor client config broke in the last month, through > OSG changes, CI Support work, etc etc. > > - Mike > > > I get this from condor_q: > > tp$ condor_q > Error: > > Extra Info: You probably saw this error because the condor_schedd is not > running on the machine you are trying to query. If the condor_schedd is not > running, the Condor system will not be able to find an address and port to > connect to and satisfy this request. Please make sure the Condor daemons are > running and try again. > > Extra Info: If the condor_schedd is running on the machine you are trying to > query and you still see the error, the most likely cause is that you have > setup a personal Condor, you have not defined SCHEDD_NAME in your > condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE > setting. You must define either or both of those settings in your config > file, or you must use the -name option to condor_q. Please see the Condor > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > tp$ > > and this from swift: > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml cat.swift > Swift svn swift-r2890 cog-r2392 > > RunID: 20090618-1404-mo0thjj4 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from cat-20090618-1404-mo0thjj4/info/h on > firefly > Progress: Failed:1 > Execution failed: > Exception in cat: > Arguments: [data.txt] > Host: firefly > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Cannot submit job: Could not submit job (condor_submit reported an > exit code of 1). no error output > tp-grid1$ ls > > -- > > Using this sites file: > > > > > > grid > gt2 > ff-grid.unl.edu/jobmanager-pbs > /panfs/panasas/CMS/data/oops/wilde/swiftwork > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Thu Jun 18 16:59:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Jun 2009 16:59:02 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues Message-ID: <4A3AB8A6.9070706@mcs.anl.gov> Zhao and Allan have been testing the new coaster block-allocation version on Ranger. They have reported some issues, and need to work with Mihael to better characterize the errors, and try to reproduce them in a way that Mihael can also do. From working with them, I see two more issues that should be discussed and resolved, which I think they have not yet mentioned on the list. Zhao will discuss at least one of these, but is swamped getting a science run completed for the SEE project. The issues: 1) Its hard to configure the time dimensions of the allocator, and to make it work well with Swift retry parameters. The properties listed in the table in the User Guide coaster section need more explanation and examples. I think Zhao in his latest run got these working OK for the "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with him on this, but help from others is welcome. 2) Allan and Zhao got kicked off of Ranger because the Coaster service was consuming too much time on the head node, which is also "login3". We were impacting other users, and got a "cease and desist" order from the Ranger sysadmin. They have at least one anecdotal "top" snapshot from the host that indicates the service was indeed using a lot of time (on his 2000 job x 2 hour script). At the same time, Zhao sees a huge coaster (service?) log. Maybe related? Allan and Zhao, please keep updates flowing to swift-devel with the list and status of coaster issues (ideally bugzilla'ed when appropriate), and work with Mihael to capture the logs and test cases he needs to see for each problem. Can you both work together to make a list, and with Mihael to decide which items need to be tracked as bugs? Thanks, Mike From iraicu at cs.uchicago.edu Thu Jun 18 21:33:29 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Thu, 18 Jun 2009 21:33:29 -0500 Subject: [Swift-devel] [Fwd: Updated Call for participation in eScience09, please distribute] Message-ID: <4A3AF8F9.9030302@cs.uchicago.edu> Hi, Here is an interesting conference on e-Science. Cheers, Ioan -------- Original Message -------- Subject: Updated Call for participation in eScience09, please distribute Date: Wed, 10 Jun 2009 05:00:15 -0700 From: Reply-To: david.wallom at oerc.ox.ac.uk To: iraicu at cs.uchicago.edu Dear Iaon Raicu Following a note about not having the URLs for submission please circulate the updated call through your network of contacts the call for papers for eScience09 below. Regards David +++++++++++++++++++++++++++++++ e-Science 2009, call for papers +++++++++++++++++++++++++++++++ About ----- Scientific research is increasingly carried out by communities of researchers that span disciplines, laboratories, organizations, and national boundaries. The e-Science 2009 conference is designed to bring together leading international and interdisciplinary research communities, developers, and users of e-Science applications and enabling technologies. The conference serves as a forum to present the results of the latest research and product/tool developments and to highlight related activities from around the world. The sixth IEEE e-Science conference will be held in Oxford, UK from Dec 9-11. The meeting coincides with the UK e-Science All Hands Meeting that will be held from Dec 7-9th, 2009. Building on the successes of previous meetings, we would like to develop some themes at the conference. These include 1. Arts and Humanities and e-Social Science 2. Bioinformatics and Health 3. Climate and Earth Sciences 4. Digital Repositories and Data Management 5. eScience Practice and Education 6. Physical Sciences and Engineering 7. Research Tools, Workflow and Systems There is also the opportunity to submit a workshop programme that is focussed on newer and less well developed areas of research. e-Science 2009 will also feature and exhibits. As well as the vibrant research agenda the meeting will offer the opportunity to meet socially with colleagues in the some of the UK?s most spectacular University venues. We look forward to seeing you Oxford, Anne Trefethen (co-chair) Dave De Roure (co-chair) Paul Roe (Programme co-chair) David Wallom (Programme co-chair) Mark Baker (Workshop chair) Instructions ------------ Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines (see website list below). Authors should submit a PDF or PostScript (level 2) file that will print on a PostScript printer. Papers conforming to the above guidelines can be submitted through the e-Science 2009 paper submission system (see URL below). It is expected that the proceedings will be published by the IEEE CS Press, USA and will be made available online through the IEEE Digital Library. The following topics concerning e-Science are of interest, but not restrictive: 1. Arts and Humanities and e-Social Science 2. Bioinformatics and Health 3. Climate and Earth Sciences 4. Digital Repositories and Data Management 5. eScience Practice and Education 6. Physical Sciences and Engineering 7. Research Tools, Workflow and Systems Important Dates Papers Due: Friday 31st July, 2009 Notification of Acceptance: Tuesday 1st September, 2009 Camera Ready Papers Due: Friday 18th September, 2009 Publication Policy All papers will be peer-reviewed. Accepted papers from both the main track and the workshops will be published in pre-conference proceedings published by IEEE. Selected excellent work may be eligible for additional post-conference publication as extended papers in selected journals, such as FGCS ( http://www.elsevier.com/locate/fgcs ) Websites http://www.escience-meeting.org/eScience2006/instructions.html https://cmt.research.microsoft.com/ESCIENCE2009/Default.aspx -- =================================================== Ioan Raicu, Ph.D. =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Jun 19 07:47:47 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 07:47:47 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <4A3AB8A6.9070706@mcs.anl.gov> References: <4A3AB8A6.9070706@mcs.anl.gov> Message-ID: <4A3B88F3.7020206@mcs.anl.gov> More thoughts on this: (2) is a showstopper on Ranger (and possible elsewhere) and hence a much more important issue than (1). It seems like this problem merits a 2-pronged attack: a) reduce the overhead. Is it logging, or intrinsic to the protocol? -- is it obvious from the log whats causing the high overhead? -- its it a situation where the overhead is incurred even when jobs are not running, just queued? b) see if the service can be moved to a worker node Mike On 6/18/09 4:59 PM, Michael Wilde wrote: > Zhao and Allan have been testing the new coaster block-allocation > version on Ranger. > > They have reported some issues, and need to work with Mihael to better > characterize the errors, and try to reproduce them in a way that Mihael > can also do. > > From working with them, I see two more issues that should be discussed > and resolved, which I think they have not yet mentioned on the list. > Zhao will discuss at least one of these, but is swamped getting a > science run completed for the SEE project. > > The issues: > > 1) Its hard to configure the time dimensions of the allocator, and to > make it work well with Swift retry parameters. The properties listed in > the table in the User Guide coaster section need more explanation and > examples. I think Zhao in his latest run got these working OK for the > "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with > him on this, but help from others is welcome. > > 2) Allan and Zhao got kicked off of Ranger because the Coaster service > was consuming too much time on the head node, which is also "login3". We > were impacting other users, and got a "cease and desist" order from the > Ranger sysadmin. They have at least one anecdotal "top" snapshot from > the host that indicates the service was indeed using a lot of time (on > his 2000 job x 2 hour script). At the same time, Zhao sees a huge > coaster (service?) log. Maybe related? > > Allan and Zhao, please keep updates flowing to swift-devel with the list > and status of coaster issues (ideally bugzilla'ed when appropriate), and > work with Mihael to capture the logs and test cases he needs to see for > each problem. Can you both work together to make a list, and with > Mihael to decide which items need to be tracked as bugs? > > Thanks, > > Mike > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Jun 19 08:23:28 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 08:23:28 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <4A3B88F3.7020206@mcs.anl.gov> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> Message-ID: <1245417808.18736.2.camel@localhost> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: > More thoughts on this: > > (2) is a showstopper on Ranger (and possible elsewhere) and hence a much > more important issue than (1). > > It seems like this problem merits a 2-pronged attack: > > a) reduce the overhead. Is it logging, or intrinsic to the protocol? > -- is it obvious from the log whats causing the high overhead? > -- its it a situation where the overhead is incurred even when > jobs are not running, just queued? Some profiling needs to be done. > b) see if the service can be moved to a worker node > > Mike > > > On 6/18/09 4:59 PM, Michael Wilde wrote: > > Zhao and Allan have been testing the new coaster block-allocation > > version on Ranger. > > > > They have reported some issues, and need to work with Mihael to better > > characterize the errors, and try to reproduce them in a way that Mihael > > can also do. > > > > From working with them, I see two more issues that should be discussed > > and resolved, which I think they have not yet mentioned on the list. > > Zhao will discuss at least one of these, but is swamped getting a > > science run completed for the SEE project. > > > > The issues: > > > > 1) Its hard to configure the time dimensions of the allocator, and to > > make it work well with Swift retry parameters. The properties listed in > > the table in the User Guide coaster section need more explanation and > > examples. I think Zhao in his latest run got these working OK for the > > "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with > > him on this, but help from others is welcome. > > > > 2) Allan and Zhao got kicked off of Ranger because the Coaster service > > was consuming too much time on the head node, which is also "login3". We > > were impacting other users, and got a "cease and desist" order from the > > Ranger sysadmin. They have at least one anecdotal "top" snapshot from > > the host that indicates the service was indeed using a lot of time (on > > his 2000 job x 2 hour script). At the same time, Zhao sees a huge > > coaster (service?) log. Maybe related? > > > > Allan and Zhao, please keep updates flowing to swift-devel with the list > > and status of coaster issues (ideally bugzilla'ed when appropriate), and > > work with Mihael to capture the logs and test cases he needs to see for > > each problem. Can you both work together to make a list, and with > > Mihael to decide which items need to be tracked as bugs? > > > > Thanks, > > > > Mike > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jun 19 08:31:32 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 08:31:32 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <1245417808.18736.2.camel@localhost> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> <1245417808.18736.2.camel@localhost> Message-ID: <4A3B9334.80200@mcs.anl.gov> On 6/19/09 8:23 AM, Mihael Hategan wrote: > On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: >> More thoughts on this: >> >> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much >> more important issue than (1). >> >> It seems like this problem merits a 2-pronged attack: >> >> a) reduce the overhead. Is it logging, or intrinsic to the protocol? >> -- is it obvious from the log whats causing the high overhead? >> -- its it a situation where the overhead is incurred even when >> jobs are not running, just queued? > > Some profiling needs to be done. Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, using a simple script and dummy app so that Mihael can readily reproduce? Mihael, do you want them to run with profiling and post results? - Mike > >> b) see if the service can be moved to a worker node >> >> Mike >> >> >> On 6/18/09 4:59 PM, Michael Wilde wrote: >>> Zhao and Allan have been testing the new coaster block-allocation >>> version on Ranger. >>> >>> They have reported some issues, and need to work with Mihael to better >>> characterize the errors, and try to reproduce them in a way that Mihael >>> can also do. >>> >>> From working with them, I see two more issues that should be discussed >>> and resolved, which I think they have not yet mentioned on the list. >>> Zhao will discuss at least one of these, but is swamped getting a >>> science run completed for the SEE project. >>> >>> The issues: >>> >>> 1) Its hard to configure the time dimensions of the allocator, and to >>> make it work well with Swift retry parameters. The properties listed in >>> the table in the User Guide coaster section need more explanation and >>> examples. I think Zhao in his latest run got these working OK for the >>> "ampl" SEE model he's running (2000 jobs, 2 hours each). I'll work with >>> him on this, but help from others is welcome. >>> >>> 2) Allan and Zhao got kicked off of Ranger because the Coaster service >>> was consuming too much time on the head node, which is also "login3". We >>> were impacting other users, and got a "cease and desist" order from the >>> Ranger sysadmin. They have at least one anecdotal "top" snapshot from >>> the host that indicates the service was indeed using a lot of time (on >>> his 2000 job x 2 hour script). At the same time, Zhao sees a huge >>> coaster (service?) log. Maybe related? >>> >>> Allan and Zhao, please keep updates flowing to swift-devel with the list >>> and status of coaster issues (ideally bugzilla'ed when appropriate), and >>> work with Mihael to capture the logs and test cases he needs to see for >>> each problem. Can you both work together to make a list, and with >>> Mihael to decide which items need to be tracked as bugs? >>> >>> Thanks, >>> >>> Mike >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Jun 19 08:35:25 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 08:35:25 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <4A3B9334.80200@mcs.anl.gov> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> <1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov> Message-ID: <1245418525.19007.1.camel@localhost> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote: > > On 6/19/09 8:23 AM, Mihael Hategan wrote: > > On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: > >> More thoughts on this: > >> > >> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much > >> more important issue than (1). > >> > >> It seems like this problem merits a 2-pronged attack: > >> > >> a) reduce the overhead. Is it logging, or intrinsic to the protocol? > >> -- is it obvious from the log whats causing the high overhead? > >> -- its it a situation where the overhead is incurred even when > >> jobs are not running, just queued? > > > > Some profiling needs to be done. > > Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, > using a simple script and dummy app so that Mihael can readily reproduce? > > Mihael, do you want them to run with profiling and post results? That would be great. Get a hprof dump with cpu tracing enabled. See http://java.sun.com/developer/technicalArticles/Programming/HPROF.html From hategan at mcs.anl.gov Fri Jun 19 08:47:23 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 08:47:23 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <1245418525.19007.1.camel@localhost> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> <1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov> <1245418525.19007.1.camel@localhost> Message-ID: <1245419243.19245.1.camel@localhost> On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote: > On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote: > > > > On 6/19/09 8:23 AM, Mihael Hategan wrote: > > > On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: > > >> More thoughts on this: > > >> > > >> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much > > >> more important issue than (1). > > >> > > >> It seems like this problem merits a 2-pronged attack: > > >> > > >> a) reduce the overhead. Is it logging, or intrinsic to the protocol? > > >> -- is it obvious from the log whats causing the high overhead? > > >> -- its it a situation where the overhead is incurred even when > > >> jobs are not running, just queued? > > > > > > Some profiling needs to be done. > > > > Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, > > using a simple script and dummy app so that Mihael can readily reproduce? > > > > Mihael, do you want them to run with profiling and post results? > > That would be great. Get a hprof dump with cpu tracing enabled. See > http://java.sun.com/developer/technicalArticles/Programming/HPROF.html Bootstrap.java will also need to be modified for the relevant profiling parameters to be passed to the coaster service JVM. addDebuggingOptions() may be the right place to do so. From support at ci.uchicago.edu Fri Jun 19 08:49:17 2009 From: support at ci.uchicago.edu (Ti Leggett) Date: Fri, 19 Jun 2009 08:49:17 -0500 Subject: [Swift-devel] [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: <4A3AB523.2060205@mcs.anl.gov> References: <4A3AB523.2060205@mcs.anl.gov> Message-ID: There were some misconfigurations in the @globus-4 macro for rhel-5 and condor that I've just fixed. Can you set your ~/.soft to look like below and then run resoft: @globus-4 @default You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after that. Let me know if that works for you, or if anything changes. On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote: > Hi, > > Swift users need to run the condor-g client in order to send jobs to > OSG > sites from a Swift script. > > Can you tell us how to set .soft and env so that condor_submit to > "grid" > universe works? > > We've had all sorts of problems in getting this to work well: > > - the version of condor client code on communicado is too new to run > with Swift. > > - On teraport, it seems difficult to get the right settings of .soft > entries and setup.sh scripts to work corrcetly together > > - I still dont know if what worked for Zhao on tp-osg a month ago > still > works. It seems not to, and I cant tell if its because of a change in > .soft or env settings, or some other software issue > > - We would like to run from Teraport compute nodes with qsub -I, and > hope that whatever we determine to be the right settings for login > nodes > work on interactive compute nodes as well. > > - It would be good *not* to run on tp-osg. > > Suchandra, Ti, or Greg, can you help us sort out how to set things > correctly? > > Tanks, > > Mike > > > -------- Original Message -------- > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) > From: Ben Clifford > To: Michael Wilde > CC: swift-devel > References: <4A3A93E2.2080805 at mcs.anl.gov> > > > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather > than > use softenv. it doesn't work for me if I use @osg in softenv, with the > error you report. > > On Thu, 18 Jun 2009, Michael Wilde wrote: > > > As far as I can tell, the condor client code is broken on TeraPort. > > > > Ive tried this on tp-login and tp-osg; I am using +osg-client and > @osg in my > > .soft. I source $VDT_LOCATION/setup.sh > > > > Zhao, Glen, can you cross-check and see if you are now seeing the > same thing? > > > > My suspicion is that the condor client config broke in the last > month, through > > OSG changes, CI Support work, etc etc. > > > > - Mike > > > > > > I get this from condor_q: > > > > tp$ condor_q > > Error: > > > > Extra Info: You probably saw this error because the condor_schedd is > not > > running on the machine you are trying to query. If the condor_schedd > is not > > running, the Condor system will not be able to find an address and > port to > > connect to and satisfy this request. Please make sure the Condor > daemons are > > running and try again. > > > > Extra Info: If the condor_schedd is running on the machine you are > trying to > > query and you still see the error, the most likely cause is that you > have > > setup a personal Condor, you have not defined SCHEDD_NAME in your > > condor_config file, and something is wrong with your > SCHEDD_ADDRESS_FILE > > setting. You must define either or both of those settings in your > config > > file, or you must use the -name option to condor_q. Please see the > Condor > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > > tp$ > > > > and this from swift: > > > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml > cat.swift > > Swift svn swift-r2890 cog-r2392 > > > > RunID: 20090618-1404-mo0thjj4 > > Progress: > > Progress: Stage in:1 > > Progress: Submitted:1 > > Failed to transfer wrapper log from cat-20090618-1404- > mo0thjj4/info/h on > > firefly > > Progress: Failed:1 > > Execution failed: > > Exception in cat: > > Arguments: [data.txt] > > Host: firefly > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > > stderr.txt: > > > > stdout.txt: > > > > ---- > > > > Caused by: > > Cannot submit job: Could not submit job (condor_submit reported an > > exit code of 1). no error output > > tp-grid1$ ls > > > > -- > > > > Using this sites file: > > > > > > > > > > > > grid > > gt2 > > ff-grid.unl.edu/jobmanager-pbs > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > From wilde at mcs.anl.gov Fri Jun 19 09:40:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 09:40:00 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <1245419243.19245.1.camel@localhost> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> <1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov> <1245418525.19007.1.camel@localhost> <1245419243.19245.1.camel@localhost> Message-ID: <4A3BA340.7000004@mcs.anl.gov> might in addition be good to start with empty logs, and summarize the record type counts in the log. That 5GB log size is a bit of a concern. Might be something simple like n^2 debug logging. On 6/19/09 8:47 AM, Mihael Hategan wrote: > On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote: >> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote: >>> On 6/19/09 8:23 AM, Mihael Hategan wrote: >>>> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: >>>>> More thoughts on this: >>>>> >>>>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much >>>>> more important issue than (1). >>>>> >>>>> It seems like this problem merits a 2-pronged attack: >>>>> >>>>> a) reduce the overhead. Is it logging, or intrinsic to the protocol? >>>>> -- is it obvious from the log whats causing the high overhead? >>>>> -- its it a situation where the overhead is incurred even when >>>>> jobs are not running, just queued? >>>> Some profiling needs to be done. >>> Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, >>> using a simple script and dummy app so that Mihael can readily reproduce? >>> >>> Mihael, do you want them to run with profiling and post results? >> That would be great. Get a hprof dump with cpu tracing enabled. See >> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html > > Bootstrap.java will also need to be modified for the relevant profiling > parameters to be passed to the coaster service JVM. > > addDebuggingOptions() may be the right place to do so. > > From smartin at mcs.anl.gov Fri Jun 19 10:21:57 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 19 Jun 2009 10:21:57 -0500 Subject: [Swift-devel] swift testing of gram5 on teraport Message-ID: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> Hi Mike, Ben was planning on testing GRAM5 on teraport for Swift. Now that Ben is moving on, I am wondering what the plan is for that. Do you still plan to do that? Is there someone else that will do the testing? Ti was going to install GRAM5 for Ben to try out, but he has been delayed dealing with other issues. GRAM5 has not yet been installed on teraport. I was going to ask him again to install it, but I don't know who will now drive this testing. -Stu From hategan at mcs.anl.gov Fri Jun 19 10:26:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 10:26:48 -0500 Subject: [Swift-devel] Overview of coaster block-allocation-version issues In-Reply-To: <4A3BA340.7000004@mcs.anl.gov> References: <4A3AB8A6.9070706@mcs.anl.gov> <4A3B88F3.7020206@mcs.anl.gov> <1245417808.18736.2.camel@localhost> <4A3B9334.80200@mcs.anl.gov> <1245418525.19007.1.camel@localhost> <1245419243.19245.1.camel@localhost> <4A3BA340.7000004@mcs.anl.gov> Message-ID: <1245425208.20840.0.camel@localhost> On Fri, 2009-06-19 at 09:40 -0500, Michael Wilde wrote: > might in addition be good to start with empty logs, and summarize the > record type counts in the log. That 5GB log size is a bit of a concern. > Might be something simple like n^2 debug logging. :) No. It's verbose because the software is new and at this point it's better to have more information than less. > > On 6/19/09 8:47 AM, Mihael Hategan wrote: > > On Fri, 2009-06-19 at 08:35 -0500, Mihael Hategan wrote: > >> On Fri, 2009-06-19 at 08:31 -0500, Michael Wilde wrote: > >>> On 6/19/09 8:23 AM, Mihael Hategan wrote: > >>>> On Fri, 2009-06-19 at 07:47 -0500, Michael Wilde wrote: > >>>>> More thoughts on this: > >>>>> > >>>>> (2) is a showstopper on Ranger (and possible elsewhere) and hence a much > >>>>> more important issue than (1). > >>>>> > >>>>> It seems like this problem merits a 2-pronged attack: > >>>>> > >>>>> a) reduce the overhead. Is it logging, or intrinsic to the protocol? > >>>>> -- is it obvious from the log whats causing the high overhead? > >>>>> -- its it a situation where the overhead is incurred even when > >>>>> jobs are not running, just queued? > >>>> Some profiling needs to be done. > >>> Zhao or Allan, can you reproduce the problem on TeraPort or UC Teragrid, > >>> using a simple script and dummy app so that Mihael can readily reproduce? > >>> > >>> Mihael, do you want them to run with profiling and post results? > >> That would be great. Get a hprof dump with cpu tracing enabled. See > >> http://java.sun.com/developer/technicalArticles/Programming/HPROF.html > > > > Bootstrap.java will also need to be modified for the relevant profiling > > parameters to be passed to the coaster service JVM. > > > > addDebuggingOptions() may be the right place to do so. > > > > From wilde at mcs.anl.gov Fri Jun 19 10:54:15 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 10:54:15 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> Message-ID: <4A3BB4A7.2070708@mcs.anl.gov> We'll find a way to do this, STu, but it may go a little slower than desired due to heavy multi-tasking in the group. So you should push forward to get it testable, thats step zero I think. In parallel, we should discuss on the list what ifany Swift changes are needed to use it. It dont have my head around the issue at the moment. Where can we read the specs of how it affects the user? We have a pretty swamped schedule through July, so I'd expect to slot this for late Jul early Aug. Thanks, Mike On 6/19/09 10:21 AM, Stuart Martin wrote: > Hi Mike, > > Ben was planning on testing GRAM5 on teraport for Swift. Now that Ben > is moving on, I am wondering what the plan is for that. Do you still > plan to do that? Is there someone else that will do the testing? > > Ti was going to install GRAM5 for Ben to try out, but he has been > delayed dealing with other issues. GRAM5 has not yet been installed on > teraport. I was going to ask him again to install it, but I don't know > who will now drive this testing. > > -Stu From hockyg at uchicago.edu Fri Jun 19 10:56:50 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 19 Jun 2009 10:56:50 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov> Message-ID: This did update my condor and globus locations, but did not fix the problem. Hopefully Zhao can tell me what to do next [hockyg at tp-grid1 swift]$ which condor_q /soft/condor-7.0.5-r1/bin/condor_q [hockyg at tp-grid1 swift]$ condor_q Neither the environment variable CONDOR_CONFIG, /etc/condor/, nor ~condor/ contain a condor_config source. Either set CONDOR_CONFIG to point to a valid config source, or put a "condor_config" file in /etc/condor or ~condor/ Exiting. On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett wrote: > There were some misconfigurations in the @globus-4 macro for rhel-5 and > condor > that I've just fixed. Can you set your ~/.soft to look like below and then > run > resoft: > > @globus-4 > > @default > > You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after > that. > Let me know if that works for you, or if anything changes. > > On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote: > > Hi, > > > > Swift users need to run the condor-g client in order to send jobs to > > OSG > > sites from a Swift script. > > > > Can you tell us how to set .soft and env so that condor_submit to > > "grid" > > universe works? > > > > We've had all sorts of problems in getting this to work well: > > > > - the version of condor client code on communicado is too new to run > > with Swift. > > > > - On teraport, it seems difficult to get the right settings of .soft > > entries and setup.sh scripts to work corrcetly together > > > > - I still dont know if what worked for Zhao on tp-osg a month ago > > still > > works. It seems not to, and I cant tell if its because of a change in > > .soft or env settings, or some other software issue > > > > - We would like to run from Teraport compute nodes with qsub -I, and > > hope that whatever we determine to be the right settings for login > > nodes > > work on interactive compute nodes as well. > > > > - It would be good *not* to run on tp-osg. > > > > Suchandra, Ti, or Greg, can you help us sort out how to set things > > correctly? > > > > Tanks, > > > > Mike > > > > > > -------- Original Message -------- > > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort > > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) > > From: Ben Clifford > > To: Michael Wilde > > CC: swift-devel > > References: <4A3A93E2.2080805 at mcs.anl.gov> > > > > > > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather > > than > > use softenv. it doesn't work for me if I use @osg in softenv, with the > > error you report. > > > > On Thu, 18 Jun 2009, Michael Wilde wrote: > > > > > As far as I can tell, the condor client code is broken on TeraPort. > > > > > > Ive tried this on tp-login and tp-osg; I am using +osg-client and > > @osg in my > > > .soft. I source $VDT_LOCATION/setup.sh > > > > > > Zhao, Glen, can you cross-check and see if you are now seeing the > > same thing? > > > > > > My suspicion is that the condor client config broke in the last > > month, through > > > OSG changes, CI Support work, etc etc. > > > > > > - Mike > > > > > > > > > I get this from condor_q: > > > > > > tp$ condor_q > > > Error: > > > > > > Extra Info: You probably saw this error because the condor_schedd is > > not > > > running on the machine you are trying to query. If the condor_schedd > > is not > > > running, the Condor system will not be able to find an address and > > port to > > > connect to and satisfy this request. Please make sure the Condor > > daemons are > > > running and try again. > > > > > > Extra Info: If the condor_schedd is running on the machine you are > > trying to > > > query and you still see the error, the most likely cause is that you > > have > > > setup a personal Condor, you have not defined SCHEDD_NAME in your > > > condor_config file, and something is wrong with your > > SCHEDD_ADDRESS_FILE > > > setting. You must define either or both of those settings in your > > config > > > file, or you must use the -name option to condor_q. Please see the > > Condor > > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > > > tp$ > > > > > > and this from swift: > > > > > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml > > cat.swift > > > Swift svn swift-r2890 cog-r2392 > > > > > > RunID: 20090618-1404-mo0thjj4 > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitted:1 > > > Failed to transfer wrapper log from cat-20090618-1404- > > mo0thjj4/info/h on > > > firefly > > > Progress: Failed:1 > > > Execution failed: > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: firefly > > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: > > > Cannot submit job: Could not submit job (condor_submit reported an > > > exit code of 1). no error output > > > tp-grid1$ ls > > > > > > -- > > > > > > Using this sites file: > > > > > > > > > > > > > > > > > > grid > > > gt2 > > > ff-grid.unl.edu/jobmanager-pbs > > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From support at ci.uchicago.edu Fri Jun 19 10:56:58 2009 From: support at ci.uchicago.edu (Glen Hocky) Date: Fri, 19 Jun 2009 10:56:58 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov> Message-ID: This did update my condor and globus locations, but did not fix the problem. Hopefully Zhao can tell me what to do next [hockyg at tp-grid1 swift]$ which condor_q /soft/condor-7.0.5-r1/bin/condor_q [hockyg at tp-grid1 swift]$ condor_q Neither the environment variable CONDOR_CONFIG, /etc/condor/, nor ~condor/ contain a condor_config source. Either set CONDOR_CONFIG to point to a valid config source, or put a "condor_config" file in /etc/condor or ~condor/ Exiting. On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett wrote: > There were some misconfigurations in the @globus-4 macro for rhel-5 and > condor > that I've just fixed. Can you set your ~/.soft to look like below and then > run > resoft: > > @globus-4 > > @default > > You should be using /soft/condor-7.0.5-r1 and /soft/globus-4.2.1-r2 after > that. > Let me know if that works for you, or if anything changes. > > On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov wrote: > > Hi, > > > > Swift users need to run the condor-g client in order to send jobs to > > OSG > > sites from a Swift script. > > > > Can you tell us how to set .soft and env so that condor_submit to > > "grid" > > universe works? > > > > We've had all sorts of problems in getting this to work well: > > > > - the version of condor client code on communicado is too new to run > > with Swift. > > > > - On teraport, it seems difficult to get the right settings of .soft > > entries and setup.sh scripts to work corrcetly together > > > > - I still dont know if what worked for Zhao on tp-osg a month ago > > still > > works. It seems not to, and I cant tell if its because of a change in > > .soft or env settings, or some other software issue > > > > - We would like to run from Teraport compute nodes with qsub -I, and > > hope that whatever we determine to be the right settings for login > > nodes > > work on interactive compute nodes as well. > > > > - It would be good *not* to run on tp-osg. > > > > Suchandra, Ti, or Greg, can you help us sort out how to set things > > correctly? > > > > Tanks, > > > > Mike > > > > > > -------- Original Message -------- > > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort > > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) > > From: Ben Clifford > > To: Michael Wilde > > CC: swift-devel > > References: <4A3A93E2.2080805 at mcs.anl.gov> > > > > > > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh rather > > than > > use softenv. it doesn't work for me if I use @osg in softenv, with the > > error you report. > > > > On Thu, 18 Jun 2009, Michael Wilde wrote: > > > > > As far as I can tell, the condor client code is broken on TeraPort. > > > > > > Ive tried this on tp-login and tp-osg; I am using +osg-client and > > @osg in my > > > .soft. I source $VDT_LOCATION/setup.sh > > > > > > Zhao, Glen, can you cross-check and see if you are now seeing the > > same thing? > > > > > > My suspicion is that the condor client config broke in the last > > month, through > > > OSG changes, CI Support work, etc etc. > > > > > > - Mike > > > > > > > > > I get this from condor_q: > > > > > > tp$ condor_q > > > Error: > > > > > > Extra Info: You probably saw this error because the condor_schedd is > > not > > > running on the machine you are trying to query. If the condor_schedd > > is not > > > running, the Condor system will not be able to find an address and > > port to > > > connect to and satisfy this request. Please make sure the Condor > > daemons are > > > running and try again. > > > > > > Extra Info: If the condor_schedd is running on the machine you are > > trying to > > > query and you still see the error, the most likely cause is that you > > have > > > setup a personal Condor, you have not defined SCHEDD_NAME in your > > > condor_config file, and something is wrong with your > > SCHEDD_ADDRESS_FILE > > > setting. You must define either or both of those settings in your > > config > > > file, or you must use the -name option to condor_q. Please see the > > Condor > > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > > > tp$ > > > > > > and this from swift: > > > > > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml > > cat.swift > > > Swift svn swift-r2890 cog-r2392 > > > > > > RunID: 20090618-1404-mo0thjj4 > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitted:1 > > > Failed to transfer wrapper log from cat-20090618-1404- > > mo0thjj4/info/h on > > > firefly > > > Progress: Failed:1 > > > Execution failed: > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: firefly > > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: > > > Cannot submit job: Could not submit job (condor_submit reported an > > > exit code of 1). no error output > > > tp-grid1$ ls > > > > > > -- > > > > > > Using this sites file: > > > > > > > > > > > > > > > > > > grid > > > gt2 > > > ff-grid.unl.edu/jobmanager-pbs > > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From benc at hawaga.org.uk Fri Jun 19 10:58:17 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 19 Jun 2009 15:58:17 +0000 (GMT) Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: <4A3BB4A7.2070708@mcs.anl.gov> References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> Message-ID: On Fri, 19 Jun 2009, Michael Wilde wrote: > In parallel, we should discuss on the list what ifany Swift changes are needed > to use it. It dont have my head around the issue at the moment. Where can we > read the specs of how it affects the user? Theoretically it will Just Work with the GRAM2 provider. Evidence thus far suggests this might be true (for example, apparently the gram2 cog stuff can submit to gram5 ok) but there hasn't been any swift-level testing to see how it all fits together. -- From zhaozhang at uchicago.edu Fri Jun 19 11:00:11 2009 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Fri, 19 Jun 2009 11:00:11 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov> Message-ID: <4A3BB60B.3040008@uchicago.edu> Here is my .soft [zzhang at tp-grid1 ~]$ cat .soft # # This is your SoftEnv configuration run control file. # # It is used to tell SoftEnv how to customize your environment by # setting up variables such as PATH and MANPATH. To learn more # about this file, do a "man softenv". # +java-sun +osg-client +maui +torque @python-2.5 @osg @default @globus-4 And the source file is source /opt/osg/setup.sh zhao Glen Hocky wrote: > This did update my condor and globus locations, but did not fix the > problem. Hopefully Zhao can tell me what to do next > > [hockyg at tp-grid1 swift]$ which condor_q > /soft/condor-7.0.5-r1/bin/condor_q > [hockyg at tp-grid1 swift]$ condor_q > > Neither the environment variable CONDOR_CONFIG, > /etc/condor/, nor ~condor/ contain a condor_config source. > Either set CONDOR_CONFIG to point to a valid config source, > or put a "condor_config" file in /etc/condor or ~condor/ > Exiting. > > > On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett > wrote: > > There were some misconfigurations in the @globus-4 macro for > rhel-5 and condor > that I've just fixed. Can you set your ~/.soft to look like below > and then run > resoft: > > @globus-4 > > @default > > You should be using /soft/condor-7.0.5-r1 and > /soft/globus-4.2.1-r2 after that. > Let me know if that works for you, or if anything changes. > > On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov > wrote: > > Hi, > > > > Swift users need to run the condor-g client in order to send jobs to > > OSG > > sites from a Swift script. > > > > Can you tell us how to set .soft and env so that condor_submit to > > "grid" > > universe works? > > > > We've had all sorts of problems in getting this to work well: > > > > - the version of condor client code on communicado is too new to run > > with Swift. > > > > - On teraport, it seems difficult to get the right settings of .soft > > entries and setup.sh scripts to work corrcetly together > > > > - I still dont know if what worked for Zhao on tp-osg a month ago > > still > > works. It seems not to, and I cant tell if its because of a > change in > > .soft or env settings, or some other software issue > > > > - We would like to run from Teraport compute nodes with qsub -I, and > > hope that whatever we determine to be the right settings for login > > nodes > > work on interactive compute nodes as well. > > > > - It would be good *not* to run on tp-osg. > > > > Suchandra, Ti, or Greg, can you help us sort out how to set things > > correctly? > > > > Tanks, > > > > Mike > > > > > > -------- Original Message -------- > > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort > > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) > > From: Ben Clifford > > > To: Michael Wilde > > > CC: swift-devel

> > > References: <4A3A93E2.2080805 at mcs.anl.gov > > > > > > > > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh > rather > > than > > use softenv. it doesn't work for me if I use @osg in softenv, > with the > > error you report. > > > > On Thu, 18 Jun 2009, Michael Wilde wrote: > > > > > As far as I can tell, the condor client code is broken on > TeraPort. > > > > > > Ive tried this on tp-login and tp-osg; I am using +osg-client and > > @osg in my > > > .soft. I source $VDT_LOCATION/setup.sh > > > > > > Zhao, Glen, can you cross-check and see if you are now seeing the > > same thing? > > > > > > My suspicion is that the condor client config broke in the last > > month, through > > > OSG changes, CI Support work, etc etc. > > > > > > - Mike > > > > > > > > > I get this from condor_q: > > > > > > tp$ condor_q > > > Error: > > > > > > Extra Info: You probably saw this error because the > condor_schedd is > > not > > > running on the machine you are trying to query. If the > condor_schedd > > is not > > > running, the Condor system will not be able to find an address and > > port to > > > connect to and satisfy this request. Please make sure the Condor > > daemons are > > > running and try again. > > > > > > Extra Info: If the condor_schedd is running on the machine you are > > trying to > > > query and you still see the error, the most likely cause is > that you > > have > > > setup a personal Condor, you have not defined SCHEDD_NAME in your > > > condor_config file, and something is wrong with your > > SCHEDD_ADDRESS_FILE > > > setting. You must define either or both of those settings in your > > config > > > file, or you must use the -name option to condor_q. Please see the > > Condor > > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. > > > tp$ > > > > > > and this from swift: > > > > > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml > > cat.swift > > > Swift svn swift-r2890 cog-r2392 > > > > > > RunID: 20090618-1404-mo0thjj4 > > > Progress: > > > Progress: Stage in:1 > > > Progress: Submitted:1 > > > Failed to transfer wrapper log from cat-20090618-1404- > > mo0thjj4/info/h on > > > firefly > > > Progress: Failed:1 > > > Execution failed: > > > Exception in cat: > > > Arguments: [data.txt] > > > Host: firefly > > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj > > > stderr.txt: > > > > > > stdout.txt: > > > > > > ---- > > > > > > Caused by: > > > Cannot submit job: Could not submit job (condor_submit reported an > > > exit code of 1). no error output > > > tp-grid1$ ls > > > > > > -- > > > > > > Using this sites file: > > > > > > > > > > > > > > > > > > grid > > > gt2 > > > ff-grid.unl.edu/jobmanager-pbs > > > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > > > > From benc at hawaga.org.uk Fri Jun 19 11:00:13 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov>

Message-ID: my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor work. this suggests perhaps that in a working environment, condor should be coming from that OSG stack and not from a specific condor softenv key. -- From smartin at mcs.anl.gov Fri Jun 19 11:02:33 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 19 Jun 2009 11:02:33 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: <4A3BB4A7.2070708@mcs.anl.gov> References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> Message-ID: On Jun 19, 2009, at Jun 19, 10:54 AM, Michael Wilde wrote: > We'll find a way to do this, STu, but it may go a little slower than > desired due to heavy multi-tasking in the group. > > So you should push forward to get it testable, thats step zero I > think. I am pushing forward with groups where there is someone to drive the testing. For example, Jaime Frey is testing gram5 with condor-g. CMS will be doing some testing in early July. Then there is the swift testing... > > > In parallel, we should discuss on the list what ifany Swift changes > are needed to use it. It dont have my head around the issue at the > moment. Where can we read the specs of how it affects the user? > > We have a pretty swamped schedule through July, so I'd expect to > slot this for late Jul early Aug. > > Thanks, > > Mike > > > On 6/19/09 10:21 AM, Stuart Martin wrote: >> Hi Mike, >> Ben was planning on testing GRAM5 on teraport for Swift. Now that >> Ben is moving on, I am wondering what the plan is for that. Do you >> still plan to do that? Is there someone else that will do the >> testing? >> Ti was going to install GRAM5 for Ben to try out, but he has been >> delayed dealing with other issues. GRAM5 has not yet been >> installed on teraport. I was going to ask him again to install it, >> but I don't know who will now drive this testing. >> -Stu From hockyg at uchicago.edu Fri Jun 19 11:05:52 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 19 Jun 2009 11:05:52 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: <4A3BB60B.3040008@uchicago.edu> References: <4A3AB523.2060205@mcs.anl.gov> <4A3BB60B.3040008@uchicago.edu> Message-ID: That did it for me! Thanks Zhao On Fri, Jun 19, 2009 at 11:00 AM, Zhao Zhang wrote: > Here is my .soft > > [zzhang at tp-grid1 ~]$ cat .soft > # > # This is your SoftEnv configuration run control file. > # > # It is used to tell SoftEnv how to customize your environment by > # setting up variables such as PATH and MANPATH. To learn more > # about this file, do a "man softenv". > # > +java-sun > +osg-client > +maui > +torque > @python-2.5 > @osg > @default > @globus-4 > > And the source file is > source /opt/osg/setup.sh > > zhao > > Glen Hocky wrote: > >> This did update my condor and globus locations, but did not fix the >> problem. Hopefully Zhao can tell me what to do next >> >> [hockyg at tp-grid1 swift]$ which condor_q >> /soft/condor-7.0.5-r1/bin/condor_q >> [hockyg at tp-grid1 swift]$ condor_q >> >> Neither the environment variable CONDOR_CONFIG, >> /etc/condor/, nor ~condor/ contain a condor_config source. >> Either set CONDOR_CONFIG to point to a valid config source, >> or put a "condor_config" file in /etc/condor or ~condor/ >> Exiting. >> >> >> On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett > support at ci.uchicago.edu>> wrote: >> >> There were some misconfigurations in the @globus-4 macro for >> rhel-5 and condor >> that I've just fixed. Can you set your ~/.soft to look like below >> and then run >> resoft: >> >> @globus-4 >> >> @default >> >> You should be using /soft/condor-7.0.5-r1 and >> /soft/globus-4.2.1-r2 after that. >> Let me know if that works for you, or if anything changes. >> >> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov >> wrote: >> > Hi, >> > >> > Swift users need to run the condor-g client in order to send jobs to >> > OSG >> > sites from a Swift script. >> > >> > Can you tell us how to set .soft and env so that condor_submit to >> > "grid" >> > universe works? >> > >> > We've had all sorts of problems in getting this to work well: >> > >> > - the version of condor client code on communicado is too new to run >> > with Swift. >> > >> > - On teraport, it seems difficult to get the right settings of .soft >> > entries and setup.sh scripts to work corrcetly together >> > >> > - I still dont know if what worked for Zhao on tp-osg a month ago >> > still >> > works. It seems not to, and I cant tell if its because of a >> change in >> > .soft or env settings, or some other software issue >> > >> > - We would like to run from Teraport compute nodes with qsub -I, and >> > hope that whatever we determine to be the right settings for login >> > nodes >> > work on interactive compute nodes as well. >> > >> > - It would be good *not* to run on tp-osg. >> > >> > Suchandra, Ti, or Greg, can you help us sort out how to set things >> > correctly? >> > >> > Tanks, >> > >> > Mike >> > >> > >> > -------- Original Message -------- >> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort >> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) >> > From: Ben Clifford > >> > To: Michael Wilde > >> > CC: swift-devel > > >> > References: <4A3A93E2.2080805 at mcs.anl.gov >> > >> > >> > >> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh >> rather >> > than >> > use softenv. it doesn't work for me if I use @osg in softenv, >> with the >> > error you report. >> > >> > On Thu, 18 Jun 2009, Michael Wilde wrote: >> > >> > > As far as I can tell, the condor client code is broken on >> TeraPort. >> > > >> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and >> > @osg in my >> > > .soft. I source $VDT_LOCATION/setup.sh >> > > >> > > Zhao, Glen, can you cross-check and see if you are now seeing the >> > same thing? >> > > >> > > My suspicion is that the condor client config broke in the last >> > month, through >> > > OSG changes, CI Support work, etc etc. >> > > >> > > - Mike >> > > >> > > >> > > I get this from condor_q: >> > > >> > > tp$ condor_q >> > > Error: >> > > >> > > Extra Info: You probably saw this error because the >> condor_schedd is >> > not >> > > running on the machine you are trying to query. If the >> condor_schedd >> > is not >> > > running, the Condor system will not be able to find an address and >> > port to >> > > connect to and satisfy this request. Please make sure the Condor >> > daemons are >> > > running and try again. >> > > >> > > Extra Info: If the condor_schedd is running on the machine you are >> > trying to >> > > query and you still see the error, the most likely cause is >> that you >> > have >> > > setup a personal Condor, you have not defined SCHEDD_NAME in your >> > > condor_config file, and something is wrong with your >> > SCHEDD_ADDRESS_FILE >> > > setting. You must define either or both of those settings in your >> > config >> > > file, or you must use the -name option to condor_q. Please see the >> > Condor >> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. >> > > tp$ >> > > >> > > and this from swift: >> > > >> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml >> > cat.swift >> > > Swift svn swift-r2890 cog-r2392 >> > > >> > > RunID: 20090618-1404-mo0thjj4 >> > > Progress: >> > > Progress: Stage in:1 >> > > Progress: Submitted:1 >> > > Failed to transfer wrapper log from cat-20090618-1404- >> > mo0thjj4/info/h on >> > > firefly >> > > Progress: Failed:1 >> > > Execution failed: >> > > Exception in cat: >> > > Arguments: [data.txt] >> > > Host: firefly >> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj >> > > stderr.txt: >> > > >> > > stdout.txt: >> > > >> > > ---- >> > > >> > > Caused by: >> > > Cannot submit job: Could not submit job (condor_submit reported an >> > > exit code of 1). no error output >> > > tp-grid1$ ls >> > > >> > > -- >> > > >> > > Using this sites file: >> > > >> > > >> > > >> > > >> > > >> > > grid >> > > gt2 >> > > ff-grid.unl.edu/jobmanager-pbs >> >> > > > > >/panfs/panasas/CMS/data/oops/wilde/swiftwork >> > > >> > > >> > > _______________________________________________ >> > > Swift-devel mailing list >> > > Swift-devel at ci.uchicago.edu >> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > >> > > >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From hockyg at uchicago.edu Fri Jun 19 11:06:24 2009 From: hockyg at uchicago.edu (Glen Hocky) Date: Fri, 19 Jun 2009 11:06:24 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov> <4A3BB60B.3040008@uchicago.edu> Message-ID: (and ben) On Fri, Jun 19, 2009 at 11:05 AM, Glen Hocky wrote: > That did it for me! Thanks Zhao > > > On Fri, Jun 19, 2009 at 11:00 AM, Zhao Zhang wrote: > >> Here is my .soft >> >> [zzhang at tp-grid1 ~]$ cat .soft >> # >> # This is your SoftEnv configuration run control file. >> # >> # It is used to tell SoftEnv how to customize your environment by >> # setting up variables such as PATH and MANPATH. To learn more >> # about this file, do a "man softenv". >> # >> +java-sun >> +osg-client >> +maui >> +torque >> @python-2.5 >> @osg >> @default >> @globus-4 >> >> And the source file is >> source /opt/osg/setup.sh >> >> zhao >> >> Glen Hocky wrote: >> >>> This did update my condor and globus locations, but did not fix the >>> problem. Hopefully Zhao can tell me what to do next >>> >>> [hockyg at tp-grid1 swift]$ which condor_q >>> /soft/condor-7.0.5-r1/bin/condor_q >>> [hockyg at tp-grid1 swift]$ condor_q >>> >>> Neither the environment variable CONDOR_CONFIG, >>> /etc/condor/, nor ~condor/ contain a condor_config source. >>> Either set CONDOR_CONFIG to point to a valid config source, >>> or put a "condor_config" file in /etc/condor or ~condor/ >>> Exiting. >>> >>> >>> On Fri, Jun 19, 2009 at 8:49 AM, Ti Leggett >> support at ci.uchicago.edu>> wrote: >>> >>> There were some misconfigurations in the @globus-4 macro for >>> rhel-5 and condor >>> that I've just fixed. Can you set your ~/.soft to look like below >>> and then run >>> resoft: >>> >>> @globus-4 >>> >>> @default >>> >>> You should be using /soft/condor-7.0.5-r1 and >>> /soft/globus-4.2.1-r2 after that. >>> Let me know if that works for you, or if anything changes. >>> >>> On Thu Jun 18 16:44:15 2009, wilde at mcs.anl.gov >>> wrote: >>> > Hi, >>> > >>> > Swift users need to run the condor-g client in order to send jobs to >>> > OSG >>> > sites from a Swift script. >>> > >>> > Can you tell us how to set .soft and env so that condor_submit to >>> > "grid" >>> > universe works? >>> > >>> > We've had all sorts of problems in getting this to work well: >>> > >>> > - the version of condor client code on communicado is too new to run >>> > with Swift. >>> > >>> > - On teraport, it seems difficult to get the right settings of .soft >>> > entries and setup.sh scripts to work corrcetly together >>> > >>> > - I still dont know if what worked for Zhao on tp-osg a month ago >>> > still >>> > works. It seems not to, and I cant tell if its because of a >>> change in >>> > .soft or env settings, or some other software issue >>> > >>> > - We would like to run from Teraport compute nodes with qsub -I, and >>> > hope that whatever we determine to be the right settings for login >>> > nodes >>> > work on interactive compute nodes as well. >>> > >>> > - It would be good *not* to run on tp-osg. >>> > >>> > Suchandra, Ti, or Greg, can you help us sort out how to set things >>> > correctly? >>> > >>> > Tanks, >>> > >>> > Mike >>> > >>> > >>> > -------- Original Message -------- >>> > Subject: Re: [Swift-devel] Cant run condor-g on TeraPort >>> > Date: Thu, 18 Jun 2009 19:31:26 +0000 (GMT) >>> > From: Ben Clifford > >>> > To: Michael Wilde > >>> > CC: swift-devel >> > >>> > References: <4A3A93E2.2080805 at mcs.anl.gov >>> > >>> > >>> > >>> > condor_q works for me on tp-osg if I source /opt/osg/setenv.sh >>> rather >>> > than >>> > use softenv. it doesn't work for me if I use @osg in softenv, >>> with the >>> > error you report. >>> > >>> > On Thu, 18 Jun 2009, Michael Wilde wrote: >>> > >>> > > As far as I can tell, the condor client code is broken on >>> TeraPort. >>> > > >>> > > Ive tried this on tp-login and tp-osg; I am using +osg-client and >>> > @osg in my >>> > > .soft. I source $VDT_LOCATION/setup.sh >>> > > >>> > > Zhao, Glen, can you cross-check and see if you are now seeing the >>> > same thing? >>> > > >>> > > My suspicion is that the condor client config broke in the last >>> > month, through >>> > > OSG changes, CI Support work, etc etc. >>> > > >>> > > - Mike >>> > > >>> > > >>> > > I get this from condor_q: >>> > > >>> > > tp$ condor_q >>> > > Error: >>> > > >>> > > Extra Info: You probably saw this error because the >>> condor_schedd is >>> > not >>> > > running on the machine you are trying to query. If the >>> condor_schedd >>> > is not >>> > > running, the Condor system will not be able to find an address and >>> > port to >>> > > connect to and satisfy this request. Please make sure the Condor >>> > daemons are >>> > > running and try again. >>> > > >>> > > Extra Info: If the condor_schedd is running on the machine you are >>> > trying to >>> > > query and you still see the error, the most likely cause is >>> that you >>> > have >>> > > setup a personal Condor, you have not defined SCHEDD_NAME in your >>> > > condor_config file, and something is wrong with your >>> > SCHEDD_ADDRESS_FILE >>> > > setting. You must define either or both of those settings in your >>> > config >>> > > file, or you must use the -name option to condor_q. Please see the >>> > Condor >>> > > manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE. >>> > > tp$ >>> > > >>> > > and this from swift: >>> > > >>> > > tp-grid1$ swift -tc.file tc.data -sites.file sites.condorg.xml >>> > cat.swift >>> > > Swift svn swift-r2890 cog-r2392 >>> > > >>> > > RunID: 20090618-1404-mo0thjj4 >>> > > Progress: >>> > > Progress: Stage in:1 >>> > > Progress: Submitted:1 >>> > > Failed to transfer wrapper log from cat-20090618-1404- >>> > mo0thjj4/info/h on >>> > > firefly >>> > > Progress: Failed:1 >>> > > Execution failed: >>> > > Exception in cat: >>> > > Arguments: [data.txt] >>> > > Host: firefly >>> > > Directory: cat-20090618-1404-mo0thjj4/jobs/h/cat-hv5s3gcj >>> > > stderr.txt: >>> > > >>> > > stdout.txt: >>> > > >>> > > ---- >>> > > >>> > > Caused by: >>> > > Cannot submit job: Could not submit job (condor_submit reported an >>> > > exit code of 1). no error output >>> > > tp-grid1$ ls >>> > > >>> > > -- >>> > > >>> > > Using this sites file: >>> > > >>> > > >>> > > >>> > > >>> > > >>> > > grid >>> > > gt2 >>> > > ff-grid.unl.edu/jobmanager-pbs >>> >>> > > >> > >/panfs/panasas/CMS/data/oops/wilde/swiftwork >>> > > >>> > > >>> > > _______________________________________________ >>> > > Swift-devel mailing list >>> > > Swift-devel at ci.uchicago.edu >>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> > > >>> > > >>> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From support at ci.uchicago.edu Fri Jun 19 11:07:02 2009 From: support at ci.uchicago.edu (Ben Clifford) Date: Fri, 19 Jun 2009 11:07:02 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: References: Message-ID: ci support got removed from this thread but I believe this is relevant. Zhao also reports the same way of getting it working, in another non-ci-support message. ---------- Forwarded message ---------- Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) From: Ben Clifford To: Glen Hocky Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, zhaozhang at uchicago.edu, papka at ci.uchicago.edu Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor work. this suggests perhaps that in a working environment, condor should be coming from that OSG stack and not from a specific condor softenv key. -- From wilde at mcs.anl.gov Fri Jun 19 11:08:02 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 11:08:02 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> Message-ID: <4A3BB7E2.3020503@mcs.anl.gov> Is what we're looking to see here the ability to run Swift with a full or wide throttle to Gram5, directly, without Condor-G, and the ability to have (a) lots of jobs in the queue and (b) many more jobs running at once, while watching the gatekeepr host for CPU stress and memory pressure? Where say (a) is a few thousand jobs and (b) is the full cluster busy? I wonder if we can get a full-system reservation on TeraPort to test this? We're also testing Swift via Condor-G at the moment on UNL's new cluster "Firefly" which has 6000 cores of which 3000 are accessible to OSG. As its a new and lightly loaded cluster, perhaps Brian Bockelman would be willing to test GRAM5 on it? (its a PBS cluster) So, now that I think about, as long as there's a GRAM5 gatekeeper we can use, since it should Just Work, Im sure we can give it some informal usage as soon as its available. Stu, do you have plans for testing beyon Teraport on larger clusters? I wonder, maybe we could test it in AWS at large scales too on a Nimbus workspace? - Mike On 6/19/09 10:58 AM, Ben Clifford wrote: > On Fri, 19 Jun 2009, Michael Wilde wrote: > >> In parallel, we should discuss on the list what ifany Swift changes are needed >> to use it. It dont have my head around the issue at the moment. Where can we >> read the specs of how it affects the user? > > Theoretically it will Just Work with the GRAM2 provider. Evidence thus far > suggests this might be true (for example, apparently the gram2 cog stuff > can submit to gram5 ok) but there hasn't been any swift-level testing to > see how it all fits together. > From wilde at mcs.anl.gov Fri Jun 19 11:13:16 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 11:13:16 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: References:

Message-ID: <4A3BB91C.9060906@mcs.anl.gov> Zhao, when the best way for regular users to do this is determined (sounds like its close) please put instructions for how to do it on: http://www.ci.uchicago.edu/wiki/bin/view/SWFT as page SwiftQuickStartForCondorG (including all the Swift config issues, eg sites.xml, etc) Thanks, Mike -- ps. I think you have a few other pages that should go there, eg, BGP, Ranger/Coasters, etc. How to run on your own local PBS cluster On 6/19/09 11:07 AM, Ben Clifford wrote: > ci support got removed from this thread but I believe this is relevant. > Zhao also reports the same way of getting it working, in another > non-ci-support message. > > ---------- Forwarded message ---------- > Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) > From: Ben Clifford > To: Glen Hocky > Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, > zhaozhang at uchicago.edu, papka at ci.uchicago.edu > Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor > on TeraPort? > > my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor > work. this suggests perhaps that in a working environment, condor should > be coming from that OSG stack and not from a specific condor softenv key. From support at ci.uchicago.edu Fri Jun 19 11:13:27 2009 From: support at ci.uchicago.edu (Mike Wilde) Date: Fri, 19 Jun 2009 11:13:27 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: <4A3BB91C.9060906@mcs.anl.gov> References:

<4A3BB91C.9060906@mcs.anl.gov> Message-ID: Zhao, when the best way for regular users to do this is determined (sounds like its close) please put instructions for how to do it on: http://www.ci.uchicago.edu/wiki/bin/view/SWFT as page SwiftQuickStartForCondorG (including all the Swift config issues, eg sites.xml, etc) Thanks, Mike -- ps. I think you have a few other pages that should go there, eg, BGP, Ranger/Coasters, etc. How to run on your own local PBS cluster On 6/19/09 11:07 AM, Ben Clifford wrote: > ci support got removed from this thread but I believe this is relevant. > Zhao also reports the same way of getting it working, in another > non-ci-support message. > > ---------- Forwarded message ---------- > Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) > From: Ben Clifford > To: Glen Hocky > Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, > zhaozhang at uchicago.edu, papka at ci.uchicago.edu > Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor > on TeraPort? > > my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor > work. this suggests perhaps that in a working environment, condor should > be coming from that OSG stack and not from a specific condor softenv key. From wilde at mcs.anl.gov Fri Jun 19 11:17:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 11:17:04 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> Message-ID: <4A3BBA00.2070805@mcs.anl.gov> On 6/19/09 11:02 AM, Stuart Martin wrote: > On Jun 19, 2009, at Jun 19, 10:54 AM, Michael Wilde wrote: > >> We'll find a way to do this, STu, but it may go a little slower than >> desired due to heavy multi-tasking in the group. >> >> So you should push forward to get it testable, thats step zero I think. > > I am pushing forward with groups where there is someone to drive the > testing. For example, Jaime Frey is testing gram5 with condor-g. CMS > will be doing some testing in early July. Then there is the swift > testing... Stu, I would suggest not to delay on getting it installed where we can test with swift. My prior comment was based on a poor initial guess of whts involved. But that's your call; when its installed where we can run Swift jobs, we'll test it. Eg: we are running 2 apps on Firefly at the moment. if you can get it installed there, we can test on it even simpler than we are testing over Condor-G. Do you have a way to capture gatekeeper stress during such tests? - Mike >> >> >> In parallel, we should discuss on the list what ifany Swift changes >> are needed to use it. It dont have my head around the issue at the >> moment. Where can we read the specs of how it affects the user? >> >> We have a pretty swamped schedule through July, so I'd expect to slot >> this for late Jul early Aug. >> >> Thanks, >> >> Mike >> >> >> On 6/19/09 10:21 AM, Stuart Martin wrote: >>> Hi Mike, >>> Ben was planning on testing GRAM5 on teraport for Swift. Now that >>> Ben is moving on, I am wondering what the plan is for that. Do you >>> still plan to do that? Is there someone else that will do the testing? >>> Ti was going to install GRAM5 for Ben to try out, but he has been >>> delayed dealing with other issues. GRAM5 has not yet been installed >>> on teraport. I was going to ask him again to install it, but I don't >>> know who will now drive this testing. >>> -Stu > From benc at hawaga.org.uk Fri Jun 19 11:17:47 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 19 Jun 2009 16:17:47 +0000 (GMT) Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: References:

<4A3BB91C.9060906@mcs.anl.gov> Message-ID: If its being exposed to users, this should be fixed in softenv, not worked around with sourcing setup shell scripts. On Fri, 19 Jun 2009, Mike Wilde wrote: > Zhao, when the best way for regular users to do this is determined > (sounds like its close) please put instructions for how to do it on: > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT > as page SwiftQuickStartForCondorG > > (including all the Swift config issues, eg sites.xml, etc) > > Thanks, > > Mike > > -- > > ps. I think you have a few other pages that should go there, eg, BGP, > Ranger/Coasters, etc. > > How to run on your own local PBS cluster > > > > On 6/19/09 11:07 AM, Ben Clifford wrote: > > ci support got removed from this thread but I believe this is relevant. > > Zhao also reports the same way of getting it working, in another > > non-ci-support message. > > > > ---------- Forwarded message ---------- > > Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) > > From: Ben Clifford > > To: Glen Hocky > > Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, > > zhaozhang at uchicago.edu, papka at ci.uchicago.edu > > Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor > > on TeraPort? > > > > my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor > > work. this suggests perhaps that in a working environment, condor should > > be coming from that OSG stack and not from a specific condor softenv key. > > From support at ci.uchicago.edu Fri Jun 19 11:18:01 2009 From: support at ci.uchicago.edu (Ben Clifford) Date: Fri, 19 Jun 2009 11:18:01 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: References:

<4A3BB91C.9060906@mcs.anl.gov> Message-ID: If its being exposed to users, this should be fixed in softenv, not worked around with sourcing setup shell scripts. On Fri, 19 Jun 2009, Mike Wilde wrote: > Zhao, when the best way for regular users to do this is determined > (sounds like its close) please put instructions for how to do it on: > > http://www.ci.uchicago.edu/wiki/bin/view/SWFT > as page SwiftQuickStartForCondorG > > (including all the Swift config issues, eg sites.xml, etc) > > Thanks, > > Mike > > -- > > ps. I think you have a few other pages that should go there, eg, BGP, > Ranger/Coasters, etc. > > How to run on your own local PBS cluster > > > > On 6/19/09 11:07 AM, Ben Clifford wrote: > > ci support got removed from this thread but I believe this is relevant. > > Zhao also reports the same way of getting it working, in another > > non-ci-support message. > > > > ---------- Forwarded message ---------- > > Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) > > From: Ben Clifford > > To: Glen Hocky > > Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, > > zhaozhang at uchicago.edu, papka at ci.uchicago.edu > > Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor > > on TeraPort? > > > > my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor > > work. this suggests perhaps that in a working environment, condor should > > be coming from that OSG stack and not from a specific condor softenv key. > > From support at ci.uchicago.edu Fri Jun 19 11:21:21 2009 From: support at ci.uchicago.edu (Ti Leggett) Date: Fri, 19 Jun 2009 11:21:21 -0500 Subject: [Swift-devel] [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: References: <4A3AB523.2060205@mcs.anl.gov>

Message-ID: If this is the case, then it sounds like @osg needs to be used on the Teraport instead of @globus-4. Can you try that and see if that helps? On Fri Jun 19 11:00:27 2009, benc at hawaga.org.uk wrote: > my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor > work. this suggests perhaps that in a working environment, condor should > be coming from that OSG stack and not from a specific condor softenv key. From wilde at mcs.anl.gov Fri Jun 19 11:24:39 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 19 Jun 2009 11:24:39 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: References:

<4A3BB91C.9060906@mcs.anl.gov>

Message-ID: <4A3BBBC7.4020907@mcs.anl.gov> On 6/19/09 11:18 AM, Ben Clifford wrote: > If its being exposed to users, this should be fixed in softenv, not worked > around with sourcing setup shell scripts. Indeed, yes. Maybe prelim info that works, can be posted, till then, if this takes much longer. > On Fri, 19 Jun 2009, Mike Wilde wrote: > >> Zhao, when the best way for regular users to do this is determined >> (sounds like its close) please put instructions for how to do it on: >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT >> as page SwiftQuickStartForCondorG >> >> (including all the Swift config issues, eg sites.xml, etc) >> >> Thanks, >> >> Mike >> >> -- >> >> ps. I think you have a few other pages that should go there, eg, BGP, >> Ranger/Coasters, etc. >> >> How to run on your own local PBS cluster >> >> >> >> On 6/19/09 11:07 AM, Ben Clifford wrote: >>> ci support got removed from this thread but I believe this is relevant. >>> Zhao also reports the same way of getting it working, in another >>> non-ci-support message. >>> >>> ---------- Forwarded message ---------- >>> Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) >>> From: Ben Clifford >>> To: Glen Hocky >>> Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, >>> zhaozhang at uchicago.edu, papka at ci.uchicago.edu >>> Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor >>> on TeraPort? >>> >>> my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor >>> work. this suggests perhaps that in a working environment, condor should >>> be coming from that OSG stack and not from a specific condor softenv key. >> > From support at ci.uchicago.edu Fri Jun 19 11:24:49 2009 From: support at ci.uchicago.edu (Mike Wilde) Date: Fri, 19 Jun 2009 11:24:49 -0500 Subject: [Swift-devel] Re: [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? (fwd) In-Reply-To: <4A3BBBC7.4020907@mcs.anl.gov> References:

<4A3BB91C.9060906@mcs.anl.gov>

<4A3BBBC7.4020907@mcs.anl.gov> Message-ID: On 6/19/09 11:18 AM, Ben Clifford wrote: > If its being exposed to users, this should be fixed in softenv, not worked > around with sourcing setup shell scripts. Indeed, yes. Maybe prelim info that works, can be posted, till then, if this takes much longer. > On Fri, 19 Jun 2009, Mike Wilde wrote: > >> Zhao, when the best way for regular users to do this is determined >> (sounds like its close) please put instructions for how to do it on: >> >> http://www.ci.uchicago.edu/wiki/bin/view/SWFT >> as page SwiftQuickStartForCondorG >> >> (including all the Swift config issues, eg sites.xml, etc) >> >> Thanks, >> >> Mike >> >> -- >> >> ps. I think you have a few other pages that should go there, eg, BGP, >> Ranger/Coasters, etc. >> >> How to run on your own local PBS cluster >> >> >> >> On 6/19/09 11:07 AM, Ben Clifford wrote: >>> ci support got removed from this thread but I believe this is relevant. >>> Zhao also reports the same way of getting it working, in another >>> non-ci-support message. >>> >>> ---------- Forwarded message ---------- >>> Date: Fri, 19 Jun 2009 16:00:13 +0000 (GMT) >>> From: Ben Clifford >>> To: Glen Hocky >>> Cc: wilde at mcs.anl.gov, sthapa at ci.uchicago.edu, swift-devel at ci.uchicago.edu, >>> zhaozhang at uchicago.edu, papka at ci.uchicago.edu >>> Subject: Re: [CI Ticketing System #1074] How to set .soft and env to run condor >>> on TeraPort? >>> >>> my experience on tp-osg is that sourcing /opt/osg/setup.sh makes condor >>> work. this suggests perhaps that in a working environment, condor should >>> be coming from that OSG stack and not from a specific condor softenv key. >> > From smartin at mcs.anl.gov Fri Jun 19 13:23:22 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 19 Jun 2009 13:23:22 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: <4A3BB7E2.3020503@mcs.anl.gov> References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> <4A3BB7E2.3020503@mcs.anl.gov> Message-ID: On Jun 19, 2009, at Jun 19, 11:08 AM, Michael Wilde wrote: > Is what we're looking to see here the ability to run Swift with a > full or wide throttle to Gram5, directly, without Condor-G, and the > ability to have (a) lots of jobs in the queue and (b) many more jobs > running at once, while watching the gatekeepr host for CPU stress > and memory pressure? Yes - exactly. > Where say (a) is a few thousand jobs and (b) is the full cluster busy? Yes and Yes. > > I wonder if we can get a full-system reservation on TeraPort to test > this? I don't know. > > We're also testing Swift via Condor-G at the moment on UNL's new > cluster "Firefly" which has 6000 cores of which 3000 are accessible > to OSG. As its a new and lightly loaded cluster, perhaps Brian > Bockelman would be willing to test GRAM5 on it? (its a PBS cluster) Ok - I'll check with Brian. > So, now that I think about, as long as there's a GRAM5 gatekeeper we > can use, since it should Just Work, Im sure we can give it some > informal usage as soon as its available. Cool. > Stu, do you have plans for testing beyon Teraport on larger clusters? Yes. CMS will be doing the initial test for OSG. If that goes well, then it could be used throughout OSG. So, CMS using GRAM5 will be a good test. > > I wonder, maybe we could test it in AWS at large scales too on a > Nimbus workspace? I suppose. What would that entail? You'd want an image with a gram5 service running that interfaces with some LRM (PBS, SGE, Condor, ...) system. Then that LRM managing a set of worker VMs? What have you done with AWS/Nimbus sofar? Anything like this? > > - Mike > > > On 6/19/09 10:58 AM, Ben Clifford wrote: >> On Fri, 19 Jun 2009, Michael Wilde wrote: >>> In parallel, we should discuss on the list what ifany Swift >>> changes are needed >>> to use it. It dont have my head around the issue at the moment. >>> Where can we >>> read the specs of how it affects the user? >> Theoretically it will Just Work with the GRAM2 provider. Evidence >> thus far suggests this might be true (for example, apparently the >> gram2 cog stuff can submit to gram5 ok) but there hasn't been any >> swift-level testing to see how it all fits together. From aespinosa at cs.uchicago.edu Fri Jun 19 13:27:42 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 19 Jun 2009 13:27:42 -0500 Subject: [Swift-devel] hprof profiling of coaster services Message-ID: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> profile trace is in run02/java.hprof.txt summary: CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 rank self accum count trace method 1 48.99% 48.99% 9550 300225 java.net.PlainSocketImpl.socketAccept 2 25.08% 74.07% 4888 300411 java.lang.UNIXProcess.waitForProcessExit 3 25.07% 99.14% 4887 300487 org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run 4 0.18% 99.32% 36 300465 java.net.SocketInputStream.socketRead0 5 0.09% 99.42% 18 300472 java.net.SocketInputStream.socketRead0 6 0.05% 99.46% 9 300051 java.lang.ClassLoader.defineClass1 7 0.02% 99.48% 4 300498 java.lang.Shutdown.halt0 8 0.02% 99.50% 3 300101 java.lang.ClassLoader.findBootstrapClass 9 0.02% 99.51% 3 300123 java.util.zip.ZipFile.getEntry 10 0.02% 99.53% 3 300492 java.io.FileInputStream.available 11 0.01% 99.54% 2 300433 COM.claymoresystems.cert.CertContext. 12 0.01% 99.55% 2 300435 java.io.FileInputStream.open 13 0.01% 99.56% 2 300447 java.lang.Throwable.fillInStackTrace 14 0.01% 99.57% 2 300460 java.util.zip.Inflater.inflateBytes 15 0.01% 99.58% 2 300491 java.lang.Throwable.fillInStackTrace CPU SAMPLES END -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: run02.tar.gz Type: application/x-gzip Size: 17948 bytes Desc: not available URL: From smartin at mcs.anl.gov Fri Jun 19 13:27:59 2009 From: smartin at mcs.anl.gov (Stuart Martin) Date: Fri, 19 Jun 2009 13:27:59 -0500 Subject: [Swift-devel] Re: swift testing of gram5 on teraport In-Reply-To: <4A3BBA00.2070805@mcs.anl.gov> References: <3AED3D0B-C847-47BA-AE00-092CC3239754@mcs.anl.gov> <4A3BB4A7.2070708@mcs.anl.gov> <4A3BBA00.2070805@mcs.anl.gov> Message-ID: <2CDC0912-D7FF-4421-8090-F44F3FEF0F7F@mcs.anl.gov> On Jun 19, 2009, at Jun 19, 11:17 AM, Michael Wilde wrote: > > > On 6/19/09 11:02 AM, Stuart Martin wrote: >> On Jun 19, 2009, at Jun 19, 10:54 AM, Michael Wilde wrote: >>> We'll find a way to do this, STu, but it may go a little slower >>> than desired due to heavy multi-tasking in the group. >>> >>> So you should push forward to get it testable, thats step zero I >>> think. >> I am pushing forward with groups where there is someone to drive >> the testing. For example, Jaime Frey is testing gram5 with condor- >> g. CMS will be doing some testing in early July. Then there is >> the swift testing... > > Stu, > > I would suggest not to delay on getting it installed where we can > test with swift. My prior comment was based on a poor initial guess > of whts involved. But that's your call; when its installed where we > can run Swift jobs, we'll test it. Ok - I'll ask Ti again to install it on teraport. > > Eg: we are running 2 apps on Firefly at the moment. if you can get > it installed there, we can test on it even simpler than we are > testing over Condor-G. Where is Firefly? Who do I talk to about that? Or maybe the request is best to come from the user (you). e.g. please install gram5 on firefly, that has the potential to improve scalability for my swift jobs... > > Do you have a way to capture gatekeeper stress during such tests? Joe has used ganglia for these test results: http://dev.globus.org/wiki/GRAM5_Scalability_Results#Test_Results > > - Mike > >>> >>> >>> In parallel, we should discuss on the list what ifany Swift >>> changes are needed to use it. It dont have my head around the >>> issue at the moment. Where can we read the specs of how it affects >>> the user? >>> >>> We have a pretty swamped schedule through July, so I'd expect to >>> slot this for late Jul early Aug. >>> >>> Thanks, >>> >>> Mike >>> >>> >>> On 6/19/09 10:21 AM, Stuart Martin wrote: >>>> Hi Mike, >>>> Ben was planning on testing GRAM5 on teraport for Swift. Now >>>> that Ben is moving on, I am wondering what the plan is for that. >>>> Do you still plan to do that? Is there someone else that will do >>>> the testing? >>>> Ti was going to install GRAM5 for Ben to try out, but he has been >>>> delayed dealing with other issues. GRAM5 has not yet been >>>> installed on teraport. I was going to ask him again to install >>>> it, but I don't know who will now drive this testing. >>>> -Stu From hategan at mcs.anl.gov Fri Jun 19 13:42:31 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 13:42:31 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> Message-ID: <1245436951.28833.4.camel@localhost> I'm looking for a complete trace (make sure you use cpu=times) in binary format (format=b) that I can load and analyze in some profiler. The think below says that waiting takes time. Which is not interesting because at the same time waiting doesn't eat CPU. On Fri, 2009-06-19 at 13:27 -0500, Allan Espinosa wrote: > profile trace is in run02/java.hprof.txt > > summary: > CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 > rank self accum count trace method > 1 48.99% 48.99% 9550 300225 java.net.PlainSocketImpl.socketAccept > 2 25.08% 74.07% 4888 300411 java.lang.UNIXProcess.waitForProcessExit > 3 25.07% 99.14% 4887 300487 > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run > 4 0.18% 99.32% 36 300465 java.net.SocketInputStream.socketRead0 > 5 0.09% 99.42% 18 300472 java.net.SocketInputStream.socketRead0 > 6 0.05% 99.46% 9 300051 java.lang.ClassLoader.defineClass1 > 7 0.02% 99.48% 4 300498 java.lang.Shutdown.halt0 > 8 0.02% 99.50% 3 300101 java.lang.ClassLoader.findBootstrapClass > 9 0.02% 99.51% 3 300123 java.util.zip.ZipFile.getEntry > 10 0.02% 99.53% 3 300492 java.io.FileInputStream.available > 11 0.01% 99.54% 2 300433 COM.claymoresystems.cert.CertContext. > 12 0.01% 99.55% 2 300435 java.io.FileInputStream.open > 13 0.01% 99.56% 2 300447 java.lang.Throwable.fillInStackTrace > 14 0.01% 99.57% 2 300460 java.util.zip.Inflater.inflateBytes > 15 0.01% 99.58% 2 300491 java.lang.Throwable.fillInStackTrace > CPU SAMPLES END > > -Allan > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Fri Jun 19 14:43:48 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 14:43:48 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245436951.28833.4.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> Message-ID: <1245440628.30342.0.camel@localhost> How many jobs did this run have? Did you observe a high cpu load on the head node while it was running? On Fri, 2009-06-19 at 13:42 -0500, Mihael Hategan wrote: > I'm looking for a complete trace (make sure you use cpu=times) in binary > format (format=b) that I can load and analyze in some profiler. > > The think below says that waiting takes time. Which is not interesting > because at the same time waiting doesn't eat CPU. > > On Fri, 2009-06-19 at 13:27 -0500, Allan Espinosa wrote: > > profile trace is in run02/java.hprof.txt > > > > summary: > > CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 > > rank self accum count trace method > > 1 48.99% 48.99% 9550 300225 java.net.PlainSocketImpl.socketAccept > > 2 25.08% 74.07% 4888 300411 java.lang.UNIXProcess.waitForProcessExit > > 3 25.07% 99.14% 4887 300487 > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run > > 4 0.18% 99.32% 36 300465 java.net.SocketInputStream.socketRead0 > > 5 0.09% 99.42% 18 300472 java.net.SocketInputStream.socketRead0 > > 6 0.05% 99.46% 9 300051 java.lang.ClassLoader.defineClass1 > > 7 0.02% 99.48% 4 300498 java.lang.Shutdown.halt0 > > 8 0.02% 99.50% 3 300101 java.lang.ClassLoader.findBootstrapClass > > 9 0.02% 99.51% 3 300123 java.util.zip.ZipFile.getEntry > > 10 0.02% 99.53% 3 300492 java.io.FileInputStream.available > > 11 0.01% 99.54% 2 300433 COM.claymoresystems.cert.CertContext. > > 12 0.01% 99.55% 2 300435 java.io.FileInputStream.open > > 13 0.01% 99.56% 2 300447 java.lang.Throwable.fillInStackTrace > > 14 0.01% 99.57% 2 300460 java.util.zip.Inflater.inflateBytes > > 15 0.01% 99.58% 2 300491 java.lang.Throwable.fillInStackTrace > > CPU SAMPLES END > > > > -Allan > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Fri Jun 19 14:52:14 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 19 Jun 2009 14:52:14 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245440628.30342.0.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> Message-ID: <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> it has 200 jobs using the vanilla 066-many.swift workflow. yes. cpu usage indicates 100-200 % utilization in the duration of the workflow. I made a script(1) recording of the top session and attached it in this email. -Allan 2009/6/19 Mihael Hategan : > How many jobs did this run have? > > Did you observe a high cpu load on the head node while it was running? > > On Fri, 2009-06-19 at 13:42 -0500, Mihael Hategan wrote: >> I'm looking for a complete trace (make sure you use cpu=times) in binary >> format (format=b) that I can load and analyze in some profiler. >> >> The think below says that waiting takes time. Which is not interesting >> because at the same time waiting doesn't eat CPU. >> >> On Fri, 2009-06-19 at 13:27 -0500, Allan Espinosa wrote: >> > profile trace is in run02/java.hprof.txt >> > >> > summary: >> > CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 >> > rank ? self ?accum ? count trace method >> > ? ?1 48.99% 48.99% ? ?9550 300225 java.net.PlainSocketImpl.socketAccept >> > ? ?2 25.08% 74.07% ? ?4888 300411 java.lang.UNIXProcess.waitForProcessExit >> > ? ?3 25.07% 99.14% ? ?4887 300487 >> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run >> > ? ?4 ?0.18% 99.32% ? ? ?36 300465 java.net.SocketInputStream.socketRead0 >> > ? ?5 ?0.09% 99.42% ? ? ?18 300472 java.net.SocketInputStream.socketRead0 >> > ? ?6 ?0.05% 99.46% ? ? ? 9 300051 java.lang.ClassLoader.defineClass1 >> > ? ?7 ?0.02% 99.48% ? ? ? 4 300498 java.lang.Shutdown.halt0 >> > ? ?8 ?0.02% 99.50% ? ? ? 3 300101 java.lang.ClassLoader.findBootstrapClass >> > ? ?9 ?0.02% 99.51% ? ? ? 3 300123 java.util.zip.ZipFile.getEntry >> > ? 10 ?0.02% 99.53% ? ? ? 3 300492 java.io.FileInputStream.available >> > ? 11 ?0.01% 99.54% ? ? ? 2 300433 COM.claymoresystems.cert.CertContext. >> > ? 12 ?0.01% 99.55% ? ? ? 2 300435 java.io.FileInputStream.open >> > ? 13 ?0.01% 99.56% ? ? ? 2 300447 java.lang.Throwable.fillInStackTrace >> > ? 14 ?0.01% 99.57% ? ? ? 2 300460 java.util.zip.Inflater.inflateBytes >> > ? 15 ?0.01% 99.58% ? ? ? 2 300491 java.lang.Throwable.fillInStackTrace >> > CPU SAMPLES END >> > >> > -Allan -------------- next part -------------- A non-text attachment was scrubbed... Name: top.gz Type: application/x-gzip Size: 8912 bytes Desc: not available URL: From support at ci.uchicago.edu Fri Jun 19 14:56:59 2009 From: support at ci.uchicago.edu (Ti Leggett) Date: Fri, 19 Jun 2009 14:56:59 -0500 Subject: [Swift-devel] [CI Ticketing System #1074] How to set .soft and env to run condor on TeraPort? In-Reply-To: <4A3AB523.2060205@mcs.anl.gov> References: <4A3AB523.2060205@mcs.anl.gov> Message-ID: This may have gotten lost in the correspondence, but did anyone try using @osg instead of @globus-4 in their ~/.soft? From hategan at mcs.anl.gov Fri Jun 19 15:23:39 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 15:23:39 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> Message-ID: <1245443019.30342.2.camel@localhost> When profiling it's bound to consume a lot of cpu. Can you verify the same without the profiler enabled? On Fri, 2009-06-19 at 14:52 -0500, Allan Espinosa wrote: > it has 200 jobs using the vanilla 066-many.swift workflow. > > yes. cpu usage indicates 100-200 % utilization in the duration of the > workflow. I made a script(1) recording of the top session and > attached it in this email. > > -Allan > > 2009/6/19 Mihael Hategan : > > How many jobs did this run have? > > > > Did you observe a high cpu load on the head node while it was running? > > > > On Fri, 2009-06-19 at 13:42 -0500, Mihael Hategan wrote: > >> I'm looking for a complete trace (make sure you use cpu=times) in binary > >> format (format=b) that I can load and analyze in some profiler. > >> > >> The think below says that waiting takes time. Which is not interesting > >> because at the same time waiting doesn't eat CPU. > >> > >> On Fri, 2009-06-19 at 13:27 -0500, Allan Espinosa wrote: > >> > profile trace is in run02/java.hprof.txt > >> > > >> > summary: > >> > CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 > >> > rank self accum count trace method > >> > 1 48.99% 48.99% 9550 300225 java.net.PlainSocketImpl.socketAccept > >> > 2 25.08% 74.07% 4888 300411 java.lang.UNIXProcess.waitForProcessExit > >> > 3 25.07% 99.14% 4887 300487 > >> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run > >> > 4 0.18% 99.32% 36 300465 java.net.SocketInputStream.socketRead0 > >> > 5 0.09% 99.42% 18 300472 java.net.SocketInputStream.socketRead0 > >> > 6 0.05% 99.46% 9 300051 java.lang.ClassLoader.defineClass1 > >> > 7 0.02% 99.48% 4 300498 java.lang.Shutdown.halt0 > >> > 8 0.02% 99.50% 3 300101 java.lang.ClassLoader.findBootstrapClass > >> > 9 0.02% 99.51% 3 300123 java.util.zip.ZipFile.getEntry > >> > 10 0.02% 99.53% 3 300492 java.io.FileInputStream.available > >> > 11 0.01% 99.54% 2 300433 COM.claymoresystems.cert.CertContext. > >> > 12 0.01% 99.55% 2 300435 java.io.FileInputStream.open > >> > 13 0.01% 99.56% 2 300447 java.lang.Throwable.fillInStackTrace > >> > 14 0.01% 99.57% 2 300460 java.util.zip.Inflater.inflateBytes > >> > 15 0.01% 99.58% 2 300491 java.lang.Throwable.fillInStackTrace > >> > CPU SAMPLES END > >> > > >> > -Allan From aespinosa at cs.uchicago.edu Fri Jun 19 18:30:54 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Fri, 19 Jun 2009 18:30:54 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245443019.30342.2.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> Message-ID: <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> here's a script recording without profiling of top: http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz it still consumes some cpu. but does not spike to 200% utilization. Thanks -Allan 2009/6/19 Mihael Hategan : > When profiling it's bound to consume a lot of cpu. Can you verify the > same without the profiler enabled? > > On Fri, 2009-06-19 at 14:52 -0500, Allan Espinosa wrote: >> it has 200 jobs using the vanilla 066-many.swift workflow. >> >> yes. ?cpu usage indicates 100-200 % utilization in the duration of the >> workflow. ?I made a script(1) recording of the top session and >> attached it in this email. >> >> -Allan >> >> 2009/6/19 Mihael Hategan : >> > How many jobs did this run have? >> > >> > Did you observe a high cpu load on the head node while it was running? >> > >> > On Fri, 2009-06-19 at 13:42 -0500, Mihael Hategan wrote: >> >> I'm looking for a complete trace (make sure you use cpu=times) in binary >> >> format (format=b) that I can load and analyze in some profiler. >> >> >> >> The think below says that waiting takes time. Which is not interesting >> >> because at the same time waiting doesn't eat CPU. >> >> >> >> On Fri, 2009-06-19 at 13:27 -0500, Allan Espinosa wrote: >> >> > profile trace is in run02/java.hprof.txt >> >> > >> >> > summary: >> >> > CPU SAMPLES BEGIN (total = 19493) Fri Jun 19 13:22:27 2009 >> >> > rank ? self ?accum ? count trace method >> >> > ? ?1 48.99% 48.99% ? ?9550 300225 java.net.PlainSocketImpl.socketAccept >> >> > ? ?2 25.08% 74.07% ? ?4888 300411 java.lang.UNIXProcess.waitForProcessExit >> >> > ? ?3 25.07% 99.14% ? ?4887 300487 >> >> > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run >> >> > ? ?4 ?0.18% 99.32% ? ? ?36 300465 java.net.SocketInputStream.socketRead0 >> >> > ? ?5 ?0.09% 99.42% ? ? ?18 300472 java.net.SocketInputStream.socketRead0 >> >> > ? ?6 ?0.05% 99.46% ? ? ? 9 300051 java.lang.ClassLoader.defineClass1 >> >> > ? ?7 ?0.02% 99.48% ? ? ? 4 300498 java.lang.Shutdown.halt0 >> >> > ? ?8 ?0.02% 99.50% ? ? ? 3 300101 java.lang.ClassLoader.findBootstrapClass >> >> > ? ?9 ?0.02% 99.51% ? ? ? 3 300123 java.util.zip.ZipFile.getEntry >> >> > ? 10 ?0.02% 99.53% ? ? ? 3 300492 java.io.FileInputStream.available >> >> > ? 11 ?0.01% 99.54% ? ? ? 2 300433 COM.claymoresystems.cert.CertContext. >> >> > ? 12 ?0.01% 99.55% ? ? ? 2 300435 java.io.FileInputStream.open >> >> > ? 13 ?0.01% 99.56% ? ? ? 2 300447 java.lang.Throwable.fillInStackTrace >> >> > ? 14 ?0.01% 99.57% ? ? ? 2 300460 java.util.zip.Inflater.inflateBytes >> >> > ? 15 ?0.01% 99.58% ? ? ? 2 300491 java.lang.Throwable.fillInStackTrace >> >> > CPU SAMPLES END >> >> > >> >> > -Allan From hategan at mcs.anl.gov Fri Jun 19 19:53:47 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Jun 2009 19:53:47 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> Message-ID: <1245459227.6629.6.camel@localhost> On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: > here's a script recording without profiling of top: > > http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz > > it still consumes some cpu. but does not spike to 200% utilization. Right, 10% != 200%. Now, as you're probably already guessing, I would need: 1. a situation (workflow/site/etc.) in which the usage does go crazy without the profiler (as in what triggered you getting kicked off ranger); repeatable 2. a profiler dump of a run in such a situation Btw, the ranger issue, where was swift running? From tfreeman at mcs.anl.gov Sun Jun 21 09:59:10 2009 From: tfreeman at mcs.anl.gov (Tim Freeman) Date: Sun, 21 Jun 2009 09:59:10 -0500 Subject: [Swift-devel] gparallelizer DataflowConcurrency Message-ID: <20090621095910.5b8ffe8a@sietch> Thought this may be of interest here: gparallelizer DataflowConcurrency http://code.google.com/p/gparallelizer/wiki/DataflowConcurrency Tim From benc at hawaga.org.uk Sun Jun 21 11:28:28 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 21 Jun 2009 16:28:28 +0000 (GMT) Subject: [Swift-devel] gparallelizer DataflowConcurrency In-Reply-To: <20090621095910.5b8ffe8a@sietch> References: <20090621095910.5b8ffe8a@sietch> Message-ID: One thing that is apparent here that Swift doesn't have (and I think is better for Swift ot not have) is the distinction between immediate variables (that have a value) and future values (represented by DataFlowVariable objects) - to get access to those future values you need to dereference their value with ~ (which returns the relevant immediate value or if that immediate value is not available, suspends the present thread of execution until such time as that immediate value is available). The distinction between immediate and future values in the language is a little awkward - for example, instead of z=x+y, you must write z << ~x + ~y to interface the imediate value desiring + operator with the future value variables z, x and y using ~ to dereference and << instead of = to assign. This exposure of the interfacing between immediate and future worlds seems a bit awkward. Swift used to distinguish some between immediate and future values in its implementation (although not really in the syntax); I think its good that we got rid of that distinction and have only future values for everything. It would of course be possible to write some kind of + operator that did the referncing and dereferencing, but this would not in general happen automatically for every possible function and operator. This is the same trouble as using pointers in C with the * operator to access the value of that pointer, or using weak references in Java using .get() syntax to access the value of the weak reference - its nothing particularly specific to programming using futures. But it looks ugly and awkward in all of them... -- From hategan at mcs.anl.gov Sun Jun 21 12:26:16 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 21 Jun 2009 12:26:16 -0500 Subject: [Swift-devel] gparallelizer DataflowConcurrency In-Reply-To: References: <20090621095910.5b8ffe8a@sietch> Message-ID: <1245605176.25711.16.camel@localhost> On Sun, 2009-06-21 at 16:28 +0000, Ben Clifford wrote: > One thing that is apparent here that Swift doesn't have (and I think is > better for Swift ot not have) is the distinction between immediate > variables (that have a value) and future values (represented by > DataFlowVariable objects) - to get access to those future values you need > to dereference their value with ~ (which returns the relevant immediate > value or if that immediate value is not available, suspends the present > thread of execution until such time as that immediate value is available). This is, from the usability perspective, and advantage. It is a problem in compiled languages because all variable accesses need to check whether a variable is a future or not (hence the explicit checks in some of such languages). A possible optimization is to have futures also be part of the type system, but it works in a limited number of cases (unless you generate code for functions for all possible combinations of futures/non-futures, which can lead to a crazy amount of compiled code, not to mention you can't do it for collections). Of course, swift being coarse-grained and all, this doesn't make much difference. > > The distinction between immediate and future values in the language is a > little awkward - for example, instead of z=x+y, you must write z << ~x + > ~y to interface the imediate value desiring + operator with the future > value variables z, x and y using ~ to dereference and << instead of = to > assign. If you have both futures and non-futures explicitly, you need a way to distinguish between them at assignment time, whether it's x << 1 or x = future(1). > This exposure of the interfacing between immediate and future > worlds seems a bit awkward. > > Swift used to distinguish some between immediate and future values in its > implementation (although not really in the syntax); I think its good that > we got rid of that distinction and have only future values for everything. > > It would of course be possible to write some kind of + operator that did > the referncing and dereferencing, but this would not in general happen > automatically for every possible function and operator. Right. If you want "auto-futures", you need to bind it at a low level in the language implementation. Also, you don't want to do it for all functions. Composite functions should not wait for futures: add(x, y) { return x + y; } You (probably) only want + to sync on x and y, not add(). > > This is the same trouble as using pointers in C with the * operator to > access the value of that pointer, or using weak references in Java using > .get() syntax to access the value of the weak reference - its nothing > particularly specific to programming using futures. But it looks ugly and > awkward in all of them... > Again, the problem is that if you want seamless futures, they can't be added as an afterthought to your language. And if you do have them as part of your language, you get a performance penalty. It may be possible, however, to compile Swift to this thing, since I assume you can use Java libraries in Groovy seamlessly. I'm not sure, however, of their green thread implementation. I think it's a very difficult business to make green threads appear preemtible. From benc at hawaga.org.uk Mon Jun 22 03:12:08 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 22 Jun 2009 08:12:08 +0000 (GMT) Subject: [Swift-devel] globally readable variables Message-ID: It seems desirable to make some kind of global variable in SwiftScript. I will not repeat the arguments here. There seem two obvious syntax choices: i) all top level variables become readable at all scopes. ii) top level variables may be annotated with a "global" modifier to make them accessible at all scopes; otherwise they are accessible only in the same places they presently are accessible, the top level scope and its subscopes. I'm aware of a preference from one person for i) and a slight preference from another for ii) - does anyone have partcular thoughts? -- From hategan at mcs.anl.gov Mon Jun 22 11:17:59 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 22 Jun 2009 11:17:59 -0500 Subject: [Swift-devel] globally readable variables In-Reply-To: References: Message-ID: <1245687479.6383.1.camel@localhost> On Mon, 2009-06-22 at 08:12 +0000, Ben Clifford wrote: > It seems desirable to make some kind of global variable in SwiftScript. I > will not repeat the arguments here. > > There seem two obvious syntax choices: > > i) all top level variables become readable at all scopes. > > ii) top level variables may be annotated with a "global" modifier I would like to express my strong preference for "constant" or "const" instead of "global". > to make > them accessible at all scopes; otherwise they are accessible only in the > same places they presently are accessible, the top level scope and its > subscopes. > > I'm aware of a preference from one person for i) and a slight preference > from another for ii) - does anyone have partcular thoughts? > From iraicu at cs.uchicago.edu Mon Jun 22 14:42:16 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 22 Jun 2009 14:42:16 -0500 Subject: [Swift-devel] CFP: 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS09) at Supercomputing 2009 Message-ID: <4A3FDE98.6030102@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------------- The 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2009 http://dsl.cs.uchicago.edu/MTAGS09/ --------------------------------------------------------------------------------------- November 16th, 2009 Portland, Oregon, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC09) ======================================================================================= The 2nd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. Many-task computing (MTC), the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. The workshop will be co-located with the IEEE/ACM Supercomputing 2009 Conference in Portland Oregon on November 16th, 2009. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Scope --------------------------------------------------------------------------------------- This workshop will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor cores are beginning to come online (i.e. TACC Sun Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with 160K processors (i.e. IBM BlueGene/P). Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems. In contrast to HPC (tightly coupled applications), these loosely coupled applications make up a new class of applications as what we call Many-Task Computing (MTC). MTC systems generally involve the execution of independent, sequential jobs that can be individually scheduled on many different computing resources across multiple administrative boundaries. MTC systems typically achieve this using various grid computing technologies and techniques, and often times use files to achieve the inter-process communication as alternative communication mechanisms than MPI. MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs from HTC in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other hand requires large amounts of computing for longer times (months and years, rather than hours and days, and are generally measured in operations per month). Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, shared file system contention and scalability, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is being adopted by industry as the next wave of technological advancement to reduce operational costs while improving efficiencies in large scale infrastructures. For an interesting discussion in a blog by Ian Foster on the difference between MTC and HTC, please see his blog athttp://ianfoster.typepad.com/blog/2008/07/many-tasks-comp.html. We also published two papers that are highly relevant to this workshop. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems", and was published in SC08; the second paper is titled "Many-Task Computing for Grids and Supercomputers", which was published in MTAGS08. Furthermore, to see last year's workshop program agenda, and accepted papers and presentations, please seehttp://dsl.cs.uchicago.edu/MTAGS08/. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Topics --------------------------------------------------------------------------------------- MTAGS 2008 topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, Supercomputers, or Cloud Computing infrastructure o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges and opportunities in running many-task workloads on HPC systems o Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Parallel File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 10 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/publications/instructions_for_proceedings_volumes); document templates can be found athttp://www.acm.org/sigs/publications/proceedings-templates. A 250 word abstract (PDF format) must be submitted online at https://cmt.research.microsoft.com/MTAGS2009/ before the deadline of August 1st, 2009 at 11:59PM PST; the final 10 page papers in PDF format will be due on September 1st, 2009 at 11:59PM PST. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Notifications of the paper decisions will be sent out by October 1st, 2009. Selected excellent work will be invited to submit extended versions of the workshop paper to the IEEE Transactions on Parallel and Distributed Systems (TPDS) Journal, Special Issue on Many-Task Computing (due December 21st, 2009); for more information about this journal special issue, please visithttp://dsl.cs.uchicago.edu/TPDS_MTC/. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visithttp://dsl.cs.uchicago.edu/MTAGS09/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: August 1st, 2009 * Papers Due: September 1st, 2009 * Notification of Acceptance: October 1st, 2009 * Camera Ready Papers Due: November 1st, 2009 * Workshop Date: November 16th, 2009 Committee Members --------------------------------------------------------------------------------------- Workshop Chairs * Ioan Raicu, University of Chicago * Ian Foster, University of Chicago& Argonne National Laboratory * Yong Zhao, Microsoft Technical Committee (confirmed) * David Abramson, Monash University, Australia * Pete Beckman, Argonne National Laboratory, USA * Peter Dinda, Northwestern University, USA * Ian Foster, University of Chicago& Argonne National Laboratory, USA * Bob Grossman, University of Illinois at Chicago, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Kamil Iskra, Argonne National Laboratory, USA * Chuang Liu, Ask.com, USA * Zhou Lei, Shanghai University, China * Shiyong Lu, Wayne State University, USA * Reagan Moore, University of North Carolina at Chapel Hill, USA * Marlon Pierce, Indiana University, USA * Ioan Raicu, University of Chicago, USA * Matei Ripeanu, University of British Columbia, Canada * David Swanson, University of Nebraska, USA * Greg Thain, Univeristy of Wisconsin, USA * Matthew Woitaszek, The University Corporation for Atmospheric Research, USA * Sherali Zeadally, University of the District of Columbia, USA * Yong Zhao, Microsoft, USA From iraicu at cs.uchicago.edu Mon Jun 22 14:42:42 2009 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 22 Jun 2009 14:42:42 -0500 Subject: [Swift-devel] CFP: Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) Journal Message-ID: <4A3FDEB2.1090109@cs.uchicago.edu> Call for Papers --------------------------------------------------------------------------------------- IEEE Transactions on Parallel and Distributed Systems Special Issue on Many-Task Computing on Grids and Supercomputers http://dsl.cs.uchicago.edu/TPDS_MTC/ ======================================================================================= The Special Issue on Many-Task Computing (MTC) will provide the scientific community a dedicated forum, within the prestigious IEEE Transactions on Parallel and Distributed Systems Journal, for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the focus of the special issue, encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This special issue will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all topics related to MTC on large scale systems. For more information on this special issue, please see http://dsl.cs.uchicago.edu/TPDS_MTC/. Scope --------------------------------------------------------------------------------------- This special issue will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with tens of thousands of processor cores are readily available, Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with up to 200K processors (i.e. IBM BlueGene/L and BlueGene/P, Cray XT5, Sun Constellation), are all now available to the broader scientific community for open science research. Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems, commonly known to fit in the high-throughput computing (HTC) paradigm. Many-task computing (MTC) aims to bridge the gap between two computing paradigms, HTC and HPC. MTC is reminiscent to HTC, but it differs in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks (i.e. including both dependent and independent tasks), where the primary metrics are measured in seconds (e.g. FLOPS, tasks/s, MB/s I/O rates), as opposed to operations (e.g. jobs) per month. MTC denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. Tasks may be small or large, uniprocessor or multiprocessor, compute-intensive or data-intensive. The set of tasks may be static or dynamic, homogeneous or heterogeneous, loosely coupled or tightly coupled. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. MTC includes loosely coupled applications that are generally communication-intensive but not naturally expressed using standard message passing interface commonly found in HPC, drawing attention to the many computations that are heterogeneous but not "happily" parallel. There is more to HPC than tightly coupled MPI, and more to HTC than embarrassingly parallel long running jobs. Like HPC applications, and science itself, applications are becoming increasingly complex opening new doors for many opportunities to apply HPC in new ways if we broaden our perspective. Some applications have just so many simple tasks that managing them is hard. Applications that operate on or produce large amounts of data need sophisticated data management in order to scale. There exist applications that involve many tasks, each composed of tightly coupled MPI tasks. Loosely coupled applications often have dependencies among tasks, and typically use files for inter-process communication. Efficient support for these sorts of applications on existing large scale systems will involve substantial technical challenges and will have big impact on science. Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. Furthermore, the MTC paradigm can be naturally applied to the emerging Cloud Computing paradigm due to its loosely coupled nature, which is being adopted by industry as the next wave of technological advancement to reduce operational costs while improving efficiencies in large scale infrastructures. For an interesting discussion in a blog by Ian Foster on the difference between MTC and HTC, please see his blog athttp://ianfoster.typepad.com/blog/2008/07/many-tasks-comp.html. The proposed editors also published several papers highly relevant to this special issue. One paper is titled "Toward Loosely Coupled Programming on Petascale Systems", and was published in IEEE/ACM Supercomputing 2008 (SC08) Conference; the second paper is titled "Many-Task Computing for Grids and Supercomputers", which was published in the IEEE Workshop on Many-Task Computing on Grids and Supercomputers 2008 (MTAGS08). To see last year's workshop program agenda, and accepted papers and presentations, please see http://dsl.cs.uchicago.edu/MTAGS08/. To see this year's workshop web site, see http://dsl.cs.uchicago.edu/MTAGS09/. Topics --------------------------------------------------------------------------------------- Topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, Supercomputers, or Cloud Computing infrastructure o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges and opportunities in running many-task workloads on HPC systems o Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Parallel File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 14 pages of double column text using single spaced 9.5 point size on 8.5 x 11 inch pages and 0.5 inch margins (http://www2.computer.org/portal/c/document_library/get_file?uuid=02e1509b-5526-4658-afb2-fe8b35044552&groupId=525767). Papers will be peer-reviewed, and accepted papers will be published in the IEEE digital library. For more information, please visithttp://dsl.cs.uchicago.edu/TPDS_MTC/. Important Dates --------------------------------------------------------------------------------------- * Abstract Due: December 1st, 2009 * Papers Due: December 21st, 2009 * First Round Decisions: February 22nd, 2010 * Major Revisions if needed: April 19th, 2010 * Second Round Decisions: May 24th, 2010 * Minor Revisions if needed: June 7th, 2010 * Final Decision: June 21st, 2010 * Publication Date: November, 2010 Guest Editors and Potential Reviewers --------------------------------------------------------------------------------------- Special Issue Guest Editors * Ian Foster, University of Chicago& Argonne National Laboratory * Ioan Raicu, University of Chicago * Yong Zhao, Microsoft From benc at hawaga.org.uk Tue Jun 23 04:27:32 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Jun 2009 09:27:32 +0000 (GMT) Subject: [Swift-devel] globally readable variables In-Reply-To: <1245687479.6383.1.camel@localhost> References: <1245687479.6383.1.camel@localhost> Message-ID: On Mon, 22 Jun 2009, Mihael Hategan wrote: > I would like to express my strong preference for "constant" or "const" > instead of "global". Everything is a constant, though. -- From aespinosa at cs.uchicago.edu Tue Jun 23 17:11:39 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Tue, 23 Jun 2009 17:11:39 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245459227.6629.6.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> Message-ID: <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> ok this looks like a good replicable case. the workflow is 2000 invocations of touch using 066-many.swift run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu utilization averages to 99-100% . i ran this for five trials and got the same results. run05.tar.gz - same run with profiling information. in java.hprof.bin 2009/6/19 Mihael Hategan : > On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: >> here's a script recording without profiling of top: >> >> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz >> >> it still consumes some cpu. but does not spike to 200% utilization. > > Right, 10% != 200%. > > Now, as you're probably already guessing, I would need: > 1. a situation (workflow/site/etc.) in which the usage does go crazy > without the profiler (as in what triggered you getting kicked off > ranger); repeatable > 2. a profiler dump of a run in such a situation > > Btw, the ranger issue, where was swift running? on communicado. -------------- next part -------------- A non-text attachment was scrubbed... Name: run04.tar.gz Type: application/x-gzip Size: 2130511 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: run05.tar.gz Type: application/x-gzip Size: 2306688 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Jun 24 18:30:00 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 24 Jun 2009 18:30:00 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> Message-ID: <1245886200.11848.1.camel@localhost> Try the following: In the source tree, edit cog/modules/provider-coaster/resources/log4j.properties and change the INFO categories to WARN. Then re-compile and see if the usage is still high. On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote: > ok this looks like a good replicable case. > > the workflow is 2000 invocations of touch using 066-many.swift > > run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu > utilization averages to 99-100% . i ran this for five trials and got > the same results. > > run05.tar.gz - same run with profiling information. in java.hprof.bin > > 2009/6/19 Mihael Hategan : > > On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: > >> here's a script recording without profiling of top: > >> > >> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz > >> > >> it still consumes some cpu. but does not spike to 200% utilization. > > > > Right, 10% != 200%. > > > > Now, as you're probably already guessing, I would need: > > 1. a situation (workflow/site/etc.) in which the usage does go crazy > > without the profiler (as in what triggered you getting kicked off > > ranger); repeatable > > 2. a profiler dump of a run in such a situation > > > > Btw, the ranger issue, where was swift running? > > on communicado. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 03:28:29 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 03:28:29 -0500 (CDT) Subject: [Swift-devel] [Bug 213] New: assignment before declaration causes runtime error but not compile error. Message-ID: https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=213 Summary: assignment before declaration causes runtime error but not compile error. Product: Swift Version: unspecified Platform: PC OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk This code causes runtime error. It should either be detected at compile time or work. m = "hi"; string m; $ swift tmp-sa.swift Swift svn swift-r2980 cog-r2407 RunID: 20090625-0927-ipsz0v96 Progress: Execution failed: java.lang.IllegalArgumentException: m is closed with a value of null -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:09:39 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:09:39 -0500 (CDT) Subject: [Swift-devel] [Bug 200] Add global variables to swift In-Reply-To: References: Message-ID: <20090625130939.CE6DC2CB03@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=200 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from Ben Clifford 2009-06-25 08:09:39 --- r2981 adds this -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:17:44 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:17:44 -0500 (CDT) Subject: [Swift-devel] [Bug 165] wrapper.sh and seq.sh name conflicts with "obvious" application-level names In-Reply-To: References: Message-ID: <20090625131744.E8FE82CA9A@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=165 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from Ben Clifford 2009-06-25 08:17:44 --- fixed in r2747 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:19:30 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:19:30 -0500 (CDT) Subject: [Swift-devel] [Bug 203] Jobs skipped in a run due to restart appear in status ticker as "Initializing" which is confusing In-Reply-To: References: Message-ID: <20090625131930.22DC52CA9A@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=203 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #1 from Ben Clifford 2009-06-25 08:19:29 --- fixed in r2905 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:21:42 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:21:42 -0500 (CDT) Subject: [Swift-devel] [Bug 191] procedures invoked inside iterate{} don't get unique execution IDs In-Reply-To: References: Message-ID: <20090625132142.206B92CA9A@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=191 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #1 from Ben Clifford 2009-06-25 08:21:41 --- fixed in r2889 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. You are watching the assignee of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:26:09 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:26:09 -0500 (CDT) Subject: [Swift-devel] [Bug 41] Deadlock in atomic procedures In-Reply-To: References: Message-ID: <20090625132609.73FD52CB61@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=41 Ben Clifford changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #2 from Ben Clifford 2009-06-25 08:26:09 --- fixed in r2498 -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. You are watching someone on the CC list of the bug. You are watching the reporter. From bugzilla-daemon at mcs.anl.gov Thu Jun 25 08:30:15 2009 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Thu, 25 Jun 2009 08:30:15 -0500 (CDT) Subject: [Swift-devel] [Bug 174] Type string is not defined In-Reply-To: References: Message-ID: <20090625133015.9AC702CB61@wind.mcs.anl.gov> https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=174 --- Comment #1 from Ben Clifford 2009-06-25 08:30:15 --- at r2981, a different error message (worse) is produced: $ cat b.swift string labels[]; trace(labels[0].label); $ swift -debug b.swift Max heap: 266403840 b.swift: source file is new. Recompiling. Validation of XML intermediate file was successful Detailed exception: org.griphyn.vdl.karajan.CompilationException: Failed to convert .xml to .kml for b.swift at org.griphyn.vdl.karajan.Loader.compile(Loader.java:283) at org.griphyn.vdl.karajan.Loader.main(Loader.java:129) Caused by: java.lang.NullPointerException at org.griphyn.vdl.engine.Karajan.expressionToKarajan(Karajan.java:1076) at org.griphyn.vdl.engine.Karajan.actualParameter(Karajan.java:779) at org.griphyn.vdl.engine.Karajan.call(Karajan.java:497) at org.griphyn.vdl.engine.Karajan.statement(Karajan.java:421) at org.griphyn.vdl.engine.Karajan.statements(Karajan.java:389) at org.griphyn.vdl.engine.Karajan.program(Karajan.java:192) at org.griphyn.vdl.engine.Karajan.compile(Karajan.java:96) at org.griphyn.vdl.karajan.Loader.compile(Loader.java:264) ... 1 more Could not start execution. Failed to convert .xml to .kml for b.swift -- Configure bugmail: https://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. From hategan at mcs.anl.gov Thu Jun 25 21:18:32 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 25 Jun 2009 21:18:32 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245886200.11848.1.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> Message-ID: <1245982712.2206.4.camel@localhost> I also committed a patch (cog r2409) to give the service job a lower priority (nice 10). This won't however prevent the process from consuming 100% CPU if that much CPU is available. On Wed, 2009-06-24 at 18:30 -0500, Mihael Hategan wrote: > Try the following: > > In the source tree, edit > cog/modules/provider-coaster/resources/log4j.properties and change the > INFO categories to WARN. > > Then re-compile and see if the usage is still high. > > On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote: > > ok this looks like a good replicable case. > > > > the workflow is 2000 invocations of touch using 066-many.swift > > > > run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu > > utilization averages to 99-100% . i ran this for five trials and got > > the same results. > > > > run05.tar.gz - same run with profiling information. in java.hprof.bin > > > > 2009/6/19 Mihael Hategan : > > > On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: > > >> here's a script recording without profiling of top: > > >> > > >> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz > > >> > > >> it still consumes some cpu. but does not spike to 200% utilization. > > > > > > Right, 10% != 200%. > > > > > > Now, as you're probably already guessing, I would need: > > > 1. a situation (workflow/site/etc.) in which the usage does go crazy > > > without the profiler (as in what triggered you getting kicked off > > > ranger); repeatable > > > 2. a profiler dump of a run in such a situation > > > > > > Btw, the ranger issue, where was swift running? > > > > on communicado. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jun 26 07:43:00 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Jun 2009 07:43:00 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245886200.11848.1.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> Message-ID: <4A44C254.5080600@mcs.anl.gov> Allan, Its not clear to me that you have really reproduced the problem yet. My understanding was that Ranger sysadmins observed two processes, owned by you and Zhao, burning 100% CPU on login3, which is also the GRAM gatekeeper host. This was when you were running "mock BLAST" and Zhao was running AMPL. I know that the AMPL tasks were running 1-2 hours, so one would expect negligible overhead from the coaster service in that case. Your BLAST tasks were, I thought, about 60 seconds in length, for which I would also thing that the coaster service would have low overhead. So I dont think that running 2000 "touch" processes is a good way to reproduce the problem. If you dont see high coaster service overhead with something like "sleep 60" (or better yet, just use your same mock-BLAST run) then I think that the problem has not yet been reproduced. The oringinal thread where the Ranger sysadmin complained is below. Looking back at it, do we have clear evidence that it was really coaster services, v.s., say, GRAM jobmanagers, that was causing the load? Is it possible that coaster settings caused too many GRAM jobs to be run? I think we should do this: - review the evidence to see if high coaster service CPU % was really observed - if so, run the BLAST test elsewhere, and see if it causes such overhead - run several tests of large numbers of sleep jobs, and record the coaster service CPU utilization. You could have the coaster service log its own CPU and memory utilization to the logfile every 60 seconds say. You could overload the #coasters per host (say to 16 or higher since they are sleep jobs), so you can readily do this on teraport. From such a plot, we could more scientifically see if we have a coaster service overhead problem or not. In other words, the Ranger sysadmin did not say "your coaster process is consuming CPU", he just said your jobs are causing the login3 host to be slow for other users. I think the only evidence that points to coasters is your observations, Allan, when you and Zhao were running. Is it possible that one of you was running the Swift command on login3 (in addition to its role as a gatekeeper host)? Lets go back and review this carefully, as this sysadmin complaint has essentially shut down production coaster usage, which is bad, and we need to determine what was the real cause of the situation that caused the complaint. - Mike On 6/24/09 6:30 PM, Mihael Hategan wrote: > Try the following: > > In the source tree, edit > cog/modules/provider-coaster/resources/log4j.properties and change the > INFO categories to WARN. > > Then re-compile and see if the usage is still high. > > On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote: >> ok this looks like a good replicable case. >> >> the workflow is 2000 invocations of touch using 066-many.swift >> >> run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu >> utilization averages to 99-100% . i ran this for five trials and got >> the same results. >> >> run05.tar.gz - same run with profiling information. in java.hprof.bin >> >> 2009/6/19 Mihael Hategan : >>> On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: >>>> here's a script recording without profiling of top: >>>> >>>> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz >>>> >>>> it still consumes some cpu. but does not spike to 200% utilization. >>> Right, 10% != 200%. >>> >>> Now, as you're probably already guessing, I would need: >>> 1. a situation (workflow/site/etc.) in which the usage does go crazy >>> without the profiler (as in what triggered you getting kicked off >>> ranger); repeatable >>> 2. a profiler dump of a run in such a situation >>> >>> Btw, the ranger issue, where was swift running? >> on communicado. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jun 26 07:46:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Jun 2009 07:46:04 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1245886200.11848.1.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> Message-ID: <4A44C30C.4000905@mcs.anl.gov> Allan, can you try this, as part of the other testing that I suggested in the prior message on this thread? I.e, re-run the suggested coaster overhead measurement tests without logging, but only if the tests *with* logging indicate high overhead? It would be interesting to also compare logfile sizes in both cases. - Mike On 6/24/09 6:30 PM, Mihael Hategan wrote: > Try the following: > > In the source tree, edit > cog/modules/provider-coaster/resources/log4j.properties and change the > INFO categories to WARN. > > Then re-compile and see if the usage is still high. > > On Tue, 2009-06-23 at 17:11 -0500, Allan Espinosa wrote: >> ok this looks like a good replicable case. >> >> the workflow is 2000 invocations of touch using 066-many.swift >> >> run04.tar.gz (logs and a "top -b" dump for the coaster service). cpu >> utilization averages to 99-100% . i ran this for five trials and got >> the same results. >> >> run05.tar.gz - same run with profiling information. in java.hprof.bin >> >> 2009/6/19 Mihael Hategan : >>> On Fri, 2009-06-19 at 18:30 -0500, Allan Espinosa wrote: >>>> here's a script recording without profiling of top: >>>> >>>> http://www.ci.uchicago.edu/~aespinosa/top_vanilla.gz >>>> >>>> it still consumes some cpu. but does not spike to 200% utilization. >>> Right, 10% != 200%. >>> >>> Now, as you're probably already guessing, I would need: >>> 1. a situation (workflow/site/etc.) in which the usage does go crazy >>> without the profiler (as in what triggered you getting kicked off >>> ranger); repeatable >>> 2. a profiler dump of a run in such a situation >>> >>> Btw, the ranger issue, where was swift running? >> on communicado. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From benc at hawaga.org.uk Fri Jun 26 07:52:04 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 26 Jun 2009 12:52:04 +0000 (GMT) Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <4A44C254.5080600@mcs.anl.gov> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> Message-ID: > In other words, the Ranger sysadmin did not say "your coaster process is > consuming CPU", he just said your jobs are causing the login3 host to be > slow for other users. Do you have what the Ranger sysadmin actually said, in its entirety? -- From wilde at mcs.anl.gov Fri Jun 26 07:55:05 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Jun 2009 07:55:05 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> Message-ID: <4A44C529.6010607@mcs.anl.gov> Yes, sorry, I meant to paste it at the bottom of the previous message but forgot. The sysadmin message is at the bottom of the thread below. -------- Original Message -------- Subject: Re: Ranger @ TACC - Jobs Running On Head Node creating heavy load Date: Wed, 17 Jun 2009 15:26:37 -0500 From: Allan Espinosa To: Zhao Zhang CC: wilde at mcs.anl.gov References: <1245269425.13629.23.camel at lockman-d630.tacc.utexas.edu> <4A394FBD.5040301 at uchicago.edu> <50b07b4b0906171322k26976392s4a99144749c437e7 at mail.gmail.com> Zhao, your coaster services and gram call back daemons are eating 2cores. You should kill these too as you abort your swift run. -Allan 2009/6/17 Allan Espinosa : > I am guessing that these are the coaster services running on the GRAM > head node (gateway.ranger points to login3.ranger). > > > I made the run last night. I am currently running stuff on teraport. > > 2009/6/17 Zhao Zhang : >> Hi, Mike >> >> That is me and Allan. I am running the remaining part of the AMPL work flow, >> 800 job left. What shall I do now? >> >> zhao >> >> John Lockman wrote: >>> >>> Dr. Wilde, >>> >>> >>> Two users on your project, [zzhang & tg802895] are running jobs on the >>> Ranger head node [login3] which is slowing the system down dramatically >>> for other users. >>> Can these jobs be run on the compute nodes and not the head node? >>> >>> >>> Thanks, >>> On 6/26/09 7:52 AM, Ben Clifford wrote: >> In other words, the Ranger sysadmin did not say "your coaster process is >> consuming CPU", he just said your jobs are causing the login3 host to be >> slow for other users. > > Do you have what the Ranger sysadmin actually said, in its entirety? > From hategan at mcs.anl.gov Fri Jun 26 09:23:46 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 26 Jun 2009 09:23:46 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <4A44C529.6010607@mcs.anl.gov> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> <4A44C529.6010607@mcs.anl.gov> Message-ID: <1246026226.13694.4.camel@localhost> I think we should ask the sysadmins for clarification on what the problem was (their email leave room for interpretation) and talk to them and see what we can do to solve it. On Fri, 2009-06-26 at 07:55 -0500, Michael Wilde wrote: > Yes, sorry, I meant to paste it at the bottom of the previous message > but forgot. The sysadmin message is at the bottom of the thread below. > > -------- Original Message -------- > Subject: Re: Ranger @ TACC - Jobs Running On Head Node creating heavy load > Date: Wed, 17 Jun 2009 15:26:37 -0500 > From: Allan Espinosa > To: Zhao Zhang > CC: wilde at mcs.anl.gov > References: <1245269425.13629.23.camel at lockman-d630.tacc.utexas.edu> > <4A394FBD.5040301 at uchicago.edu> > <50b07b4b0906171322k26976392s4a99144749c437e7 at mail.gmail.com> > > Zhao, > > your coaster services and gram call back daemons are eating 2cores. > You should kill these too as you abort your swift run. > > -Allan > > 2009/6/17 Allan Espinosa : > > I am guessing that these are the coaster services running on the GRAM > > head node (gateway.ranger points to login3.ranger). > > > > > > I made the run last night. I am currently running stuff on teraport. > > > > 2009/6/17 Zhao Zhang : > >> Hi, Mike > >> > >> That is me and Allan. I am running the remaining part of the AMPL > work flow, > >> 800 job left. What shall I do now? > >> > >> zhao > >> > >> John Lockman wrote: > >>> > >>> Dr. Wilde, > >>> > >>> > >>> Two users on your project, [zzhang & tg802895] are running jobs on the > >>> Ranger head node [login3] which is slowing the system down dramatically > >>> for other users. > >>> Can these jobs be run on the compute nodes and not the head node? > >>> > >>> > >>> Thanks, > >>> > > > On 6/26/09 7:52 AM, Ben Clifford wrote: > >> In other words, the Ranger sysadmin did not say "your coaster process is > >> consuming CPU", he just said your jobs are causing the login3 host to be > >> slow for other users. > > > > Do you have what the Ranger sysadmin actually said, in its entirety? > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From wilde at mcs.anl.gov Fri Jun 26 09:29:04 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Jun 2009 09:29:04 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1246026226.13694.4.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> <4A44C529.6010607@mcs.anl.gov> <1246026226.13694.4.camel@localhost> Message-ID: <4A44DB30.8060104@mcs.anl.gov> I will try, and cc the list. Its not clear that John knows more than he reported - most likely he saw processes owned by Zhao and Allan at the top of "top", and concluded that they were runing application tests on the login hosts. But worth a try. I think in the meantime its worthwhile for Allan to get reliable data on coaster performance. That seems useful for our own needs and for eventual publication. If we determine that the overhead should indeed be small, I suspect we can coordinate with the sysadmins and start running again, watching closely to make sure we do no harm. - Mike On 6/26/09 9:23 AM, Mihael Hategan wrote: > I think we should ask the sysadmins for clarification on what the > problem was (their email leave room for interpretation) and talk to them > and see what we can do to solve it. > > On Fri, 2009-06-26 at 07:55 -0500, Michael Wilde wrote: >> Yes, sorry, I meant to paste it at the bottom of the previous message >> but forgot. The sysadmin message is at the bottom of the thread below. >> >> -------- Original Message -------- >> Subject: Re: Ranger @ TACC - Jobs Running On Head Node creating heavy load >> Date: Wed, 17 Jun 2009 15:26:37 -0500 >> From: Allan Espinosa >> To: Zhao Zhang >> CC: wilde at mcs.anl.gov >> References: <1245269425.13629.23.camel at lockman-d630.tacc.utexas.edu> >> <4A394FBD.5040301 at uchicago.edu> >> <50b07b4b0906171322k26976392s4a99144749c437e7 at mail.gmail.com> >> >> Zhao, >> >> your coaster services and gram call back daemons are eating 2cores. >> You should kill these too as you abort your swift run. >> >> -Allan >> >> 2009/6/17 Allan Espinosa : >> > I am guessing that these are the coaster services running on the GRAM >> > head node (gateway.ranger points to login3.ranger). >> > >> > >> > I made the run last night. I am currently running stuff on teraport. >> > >> > 2009/6/17 Zhao Zhang : >> >> Hi, Mike >> >> >> >> That is me and Allan. I am running the remaining part of the AMPL >> work flow, >> >> 800 job left. What shall I do now? >> >> >> >> zhao >> >> >> >> John Lockman wrote: >> >>> >> >>> Dr. Wilde, >> >>> >> >>> >> >>> Two users on your project, [zzhang & tg802895] are running jobs on the >> >>> Ranger head node [login3] which is slowing the system down dramatically >> >>> for other users. >> >>> Can these jobs be run on the compute nodes and not the head node? >> >>> >> >>> >> >>> Thanks, >> >>> >> >> >> On 6/26/09 7:52 AM, Ben Clifford wrote: >>>> In other words, the Ranger sysadmin did not say "your coaster process is >>>> consuming CPU", he just said your jobs are causing the login3 host to be >>>> slow for other users. >>> Do you have what the Ranger sysadmin actually said, in its entirety? >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Fri Jun 26 09:33:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 26 Jun 2009 09:33:58 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <4A44DB30.8060104@mcs.anl.gov> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> <4A44C529.6010607@mcs.anl.gov> <1246026226.13694.4.camel@localhost> <4A44DB30.8060104@mcs.anl.gov> Message-ID: <1246026838.13953.1.camel@localhost> On Fri, 2009-06-26 at 09:29 -0500, Michael Wilde wrote: > I will try, and cc the list. Its not clear that John knows more than he > reported - most likely he saw processes owned by Zhao and Allan at the > top of "top", and concluded that they were runing application tests on > the login hosts. But worth a try. Well, we keep guessing while there was a specific thing that triggered the email. From wilde at mcs.anl.gov Fri Jun 26 10:55:44 2009 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 26 Jun 2009 10:55:44 -0500 Subject: [Swift-devel] hprof profiling of coaster services In-Reply-To: <1246026838.13953.1.camel@localhost> References: <50b07b4b0906191127h669ed48bybee61cb855643c7c@mail.gmail.com> <1245436951.28833.4.camel@localhost> <1245440628.30342.0.camel@localhost> <50b07b4b0906191252x8a5f712k84a3709dc5d9dbc9@mail.gmail.com> <1245443019.30342.2.camel@localhost> <50b07b4b0906191630pef1b4e4lb6f28324bffccb4@mail.gmail.com> <1245459227.6629.6.camel@localhost> <50b07b4b0906231511te3384d9uc59c679a5a3471e2@mail.gmail.com> <1245886200.11848.1.camel@localhost> <4A44C254.5080600@mcs.anl.gov> <4A44C529.6010607@mcs.anl.gov> <1246026226.13694.4.camel@localhost> <4A44DB30.8060104@mcs.anl.gov> <1246026838.13953.1.camel@localhost> Message-ID: <4A44EF80.7050601@mcs.anl.gov> Below is the response from the Ranger Sysadmin. Allan, can you both resume running Zhao's AMPL script on Ranger? I would do this: - watch the AMPL script closely, see what the overhead is for both the Swift JVM and the coaster service JVM (I'd watch CPU and memory, using "top" if thats easiest, although a tool that gives a plot of the resources over time would be nice; I suspect such tools exist). You could sample /proc and plot if nothing else. - if you dont see the problem, lets simply keep an eye on things. - if you do see the problem, I suggest we go back to measuring on a less visible machine, like teraport. Im also curious as to what the overhead for logging is, and whether to start back on Ranger using the original log settings or the log level reduced to WARN as Mihael suggested. It certainly possible as John suggests that the excessive load occurred when the coaster server encouterred a problem, rather than during normal operation. That would be very visile in a plot of its CPU usage over time - we'd see it spike up after running OK for a while. Other/better suggestions welcome. - Mike -------- Original Message -------- Subject: Re: Ranger @ TACC - Jobs Running On Head Node creating heavy load Date: Fri, 26 Jun 2009 11:41:03 -0400 From: John Lockman To: Michael Wilde References: <1245269425.13629.23.camel at lockman-d630.tacc.utexas.edu> <4A44DD70.2080207 at mcs.anl.gov> Mike, It is still unclear as to why your java processes were chewing up so much time and CPU resources, we have a couple of other folks who do similar activities monitoring jobs and they don't seem to trigger such a load. I have a feeling your code may not be cleaning up the processes after something maybe goes wrong and then the java process goes spinning out of control. If you would like to begin testing again, it will be okay on login3 for now. Also, we are investigating adding additional system resources to Ranger to better support these types of activities and move some of the globus workload off of the login nodes. Cheers! -- John Lockman III High Performance Computing +1.512.471.4097 ROC 1.428 Texas Advanced Computing Center The University of Texas at Austin On 6/26/09 9:33 AM, Mihael Hategan wrote: > On Fri, 2009-06-26 at 09:29 -0500, Michael Wilde wrote: >> I will try, and cc the list. Its not clear that John knows more than he >> reported - most likely he saw processes owned by Zhao and Allan at the >> top of "top", and concluded that they were runing application tests on >> the login hosts. But worth a try. > > Well, we keep guessing while there was a specific thing that triggered > the email. > > From aespinosa at cs.uchicago.edu Sat Jun 27 14:28:11 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 27 Jun 2009 14:28:11 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1239394510.27021.1.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> Message-ID: <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> ok this gets me confused. in the swift docs: throttle.score.job.factor Valid values: , off Default value: 4 The Swift scheduler has the ability to limit the number of concurrent jobs allowed on a site based on the performance history of that site. Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula: 2 + score*throttle.score.job.factor so the score can exceed 100? 2009/4/10 Mihael Hategan : > On Fri, 2009-04-10 at 14:44 -0500, Michael Wilde wrote: >> Mihael, your suggestion of: >> >> 2.56 >> 1000 >> >> Is *almost* right on: >> >> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c | awk >> '{ sum += $1} END {print sum}' >> 8131 >> int$ grep JOB_START *45.log | awk '{print $19}' | sort | uniq -c >> >> ? ? ? ?3 >> ? ? ?254 host=bgp000 >> ? ? ?254 host=bgp001 >> ? ? ?254 host=bgp002 >> ? ? ?... >> ? ? ?254 host=bgp030 >> ? ? ?254 host=bgp031 >> int$ >> >> Can you suggest how to tweak it up to 256? Use jobThrottle=2.58 maybe? > > Make the initial score larger. 10000 should be enough. As it goes to > +inf, you should have a max of 100*jobThrottle + 1 jobs. > >> ?I >> will experiment, but if there's a precise way to hit it "just right" >> that would be great. If not, we will adjust as needed and reduce the >> total # of jobs. >> >> Is this a roundoff issue, or does the formula subtract 2 somewhere from >> the throttle * score product? >> >> - Mike >> >> >> On 4/10/09 12:39 PM, Michael Wilde wrote: >> > >> > >> > On 4/10/09 12:22 PM, Mihael Hategan wrote: >> >> On Fri, 2009-04-10 at 12:18 -0500, Mihael Hategan wrote: >> >>> Increase foreach.max.threads to at least 4096. >> > >> > it was set to 100000 (100K) >> > >> >> That doesn't seem to be the cause though. Do you have all the >> >> sites/executables properly in tc.data? >> > >> > duh. of course not :) >> > >> > thats the problem, thanks. >> > >> >> >> >>> On Fri, 2009-04-10 at 12:00 -0500, Michael Wilde wrote: >> >>>> They are in ci:/home/wilde/oops.1063.2 >> >>>> >> >>>> I spotted the anomaly (if thats what it is) as below. >> >>>> >> >>>> Also: we discussed on the list way way back how to get the swift >> >>>> scheduler to send no more jobs to each "site" than there are cores >> >>>> in that site (for this bgp/falkon case) so that jobs dont get >> >>>> committed to busy sites while other sites have free cores. >> >>>> >> >>>> In this run, we are trying to send 32K jobs to 32K cores. >> >>>> Each of the 128 "sites" have 256 cores. >> >>>> >> >>>> The #s below show about 19K of those jobs as having been dispatched >> >>>> to 32*256 = 8192 cores. >> >>>> >> >>>> int$ grep JOB_START *nr3.log | awk '{print $19}' | sort | uniq -c >> >>>> ? ? ? 24 >> >>>> ? ? ?365 host=bgp000 >> >>>> ? ? ?790 host=bgp001 >> >>>> ? ? ?371 host=bgp002 >> >>>> ? ? ?383 host=bgp003 >> >>>> ? ? ?365 host=bgp004 >> >>>> ? ? ?791 host=bgp005 >> >>>> ? ? ?415 host=bgp006 >> >>>> ? ? ?775 host=bgp007 >> >>>> ? ? ?790 host=bgp008 >> >>>> ? ? ?791 host=bgp009 >> >>>> ? ? ?369 host=bgp010 >> >>>> ? ? ?790 host=bgp011 >> >>>> ? ? ?359 host=bgp012 >> >>>> ? ? ?791 host=bgp013 >> >>>> ? ? ?394 host=bgp014 >> >>>> ? ? ?402 host=bgp015 >> >>>> ? ? ?358 host=bgp016 >> >>>> ? ? ?595 host=bgp017 >> >>>> ? ? ?790 host=bgp018 >> >>>> ? ? ?790 host=bgp019 >> >>>> ? ? ?791 host=bgp020 >> >>>> ? ? ?790 host=bgp021 >> >>>> ? ? ?370 host=bgp022 >> >>>> ? ? ?790 host=bgp023 >> >>>> ? ? ?790 host=bgp024 >> >>>> ? ? ?674 host=bgp025 >> >>>> ? ? ?567 host=bgp026 >> >>>> ? ? ?389 host=bgp027 >> >>>> ? ? ?778 host=bgp028 >> >>>> ? ? ?366 host=bgp029 >> >>>> ? ? ?787 host=bgp030 >> >>>> ? ? ?695 host=bgp031 >> >>>> int$ pwd >> >>>> >> >>>> >> >>>> On 4/10/09 11:42 AM, Mihael Hategan wrote: >> >>>>> On Fri, 2009-04-10 at 11:38 -0500, Michael Wilde wrote: >> >>>>>> Hi, >> >>>>>> >> >>>>>> We're trying to run an oops run on 8 racks of the BGP. Its >> >>>>>> possible this is larger than has been done to date with swift. >> >>>>>> >> >>>>>> Our sites.xml file has localhost plus 128 Falkon sites, one for >> >>>>>> each pset in the 8-rack partition. >> >>>>>> >> >>>>>> ?From what I can tell, Swift sees all 128 sites, but only sends >> >>>>>> jobs to exactly the first 32, bgp000-bgp031. >> >>>>>> >> >>>>>> While I debug this further, does anyone know of some hardwired >> >>>>>> limit that would cause swift to send to only the first 32 bgp sites? >> >>>>> I can't think of anything that would make that the case. The sites >> >>>>> file >> >>>>> and a log would be useful. >> >>>>> >> >>> _______________________________________________ >> >>> Swift-devel mailing list >> >>> Swift-devel at ci.uchicago.edu >> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> >> >> > _______________________________________________ >> > Swift-devel mailing list >> > Swift-devel at ci.uchicago.edu >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Sat Jun 27 15:14:09 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 27 Jun 2009 15:14:09 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> Message-ID: <1246133649.31151.2.camel@localhost> On Sat, 2009-06-27 at 14:28 -0500, Allan Espinosa wrote: > ok this gets me confused. in the swift docs: > > throttle.score.job.factor > > Valid values: , off > > Default value: 4 > > The Swift scheduler has the ability to limit the number of > concurrent jobs allowed on a site based on the performance history of > that site. Each site is assigned a score (initially 1), which can > increase or decrease based on whether the site yields successful or > faulty job runs. The score for a site can take values in the (0.1, > 100) interval. The number of allowed jobs is calculated using the > following formula: > > 2 + score*throttle.score.job.factor > > > so the score can exceed 100? There are two scores. The raw score, which can be anything, and the scaled score, which goes from 0.1 to 100. There's a 1-to-1 mapping between the two. From aespinosa at cs.uchicago.edu Sat Jun 27 17:40:04 2009 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Sat, 27 Jun 2009 17:40:04 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1246133649.31151.2.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> <1246133649.31151.2.camel@localhost> Message-ID: <50b07b4b0906271540g5d88cfc1ncd9faaa4b6ed8bfb@mail.gmail.com> ok i sort of understand now. we set the throttle to set the max number of jobs in a site (calculate from the scaled score). then to ensure we send the max number of jobs to th site from the start, we set a ridiculously high initial score (for raw score) so it scales to 100. correct? Thanks! -Allan 2009/6/27 Mihael Hategan : > On Sat, 2009-06-27 at 14:28 -0500, Allan Espinosa wrote: >> ok this gets me confused. ?in the swift docs: >> >> ?throttle.score.job.factor >> >> ? ? Valid values: , off >> >> ? ? Default value: 4 >> >> ? ? The Swift scheduler has the ability to limit the number of >> concurrent jobs allowed on a site based on the performance history of >> that site. Each site is assigned a score (initially 1), which can >> increase or decrease based on whether the site yields successful or >> faulty job runs. The score for a site can take values in the (0.1, >> 100) interval. The number of allowed jobs is calculated using the >> following formula: >> >> ? ? 2 + score*throttle.score.job.factor >> >> >> so the score can exceed 100? > > There are two scores. The raw score, which can be anything, and the > scaled score, which goes from 0.1 to 100. There's a 1-to-1 mapping > between the two. > > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Sat Jun 27 19:49:58 2009 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 27 Jun 2009 19:49:58 -0500 Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <50b07b4b0906271540g5d88cfc1ncd9faaa4b6ed8bfb@mail.gmail.com> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> <1246133649.31151.2.camel@localhost> <50b07b4b0906271540g5d88cfc1ncd9faaa4b6ed8bfb@mail.gmail.com> Message-ID: <1246150198.3376.1.camel@localhost> On Sat, 2009-06-27 at 17:40 -0500, Allan Espinosa wrote: > ok i sort of understand now. > > we set the throttle to set the max number of jobs in a site (calculate > from the scaled score). > > then to ensure we send the max number of jobs to th site from the > start, we set a ridiculously high initial score (for raw score) so it > scales to 100. correct? Exactly. Ben had some plans to allow setting the maximum and initial number of jobs instead of fiddling with abstract scores, but I think he quit or something. Mihael From benc at hawaga.org.uk Sun Jun 28 16:13:54 2009 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 28 Jun 2009 21:13:54 +0000 (GMT) Subject: [Swift-devel] Is there a site count limit? In-Reply-To: <1246150198.3376.1.camel@localhost> References: <49DF75ED.1060704@mcs.anl.gov> <1239381726.8860.0.camel@localhost> <49DF7B2A.4000201@mcs.anl.gov> <1239383885.10739.3.camel@localhost> <1239384130.10739.5.camel@localhost> <49DF845E.6000908@mcs.anl.gov> <49DFA1AD.8050000@mcs.anl.gov> <1239394510.27021.1.camel@localhost> <50b07b4b0906271228pcfdfab4h836ba63dfe22e4cd@mail.gmail.com> <1246133649.31151.2.camel@localhost> <50b07b4b0906271540g5d88cfc1ncd9faaa4b6ed8bfb@mail.gmail.com> <1246150198.3376.1.camel@localhost> Message-ID: On Sat, 27 Jun 2009, Mihael Hategan wrote: > Ben had some plans to allow setting the maximum and initial number of > jobs instead of fiddling with abstract scores, but I think he quit or > something. Actually, the summer student who was going to do it quit (or rather, got a more enticing summer activity). --