From abejan at ci.uchicago.edu Tue Sep 2 11:13:20 2008 From: abejan at ci.uchicago.edu (Alina Bejan) Date: Tue, 02 Sep 2008 11:13:20 -0500 Subject: [Swift-user] swift on fermigrid site question Message-ID: <48BD6620.4040504@ci.uchicago.edu> Hello -- I have the following very basic problem: I am trying to run the Hello World swift program on the fermigrid site (with OSG VO credentials). I am getting the following error: [abejan at communicado examples]$ swift -tc.file tc.data -sites.file fermi.xml first.swift Swift 0.5 swift-r1783 cog-r1962 RunID: 20080902-1035-1dnbi226 Progress: echo started Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Failed but can retry:1 Failed to transfer wrapper log from first-20080902-1035-1dnbi226/info/t/fermigridosg1.fnal.gov Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Where: tc.data is only one line: fermigridosg1.fnal.gov echo /bin/echo INSTALLED INTEL32::LINUX null and the sites file is fermi.xml: /grid/data The results (writing "Hello" in a hello.txt file) are not generated. Could you please explain me what is wrong ? Thanks, Alina From benc at hawaga.org.uk Tue Sep 2 11:23:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 2 Sep 2008 16:23:18 +0000 (GMT) Subject: [Swift-user] swift on fermigrid site question In-Reply-To: <48BD6620.4040504@ci.uchicago.edu> References: <48BD6620.4040504@ci.uchicago.edu> Message-ID: On Tue, 2 Sep 2008, Alina Bejan wrote: > I have the following very basic problem: I am trying to run the Hello > World swift program on the fermigrid site (with OSG VO credentials). I am > getting the following error: Without seeing the log file, my first guess would be that something is failing after the site has had your job in a queue for a while. Try these two commands: globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo globus-url-copy file:///etc/group gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1 and post the results. -- From abejan at ci.uchicago.edu Tue Sep 2 11:38:12 2008 From: abejan at ci.uchicago.edu (Alina Bejan) Date: Tue, 02 Sep 2008 11:38:12 -0500 Subject: [Swift-user] swift on fermigrid site question In-Reply-To: References: <48BD6620.4040504@ci.uchicago.edu> Message-ID: <48BD6BF4.4070302@ci.uchicago.edu> Alright, these are the outputs: [abejan at communicado examples]$ globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo condor_exec.exe: /lib/tls/libc.so.6: version `GLIBC_2.4' not found (required by condor_exec.exe) [abejan at communicado examples]$ globus-url-copy file:///etc/group gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1 GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: etc/group (No such file or directory)] For the logs, here's the path: /home/abejan/workflow-ex/examples Thanks. Alina Ben Clifford wrote: > On Tue, 2 Sep 2008, Alina Bejan wrote: > > >> I have the following very basic problem: I am trying to run the Hello >> World swift program on the fermigrid site (with OSG VO credentials). I am >> getting the following error: >> > > Without seeing the log file, my first guess would be that something is > failing after the site has had your job in a queue for a while. > > Try these two commands: > > globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo > > globus-url-copy file:///etc/group > gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1 > > and post the results. > > From benc at hawaga.org.uk Tue Sep 2 11:42:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 2 Sep 2008 16:42:27 +0000 (GMT) Subject: [Swift-user] swift on fermigrid site question In-Reply-To: <48BD6BF4.4070302@ci.uchicago.edu> References: <48BD6620.4040504@ci.uchicago.edu> <48BD6BF4.4070302@ci.uchicago.edu> Message-ID: On Tue, 2 Sep 2008, Alina Bejan wrote: > [abejan at communicado examples]$ globus-job-run > fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo > > condor_exec.exe: /lib/tls/libc.so.6: version `GLIBC_2.4' not found (required > by condor_exec.exe) ok. So that's a problem with running jobs in condor. Probably talk to fermilab site admin showing them this. > [abejan at communicado examples]$ globus-url-copy file:///etc/group > gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1 > GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: etc/group (No such > file or directory) my bad - your not using the Globus Toolkit version of globus-url-copy so you need a different commandline (the cog version uses different URL semantics). However, the globus-job-run problem looks like the main problem to investigate. -- From fedorov at cs.wm.edu Thu Sep 4 09:39:44 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Thu, 4 Sep 2008 10:39:44 -0400 Subject: [Swift-user] Swift scheduler In-Reply-To: References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> Message-ID: <82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com> >> Can any of the developers point me to the specific part of the source >> that is responsible for scheduling, so that I could try to figure this >> out myself? > > Start here: > > cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java. > I look at the latest cog release 2130, and line 342 says else if ("best".equals("value")) { Am I wrong, or value in quotes here is a bug? It looks like POLICY_BEST_SCORE is never used because of this. -- Andrey Fedorov From hategan at mcs.anl.gov Thu Sep 4 10:38:47 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Sep 2008 10:38:47 -0500 Subject: [Swift-user] Swift scheduler In-Reply-To: <82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com> References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com> <82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com> Message-ID: <1220542727.5823.18.camel@localhost> On Thu, 2008-09-04 at 10:39 -0400, Andriy Fedorov wrote: > >> Can any of the developers point me to the specific part of the source > >> that is responsible for scheduling, so that I could try to figure this > >> out myself? > > > > Start here: > > > > cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java. > > > > I look at the latest cog release 2130, and line 342 says > > else if ("best".equals("value")) { > > Am I wrong, or value in quotes here is a bug? It looks like > POLICY_BEST_SCORE is never used because of this. That is funny, but yes, it is a bug. However, that selection policy is not used in swift anyway, nor is it much useful otherwise. The weighted random selection distributes load more smoothly across sites. Anyway, I fixed it in SVN. Mihael From fedorov at cs.wm.edu Tue Sep 9 21:55:50 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 9 Sep 2008 22:55:50 -0400 Subject: [Swift-user] Swift build problems Message-ID: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> Hi, While trying to build development checkout of Swift, I have this problem (see the report below) -- strange warning about running out of memory, and then failure due to a problem which doesn't make sense. I saw similar reports about other packages that use ant, but I didn't find a general solution. I have openSuse 11.0, uname -a says Linux beat 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200 x86_64 x86_64 x86_64 GNU/Linux Has anyone seen something like this? [andrey at beat vdsk] ant dist Buildfile: build.xml generateVersion: antlr: [java] ANTLR Parser Generator Version 2.7.5 (20050128) 1989-2005 jGuru.com [java] resources/swiftscript.g:944: warning:nondeterminism upon [java] resources/swiftscript.g:944: k==1:LBRACK [java] resources/swiftscript.g:944: k==2:ID,STRING_LITERAL,LBRACK,LPAREN,AT,PLUS,MINUS,STAR,NOT,INT_LITERAL,FLOAT_LITERAL,"true","false" [java] resources/swiftscript.g:944: between alt 1 and exit branch of block compileSchema: GC Warning: Out of Memory! Returning NIL! GC Warning: Repeated allocation of very large block (appr. size 131072000): May lead to memory leak and poor performance. [java] Time to build schema type system: 67.854 seconds [java] Time to generate code: 6.224 seconds [java] java.io.IOException: Cannot run program "/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException: error=2, No such file or directory [java] java.io.IOException: java.io.IOException: error=2, No such file or directory [java] java.io.IOException: Cannot run program "/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException: error=2, No such file or directory [java] at java.lang.ProcessBuilder.start(ProcessBuilder.java:474) [java] at java.lang.Runtime.exec(Runtime.java:610) [java] at java.lang.Runtime.exec(Runtime.java:483) [java] at org.apache.xmlbeans.impl.tool.CodeGenUtil.externalCompile(CodeGenUtil.java:231) [java] at org.apache.xmlbeans.impl.tool.SchemaCompiler.compile(SchemaCompiler.java:1126) [java] at org.apache.xmlbeans.impl.tool.SchemaCompiler.main(SchemaCompiler.java:368) [java] Caused by: java.io.IOException: java.io.IOException: error=2, No such file or directory [java] at java.lang.UNIXProcess.(UNIXProcess.java:164) [java] at java.lang.ProcessImpl.start(ProcessImpl.java:81) [java] at java.lang.ProcessBuilder.start(ProcessBuilder.java:467) [java] ... 5 more [java] BUILD FAILED BUILD FAILED /home/andrey/local/src/cog/modules/vdsk/build.xml:152: Java returned: 1 Total time: 1 minute 30 seconds -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From hategan at mcs.anl.gov Tue Sep 9 23:14:48 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 09 Sep 2008 23:14:48 -0500 Subject: [Swift-user] Swift build problems In-Reply-To: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> Message-ID: <1221020088.5312.2.camel@localhost> On Tue, 2008-09-09 at 22:55 -0400, Andriy Fedorov wrote: > [java] java.io.IOException: Cannot run program > "/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException: > error=2, No such file or directory Is JAVA_HOME set? From benc at hawaga.org.uk Wed Sep 10 01:56:21 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 06:56:21 +0000 (GMT) Subject: [Swift-user] Swift build problems In-Reply-To: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> Message-ID: what java are you using? $ java -version $ javac -version -- From fedorov at cs.wm.edu Wed Sep 10 06:44:18 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 10 Sep 2008 07:44:18 -0400 Subject: [Swift-user] Swift build problems In-Reply-To: References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> Message-ID: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> [andrey at beat cog] echo $JAVA_HOME /usr/lib64/jvm/java [andrey at beat cog] java -version java version "1.6.0" IcedTea Runtime Environment (build 1.6.0-b09) OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode) [andrey at beat cog] javac -version javac: unrecognized option '-version' javac: no input files [andrey at beat cog] On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote: > what java are you using? > > > $ java -version > $ javac -version > > -- > > > From benc at hawaga.org.uk Wed Sep 10 06:49:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 11:49:46 +0000 (GMT) Subject: [Swift-user] Swift build problems In-Reply-To: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> Message-ID: I've not heard of that one before. Have you used it much for compiling non-trivial stuff? On Wed, 10 Sep 2008, Andriy Fedorov wrote: > [andrey at beat cog] echo $JAVA_HOME > /usr/lib64/jvm/java > [andrey at beat cog] java -version > java version "1.6.0" > IcedTea Runtime Environment (build 1.6.0-b09) > OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode) > [andrey at beat cog] javac -version > javac: unrecognized option '-version' > javac: no input files > [andrey at beat cog] > > > > On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote: > > what java are you using? > > > > > > $ java -version > > $ javac -version > > > > -- > > > > > > > > From fedorov at cs.wm.edu Wed Sep 10 08:06:26 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 10 Sep 2008 09:06:26 -0400 Subject: [Swift-user] Swift build problems In-Reply-To: References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> Message-ID: <82f536810809100606r49ceefd4jb915b59d79529f35@mail.gmail.com> On Wed, Sep 10, 2008 at 7:49 AM, Ben Clifford wrote: > > I've not heard of that one before. Have you used it much for compiling > non-trivial stuff? > No, I haven't... > > On Wed, 10 Sep 2008, Andriy Fedorov wrote: > >> [andrey at beat cog] echo $JAVA_HOME >> /usr/lib64/jvm/java >> [andrey at beat cog] java -version >> java version "1.6.0" >> IcedTea Runtime Environment (build 1.6.0-b09) >> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode) >> [andrey at beat cog] javac -version >> javac: unrecognized option '-version' >> javac: no input files >> [andrey at beat cog] >> >> >> >> On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote: >> > what java are you using? >> > >> > >> > $ java -version >> > $ javac -version >> > >> > -- >> > >> > >> > >> >> > From hategan at mcs.anl.gov Wed Sep 10 10:30:46 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Sep 2008 10:30:46 -0500 Subject: [Swift-user] Swift build problems In-Reply-To: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> Message-ID: <1221060646.16054.4.camel@localhost> That tool is looking for a javac in ${java.home}/bin/ (and I assume it sets that system property from JAVA_HOME). Not finding it, it tries the current directory. On Wed, 2008-09-10 at 07:44 -0400, Andriy Fedorov wrote: > [andrey at beat cog] echo $JAVA_HOME > /usr/lib64/jvm/java > [andrey at beat cog] java -version > java version "1.6.0" > IcedTea Runtime Environment (build 1.6.0-b09) > OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode) > [andrey at beat cog] javac -version > javac: unrecognized option '-version' > javac: no input files > [andrey at beat cog] > From fedorov at cs.wm.edu Wed Sep 10 10:48:44 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 10 Sep 2008 11:48:44 -0400 Subject: [Swift-user] Swift build problems In-Reply-To: <1221060646.16054.4.camel@localhost> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> <1221060646.16054.4.camel@localhost> Message-ID: <82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com> On Wed, Sep 10, 2008 at 11:30 AM, Mihael Hategan wrote: > That tool is looking for a javac in ${java.home}/bin/ (and I assume it > sets that system property from JAVA_HOME). Not finding it, it tries the > current directory. > This may well be true. However, the question is why it is not finding javac in $JAVA_HOME/bin, while it is clearly present there? My guess is, it may be because earlier memory allocation failed, and there was garbage somewhere in memory, that confused ant about javac location. So, there may be hope in resolving the weird javac error by resolving the memory allocation problem. Having googled related issues, I was not been able to find any solution to this, so I am just not going to use that particular system with svn Swift. This is not critical right now, I was just wondering if this is an issue known to the community. > On Wed, 2008-09-10 at 07:44 -0400, Andriy Fedorov wrote: >> [andrey at beat cog] echo $JAVA_HOME >> /usr/lib64/jvm/java >> [andrey at beat cog] java -version >> java version "1.6.0" >> IcedTea Runtime Environment (build 1.6.0-b09) >> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode) >> [andrey at beat cog] javac -version >> javac: unrecognized option '-version' >> javac: no input files >> [andrey at beat cog] >> > > > From benc at hawaga.org.uk Wed Sep 10 11:02:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 10 Sep 2008 16:02:23 +0000 (GMT) Subject: [Swift-user] Swift build problems In-Reply-To: <82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com> References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com> <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com> <1221060646.16054.4.camel@localhost> <82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com> Message-ID: On Wed, 10 Sep 2008, Andriy Fedorov wrote: [...] > I am just not going to > use that particular system with svn Swift. You can rsync vdsk/dist/swift-svn (or tar or otherwise copy it) from one machine to another (assuming compatible JREs); that's basically what a Swift release is. -- From zhouxy at uchicago.edu Wed Sep 17 14:22:30 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Wed, 17 Sep 2008 14:22:30 -0500 Subject: [Swift-user] Problem about configuration for swift Message-ID: <8C388ACD0A8F42458DEC650238543E10@VXAVIER> Hi all, I just got my certificate, and I can run grid-proxy-init. But when I chang tc.data to teraport echo /bin/echo INSTALLED INTEL32::LINUX null and sites.xml to /home/zhouxy/swift/working I got the following error when I tried to run first.swfit example 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set -Dpegasus.home=$PEGASUS_HOME! Anyone can help me with this ? Thanks! Xueyuan From benc at hawaga.org.uk Wed Sep 17 14:31:38 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 17 Sep 2008 19:31:38 +0000 (GMT) Subject: [Swift-user] Problem about configuration for swift In-Reply-To: <8C388ACD0A8F42458DEC650238543E10@VXAVIER> References: <8C388ACD0A8F42458DEC650238543E10@VXAVIER> Message-ID: On Wed, 17 Sep 2008, Xueyuan Zhou wrote: > 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set > -Dpegasus.home=$PEGASUS_HOME! I've seen this error when both of the following conditions are true: i) Pegasus is installed on the same machine ii) You are using an old version of swift (build prior to cog r2007, 2008-05-12) Are the above true for you? If so, move to a recent version of swift, eg 0.6 which is the latest release. -- From zhouxy at uchicago.edu Wed Sep 17 14:36:33 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Wed, 17 Sep 2008 14:36:33 -0500 Subject: [Swift-user] Problem about configuration for swift Message-ID: <57317D9DE7AE476DBEAF1C6B7205344D@VXAVIER> Hi all, I just got my certificate, and I can run grid-proxy-init. But when I change tc.data to teraport echo /bin/echo INSTALLED INTEL32::LINUX null and sites.xml to /home/zhouxy/swift/working I got the following error when I tried to run first.swfit example 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set -Dpegasus.home=$PEGASUS_HOME! Anyone can help me with this ? Thanks! Xueyuan From zhouxy at uchicago.edu Wed Sep 17 14:48:09 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Wed, 17 Sep 2008 14:48:09 -0500 Subject: [Swift-user] Problem about configuration for swift References: <8C388ACD0A8F42458DEC650238543E10@VXAVIER> Message-ID: <474C1AB60DEB44A89CA8F66479917D39@VXAVIER> Thanks a lot Ben! After I installed 0.6 version, it works! -Xueyuan ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: Sent: Wednesday, September 17, 2008 2:31 PM Subject: Re: [Swift-user] Problem about configuration for swift > > On Wed, 17 Sep 2008, Xueyuan Zhou wrote: > >> 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set >> -Dpegasus.home=$PEGASUS_HOME! > > I've seen this error when both of the following conditions are true: > > i) Pegasus is installed on the same machine > > ii) You are using an old version of swift (build prior to cog r2007, > 2008-05-12) > > Are the above true for you? If so, move to a recent version of swift, eg > 0.6 which is the latest release. > > -- > > From fedorov at cs.wm.edu Fri Sep 19 09:22:48 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 19 Sep 2008 10:22:48 -0400 Subject: [Swift-user] Swift+MPI+LSF Message-ID: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> Hi, I am trying to use Swift to run an MPI job via LSF scheduler (TG Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/). Previously, I had problems running stuff like this with PBS (see http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html). Right now I am using the solution suggested by Ben (submit single node job, and run a shell wrapper to launch mpirun, ). This doesn't seem to work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and have my shell wrapper run #!/bin/bash ibrun /home/teragrid/tg457149/meshreg/trunk/build-mpicc/bin/blockMatchingMPI $* but I get one node allocated. According to Lonestar manual, the number of nodes is specified in the script like this (note: CPUs are specified in #BSUB, not as an argument to ibrun): #!/bin/tcsh # first line specifies shell #BSUB -J jobname #name the job "jobname" #BSUB -o out.o%J #output-> out.o<jobID> #BSUB -e err.o%J #error -> error.o<jobID> #BSUB -n 4 -W 1:30 #4 CPU cores and 1hr+30min #BSUB -q normal #Use normal queue. set echo #Echo all commands. cd $LS_SUBCWD #cd to directory of submission ibrun ./a.out #use ibrun for "pam -g 1 mvapich_wrapper" #CPUs are specified above in -n option. Is this a known issue? Has anyone run into something like this? -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From hategan at mcs.anl.gov Fri Sep 19 11:46:03 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Sep 2008 11:46:03 -0500 Subject: [Swift-user] Swift+MPI+LSF In-Reply-To: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> Message-ID: <1221842763.6071.3.camel@localhost> On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote: > Hi, > > I am trying to use Swift to run an MPI job via LSF scheduler (TG > Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/). > Previously, I had problems running stuff like this with PBS (see > http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html). > > Right now I am using the solution suggested by Ben (submit single node > job, and run a shell wrapper to launch mpirun, ). This doesn't seem to > work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and > have my shell wrapper run Try "count" instead of "host_xcount". Mihael From fedorov at cs.wm.edu Fri Sep 19 12:04:10 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 19 Sep 2008 13:04:10 -0400 Subject: [Swift-user] Swift+MPI+LSF In-Reply-To: <1221842763.6071.3.camel@localhost> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> Message-ID: <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> On Fri, Sep 19, 2008 at 12:46 PM, Mihael Hategan wrote: > On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote: >> Hi, >> >> I am trying to use Swift to run an MPI job via LSF scheduler (TG >> Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/). >> Previously, I had problems running stuff like this with PBS (see >> http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html). >> >> Right now I am using the solution suggested by Ben (submit single node >> job, and run a shell wrapper to launch mpirun, ). This doesn't seem to >> work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and >> have my shell wrapper run > > Try "count" instead of "host_xcount". > This was indeed the solution. It seems to work. Thank you, Mihael! > > From hategan at mcs.anl.gov Fri Sep 19 12:13:48 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Sep 2008 12:13:48 -0500 Subject: [Swift-user] Swift+MPI+LSF In-Reply-To: <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> Message-ID: <1221844428.7599.1.camel@localhost> On Fri, 2008-09-19 at 13:04 -0400, Andriy Fedorov wrote: > On Fri, Sep 19, 2008 at 12:46 PM, Mihael Hategan wrote: > > On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote: > >> Hi, > >> > >> I am trying to use Swift to run an MPI job via LSF scheduler (TG > >> Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/). > >> Previously, I had problems running stuff like this with PBS (see > >> http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html). > >> > >> Right now I am using the solution suggested by Ben (submit single node > >> job, and run a shell wrapper to launch mpirun, ). This doesn't seem to > >> work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and > >> have my shell wrapper run > > > > Try "count" instead of "host_xcount". > > > > This was indeed the solution. It seems to work. Thank you, Mihael! It looks like it works on other sites too (I tried mercury at ncsa, which has PBS). So it looks like UC is broken in this respect. I'll send a ticket to TG. From fedorov at cs.wm.edu Fri Sep 19 12:27:40 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Fri, 19 Sep 2008 13:27:40 -0400 Subject: [Swift-user] Swift+MPI+LSF In-Reply-To: <1221844428.7599.1.camel@localhost> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> Message-ID: <82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com> On Fri, Sep 19, 2008 at 1:13 PM, Mihael Hategan wrote: > It looks like it works on other sites too (I tried mercury at ncsa, which > has PBS). So it looks like UC is broken in this respect. I'll send a > ticket to TG. > UC is different from other TG sites, because it is heterogeneous. You need to use something more than just "count", because you need to pass "compute" type to the node specs. I use "host_types" on UC, and it works for me... > > From hategan at mcs.anl.gov Fri Sep 19 12:45:48 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 19 Sep 2008 12:45:48 -0500 Subject: [Swift-user] Swift+MPI+LSF In-Reply-To: <82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com> Message-ID: <1221846348.8384.1.camel@localhost> On Fri, 2008-09-19 at 13:27 -0400, Andriy Fedorov wrote: > On Fri, Sep 19, 2008 at 1:13 PM, Mihael Hategan wrote: > > It looks like it works on other sites too (I tried mercury at ncsa, which > > has PBS). So it looks like UC is broken in this respect. I'll send a > > ticket to TG. > > > > UC is different from other TG sites, because it is heterogeneous. You > need to use something more than just "count", because you need to pass > "compute" type to the node specs. I use "host_types" on UC, and it > works for me... I see little reason to not have "count" work. And if not, things should be documented properly. From benc at hawaga.org.uk Sat Sep 20 10:49:22 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 20 Sep 2008 15:49:22 +0000 (GMT) Subject: [Swift-user] multipage user guide Message-ID: The user guide is now available in one-page-per-section format which some people regard as easier to view online than the existing one-page-for-whole-guide format. Both one-page-per-section and one-page-for-whole-guide formats are linked from the documentation webpage at http://www.ci.uchicago.edu/swift/docs/ and both are generated from the same docbook source so should have the same content always. -- From zhouxy at uchicago.edu Mon Sep 22 18:20:30 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Mon, 22 Sep 2008 18:20:30 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> Message-ID: <176C82530E0344829746E3493B289CF3@VXAVIER> Hi all, I am using swift on Teraport, in some cases, I only got 4 nodes! So it is extremely slow. Even slower than single machine. I have 21 input files, each is about 26M. Computation is straightforward, foreach ifile, i in infile{ outfile[i] = opt(ifile, myjar); } In this case I only got 4 nodes. When input files are smaller, like 1M, the computation speed, nodes, are pretty reasonable.(about>30 nodes, 10min) Anyone knows why it happened ? How can I know how to optimize this ? Is it because my application code uses too much memory or too slow for large data set? Thanks, Xueyuan From benc at hawaga.org.uk Mon Sep 22 18:25:54 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 22 Sep 2008 23:25:54 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: <176C82530E0344829746E3493B289CF3@VXAVIER> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> Message-ID: On Mon, 22 Sep 2008, Xueyuan Zhou wrote: > I am using swift on Teraport, in some cases, I only got 4 nodes! So it is > extremely slow. Even slower than single machine. Put a log file from a complete run online where I can download it and I will use the Swift log-processing module to generate some graphs that might be useful. (The log file is the file that looks like foo-20089999-9999-abcdef.log) There are a number of possible causes, and looking at the graphs will help. -- From zhouxy at uchicago.edu Mon Sep 22 20:23:10 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Mon, 22 Sep 2008 20:23:10 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> Message-ID: <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> I tried to use less input to finish a complete run, but, duiring the last input, it failed. all log and related outputs are in (test3) /home/zhouxy/dic_parser/swift_script Thanks, Xueyuan ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: Sent: Monday, September 22, 2008 6:25 PM Subject: Re: [Swift-user] Why so few nodes ? > > On Mon, 22 Sep 2008, Xueyuan Zhou wrote: > >> I am using swift on Teraport, in some cases, I only got 4 nodes! So it is >> extremely slow. Even slower than single machine. > > Put a log file from a complete run online where I can download it and I > will use the Swift log-processing module to generate some graphs that > might be useful. > > (The log file is the file that looks like foo-20089999-9999-abcdef.log) > > There are a number of possible causes, and looking at the graphs will > help. > > -- > > From zhouxy at uchicago.edu Mon Sep 22 20:55:01 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Mon, 22 Sep 2008 20:55:01 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: /home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with mush smaller input file than /home/zhouxy/dic_parser/swift_script Thanks, Xueyuan ----- Original Message ----- From: "Xueyuan Zhou" To: "Ben Clifford" Cc: Sent: Monday, September 22, 2008 8:23 PM Subject: Re: [Swift-user] Why so few nodes ? >I tried to use less input to finish a complete run, but, duiring the last >input, it failed. > > all log and related outputs are in (test3) > /home/zhouxy/dic_parser/swift_script > > > Thanks, > > Xueyuan > > > > > ----- Original Message ----- > From: "Ben Clifford" > To: "Xueyuan Zhou" > Cc: > Sent: Monday, September 22, 2008 6:25 PM > Subject: Re: [Swift-user] Why so few nodes ? > > >> >> On Mon, 22 Sep 2008, Xueyuan Zhou wrote: >> >>> I am using swift on Teraport, in some cases, I only got 4 nodes! So it >>> is >>> extremely slow. Even slower than single machine. >> >> Put a log file from a complete run online where I can download it and I >> will use the Swift log-processing module to generate some graphs that >> might be useful. >> >> (The log file is the file that looks like foo-20089999-9999-abcdef.log) >> >> There are a number of possible causes, and looking at the graphs will >> help. >> >> -- >> >> > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From fedorov at cs.wm.edu Tue Sep 23 10:12:14 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 23 Sep 2008 11:12:14 -0400 Subject: [Swift-user] Swift job environment on NCSA Abe Message-ID: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> Hi, I had difficulties running mpi jobs on Abe using the wrapper, so I tried to debug the problem. It appears that Swift jobs submitted to Abe do not get environment initialized properly. Here's the test shell script I am trying to run on Abe: [fedorov at TG/Abe:honest1 etc] cat /u/ac/fedorov/local/env_wrapper.sh #!/usr/local/bin/bash which mpirun When I run this using PBS script directly on Abe, I get this output: ---------------------------------------- Begin Torque Prologue (Tue Sep 23 09:49:36 2008) Job ID: 545989 Username: fedorov Group: bri Job Name: env.pbs Limits: ncpus=1,neednodes=abe0726,nodes=1,walltime=00:01:00 Job Queue: normal Account: bri Nodes: abe0726 End Torque Prologue ---------------------------------------- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. /opt/mpich-vmi-2.2.0-3-gcc-ofed-1.2/bin/mpirun ---------------------------------------- Begin Torque Epilogue (Tue Sep 23 09:49:39 2008) Job ID: 545989 Username: fedorov Group: bri Job Name: env.pbs Session: 721 Limits: ncpus=1,nodes=1,walltime=00:01:00 Resources: cput=00:00:00,mem=2960kb,vmem=13044kb,walltime=00:00:03 Job Queue: normal Account: bri Nodes: abe0726 Killing leftovers... End Torque Epilogue ---------------------------------------- Now, when I am trying to run the same script from Swift, I get Execution failed: Exception in ABE_env_wrapped: Arguments: [] Host: Abe-GT4 Directory: site_test-20080923-1107-or2r0ut7/jobs/9/ABE_env_wrapped-9doxytzi stderr.txt: which: no mpirun in ((null)) stdout.txt: ---- Caused by: Exit code 1 Here's the relevant line from tc.data: [fedorov at mistral runs.d] grep ABE_env ~/local/vdsk-0.6/etc/tc.data Abe-GT4 ABE_env_wrapped /u/ac/fedorov/local/env_wrapper.sh INSTALLED INTEL32::LINUX null I have this problem only on Abe. -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From benc at hawaga.org.uk Tue Sep 23 10:21:29 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 15:21:29 +0000 (GMT) Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> Message-ID: can you try running the script with the relevant GRAM command line tool (globus-job-run or globusrun-ws depending on if you are using the gt2 or gt4 provider in swift) to bisect the problem. -- From fedorov at cs.wm.edu Tue Sep 23 10:30:42 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 23 Sep 2008 11:30:42 -0400 Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> Message-ID: <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> On Tue, Sep 23, 2008 at 11:21 AM, Ben Clifford wrote: > can you try running the script with the relevant GRAM command line tool > (globus-job-run or globusrun-ws depending on if you are using the gt2 or > gt4 provider in swift) to bisect the problem. Right, should have done it before. Here's the result. Appears to work with globusrun-ws: [fedorov at mistral runs.d] globusrun-ws -submit -s -F https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService -Ft PBS -c /u/ac/fedorov/local/env_wrapper.sh Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:f9d5d2a8-8983-11dd-ad02-4d401e450000 Termination time: 09/24/2008 15:26 GMT Current job state: Pending Current job state: Active Current job state: CleanUp-Hold ---------------------------------------- Begin Torque Prologue (Tue Sep 23 10:27:28 2008) Job ID: 546065 Username: fedorov Group: bri Job Name: STDIN Limits: ncpus=1,neednodes=abe1103,nodes=1,walltime=00:10:00 Job Queue: normal Account: bri Nodes: abe1103 End Torque Prologue ---------------------------------------- /opt/mpich-vmi-2.2.0-3-gcc-ofed-1.2/bin/mpirun ---------------------------------------- Begin Torque Epilogue (Tue Sep 23 10:27:30 2008) Job ID: 546065 Username: fedorov Group: bri Job Name: STDIN Session: 5071 Limits: ncpus=1,nodes=1,walltime=00:10:00 Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:02 Job Queue: normal Account: bri Nodes: abe1103 Killing leftovers... End Torque Epilogue ---------------------------------------- Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov From benc at hawaga.org.uk Tue Sep 23 10:33:07 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 15:33:07 +0000 (GMT) Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> Message-ID: whats in your sites.xml? -- From wilde at mcs.anl.gov Tue Sep 23 10:45:05 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 23 Sep 2008 10:45:05 -0500 Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> Message-ID: <48D90F01.8050106@mcs.anl.gov> Andriy, in case this is of use: To make coasters work on ABe, I also had to, as I recall, remove the "-l" option to an ssh command generated by coaster setup, and get it to run my own version of /etc/profile in which I disabled the following code: > if [ -f /etc/dedicated ] && [ `/usr/bin/id -u` -ne 0 ] > then > grep "^$LOGNAME\$" /etc/dedicated-users > /dev/null > if [ $? -ne 0 ] && [ -f /etc/dedicated-users ] > then > cat /etc/dedicated > sleep 5 > exit > else > echo "`hostname` is in dedicated user mode" > fi > fi Otherwise, I would get the "...is in dedicated user mode" error, above. (I'll loook for my patch for this, but cant now, and you or Ben may find it quicker) Two other problems I encountered, which have been fixed in SVN, were: - hostspernode was getting sent in to the RSL, neeeded to be filtered out - the random number generator getting called by coaster was blocking on dev/random and needed to be switched to urandom When I tested ab a week ago, with just the /etc/login fix above, it seemed to work. My sites entry was: /u/ac/wilde/swiftwork 8 - Mike On 9/23/08 10:33 AM, Ben Clifford wrote: > whats in your sites.xml? > From fedorov at cs.wm.edu Tue Sep 23 10:46:22 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 23 Sep 2008 11:46:22 -0400 Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> Message-ID: <82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com> On Tue, Sep 23, 2008 at 11:33 AM, Ben Clifford wrote: > whats in your sites.xml? > This was it. Here's the original relevant part of my sites.xml: /u/ac/fedorov/scratch-global/scratch In the globusrun-ws test I used "pbs" jobmanager. This makes a big difference. I assume, environment is initialized differently for fork jobs than for pbs jobs. When I repeat the globusrun test with fork provider, I have the same problem, so it's obviously not Swift. After substituting jobmanager from fork to pbs in sites.xml, Swift works as expected. Thank you for helping to figure this out! -- Andrey Fedorov Center for Real-Time Computing College of William and Mary http://www.cs.wm.edu/~fedorov > -- > > > From benc at hawaga.org.uk Tue Sep 23 11:02:13 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 16:02:13 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: On Mon, 22 Sep 2008, Xueyuan Zhou wrote: > /home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with mush > smaller input file than > /home/zhouxy/dic_parser/swift_script I plotted that log file here: http://www.ci.uchicago.edu/~benc/tmp/report-test3-20080922-2040-h1383jdc/ The log file seems to suggest that between 5 and 12 jobs are running at any one time according to the submit side after about 200 seconds into the run. To begin with, each site will only get two nodes at once; as your run progresses successfully on a site, that site will be given more nodes. Have a look at the graph labelled: execute2 tasks, coloured by site and you can see how many jobs are sent to each site at once. You have two sites defined - localhost and teraport. The localhost definition looks a bit suspicious, though - the usual swift localhost definition never allows more than 2 jobs to run at once, but on your run you often have more than that. What does your sites.xml file look like? -- From hategan at mcs.anl.gov Tue Sep 23 11:20:38 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Sep 2008 11:20:38 -0500 Subject: [Swift-user] Swift job environment on NCSA Abe In-Reply-To: <82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com> References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com> <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com> <82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com> Message-ID: <1222186838.31824.2.camel@localhost> On Tue, 2008-09-23 at 11:46 -0400, Andriy Fedorov wrote: > On Tue, Sep 23, 2008 at 11:33 AM, Ben Clifford wrote: > > whats in your sites.xml? > > > > This was it. > > Here's the original relevant part of my sites.xml: > > > > url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/> > /u/ac/fedorov/scratch-global/scratch > > > In the globusrun-ws test I used "pbs" jobmanager. This makes a big > difference. I assume, environment is initialized differently for fork > jobs than for pbs jobs. When I repeat the globusrun test with fork > provider, I have the same problem, so it's obviously not Swift. > > After substituting jobmanager from fork to pbs in sites.xml, Swift > works as expected. "Fork" is Globus for "fork a job on the *head node* and don't use the queuing system". Which is probably not what you want in general. Mihael From zhouxy at uchicago.edu Tue Sep 23 11:30:04 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 11:30:04 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: my sites.xml is /home/zhouxy/swift/working fast and /home/zhouxy/swift/working fast I run a longer job last night, which is at /home/zhouxy/dic_parser/swift_script3, and it is 10 times jobs more than /home/zhouxy/dic_parser/swift_script2 still, except the very beginning and ending(< 4 nodes), the most nodes I can get is about 4 nodes, since it seems to be dual core, it is 8 running jobs. I also noticed there are some free nodes there. I am wondering why I can only have about 8 running job (on 4 nodes), and about 20~30 in Q, while dozons of nodes are free. In this case, it is acting like a little faster single machine. If I can have more running jobs, that "execute2 tasks, coloured by site" will be more steep I think. But it seems swift does not do that. Thanks. ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: Sent: Tuesday, September 23, 2008 11:02 AM Subject: Re: [Swift-user] Why so few nodes ? > > On Mon, 22 Sep 2008, Xueyuan Zhou wrote: > >> /home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with >> mush >> smaller input file than >> /home/zhouxy/dic_parser/swift_script > > I plotted that log file here: > > http://www.ci.uchicago.edu/~benc/tmp/report-test3-20080922-2040-h1383jdc/ > > The log file seems to suggest that between 5 and 12 jobs are running at > any one time according to the submit side after about 200 seconds into the > run. > > To begin with, each site will only get two nodes at once; as your run > progresses successfully on a site, that site will be given more nodes. > > Have a look at the graph labelled: > > execute2 tasks, coloured by site > > and you can see how many jobs are sent to each site at once. > > You have two sites defined - localhost and teraport. The localhost > definition looks a bit suspicious, though - the usual swift localhost > definition never allows more than 2 jobs to run at once, but on your run > you often have more than that. > > What does your sites.xml file look like? > > -- > > > > From benc at hawaga.org.uk Tue Sep 23 11:37:50 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 16:37:50 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > > > > /home/zhouxy/swift/working > fast > what machine are you running your swift command on? -- From zhouxy at uchicago.edu Tue Sep 23 11:41:01 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 11:41:01 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: most on tp-login1.ci.uchicago.edu and some on tp-login2. ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: Sent: Tuesday, September 23, 2008 11:37 AM Subject: Re: [Swift-user] Why so few nodes ? > > On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > >> >> >> >> /home/zhouxy/swift/working >> fast >> > > what machine are you running your swift command on? > > -- > > From benc at hawaga.org.uk Tue Sep 23 11:47:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 16:47:48 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > most on tp-login1.ci.uchicago.edu and some on tp-login2. ok. So you have two site definitions that will both, through different mechanisms, go into the teraport PBS queue. That probably isn't going to hurt things, but you could probably use only one of those definitions. I'm looking at your swift_script3 logs now... -- From zhouxy at uchicago.edu Tue Sep 23 11:50:48 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 11:50:48 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: thanks, I remember it seems if I only use teraport, it is just half nodes. I will try again go make sure. ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: Sent: Tuesday, September 23, 2008 11:47 AM Subject: Re: [Swift-user] Why so few nodes ? > > On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > >> most on tp-login1.ci.uchicago.edu and some on tp-login2. > > > ok. So you have two site definitions that will both, through different > mechanisms, go into the teraport PBS queue. That probably isn't going to > hurt things, but you could probably use only one of those definitions. > > I'm looking at your swift_script3 logs now... > > -- > From benc at hawaga.org.uk Tue Sep 23 11:54:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 16:54:10 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > thanks, I remember it seems if I only use teraport, it is just half nodes. I > will try again go make sure. It will be half, through out the whole run. To use more, you can adjust the jobThrottle parameter. I'm not sure which is better to use - the gt2 provider or the pbs provider. The gt2 provider is much more tested; the pbs provider might put less load on the head node (or might not) - not many people have used it. -- From hategan at mcs.anl.gov Tue Sep 23 12:04:05 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Sep 2008 12:04:05 -0500 Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: <1222189445.32470.2.camel@localhost> On Tue, 2008-09-23 at 11:30 -0500, Xueyuan Zhou wrote: > still, except the very beginning and ending(< 4 nodes), the most nodes I can > get is about 4 nodes, since it seems to be dual core, it is 8 running jobs. How do you reach that conclusion? As far as I can tell, by default, one job maps to one node (regardless of the number of cores). > I also noticed there are some free nodes there. I am wondering why I can > only have about 8 running job (on 4 nodes), and about 20~30 in Q, while > dozons of nodes are free. May it be because you're using the "fast" queue, which may have a limitation on the maximum number of nodes you can get at one time? From zhouxy at uchicago.edu Tue Sep 23 12:12:27 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 12:12:27 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> Message-ID: > On Tue, 2008-09-23 at 11:30 -0500, Xueyuan Zhou wrote: >> still, except the very beginning and ending(< 4 nodes), the most nodes I >> can >> get is about 4 nodes, since it seems to be dual core, it is 8 running >> jobs. > > How do you reach that conclusion? As far as I can tell, by default, one > job maps to one node (regardless of the number of cores). > I check qstat -n -1 -u zhouxy frequently, and Ben's graph result also confirms that I think. It is like this: 703373.tp-mgt.ci.uch zhouxy fast STDIN 16543 1 -- -- 01:00 R -- tp-c114/0 703374.tp-mgt.ci.uch zhouxy fast STDIN 16863 1 -- -- 01:00 R -- tp-c114/1 703375.tp-mgt.ci.uch zhouxy fast STDIN 27455 1 -- -- 01:00 R -- tp-c064/0 703376.tp-mgt.ci.uch zhouxy fast STDIN 27884 1 -- -- 01:00 R -- tp-c064/1 703377.tp-mgt.ci.uch zhouxy fast STDIN 22513 1 -- -- 01:00 R -- tp-c043/0 703378.tp-mgt.ci.uch zhouxy fast STDIN 13364 1 -- -- 01:00 R -- tp-c119/0 703379.tp-mgt.ci.uch zhouxy fast STDIN 13673 1 -- -- 01:00 R -- tp-c119/1 703380.tp-mgt.ci.uch zhouxy fast STDIN -- 1 -- -- 01:00 R -- tp-c118/0 >> I also noticed there are some free nodes there. I am wondering why I can >> only have about 8 running job (on 4 nodes), and about 20~30 in Q, while >> dozons of nodes are free. > > May it be because you're using the "fast" queue, which may have a > limitation on the maximum number of nodes you can get at one time? > I tried it without "fast" for some task with 26M input file, even slower, they all go to Q, waiting there. And even less nodes for larger input file, which seems only 2 nodes. I'll try some more. > From benc at hawaga.org.uk Tue Sep 23 12:16:40 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 17:16:40 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> Message-ID: On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > also noticed there are some free nodes there. I am wondering why I can only > have about 8 running job (on 4 nodes), and about 20~30 in Q, while dozons of > nodes are free. If you have jobs sitting in the queue on teraport then that is something to do with the way that the queueing system is working on teragrid. I put graphs here: http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/ It looks like your getting around 8 jobs running at once, which matches up with your observation in the above quote. Mihael says this: > May it be because you're using the "fast" queue, which may have a > limitation on the maximum number of nodes you can get at one time? though the queue policy page here: http://www.ci.uchicago.edu/wiki/bin/view/Teraport/QueuePolicies does not mention anything about it. Though the wiki also contractions this from Mihael: > How do you reach that conclusion? As far as I can tell, by default, one > job maps to one node (regardless of the number of cores). by saying: > Single-processor jobs submitted by a user are paired up on a single node > to maximize cluster efficiency. I haven't tried that myself. -- From benc at hawaga.org.uk Tue Sep 23 12:55:27 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 17:55:27 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: <1222189445.32470.2.camel@localhost> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> Message-ID: On Tue, 23 Sep 2008, Mihael Hategan wrote: > May it be because you're using the "fast" queue, which may have a > limitation on the maximum number of nodes you can get at one time? Another thing might be scheduling delay in the teraport scheduler. Not relly sure if that would affect things, but the jobs in this workflow seem quite short. At least based on the Submitted->Active->Completed transitions in karajan, jobs seem to only take 10..45 seconds. See the graph I just added titled: karajan active JOB_SUBMISSION cumulative duration on http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/ -- From zhouxy at uchicago.edu Tue Sep 23 13:01:36 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 13:01:36 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> Message-ID: <59B9BC637E9940E49144BC58098C329C@VXAVIER> thanks a lot. Yes, the job is not long. I am wondering, is there any priority setting for different user ? Because what I got are really short jobs, but there are quite a few nodes are free, instead of being allocated to me. ----- Original Message ----- From: "Ben Clifford" To: "Mihael Hategan" Cc: "Xueyuan Zhou" ; Sent: Tuesday, September 23, 2008 12:55 PM Subject: Re: [Swift-user] Why so few nodes ? > > On Tue, 23 Sep 2008, Mihael Hategan wrote: > >> May it be because you're using the "fast" queue, which may have a >> limitation on the maximum number of nodes you can get at one time? > > Another thing might be scheduling delay in the teraport scheduler. Not > relly sure if that would affect things, but the jobs in this workflow seem > quite short. > > At least based on the Submitted->Active->Completed transitions in karajan, > jobs seem to only take 10..45 seconds. > > See the graph I just added titled: > > karajan active JOB_SUBMISSION cumulative duration > > on > http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/ > > -- > > From benc at hawaga.org.uk Tue Sep 23 13:07:32 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 18:07:32 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: <59B9BC637E9940E49144BC58098C329C@VXAVIER> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> <59B9BC637E9940E49144BC58098C329C@VXAVIER> Message-ID: On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > I am wondering, is there any priority setting for different user ? Because > what I got are really short jobs, but there are quite a few nodes are free, > instead of being allocated to me. No idea. You could ask teraport admins. -- From benc at hawaga.org.uk Tue Sep 23 13:11:25 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 18:11:25 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> <59B9BC637E9940E49144BC58098C329C@VXAVIER> Message-ID: if this is coming from queuing delays in teraport's PBS, maybe trying coasters would help in this situation. I'm not sure if they work on teraport at the moment or not. If they do, then see the paragraph about coasters in this, at the end of section 16.3. http://www.ci.uchicago.edu/swift/guides/userguide/sitecatalog.php -- From zhouxy at uchicago.edu Tue Sep 23 13:12:09 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 13:12:09 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> <59B9BC637E9940E49144BC58098C329C@VXAVIER> Message-ID: <112EC5CDCDC34B58AB0A06CBAEF8A41B@VXAVIER> thanks guys, it helps a lot. ----- Original Message ----- From: "Ben Clifford" To: "Xueyuan Zhou" Cc: "Mihael Hategan" ; Sent: Tuesday, September 23, 2008 1:07 PM Subject: Re: [Swift-user] Why so few nodes ? > > On Tue, 23 Sep 2008, Xueyuan Zhou wrote: > >> I am wondering, is there any priority setting for different user ? >> Because >> what I got are really short jobs, but there are quite a few nodes are >> free, >> instead of being allocated to me. > > No idea. You could ask teraport admins. > > -- > > From benc at hawaga.org.uk Tue Sep 23 13:25:10 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 23 Sep 2008 18:25:10 +0000 (GMT) Subject: [Swift-user] Why so few nodes ? In-Reply-To: <1222189445.32470.2.camel@localhost> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> Message-ID: On Tue, 23 Sep 2008, Mihael Hategan wrote: > May it be because you're using the "fast" queue, which may have a > limitation on the maximum number of nodes you can get at one time? There is evidence that sending jobs twice as fast (by defining two swift sites for teraport) made twice as many jobs run. These would be counter to the argument that there is a max node count and supportive of there being some long scheduling delay in the LRM, I think. I guess increasing the jobThrottle to a larger number would therefore get more jobs in the queue and more jobs run, but putting more node on the head node. Xueyuan, add this line to one of your site definitions: (but not both) 0.5 then make a run and send a log. -- From hategan at mcs.anl.gov Tue Sep 23 13:34:31 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Sep 2008 13:34:31 -0500 Subject: [Swift-user] Why so few nodes ? In-Reply-To: <59B9BC637E9940E49144BC58098C329C@VXAVIER> References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> <59B9BC637E9940E49144BC58098C329C@VXAVIER> Message-ID: <1222194871.1798.0.camel@localhost> On Tue, 2008-09-23 at 13:01 -0500, Xueyuan Zhou wrote: > thanks a lot. Yes, the job is not long. > > > I am wondering, is there any priority setting for different user ? Because > what I got are really short jobs, but there are quite a few nodes are free, > instead of being allocated to me. The "debug" queue has a higher priority but a limited number of nodes (I think 4) allocated. From zhouxy at uchicago.edu Tue Sep 23 13:38:24 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 13:38:24 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> <59B9BC637E9940E49144BC58098C329C@VXAVIER> <1222194871.1798.0.camel@localhost> Message-ID: Ben, I am running the same task with 0.5 now. Mihael, by "debug" queue, you mean I used "fast" ? So if I use "extended", it might have more nodes? ----- Original Message ----- From: "Mihael Hategan" To: "Xueyuan Zhou" Cc: "Ben Clifford" ; Sent: Tuesday, September 23, 2008 1:34 PM Subject: Re: [Swift-user] Why so few nodes ? > On Tue, 2008-09-23 at 13:01 -0500, Xueyuan Zhou wrote: >> thanks a lot. Yes, the job is not long. >> >> >> I am wondering, is there any priority setting for different user ? >> Because >> what I got are really short jobs, but there are quite a few nodes are >> free, >> instead of being allocated to me. > > The "debug" queue has a higher priority but a limited number of nodes (I > think 4) allocated. > > > From zhouxy at uchicago.edu Tue Sep 23 13:49:55 2008 From: zhouxy at uchicago.edu (Xueyuan Zhou) Date: Tue, 23 Sep 2008 13:49:55 -0500 Subject: [Swift-user] Why so few nodes ? References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com> <1221842763.6071.3.camel@localhost> <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com> <1221844428.7599.1.camel@localhost> <176C82530E0344829746E3493B289CF3@VXAVIER> <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER> <1222189445.32470.2.camel@localhost> Message-ID: <399D9AF1930247A1AB1BB4EA89C1C3B6@VXAVIER> the result is in /home/zhouxy/dic_parser/swift_script4 one thing I noticed is that, when I only use one site, it starts with 2 running jobs, then increases, and finally also reaches 8 running jobs, which is the same as two sites. So the total time when using one site and two site is not quite different. ----- Original Message ----- From: "Ben Clifford" To: "Mihael Hategan" Cc: "Xueyuan Zhou" ; Sent: Tuesday, September 23, 2008 1:25 PM Subject: Re: [Swift-user] Why so few nodes ? > > On Tue, 23 Sep 2008, Mihael Hategan wrote: > >> May it be because you're using the "fast" queue, which may have a >> limitation on the maximum number of nodes you can get at one time? > > There is evidence that sending jobs twice as fast (by defining two swift > sites for teraport) made twice as many jobs run. These would be counter to > the argument that there is a max node count and supportive of there being > some long scheduling delay in the LRM, I think. > > I guess increasing the jobThrottle to a larger number would therefore get > more jobs in the queue and more jobs run, but putting more node on the > head node. > > Xueyuan, add this line to one of your site definitions: (but not both) > > 0.5 > > then make a run and send a log. > > -- >