From abejan at ci.uchicago.edu Tue Sep 2 11:13:20 2008
From: abejan at ci.uchicago.edu (Alina Bejan)
Date: Tue, 02 Sep 2008 11:13:20 -0500
Subject: [Swift-user] swift on fermigrid site question
Message-ID: <48BD6620.4040504@ci.uchicago.edu>
Hello --
I have the following very basic problem: I am trying to run the
Hello World swift program on the fermigrid site (with OSG VO
credentials). I am getting the following error:
[abejan at communicado examples]$ swift -tc.file tc.data -sites.file
fermi.xml first.swift
Swift 0.5 swift-r1783 cog-r1962
RunID: 20080902-1035-1dnbi226
Progress:
echo started
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Failed but can retry:1
Failed to transfer wrapper log from
first-20080902-1035-1dnbi226/info/t/fermigridosg1.fnal.gov
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Where:
tc.data is only one line:
fermigridosg1.fnal.gov echo /bin/echo INSTALLED
INTEL32::LINUX null
and the sites file is fermi.xml:
/grid/data
The results (writing "Hello" in a hello.txt file) are not generated.
Could you please explain me what is wrong ?
Thanks,
Alina
From benc at hawaga.org.uk Tue Sep 2 11:23:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 2 Sep 2008 16:23:18 +0000 (GMT)
Subject: [Swift-user] swift on fermigrid site question
In-Reply-To: <48BD6620.4040504@ci.uchicago.edu>
References: <48BD6620.4040504@ci.uchicago.edu>
Message-ID:
On Tue, 2 Sep 2008, Alina Bejan wrote:
> I have the following very basic problem: I am trying to run the Hello
> World swift program on the fermigrid site (with OSG VO credentials). I am
> getting the following error:
Without seeing the log file, my first guess would be that something is
failing after the site has had your job in a queue for a while.
Try these two commands:
globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo
globus-url-copy file:///etc/group
gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1
and post the results.
--
From abejan at ci.uchicago.edu Tue Sep 2 11:38:12 2008
From: abejan at ci.uchicago.edu (Alina Bejan)
Date: Tue, 02 Sep 2008 11:38:12 -0500
Subject: [Swift-user] swift on fermigrid site question
In-Reply-To:
References: <48BD6620.4040504@ci.uchicago.edu>
Message-ID: <48BD6BF4.4070302@ci.uchicago.edu>
Alright, these are the outputs:
[abejan at communicado examples]$ globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo
condor_exec.exe: /lib/tls/libc.so.6: version `GLIBC_2.4' not found (required by condor_exec.exe)
[abejan at communicado examples]$ globus-url-copy file:///etc/group gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1
GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: etc/group (No such file or directory)]
For the logs, here's the path: /home/abejan/workflow-ex/examples
Thanks.
Alina
Ben Clifford wrote:
> On Tue, 2 Sep 2008, Alina Bejan wrote:
>
>
>> I have the following very basic problem: I am trying to run the Hello
>> World swift program on the fermigrid site (with OSG VO credentials). I am
>> getting the following error:
>>
>
> Without seeing the log file, my first guess would be that something is
> failing after the site has had your job in a queue for a while.
>
> Try these two commands:
>
> globus-job-run fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo
>
> globus-url-copy file:///etc/group
> gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1
>
> and post the results.
>
>
From benc at hawaga.org.uk Tue Sep 2 11:42:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 2 Sep 2008 16:42:27 +0000 (GMT)
Subject: [Swift-user] swift on fermigrid site question
In-Reply-To: <48BD6BF4.4070302@ci.uchicago.edu>
References: <48BD6620.4040504@ci.uchicago.edu>
<48BD6BF4.4070302@ci.uchicago.edu>
Message-ID:
On Tue, 2 Sep 2008, Alina Bejan wrote:
> [abejan at communicado examples]$ globus-job-run
> fermigridosg1.fnal.gov/jobmanager-condor /bin/echo foo
>
> condor_exec.exe: /lib/tls/libc.so.6: version `GLIBC_2.4' not found (required
> by condor_exec.exe)
ok. So that's a problem with running jobs in condor. Probably talk to
fermilab site admin showing them this.
> [abejan at communicado examples]$ globus-url-copy file:///etc/group
> gsiftp://fermigridosg1.fnal.gov/grid/data/abejan-test-1
> GlobusUrlCopy error: UrlCopy transfer failed. [Caused by: etc/group (No such
> file or directory)
my bad - your not using the Globus Toolkit version of globus-url-copy so
you need a different commandline (the cog version uses different URL
semantics). However, the globus-job-run problem looks like the main
problem to investigate.
--
From fedorov at cs.wm.edu Thu Sep 4 09:39:44 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 4 Sep 2008 10:39:44 -0400
Subject: [Swift-user] Swift scheduler
In-Reply-To:
References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
Message-ID: <82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com>
>> Can any of the developers point me to the specific part of the source
>> that is responsible for scheduling, so that I could try to figure this
>> out myself?
>
> Start here:
>
> cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java.
>
I look at the latest cog release 2130, and line 342 says
else if ("best".equals("value")) {
Am I wrong, or value in quotes here is a bug? It looks like
POLICY_BEST_SCORE is never used because of this.
--
Andrey Fedorov
From hategan at mcs.anl.gov Thu Sep 4 10:38:47 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 04 Sep 2008 10:38:47 -0500
Subject: [Swift-user] Swift scheduler
In-Reply-To: <82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com>
References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
<82f536810809040739j4b92e27bq4bc32992b75b706@mail.gmail.com>
Message-ID: <1220542727.5823.18.camel@localhost>
On Thu, 2008-09-04 at 10:39 -0400, Andriy Fedorov wrote:
> >> Can any of the developers point me to the specific part of the source
> >> that is responsible for scheduling, so that I could try to figure this
> >> out myself?
> >
> > Start here:
> >
> > cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java.
> >
>
> I look at the latest cog release 2130, and line 342 says
>
> else if ("best".equals("value")) {
>
> Am I wrong, or value in quotes here is a bug? It looks like
> POLICY_BEST_SCORE is never used because of this.
That is funny, but yes, it is a bug. However, that selection policy is
not used in swift anyway, nor is it much useful otherwise. The weighted
random selection distributes load more smoothly across sites.
Anyway, I fixed it in SVN.
Mihael
From fedorov at cs.wm.edu Tue Sep 9 21:55:50 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 9 Sep 2008 22:55:50 -0400
Subject: [Swift-user] Swift build problems
Message-ID: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
Hi,
While trying to build development checkout of Swift, I have this
problem (see the report below) -- strange warning about running out of
memory, and then failure due to a problem which doesn't make sense. I
saw similar reports about other packages that use ant, but I didn't
find a general solution.
I have openSuse 11.0, uname -a says
Linux beat 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200
x86_64 x86_64 x86_64 GNU/Linux
Has anyone seen something like this?
[andrey at beat vdsk] ant dist
Buildfile: build.xml
generateVersion:
antlr:
[java] ANTLR Parser Generator Version 2.7.5 (20050128)
1989-2005 jGuru.com
[java] resources/swiftscript.g:944: warning:nondeterminism upon
[java] resources/swiftscript.g:944: k==1:LBRACK
[java] resources/swiftscript.g:944:
k==2:ID,STRING_LITERAL,LBRACK,LPAREN,AT,PLUS,MINUS,STAR,NOT,INT_LITERAL,FLOAT_LITERAL,"true","false"
[java] resources/swiftscript.g:944: between alt 1 and exit
branch of block
compileSchema:
GC Warning: Out of Memory! Returning NIL!
GC Warning: Repeated allocation of very large block (appr. size 131072000):
May lead to memory leak and poor performance.
[java] Time to build schema type system: 67.854 seconds
[java] Time to generate code: 6.224 seconds
[java] java.io.IOException: Cannot run program
"/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException:
error=2, No such file or directory
[java] java.io.IOException: java.io.IOException: error=2, No such
file or directory
[java] java.io.IOException: Cannot run program
"/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException:
error=2, No such file or directory
[java] at java.lang.ProcessBuilder.start(ProcessBuilder.java:474)
[java] at java.lang.Runtime.exec(Runtime.java:610)
[java] at java.lang.Runtime.exec(Runtime.java:483)
[java] at
org.apache.xmlbeans.impl.tool.CodeGenUtil.externalCompile(CodeGenUtil.java:231)
[java] at
org.apache.xmlbeans.impl.tool.SchemaCompiler.compile(SchemaCompiler.java:1126)
[java] at
org.apache.xmlbeans.impl.tool.SchemaCompiler.main(SchemaCompiler.java:368)
[java] Caused by: java.io.IOException: java.io.IOException:
error=2, No such file or directory
[java] at java.lang.UNIXProcess.(UNIXProcess.java:164)
[java] at java.lang.ProcessImpl.start(ProcessImpl.java:81)
[java] at java.lang.ProcessBuilder.start(ProcessBuilder.java:467)
[java] ... 5 more
[java] BUILD FAILED
BUILD FAILED
/home/andrey/local/src/cog/modules/vdsk/build.xml:152: Java returned: 1
Total time: 1 minute 30 seconds
--
Andrey Fedorov
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov
From hategan at mcs.anl.gov Tue Sep 9 23:14:48 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 09 Sep 2008 23:14:48 -0500
Subject: [Swift-user] Swift build problems
In-Reply-To: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
Message-ID: <1221020088.5312.2.camel@localhost>
On Tue, 2008-09-09 at 22:55 -0400, Andriy Fedorov wrote:
> [java] java.io.IOException: Cannot run program
> "/home/andrey/local/src/cog/modules/vdsk/javac": java.io.IOException:
> error=2, No such file or directory
Is JAVA_HOME set?
From benc at hawaga.org.uk Wed Sep 10 01:56:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 06:56:21 +0000 (GMT)
Subject: [Swift-user] Swift build problems
In-Reply-To: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
Message-ID:
what java are you using?
$ java -version
$ javac -version
--
From fedorov at cs.wm.edu Wed Sep 10 06:44:18 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 10 Sep 2008 07:44:18 -0400
Subject: [Swift-user] Swift build problems
In-Reply-To:
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
Message-ID: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
[andrey at beat cog] echo $JAVA_HOME
/usr/lib64/jvm/java
[andrey at beat cog] java -version
java version "1.6.0"
IcedTea Runtime Environment (build 1.6.0-b09)
OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
[andrey at beat cog] javac -version
javac: unrecognized option '-version'
javac: no input files
[andrey at beat cog]
On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote:
> what java are you using?
>
>
> $ java -version
> $ javac -version
>
> --
>
>
>
From benc at hawaga.org.uk Wed Sep 10 06:49:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 11:49:46 +0000 (GMT)
Subject: [Swift-user] Swift build problems
In-Reply-To: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
<82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
Message-ID:
I've not heard of that one before. Have you used it much for compiling
non-trivial stuff?
On Wed, 10 Sep 2008, Andriy Fedorov wrote:
> [andrey at beat cog] echo $JAVA_HOME
> /usr/lib64/jvm/java
> [andrey at beat cog] java -version
> java version "1.6.0"
> IcedTea Runtime Environment (build 1.6.0-b09)
> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
> [andrey at beat cog] javac -version
> javac: unrecognized option '-version'
> javac: no input files
> [andrey at beat cog]
>
>
>
> On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote:
> > what java are you using?
> >
> >
> > $ java -version
> > $ javac -version
> >
> > --
> >
> >
> >
>
>
From fedorov at cs.wm.edu Wed Sep 10 08:06:26 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 10 Sep 2008 09:06:26 -0400
Subject: [Swift-user] Swift build problems
In-Reply-To:
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
<82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
Message-ID: <82f536810809100606r49ceefd4jb915b59d79529f35@mail.gmail.com>
On Wed, Sep 10, 2008 at 7:49 AM, Ben Clifford wrote:
>
> I've not heard of that one before. Have you used it much for compiling
> non-trivial stuff?
>
No, I haven't...
>
> On Wed, 10 Sep 2008, Andriy Fedorov wrote:
>
>> [andrey at beat cog] echo $JAVA_HOME
>> /usr/lib64/jvm/java
>> [andrey at beat cog] java -version
>> java version "1.6.0"
>> IcedTea Runtime Environment (build 1.6.0-b09)
>> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
>> [andrey at beat cog] javac -version
>> javac: unrecognized option '-version'
>> javac: no input files
>> [andrey at beat cog]
>>
>>
>>
>> On Wed, Sep 10, 2008 at 2:56 AM, Ben Clifford wrote:
>> > what java are you using?
>> >
>> >
>> > $ java -version
>> > $ javac -version
>> >
>> > --
>> >
>> >
>> >
>>
>>
>
From hategan at mcs.anl.gov Wed Sep 10 10:30:46 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Sep 2008 10:30:46 -0500
Subject: [Swift-user] Swift build problems
In-Reply-To: <82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
<82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
Message-ID: <1221060646.16054.4.camel@localhost>
That tool is looking for a javac in ${java.home}/bin/ (and I assume it
sets that system property from JAVA_HOME). Not finding it, it tries the
current directory.
On Wed, 2008-09-10 at 07:44 -0400, Andriy Fedorov wrote:
> [andrey at beat cog] echo $JAVA_HOME
> /usr/lib64/jvm/java
> [andrey at beat cog] java -version
> java version "1.6.0"
> IcedTea Runtime Environment (build 1.6.0-b09)
> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
> [andrey at beat cog] javac -version
> javac: unrecognized option '-version'
> javac: no input files
> [andrey at beat cog]
>
From fedorov at cs.wm.edu Wed Sep 10 10:48:44 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 10 Sep 2008 11:48:44 -0400
Subject: [Swift-user] Swift build problems
In-Reply-To: <1221060646.16054.4.camel@localhost>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
<82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
<1221060646.16054.4.camel@localhost>
Message-ID: <82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com>
On Wed, Sep 10, 2008 at 11:30 AM, Mihael Hategan wrote:
> That tool is looking for a javac in ${java.home}/bin/ (and I assume it
> sets that system property from JAVA_HOME). Not finding it, it tries the
> current directory.
>
This may well be true.
However, the question is why it is not finding javac in
$JAVA_HOME/bin, while it is clearly present there?
My guess is, it may be because earlier memory allocation failed, and
there was garbage somewhere in memory, that confused ant about javac
location.
So, there may be hope in resolving the weird javac error by resolving
the memory allocation problem. Having googled related issues, I was
not been able to find any solution to this, so I am just not going to
use that particular system with svn Swift. This is not critical right
now, I was just wondering if this is an issue known to the community.
> On Wed, 2008-09-10 at 07:44 -0400, Andriy Fedorov wrote:
>> [andrey at beat cog] echo $JAVA_HOME
>> /usr/lib64/jvm/java
>> [andrey at beat cog] java -version
>> java version "1.6.0"
>> IcedTea Runtime Environment (build 1.6.0-b09)
>> OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
>> [andrey at beat cog] javac -version
>> javac: unrecognized option '-version'
>> javac: no input files
>> [andrey at beat cog]
>>
>
>
>
From benc at hawaga.org.uk Wed Sep 10 11:02:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 10 Sep 2008 16:02:23 +0000 (GMT)
Subject: [Swift-user] Swift build problems
In-Reply-To: <82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com>
References: <82f536810809091955x566981dfr5b81c0b1b70a6f92@mail.gmail.com>
<82f536810809100444g1476c9c4w63d2b24c7316d8aa@mail.gmail.com>
<1221060646.16054.4.camel@localhost>
<82f536810809100848g79d0c7a7vd6d6fbe5b07114a7@mail.gmail.com>
Message-ID:
On Wed, 10 Sep 2008, Andriy Fedorov wrote:
[...]
> I am just not going to
> use that particular system with svn Swift.
You can rsync vdsk/dist/swift-svn (or tar or otherwise copy it) from one
machine to another (assuming compatible JREs); that's basically what a
Swift release is.
--
From zhouxy at uchicago.edu Wed Sep 17 14:22:30 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Wed, 17 Sep 2008 14:22:30 -0500
Subject: [Swift-user] Problem about configuration for swift
Message-ID: <8C388ACD0A8F42458DEC650238543E10@VXAVIER>
Hi all,
I just got my certificate, and I can run grid-proxy-init. But when I chang
tc.data to
teraport echo /bin/echo INSTALLED INTEL32::LINUX null
and sites.xml to
/home/zhouxy/swift/working
I got the following error when I tried to run first.swfit example
2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to
set -Dpegasus.home=$PEGASUS_HOME!
Anyone can help me with this ?
Thanks!
Xueyuan
From benc at hawaga.org.uk Wed Sep 17 14:31:38 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 17 Sep 2008 19:31:38 +0000 (GMT)
Subject: [Swift-user] Problem about configuration for swift
In-Reply-To: <8C388ACD0A8F42458DEC650238543E10@VXAVIER>
References: <8C388ACD0A8F42458DEC650238543E10@VXAVIER>
Message-ID:
On Wed, 17 Sep 2008, Xueyuan Zhou wrote:
> 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set
> -Dpegasus.home=$PEGASUS_HOME!
I've seen this error when both of the following conditions are true:
i) Pegasus is installed on the same machine
ii) You are using an old version of swift (build prior to cog r2007,
2008-05-12)
Are the above true for you? If so, move to a recent version of swift, eg
0.6 which is the latest release.
--
From zhouxy at uchicago.edu Wed Sep 17 14:36:33 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Wed, 17 Sep 2008 14:36:33 -0500
Subject: [Swift-user] Problem about configuration for swift
Message-ID: <57317D9DE7AE476DBEAF1C6B7205344D@VXAVIER>
Hi all,
I just got my certificate, and I can run grid-proxy-init. But when I change
tc.data to
teraport echo /bin/echo INSTALLED INTEL32::LINUX null
and sites.xml to
/home/zhouxy/swift/working
I got the following error when I tried to run first.swfit example
2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to
set -Dpegasus.home=$PEGASUS_HOME!
Anyone can help me with this ?
Thanks!
Xueyuan
From zhouxy at uchicago.edu Wed Sep 17 14:48:09 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Wed, 17 Sep 2008 14:48:09 -0500
Subject: [Swift-user] Problem about configuration for swift
References: <8C388ACD0A8F42458DEC650238543E10@VXAVIER>
Message-ID: <474C1AB60DEB44A89CA8F66479917D39@VXAVIER>
Thanks a lot Ben!
After I installed 0.6 version, it works!
-Xueyuan
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc:
Sent: Wednesday, September 17, 2008 2:31 PM
Subject: Re: [Swift-user] Problem about configuration for swift
>
> On Wed, 17 Sep 2008, Xueyuan Zhou wrote:
>
>> 2008.09.17 14:20:08.306 CDT: [FATAL ERROR] You forgot to set
>> -Dpegasus.home=$PEGASUS_HOME!
>
> I've seen this error when both of the following conditions are true:
>
> i) Pegasus is installed on the same machine
>
> ii) You are using an old version of swift (build prior to cog r2007,
> 2008-05-12)
>
> Are the above true for you? If so, move to a recent version of swift, eg
> 0.6 which is the latest release.
>
> --
>
>
From fedorov at cs.wm.edu Fri Sep 19 09:22:48 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 19 Sep 2008 10:22:48 -0400
Subject: [Swift-user] Swift+MPI+LSF
Message-ID: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
Hi,
I am trying to use Swift to run an MPI job via LSF scheduler (TG
Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/).
Previously, I had problems running stuff like this with PBS (see
http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html).
Right now I am using the solution suggested by Ben (submit single node
job, and run a shell wrapper to launch mpirun, ). This doesn't seem to
work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and
have my shell wrapper run
#!/bin/bash
ibrun /home/teragrid/tg457149/meshreg/trunk/build-mpicc/bin/blockMatchingMPI $*
but I get one node allocated.
According to Lonestar manual, the number of nodes is specified in the
script like this (note: CPUs are specified in #BSUB, not as an
argument to ibrun):
#!/bin/tcsh
# first line specifies shell
#BSUB -J jobname #name the job "jobname"
#BSUB -o out.o%J #output-> out.o<jobID>
#BSUB -e err.o%J #error -> error.o<jobID>
#BSUB -n 4 -W 1:30 #4 CPU cores and 1hr+30min
#BSUB -q normal #Use normal queue.
set echo #Echo all commands.
cd $LS_SUBCWD #cd to directory of submission
ibrun ./a.out #use ibrun for "pam -g 1 mvapich_wrapper"
#CPUs are specified above in -n option.
Is this a known issue? Has anyone run into something like this?
--
Andrey Fedorov
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov
From hategan at mcs.anl.gov Fri Sep 19 11:46:03 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Sep 2008 11:46:03 -0500
Subject: [Swift-user] Swift+MPI+LSF
In-Reply-To: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
Message-ID: <1221842763.6071.3.camel@localhost>
On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote:
> Hi,
>
> I am trying to use Swift to run an MPI job via LSF scheduler (TG
> Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/).
> Previously, I had problems running stuff like this with PBS (see
> http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html).
>
> Right now I am using the solution suggested by Ben (submit single node
> job, and run a shell wrapper to launch mpirun, ). This doesn't seem to
> work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and
> have my shell wrapper run
Try "count" instead of "host_xcount".
Mihael
From fedorov at cs.wm.edu Fri Sep 19 12:04:10 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 19 Sep 2008 13:04:10 -0400
Subject: [Swift-user] Swift+MPI+LSF
In-Reply-To: <1221842763.6071.3.camel@localhost>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
Message-ID: <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
On Fri, Sep 19, 2008 at 12:46 PM, Mihael Hategan wrote:
> On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote:
>> Hi,
>>
>> I am trying to use Swift to run an MPI job via LSF scheduler (TG
>> Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/).
>> Previously, I had problems running stuff like this with PBS (see
>> http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html).
>>
>> Right now I am using the solution suggested by Ben (submit single node
>> job, and run a shell wrapper to launch mpirun, ). This doesn't seem to
>> work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and
>> have my shell wrapper run
>
> Try "count" instead of "host_xcount".
>
This was indeed the solution. It seems to work. Thank you, Mihael!
>
>
From hategan at mcs.anl.gov Fri Sep 19 12:13:48 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Sep 2008 12:13:48 -0500
Subject: [Swift-user] Swift+MPI+LSF
In-Reply-To: <82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
Message-ID: <1221844428.7599.1.camel@localhost>
On Fri, 2008-09-19 at 13:04 -0400, Andriy Fedorov wrote:
> On Fri, Sep 19, 2008 at 12:46 PM, Mihael Hategan wrote:
> > On Fri, 2008-09-19 at 10:22 -0400, Andriy Fedorov wrote:
> >> Hi,
> >>
> >> I am trying to use Swift to run an MPI job via LSF scheduler (TG
> >> Lonestar, http://www.tacc.utexas.edu/services/userguides/lonestar/).
> >> Previously, I had problems running stuff like this with PBS (see
> >> http://mail.ci.uchicago.edu/pipermail/swift-user/2008-July/000443.html).
> >>
> >> Right now I am using the solution suggested by Ben (submit single node
> >> job, and run a shell wrapper to launch mpirun, ). This doesn't seem to
> >> work with LSF. I specify "GLOBUS::jobType=single,host_xcount=10", and
> >> have my shell wrapper run
> >
> > Try "count" instead of "host_xcount".
> >
>
> This was indeed the solution. It seems to work. Thank you, Mihael!
It looks like it works on other sites too (I tried mercury at ncsa, which
has PBS). So it looks like UC is broken in this respect. I'll send a
ticket to TG.
From fedorov at cs.wm.edu Fri Sep 19 12:27:40 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 19 Sep 2008 13:27:40 -0400
Subject: [Swift-user] Swift+MPI+LSF
In-Reply-To: <1221844428.7599.1.camel@localhost>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
Message-ID: <82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com>
On Fri, Sep 19, 2008 at 1:13 PM, Mihael Hategan wrote:
> It looks like it works on other sites too (I tried mercury at ncsa, which
> has PBS). So it looks like UC is broken in this respect. I'll send a
> ticket to TG.
>
UC is different from other TG sites, because it is heterogeneous. You
need to use something more than just "count", because you need to pass
"compute" type to the node specs. I use "host_types" on UC, and it
works for me...
>
>
From hategan at mcs.anl.gov Fri Sep 19 12:45:48 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 19 Sep 2008 12:45:48 -0500
Subject: [Swift-user] Swift+MPI+LSF
In-Reply-To: <82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<82f536810809191027p51f0bf75h4874d09d52771e78@mail.gmail.com>
Message-ID: <1221846348.8384.1.camel@localhost>
On Fri, 2008-09-19 at 13:27 -0400, Andriy Fedorov wrote:
> On Fri, Sep 19, 2008 at 1:13 PM, Mihael Hategan wrote:
> > It looks like it works on other sites too (I tried mercury at ncsa, which
> > has PBS). So it looks like UC is broken in this respect. I'll send a
> > ticket to TG.
> >
>
> UC is different from other TG sites, because it is heterogeneous. You
> need to use something more than just "count", because you need to pass
> "compute" type to the node specs. I use "host_types" on UC, and it
> works for me...
I see little reason to not have "count" work. And if not, things should
be documented properly.
From benc at hawaga.org.uk Sat Sep 20 10:49:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 20 Sep 2008 15:49:22 +0000 (GMT)
Subject: [Swift-user] multipage user guide
Message-ID:
The user guide is now available in one-page-per-section format which some
people regard as easier to view online than the existing
one-page-for-whole-guide format.
Both one-page-per-section and one-page-for-whole-guide formats are
linked from the documentation webpage at
http://www.ci.uchicago.edu/swift/docs/ and both are generated from the
same docbook source so should have the same content always.
--
From zhouxy at uchicago.edu Mon Sep 22 18:20:30 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Mon, 22 Sep 2008 18:20:30 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
Message-ID: <176C82530E0344829746E3493B289CF3@VXAVIER>
Hi all,
I am using swift on Teraport, in some cases, I only got 4 nodes! So it is
extremely slow. Even slower than single machine.
I have 21 input files, each is about 26M. Computation is straightforward,
foreach ifile, i in infile{
outfile[i] = opt(ifile, myjar);
}
In this case I only got 4 nodes.
When input files are smaller, like 1M, the computation speed, nodes, are
pretty reasonable.(about>30 nodes, 10min)
Anyone knows why it happened ? How can I know how to optimize this ? Is it
because my application code uses too much memory or too slow for large data
set?
Thanks,
Xueyuan
From benc at hawaga.org.uk Mon Sep 22 18:25:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 22 Sep 2008 23:25:54 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To: <176C82530E0344829746E3493B289CF3@VXAVIER>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
Message-ID:
On Mon, 22 Sep 2008, Xueyuan Zhou wrote:
> I am using swift on Teraport, in some cases, I only got 4 nodes! So it is
> extremely slow. Even slower than single machine.
Put a log file from a complete run online where I can download it and I
will use the Swift log-processing module to generate some graphs that
might be useful.
(The log file is the file that looks like foo-20089999-9999-abcdef.log)
There are a number of possible causes, and looking at the graphs will
help.
--
From zhouxy at uchicago.edu Mon Sep 22 20:23:10 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Mon, 22 Sep 2008 20:23:10 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
Message-ID: <3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
I tried to use less input to finish a complete run, but, duiring the last
input, it failed.
all log and related outputs are in (test3)
/home/zhouxy/dic_parser/swift_script
Thanks,
Xueyuan
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc:
Sent: Monday, September 22, 2008 6:25 PM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Mon, 22 Sep 2008, Xueyuan Zhou wrote:
>
>> I am using swift on Teraport, in some cases, I only got 4 nodes! So it is
>> extremely slow. Even slower than single machine.
>
> Put a log file from a complete run online where I can download it and I
> will use the Swift log-processing module to generate some graphs that
> might be useful.
>
> (The log file is the file that looks like foo-20089999-9999-abcdef.log)
>
> There are a number of possible causes, and looking at the graphs will
> help.
>
> --
>
>
From zhouxy at uchicago.edu Mon Sep 22 20:55:01 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Mon, 22 Sep 2008 20:55:01 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
/home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with
mush smaller input file than
/home/zhouxy/dic_parser/swift_script
Thanks,
Xueyuan
----- Original Message -----
From: "Xueyuan Zhou"
To: "Ben Clifford"
Cc:
Sent: Monday, September 22, 2008 8:23 PM
Subject: Re: [Swift-user] Why so few nodes ?
>I tried to use less input to finish a complete run, but, duiring the last
>input, it failed.
>
> all log and related outputs are in (test3)
> /home/zhouxy/dic_parser/swift_script
>
>
> Thanks,
>
> Xueyuan
>
>
>
>
> ----- Original Message -----
> From: "Ben Clifford"
> To: "Xueyuan Zhou"
> Cc:
> Sent: Monday, September 22, 2008 6:25 PM
> Subject: Re: [Swift-user] Why so few nodes ?
>
>
>>
>> On Mon, 22 Sep 2008, Xueyuan Zhou wrote:
>>
>>> I am using swift on Teraport, in some cases, I only got 4 nodes! So it
>>> is
>>> extremely slow. Even slower than single machine.
>>
>> Put a log file from a complete run online where I can download it and I
>> will use the Swift log-processing module to generate some graphs that
>> might be useful.
>>
>> (The log file is the file that looks like foo-20089999-9999-abcdef.log)
>>
>> There are a number of possible causes, and looking at the graphs will
>> help.
>>
>> --
>>
>>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
From fedorov at cs.wm.edu Tue Sep 23 10:12:14 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 23 Sep 2008 11:12:14 -0400
Subject: [Swift-user] Swift job environment on NCSA Abe
Message-ID: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
Hi,
I had difficulties running mpi jobs on Abe using the wrapper, so I
tried to debug the problem.
It appears that Swift jobs submitted to Abe do not get environment
initialized properly.
Here's the test shell script I am trying to run on Abe:
[fedorov at TG/Abe:honest1 etc] cat /u/ac/fedorov/local/env_wrapper.sh
#!/usr/local/bin/bash
which mpirun
When I run this using PBS script directly on Abe, I get this output:
----------------------------------------
Begin Torque Prologue (Tue Sep 23 09:49:36 2008)
Job ID: 545989
Username: fedorov
Group: bri
Job Name: env.pbs
Limits: ncpus=1,neednodes=abe0726,nodes=1,walltime=00:01:00
Job Queue: normal
Account: bri
Nodes: abe0726
End Torque Prologue
----------------------------------------
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/opt/mpich-vmi-2.2.0-3-gcc-ofed-1.2/bin/mpirun
----------------------------------------
Begin Torque Epilogue (Tue Sep 23 09:49:39 2008)
Job ID: 545989
Username: fedorov
Group: bri
Job Name: env.pbs
Session: 721
Limits: ncpus=1,nodes=1,walltime=00:01:00
Resources: cput=00:00:00,mem=2960kb,vmem=13044kb,walltime=00:00:03
Job Queue: normal
Account: bri
Nodes: abe0726
Killing leftovers...
End Torque Epilogue
----------------------------------------
Now, when I am trying to run the same script from Swift, I get
Execution failed:
Exception in ABE_env_wrapped:
Arguments: []
Host: Abe-GT4
Directory: site_test-20080923-1107-or2r0ut7/jobs/9/ABE_env_wrapped-9doxytzi
stderr.txt: which: no mpirun in ((null))
stdout.txt:
----
Caused by:
Exit code 1
Here's the relevant line from tc.data:
[fedorov at mistral runs.d] grep ABE_env ~/local/vdsk-0.6/etc/tc.data
Abe-GT4 ABE_env_wrapped /u/ac/fedorov/local/env_wrapper.sh INSTALLED
INTEL32::LINUX null
I have this problem only on Abe.
--
Andrey Fedorov
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov
From benc at hawaga.org.uk Tue Sep 23 10:21:29 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 15:21:29 +0000 (GMT)
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
Message-ID:
can you try running the script with the relevant GRAM command line tool
(globus-job-run or globusrun-ws depending on if you are using the gt2 or
gt4 provider in swift) to bisect the problem.
--
From fedorov at cs.wm.edu Tue Sep 23 10:30:42 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 23 Sep 2008 11:30:42 -0400
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To:
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
Message-ID: <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
On Tue, Sep 23, 2008 at 11:21 AM, Ben Clifford wrote:
> can you try running the script with the relevant GRAM command line tool
> (globus-job-run or globusrun-ws depending on if you are using the gt2 or
> gt4 provider in swift) to bisect the problem.
Right, should have done it before.
Here's the result. Appears to work with globusrun-ws:
[fedorov at mistral runs.d] globusrun-ws -submit -s -F
https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
-Ft PBS -c /u/ac/fedorov/local/env_wrapper.sh
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:f9d5d2a8-8983-11dd-ad02-4d401e450000
Termination time: 09/24/2008 15:26 GMT
Current job state: Pending
Current job state: Active
Current job state: CleanUp-Hold
----------------------------------------
Begin Torque Prologue (Tue Sep 23 10:27:28 2008)
Job ID: 546065
Username: fedorov
Group: bri
Job Name: STDIN
Limits: ncpus=1,neednodes=abe1103,nodes=1,walltime=00:10:00
Job Queue: normal
Account: bri
Nodes: abe1103
End Torque Prologue
----------------------------------------
/opt/mpich-vmi-2.2.0-3-gcc-ofed-1.2/bin/mpirun
----------------------------------------
Begin Torque Epilogue (Tue Sep 23 10:27:30 2008)
Job ID: 546065
Username: fedorov
Group: bri
Job Name: STDIN
Session: 5071
Limits: ncpus=1,nodes=1,walltime=00:10:00
Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:02
Job Queue: normal
Account: bri
Nodes: abe1103
Killing leftovers...
End Torque Epilogue
----------------------------------------
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
--
Andrey Fedorov
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov
From benc at hawaga.org.uk Tue Sep 23 10:33:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 15:33:07 +0000 (GMT)
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To: <82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
<82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
Message-ID:
whats in your sites.xml?
--
From wilde at mcs.anl.gov Tue Sep 23 10:45:05 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 23 Sep 2008 10:45:05 -0500
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To:
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
<82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
Message-ID: <48D90F01.8050106@mcs.anl.gov>
Andriy, in case this is of use:
To make coasters work on ABe, I also had to, as I recall, remove the
"-l" option to an ssh command generated by coaster setup, and get it to
run my own version of /etc/profile in which I disabled the following code:
> if [ -f /etc/dedicated ] && [ `/usr/bin/id -u` -ne 0 ]
> then
> grep "^$LOGNAME\$" /etc/dedicated-users > /dev/null
> if [ $? -ne 0 ] && [ -f /etc/dedicated-users ]
> then
> cat /etc/dedicated
> sleep 5
> exit
> else
> echo "`hostname` is in dedicated user mode"
> fi
> fi
Otherwise, I would get the "...is in dedicated user mode" error, above.
(I'll loook for my patch for this, but cant now, and you or Ben may find
it quicker)
Two other problems I encountered, which have been fixed in SVN, were:
- hostspernode was getting sent in to the RSL, neeeded to be filtered out
- the random number generator getting called by coaster was blocking on
dev/random and needed to be switched to urandom
When I tested ab a week ago, with just the /etc/login fix above, it
seemed to work.
My sites entry was:
/u/ac/wilde/swiftwork
8
- Mike
On 9/23/08 10:33 AM, Ben Clifford wrote:
> whats in your sites.xml?
>
From fedorov at cs.wm.edu Tue Sep 23 10:46:22 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 23 Sep 2008 11:46:22 -0400
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To:
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
<82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
Message-ID: <82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com>
On Tue, Sep 23, 2008 at 11:33 AM, Ben Clifford wrote:
> whats in your sites.xml?
>
This was it.
Here's the original relevant part of my sites.xml:
/u/ac/fedorov/scratch-global/scratch
In the globusrun-ws test I used "pbs" jobmanager. This makes a big
difference. I assume, environment is initialized differently for fork
jobs than for pbs jobs. When I repeat the globusrun test with fork
provider, I have the same problem, so it's obviously not Swift.
After substituting jobmanager from fork to pbs in sites.xml, Swift
works as expected.
Thank you for helping to figure this out!
--
Andrey Fedorov
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov
> --
>
>
>
From benc at hawaga.org.uk Tue Sep 23 11:02:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 16:02:13 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
On Mon, 22 Sep 2008, Xueyuan Zhou wrote:
> /home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with mush
> smaller input file than
> /home/zhouxy/dic_parser/swift_script
I plotted that log file here:
http://www.ci.uchicago.edu/~benc/tmp/report-test3-20080922-2040-h1383jdc/
The log file seems to suggest that between 5 and 12 jobs are running at
any one time according to the submit side after about 200 seconds into the
run.
To begin with, each site will only get two nodes at once; as your run
progresses successfully on a site, that site will be given more nodes.
Have a look at the graph labelled:
execute2 tasks, coloured by site
and you can see how many jobs are sent to each site at once.
You have two sites defined - localhost and teraport. The localhost
definition looks a bit suspicious, though - the usual swift localhost
definition never allows more than 2 jobs to run at once, but on your run
you often have more than that.
What does your sites.xml file look like?
--
From hategan at mcs.anl.gov Tue Sep 23 11:20:38 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 23 Sep 2008 11:20:38 -0500
Subject: [Swift-user] Swift job environment on NCSA Abe
In-Reply-To: <82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com>
References: <82f536810809230812q5be54e5fx59d63881c24d2ec8@mail.gmail.com>
<82f536810809230830u5ce08b4ta78e8a5e50027cd9@mail.gmail.com>
<82f536810809230846o1c09e50dubc474f1a1900fd9e@mail.gmail.com>
Message-ID: <1222186838.31824.2.camel@localhost>
On Tue, 2008-09-23 at 11:46 -0400, Andriy Fedorov wrote:
> On Tue, Sep 23, 2008 at 11:33 AM, Ben Clifford wrote:
> > whats in your sites.xml?
> >
>
> This was it.
>
> Here's the original relevant part of my sites.xml:
>
>
>
> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
> /u/ac/fedorov/scratch-global/scratch
>
>
> In the globusrun-ws test I used "pbs" jobmanager. This makes a big
> difference. I assume, environment is initialized differently for fork
> jobs than for pbs jobs. When I repeat the globusrun test with fork
> provider, I have the same problem, so it's obviously not Swift.
>
> After substituting jobmanager from fork to pbs in sites.xml, Swift
> works as expected.
"Fork" is Globus for "fork a job on the *head node* and don't use the
queuing system". Which is probably not what you want in general.
Mihael
From zhouxy at uchicago.edu Tue Sep 23 11:30:04 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 11:30:04 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
my sites.xml is
/home/zhouxy/swift/working
fast
and
/home/zhouxy/swift/working
fast
I run a longer job last night, which is at
/home/zhouxy/dic_parser/swift_script3, and it is 10 times jobs more than
/home/zhouxy/dic_parser/swift_script2
still, except the very beginning and ending(< 4 nodes), the most nodes I can
get is about 4 nodes, since it seems to be dual core, it is 8 running jobs.
I also noticed there are some free nodes there. I am wondering why I can
only have about 8 running job (on 4 nodes), and about 20~30 in Q, while
dozons of nodes are free.
In this case, it is acting like a little faster single machine.
If I can have more running jobs, that "execute2 tasks, coloured by site"
will be more steep I think. But it seems swift does not do that.
Thanks.
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc:
Sent: Tuesday, September 23, 2008 11:02 AM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Mon, 22 Sep 2008, Xueyuan Zhou wrote:
>
>> /home/zhouxy/dic_parser/swift_script2 has a succesful complete log, with
>> mush
>> smaller input file than
>> /home/zhouxy/dic_parser/swift_script
>
> I plotted that log file here:
>
> http://www.ci.uchicago.edu/~benc/tmp/report-test3-20080922-2040-h1383jdc/
>
> The log file seems to suggest that between 5 and 12 jobs are running at
> any one time according to the submit side after about 200 seconds into the
> run.
>
> To begin with, each site will only get two nodes at once; as your run
> progresses successfully on a site, that site will be given more nodes.
>
> Have a look at the graph labelled:
>
> execute2 tasks, coloured by site
>
> and you can see how many jobs are sent to each site at once.
>
> You have two sites defined - localhost and teraport. The localhost
> definition looks a bit suspicious, though - the usual swift localhost
> definition never allows more than 2 jobs to run at once, but on your run
> you often have more than that.
>
> What does your sites.xml file look like?
>
> --
>
>
>
>
From benc at hawaga.org.uk Tue Sep 23 11:37:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 16:37:50 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
>
>
>
> /home/zhouxy/swift/working
> fast
>
what machine are you running your swift command on?
--
From zhouxy at uchicago.edu Tue Sep 23 11:41:01 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 11:41:01 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
most on tp-login1.ci.uchicago.edu and some on tp-login2.
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc:
Sent: Tuesday, September 23, 2008 11:37 AM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
>
>>
>>
>>
>> /home/zhouxy/swift/working
>> fast
>>
>
> what machine are you running your swift command on?
>
> --
>
>
From benc at hawaga.org.uk Tue Sep 23 11:47:48 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 16:47:48 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
> most on tp-login1.ci.uchicago.edu and some on tp-login2.
ok. So you have two site definitions that will both, through different
mechanisms, go into the teraport PBS queue. That probably isn't going to
hurt things, but you could probably use only one of those definitions.
I'm looking at your swift_script3 logs now...
--
From zhouxy at uchicago.edu Tue Sep 23 11:50:48 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 11:50:48 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
thanks, I remember it seems if I only use teraport, it is just half nodes. I
will try again go make sure.
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc:
Sent: Tuesday, September 23, 2008 11:47 AM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
>
>> most on tp-login1.ci.uchicago.edu and some on tp-login2.
>
>
> ok. So you have two site definitions that will both, through different
> mechanisms, go into the teraport PBS queue. That probably isn't going to
> hurt things, but you could probably use only one of those definitions.
>
> I'm looking at your swift_script3 logs now...
>
> --
>
From benc at hawaga.org.uk Tue Sep 23 11:54:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 16:54:10 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
> thanks, I remember it seems if I only use teraport, it is just half nodes. I
> will try again go make sure.
It will be half, through out the whole run. To use more, you can adjust
the jobThrottle parameter.
I'm not sure which is better to use - the gt2 provider or the pbs
provider. The gt2 provider is much more tested; the pbs provider might put
less load on the head node (or might not) - not many people have used it.
--
From hategan at mcs.anl.gov Tue Sep 23 12:04:05 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 23 Sep 2008 12:04:05 -0500
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID: <1222189445.32470.2.camel@localhost>
On Tue, 2008-09-23 at 11:30 -0500, Xueyuan Zhou wrote:
> still, except the very beginning and ending(< 4 nodes), the most nodes I can
> get is about 4 nodes, since it seems to be dual core, it is 8 running jobs.
How do you reach that conclusion? As far as I can tell, by default, one
job maps to one node (regardless of the number of cores).
> I also noticed there are some free nodes there. I am wondering why I can
> only have about 8 running job (on 4 nodes), and about 20~30 in Q, while
> dozons of nodes are free.
May it be because you're using the "fast" queue, which may have a
limitation on the maximum number of nodes you can get at one time?
From zhouxy at uchicago.edu Tue Sep 23 12:12:27 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 12:12:27 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
Message-ID:
> On Tue, 2008-09-23 at 11:30 -0500, Xueyuan Zhou wrote:
>> still, except the very beginning and ending(< 4 nodes), the most nodes I
>> can
>> get is about 4 nodes, since it seems to be dual core, it is 8 running
>> jobs.
>
> How do you reach that conclusion? As far as I can tell, by default, one
> job maps to one node (regardless of the number of cores).
>
I check qstat -n -1 -u zhouxy frequently, and Ben's graph result also
confirms that I think.
It is like this:
703373.tp-mgt.ci.uch zhouxy fast STDIN 16543 1 -- --
01:00 R -- tp-c114/0
703374.tp-mgt.ci.uch zhouxy fast STDIN 16863 1 -- --
01:00 R -- tp-c114/1
703375.tp-mgt.ci.uch zhouxy fast STDIN 27455 1 -- --
01:00 R -- tp-c064/0
703376.tp-mgt.ci.uch zhouxy fast STDIN 27884 1 -- --
01:00 R -- tp-c064/1
703377.tp-mgt.ci.uch zhouxy fast STDIN 22513 1 -- --
01:00 R -- tp-c043/0
703378.tp-mgt.ci.uch zhouxy fast STDIN 13364 1 -- --
01:00 R -- tp-c119/0
703379.tp-mgt.ci.uch zhouxy fast STDIN 13673 1 -- --
01:00 R -- tp-c119/1
703380.tp-mgt.ci.uch zhouxy fast STDIN -- 1 -- --
01:00 R -- tp-c118/0
>> I also noticed there are some free nodes there. I am wondering why I can
>> only have about 8 running job (on 4 nodes), and about 20~30 in Q, while
>> dozons of nodes are free.
>
> May it be because you're using the "fast" queue, which may have a
> limitation on the maximum number of nodes you can get at one time?
>
I tried it without "fast" for some task with 26M input file, even slower,
they all go to Q, waiting there. And even less nodes for larger input file,
which seems only 2 nodes.
I'll try some more.
>
From benc at hawaga.org.uk Tue Sep 23 12:16:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 17:16:40 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com><1221842763.6071.3.camel@localhost><82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com><1221844428.7599.1.camel@localhost><176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
Message-ID:
On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
> also noticed there are some free nodes there. I am wondering why I can only
> have about 8 running job (on 4 nodes), and about 20~30 in Q, while dozons of
> nodes are free.
If you have jobs sitting in the queue on teraport then that is something
to do with the way that the queueing system is working on teragrid.
I put graphs here:
http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/
It looks like your getting around 8 jobs running at once, which matches up
with your observation in the above quote.
Mihael says this:
> May it be because you're using the "fast" queue, which may have a
> limitation on the maximum number of nodes you can get at one time?
though the queue policy page here:
http://www.ci.uchicago.edu/wiki/bin/view/Teraport/QueuePolicies
does not mention anything about it.
Though the wiki also contractions this from Mihael:
> How do you reach that conclusion? As far as I can tell, by default, one
> job maps to one node (regardless of the number of cores).
by saying:
> Single-processor jobs submitted by a user are paired up on a single node
> to maximize cluster efficiency.
I haven't tried that myself.
--
From benc at hawaga.org.uk Tue Sep 23 12:55:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 17:55:27 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To: <1222189445.32470.2.camel@localhost>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
Message-ID:
On Tue, 23 Sep 2008, Mihael Hategan wrote:
> May it be because you're using the "fast" queue, which may have a
> limitation on the maximum number of nodes you can get at one time?
Another thing might be scheduling delay in the teraport scheduler. Not
relly sure if that would affect things, but the jobs in this workflow seem
quite short.
At least based on the Submitted->Active->Completed transitions in karajan,
jobs seem to only take 10..45 seconds.
See the graph I just added titled:
karajan active JOB_SUBMISSION cumulative duration
on
http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/
--
From zhouxy at uchicago.edu Tue Sep 23 13:01:36 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 13:01:36 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
Message-ID: <59B9BC637E9940E49144BC58098C329C@VXAVIER>
thanks a lot. Yes, the job is not long.
I am wondering, is there any priority setting for different user ? Because
what I got are really short jobs, but there are quite a few nodes are free,
instead of being allocated to me.
----- Original Message -----
From: "Ben Clifford"
To: "Mihael Hategan"
Cc: "Xueyuan Zhou" ;
Sent: Tuesday, September 23, 2008 12:55 PM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Tue, 23 Sep 2008, Mihael Hategan wrote:
>
>> May it be because you're using the "fast" queue, which may have a
>> limitation on the maximum number of nodes you can get at one time?
>
> Another thing might be scheduling delay in the teraport scheduler. Not
> relly sure if that would affect things, but the jobs in this workflow seem
> quite short.
>
> At least based on the Submitted->Active->Completed transitions in karajan,
> jobs seem to only take 10..45 seconds.
>
> See the graph I just added titled:
>
> karajan active JOB_SUBMISSION cumulative duration
>
> on
> http://www.ci.uchicago.edu/~benc/tmp/report-test1M-20080922-2203-gamhpzp1/
>
> --
>
>
From benc at hawaga.org.uk Tue Sep 23 13:07:32 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 18:07:32 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To: <59B9BC637E9940E49144BC58098C329C@VXAVIER>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
<59B9BC637E9940E49144BC58098C329C@VXAVIER>
Message-ID:
On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
> I am wondering, is there any priority setting for different user ? Because
> what I got are really short jobs, but there are quite a few nodes are free,
> instead of being allocated to me.
No idea. You could ask teraport admins.
--
From benc at hawaga.org.uk Tue Sep 23 13:11:25 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 18:11:25 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To:
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
<59B9BC637E9940E49144BC58098C329C@VXAVIER>
Message-ID:
if this is coming from queuing delays in teraport's PBS, maybe trying
coasters would help in this situation.
I'm not sure if they work on teraport at the moment or not. If they do,
then see the paragraph about coasters in this, at the end of section 16.3.
http://www.ci.uchicago.edu/swift/guides/userguide/sitecatalog.php
--
From zhouxy at uchicago.edu Tue Sep 23 13:12:09 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 13:12:09 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
<59B9BC637E9940E49144BC58098C329C@VXAVIER>
Message-ID: <112EC5CDCDC34B58AB0A06CBAEF8A41B@VXAVIER>
thanks guys, it helps a lot.
----- Original Message -----
From: "Ben Clifford"
To: "Xueyuan Zhou"
Cc: "Mihael Hategan" ;
Sent: Tuesday, September 23, 2008 1:07 PM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Tue, 23 Sep 2008, Xueyuan Zhou wrote:
>
>> I am wondering, is there any priority setting for different user ?
>> Because
>> what I got are really short jobs, but there are quite a few nodes are
>> free,
>> instead of being allocated to me.
>
> No idea. You could ask teraport admins.
>
> --
>
>
From benc at hawaga.org.uk Tue Sep 23 13:25:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 23 Sep 2008 18:25:10 +0000 (GMT)
Subject: [Swift-user] Why so few nodes ?
In-Reply-To: <1222189445.32470.2.camel@localhost>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
Message-ID:
On Tue, 23 Sep 2008, Mihael Hategan wrote:
> May it be because you're using the "fast" queue, which may have a
> limitation on the maximum number of nodes you can get at one time?
There is evidence that sending jobs twice as fast (by defining two swift
sites for teraport) made twice as many jobs run. These would be counter to
the argument that there is a max node count and supportive of there being
some long scheduling delay in the LRM, I think.
I guess increasing the jobThrottle to a larger number would therefore get
more jobs in the queue and more jobs run, but putting more node on the
head node.
Xueyuan, add this line to one of your site definitions: (but not both)
0.5
then make a run and send a log.
--
From hategan at mcs.anl.gov Tue Sep 23 13:34:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 23 Sep 2008 13:34:31 -0500
Subject: [Swift-user] Why so few nodes ?
In-Reply-To: <59B9BC637E9940E49144BC58098C329C@VXAVIER>
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
<59B9BC637E9940E49144BC58098C329C@VXAVIER>
Message-ID: <1222194871.1798.0.camel@localhost>
On Tue, 2008-09-23 at 13:01 -0500, Xueyuan Zhou wrote:
> thanks a lot. Yes, the job is not long.
>
>
> I am wondering, is there any priority setting for different user ? Because
> what I got are really short jobs, but there are quite a few nodes are free,
> instead of being allocated to me.
The "debug" queue has a higher priority but a limited number of nodes (I
think 4) allocated.
From zhouxy at uchicago.edu Tue Sep 23 13:38:24 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 13:38:24 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
<59B9BC637E9940E49144BC58098C329C@VXAVIER>
<1222194871.1798.0.camel@localhost>
Message-ID:
Ben, I am running the same task with 0.5 now.
Mihael, by "debug" queue, you mean I used "fast" ? So if I use "extended",
it might have more nodes?
----- Original Message -----
From: "Mihael Hategan"
To: "Xueyuan Zhou"
Cc: "Ben Clifford" ;
Sent: Tuesday, September 23, 2008 1:34 PM
Subject: Re: [Swift-user] Why so few nodes ?
> On Tue, 2008-09-23 at 13:01 -0500, Xueyuan Zhou wrote:
>> thanks a lot. Yes, the job is not long.
>>
>>
>> I am wondering, is there any priority setting for different user ?
>> Because
>> what I got are really short jobs, but there are quite a few nodes are
>> free,
>> instead of being allocated to me.
>
> The "debug" queue has a higher priority but a limited number of nodes (I
> think 4) allocated.
>
>
>
From zhouxy at uchicago.edu Tue Sep 23 13:49:55 2008
From: zhouxy at uchicago.edu (Xueyuan Zhou)
Date: Tue, 23 Sep 2008 13:49:55 -0500
Subject: [Swift-user] Why so few nodes ?
References: <82f536810809190722s2396c516we8a2d0db92670c42@mail.gmail.com>
<1221842763.6071.3.camel@localhost>
<82f536810809191004i90b2087y363f55adc01bfbc9@mail.gmail.com>
<1221844428.7599.1.camel@localhost>
<176C82530E0344829746E3493B289CF3@VXAVIER>
<3BA90E9DD8AC4FB689F14C40E3390402@VXAVIER>
<1222189445.32470.2.camel@localhost>
Message-ID: <399D9AF1930247A1AB1BB4EA89C1C3B6@VXAVIER>
the result is in /home/zhouxy/dic_parser/swift_script4
one thing I noticed is that, when I only use one site, it starts with 2
running jobs, then increases, and finally also reaches 8 running jobs, which
is the same as two sites. So the total time when using one site and two site
is not quite different.
----- Original Message -----
From: "Ben Clifford"
To: "Mihael Hategan"
Cc: "Xueyuan Zhou" ;
Sent: Tuesday, September 23, 2008 1:25 PM
Subject: Re: [Swift-user] Why so few nodes ?
>
> On Tue, 23 Sep 2008, Mihael Hategan wrote:
>
>> May it be because you're using the "fast" queue, which may have a
>> limitation on the maximum number of nodes you can get at one time?
>
> There is evidence that sending jobs twice as fast (by defining two swift
> sites for teraport) made twice as many jobs run. These would be counter to
> the argument that there is a max node count and supportive of there being
> some long scheduling delay in the LRM, I think.
>
> I guess increasing the jobThrottle to a larger number would therefore get
> more jobs in the queue and more jobs run, but putting more node on the
> head node.
>
> Xueyuan, add this line to one of your site definitions: (but not both)
>
> 0.5
>
> then make a run and send a log.
>
> --
>