From fedorov at cs.wm.edu Tue Jul 1 08:39:28 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 1 Jul 2008 09:39:28 -0400
Subject: [Swift-user] Passing hostType for MPI jobs
Message-ID: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
Hi,
I am having problems passing host type for MPI jobs. This appears to
happen both when I am using globusrun-ws (XML job description),
although the errors are different.
I am trying to request nodes of type "compute" on UC TeraGrid site.
This host type is recognized by PBS when I pass it to "qsub".
Basically, when I am using XML job description, I am specifying
hostType using Job description extension support
(http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes).
What happens is that I get the correct type of nodes, but the count is
not what I request.
When I specify hostType parameter in tc.data I either get an error
(when I have hostCount="4:compute"):
===>
RunID: 20080701-0829-xstp5l98
Progress:
hello_mpi started
Progress: Stage in:1
Failed to transfer wrapper log from
hello_mpi_swift-20080701-0829-xstp5l98/info/a/UC-GT4
Failed to transfer wrapper log from
hello_mpi_swift-20080701-0829-xstp5l98/info/b/UC-GT4
Failed to transfer wrapper log from
hello_mpi_swift-20080701-0829-xstp5l98/info/c/UC-GT4
hello_mpi failed
Execution failed:
Exception in hello_mpi:
Arguments: []
Host: UC-GT4
Directory: hello_mpi_swift-20080701-0829-xstp5l98/jobs/c/hello_mpi-cltmnvui
stderr.txt:
stdout.txt:
----
Caused by:
For input string: "4:compute"
<===
or I get the nodes of the wrong type (when I use hostType="compute" --
looks like it is just ignored).
Does anyone know how to specify host type correctly? Is this a GT4
bug? I suspect there is a GT4 bug involved, because when I skip
, I can correctly run MPI job on 4 hosts. I don't know
what is the Swift support for host type functionality.
For the reference, I attach my XML job description, tc.data,
sites.xml, Swift script, and the simple MPI "hello world" code.
hello_mpi.c (compile with `mpicc -o hello_mpi hello_mpi.c') ==>
#include
#include
int main(int argc, char **argv){
int myrank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
fprintf(stderr, "Hello, world from cpu %i (total %i)\n",
myrank, size);
MPI_Finalize();
return 0;
}
<===
hello_mpi_xml.xml ===>
https://tg-grid.uc.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
PBS
/home/fedorov/local/bin/hello_mpi
/home/fedorov/scratch/hello_mpi_xml.stdout
/home/fedorov/scratch/hello_mpi_xml.stderr
4
4
10
mpi
compute
<===
hello_mpi_swift.swift ===>
type messagefile {}
(messagefile t) greeting() {
app {
hello_mpi stderr=@filename(t);
}
}
messagefile outfile <"hello_mpi.txt">;
outfile = greeting();
<===
tc.data ===>
UC-GT4 hello_mpi /home/fedorov/local/bin/hello_mpi_v INSTALLED
INTEL32::LINUX GLOBUS::hostCount="4",jobType=mpi,maxWallTime="10",count="4",hostType="compute"
<===
sites.xml ===>
/home/fedorov/scratch
<===
From benc at hawaga.org.uk Tue Jul 1 09:34:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 1 Jul 2008 14:34:15 +0000 (GMT)
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
Message-ID:
On Tue, 1 Jul 2008, Andriy Fedorov wrote:
> I am having problems passing host type for MPI jobs. This appears to
> happen both when I am using globusrun-ws (XML job description), although
> the errors are different.
I've been working on running MPI jobs inside Swift today. On TG UC I find
a problem that sounds like that when using GRAM4. Using GRAM2 works ok
(but slower). I can specify the host type ok, but not the job node count.
I will interact with the TG UC admins to see if they know what is going on
there.
--
From fedorov at cs.wm.edu Tue Jul 1 09:47:21 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 1 Jul 2008 10:47:21 -0400
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To:
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
Message-ID: <82f536810807010747i1615b5c8l87186035aed3f118@mail.gmail.com>
> I've been working on running MPI jobs inside Swift today. On TG UC I find
> a problem that sounds like that when using GRAM4. Using GRAM2 works ok
> (but slower). I can specify the host type ok, but not the job node count.
>
By the way, specifying the job node count works fine for me both with
GRAM4+Swift, and with just GRAM4 (XML) -- try the configuration and
scripts I attach to the initial post. It does NOT work if I try to
specify both host count and host type for GRAM4 XML.
Andrey
From benc at hawaga.org.uk Wed Jul 2 03:34:35 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 08:34:35 +0000 (GMT)
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
Message-ID:
So:
/bin/hostname
/home/benc/mpi
/home/benc/mpi/test.stdout
/home/benc/mpi/test.stderr
3
allocates three hosts for me, without specifying the type. This seems to
give the correct behaviour.
/bin/hostname
/home/benc/mpi
/home/benc/mpi/test.stdout
/home/benc/mpi/test.stderr
3
ia64-compute
allocates one host for me (ignoring the hostCount) but it is of the
correct type, ia64-compute. This seems to be incorrect behaviour because
it ignores the hostcount.
A different approach, using a different hostcount field that the job
extensions web page at
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html
suggests:
/bin/hostname
/home/benc/mpi
/home/benc/mpi/test.stdout
/home/benc/mpi/test.stderr
ia64-compute
3
results in:
[benc at tg-login1 mpi]$ globusrun-ws -submit -Ft PBS -F
tg-grid.uc.teragrid.org -job-description-file ./gram4-dbg.rsl
Submitting job...Done.
Job ID: uuid:cc6b465e-4810-11dd-9981-0007e9d811ce
Termination time: 07/03/2008 08:28 GMT
Current job state: Failed
Destroying job...Done.
globusrun-ws: Job failed: The executable could not be started.
qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes
Likewise if I use this extension:
ia64-compute
3
But finally...
ia64-compute
5
1
allocates 5 hosts.
So it looks like you need to specify both hostCount and cpusPerHost.
So that is how to specify it with GRAM4 direct submission.
I'll have to have a play around to figure out how that can be specified in
Swift+GRAM4.
--
From lixi at uchicago.edu Wed Jul 2 08:07:49 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed, 2 Jul 2008 08:07:49 -0500 (CDT)
Subject: [Swift-user] Re: No response of Swift run
Message-ID: <20080702080749.BBV69776@m4500-03.uchicago.edu>
>Hi,
>
>I launched a Swift workflow (including 2001 jobs) at 16:16
>yesterday. At 17:20, it returned the results of 2000 jobs,
>then there is no reponse any more. I wonder why? I enabled
>the replication option.
>
>The log file is very large (more 1G) and is on CI:
>/home/lixi/newswift/test/newversion/workflowtest-20080629-
>1616-c4h22j03.log
>
>Please check it, thanks
>
The similar execution result occurred again. The log file is
on CI:
/home/lixi/newswift/test/newversion/0701/workflowtest-
20080701-1206-sjuu3cnc.log
Thanks,
Xi
From benc at hawaga.org.uk Wed Jul 2 08:14:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 13:14:17 +0000 (GMT)
Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run
In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu>
References: <20080702080749.BBV69776@m4500-03.uchicago.edu>
Message-ID:
cog r2064 and r2065 introduce some changes in the scheduling code which
will reduce the size of log files substantially and fix a hanging problem
that was introduced with my r2058 scheduler changes.
This might or might not fix your problem. I think probably not, but it is
worth a try.
--
From lixi at uchicago.edu Wed Jul 2 08:34:04 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed, 2 Jul 2008 08:34:04 -0500 (CDT)
Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift
run
Message-ID: <20080702083404.BBV71639@m4500-03.uchicago.edu>
>cog r2064 and r2065 introduce some changes in the
scheduling code which
>will reduce the size of log files substantially and fix a
hanging problem
>that was introduced with my r2058 scheduler changes.
>
>This might or might not fix your problem. I think probably
not, but it is
>worth a try.
>
Thanks, I'll try.
In fact, this is the result of Swift svn swift-r2079 cog-
r2063.
Xi
From benc at hawaga.org.uk Wed Jul 2 08:43:00 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 13:43:00 +0000 (GMT)
Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run
In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu>
References: <20080702083404.BBV71639@m4500-03.uchicago.edu>
Message-ID:
On Wed, 2 Jul 2008, lixi at uchicago.edu wrote:
> In fact, this is the result of Swift svn swift-r2079 cog-
> r2063.
Yes, I can see that from the log file. Actually it is r2063 with some
changes that you have applied, according to the log file (presumably one
of the patches that mihael and I sent earlier that you will not need to
use after r2065)
In your log workflowtest-20080701-1206-sjuu3cnc, a single task appears to
still be in 'Active' state, which is possibly why the run does not end.
The task ID for that is 0-1-1550-2-1214932015745. It is a file transfer of
some kind. I think to site AGLT2 though the log information is a little
vague - probably we should give more information there.
--
From benc at hawaga.org.uk Wed Jul 2 09:10:03 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 14:10:03 +0000 (GMT)
Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run
In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu>
References: <20080702083404.BBV71639@m4500-03.uchicago.edu>
Message-ID:
unrelated to your problem:
here are log plots:
http://www.ci.uchicago.edu/~benc/tmp/report-workflowtest-20080701-1206-sjuu3cnc/
the table: 'sites/success table' gives some quantification of what
replication is doing. the columns in that table mean, basically:
JOB_SUCCESS - a job ran all the way through
APPLICATION_EXCEPTION - a job was attempted but failed
JOB_CANCELLED - a job was submitted to the queue, but a replica ran
first so this was cancelled.
On the big (high success rate) sites, it looks like around a third of
submissions end up getting cancelled due to replication.
--
From fedorov at cs.wm.edu Wed Jul 2 10:01:02 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 2 Jul 2008 11:01:02 -0400
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To:
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
Message-ID: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
> But finally...
>
>
>
> ia64-compute
> 5
> 1
>
>
>
> allocates 5 hosts.
>
> So it looks like you need to specify both hostCount and cpusPerHost.
>
Ok, I tried that. It indeed allocates correct number of the requested
hosts. But, there's still a problem. It appears that only one instance
of the executable is running, at least when I specify jpbType to mpi.
I am not sure it is being run as an MPI job. I have a simple mpi code
that outputs rank and COMM_WORLD size, ant the test says I have the
total of 1 process, when I submit my job with the following job
specification:
/home/fedorov/local/bin/hello_mpi
/home/fedorov/scratch/hello_mpi_xml.stdout
/home/fedorov/scratch/hello_mpi_xml.stderr
10
mpi
compute
4
1
4
Ben, can you try to run some MPI executable, and see if it works for you?
By the way, I also discovered, that sometimes the order of tags in
.xml makes difference (meaning, with certain order of "count",
"walltime" and "hostCount" globusrun-ws will abort). I had no idea
order matters...
Andrey
From hategan at mcs.anl.gov Wed Jul 2 10:12:44 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 02 Jul 2008 10:12:44 -0500
Subject: [Swift-user] Re: No response of Swift run
In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu>
References: <20080702080749.BBV69776@m4500-03.uchicago.edu>
Message-ID: <1215011564.469.4.camel@localhost>
Could you do the following for me:
1. edit dist/vdsk-xyz/bin/swift
2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug
-Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"' (a
single line) (you may need to do this every time you compile swift)
3. then run it again and let me know when it hangs. Don't kill the
hanging workflow. Let it hang instead.
4. Also let me know what machine you run this on.
On Wed, 2008-07-02 at 08:07 -0500, lixi at uchicago.edu wrote:
> >Hi,
> >
> >I launched a Swift workflow (including 2001 jobs) at 16:16
> >yesterday. At 17:20, it returned the results of 2000 jobs,
> >then there is no reponse any more. I wonder why? I enabled
> >the replication option.
> >
> >The log file is very large (more 1G) and is on CI:
> >/home/lixi/newswift/test/newversion/workflowtest-20080629-
> >1616-c4h22j03.log
> >
> >Please check it, thanks
> >
> The similar execution result occurred again. The log file is
> on CI:
> /home/lixi/newswift/test/newversion/0701/workflowtest-
> 20080701-1206-sjuu3cnc.log
>
> Thanks,
>
> Xi
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From lixi at uchicago.edu Wed Jul 2 12:22:09 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed, 2 Jul 2008 12:22:09 -0500 (CDT)
Subject: [Swift-user] Re: No response of Swift
run
Message-ID: <20080702122209.BBV97884@m4500-03.uchicago.edu>
>Could you do the following for me:
>1. edit dist/vdsk-xyz/bin/swift
>2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug
>-
Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"
' (a
>single line) (you may need to do this every time you
compile swift)
>3. then run it again and let me know when it hangs. Don't
kill the
>hanging workflow. Let it hang instead.
>4. Also let me know what machine you run this on.
Now I'm running this workflow again on login.ci.uchicago.edu.
Meanwhile, I launched another swift run to test a single
site, but I got such error:
[lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file
tc.data workflowtest.swift
ERROR: transport error 202: bind failed: Address already in
use ["transport.c",L41]
ERROR: JDWP Transport dt_socket failed to initialize,
TRANSPORT_INIT(510) ["debugInit.c",L500]
JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports
initializedFATAL ERROR in native method: JDWP No transports
initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113)
Is there something to do with this option?
Thanks,
Xi
From hategan at mcs.anl.gov Wed Jul 2 12:30:52 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 02 Jul 2008 12:30:52 -0500
Subject: [Swift-user] Re: No response of Swift run
In-Reply-To: <20080702122209.BBV97884@m4500-03.uchicago.edu>
References: <20080702122209.BBV97884@m4500-03.uchicago.edu>
Message-ID: <1215019852.3631.4.camel@localhost>
On Wed, 2008-07-02 at 12:22 -0500, lixi at uchicago.edu wrote:
> >Could you do the following for me:
> >1. edit dist/vdsk-xyz/bin/swift
> >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug
> >-
> Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"
> ' (a
> >single line) (you may need to do this every time you
> compile swift)
> >3. then run it again and let me know when it hangs. Don't
> kill the
> >hanging workflow. Let it hang instead.
> >4. Also let me know what machine you run this on.
>
> Now I'm running this workflow again on login.ci.uchicago.edu.
>
> Meanwhile, I launched another swift run to test a single
> site, but I got such error:
> [lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file
> tc.data workflowtest.swift
> ERROR: transport error 202: bind failed: Address already in
> use ["transport.c",L41]
> ERROR: JDWP Transport dt_socket failed to initialize,
> TRANSPORT_INIT(510) ["debugInit.c",L500]
> JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports
> initializedFATAL ERROR in native method: JDWP No transports
> initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113)
>
> Is there something to do with this option?
It has everything to do with that option :)
As far as I remember, things should continue to run ok (except for the
debugger not being started), so you should ignore the error message. If
swift doesn't run, then you could make two copies of the swift startup
script (say swift-debugger with the option and swift without the
option). Then if you want the debugger on, use swift-debugger, and for
normal runs, use swift.
>
> Thanks,
>
> Xi
From lixi at uchicago.edu Wed Jul 2 12:38:29 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed, 2 Jul 2008 12:38:29 -0500 (CDT)
Subject: [Swift-user] Re: No response of Swift
run
Message-ID: <20080702123829.BBV99464@m4500-03.uchicago.edu>
>> Now I'm running this workflow again on
login.ci.uchicago.edu.
This workflow with 2001 jobs finished successfully and
quickly without hanging up. Then I continue to launch a
workflow with 3001 jobs and see the result.
>As far as I remember, things should continue to run ok
(except for the
>debugger not being started), so you should ignore the error
message. If
>swift doesn't run, then you could make two copies of the
swift startup
>script (say swift-debugger with the option and swift
without the
>option). Then if you want the debugger on, use swift-
debugger, and for
>normal runs, use swift.
Do you mean that I could copy swift into swift-debugger
(specifying that option). I could choose one of these ways
to run swift, e.g:
swift first.swift
swift-debugger first.swift
Then it will invoke the corresponding script.
>>
>> Thanks,
>>
>> Xi
>
From benc at hawaga.org.uk Wed Jul 2 14:10:56 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 19:10:56 +0000 (GMT)
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
<82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
Message-ID:
On Wed, 2 Jul 2008, Andriy Fedorov wrote:
> Ok, I tried that. It indeed allocates correct number of the requested
> hosts. But, there's still a problem. It appears that only one instance
> of the executable is running, at least when I specify jpbType to mpi.
Specify jobType=single. Don't specify jobtype=mpi. Then in your
executable, use mpirun. The idea is to make GRAM run only a single job,
and use mpirun to launch the executables. Look at mpi.sh in the example
that I posted.
--
From hategan at mcs.anl.gov Wed Jul 2 14:26:49 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 02 Jul 2008 14:26:49 -0500
Subject: [Swift-user] Re: No response of Swift run
In-Reply-To: <20080702123829.BBV99464@m4500-03.uchicago.edu>
References: <20080702123829.BBV99464@m4500-03.uchicago.edu>
Message-ID: <1215026809.5659.0.camel@localhost>
On Wed, 2008-07-02 at 12:38 -0500, lixi at uchicago.edu wrote:
> >> Now I'm running this workflow again on
> login.ci.uchicago.edu.
>
> This workflow with 2001 jobs finished successfully and
> quickly without hanging up. Then I continue to launch a
> workflow with 3001 jobs and see the result.
>
> >As far as I remember, things should continue to run ok
> (except for the
> >debugger not being started), so you should ignore the error
> message. If
> >swift doesn't run, then you could make two copies of the
> swift startup
> >script (say swift-debugger with the option and swift
> without the
> >option). Then if you want the debugger on, use swift-
> debugger, and for
> >normal runs, use swift.
>
> Do you mean that I could copy swift into swift-debugger
> (specifying that option). I could choose one of these ways
> to run swift, e.g:
> swift first.swift
> swift-debugger first.swift
>
> Then it will invoke the corresponding script.
Yes.
>
> >>
> >> Thanks,
> >>
> >> Xi
> >
From fedorov at cs.wm.edu Wed Jul 2 14:43:42 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 2 Jul 2008 15:43:42 -0400
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To:
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
<82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
Message-ID: <82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com>
On Wed, Jul 2, 2008 at 3:10 PM, Ben Clifford wrote:
>
> On Wed, 2 Jul 2008, Andriy Fedorov wrote:
>
>> Ok, I tried that. It indeed allocates correct number of the requested
>> hosts. But, there's still a problem. It appears that only one instance
>> of the executable is running, at least when I specify jpbType to mpi.
>
> Specify jobType=single. Don't specify jobtype=mpi. Then in your
> executable, use mpirun. The idea is to make GRAM run only a single job,
> and use mpirun to launch the executables. Look at mpi.sh in the example
> that I posted.
>
I was referring to using GT4 GRAM directly -- no Swift. What is
happening doesn't seem right to me. Not that this is the right place
to talk about GRAM issues, just reporting my experience.
> --
>
From benc at hawaga.org.uk Wed Jul 2 17:14:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 2 Jul 2008 22:14:55 +0000 (GMT)
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To: <82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com>
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
<82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
<82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com>
Message-ID:
On Wed, 2 Jul 2008, Andriy Fedorov wrote:
> I was referring to using GT4 GRAM directly -- no Swift. What is
> happening doesn't seem right to me. Not that this is the right place
> to talk about GRAM issues, just reporting my experience.
what I was showing was specifically for running swift+mpi - it needs to
happen very differently to plain gram+mpi because of the server-side
components of Swift.
--
From lixi at uchicago.edu Wed Jul 2 17:40:45 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed, 2 Jul 2008 17:40:45 -0500 (CDT)
Subject: [Swift-user] Swift run finished with errors
Message-ID: <20080702174045.BBW30506@m4500-03.uchicago.edu>
Hi,
I just ran a workflow with 5001 jobs, but it terminated with
errors during execution. It seems that job 0-1-588 produces
a failure which is caused by site SWT2_CPB's sudden
connection error and leads to the failure of whole workflow.
The log file plot is:
http://www.ci.uchicago.edu/~lixi/Log/report-workflowtest-
20080702-1415-s9vmjplf/
The log file is on CI:
/home/lixi/newswift/test/newversion/0702/workflowtest-
20080702-1415-s9vmjplf.log
Could you find if this job is resubmitted to another site or
the same site before the final failure?
Thanks,
Xi
From benc at hawaga.org.uk Thu Jul 3 03:22:57 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Jul 2008 08:22:57 +0000 (GMT)
Subject: [Swift-user] Swift run finished with errors
In-Reply-To: <20080702174045.BBW30506@m4500-03.uchicago.edu>
References: <20080702174045.BBW30506@m4500-03.uchicago.edu>
Message-ID:
That job failed 3 times. Sometimes that will happen.
There are various things you can do to reduce the effect this has on your
run:
Turn on lazy.errors in swift.properties:
Normally when one job has failed (eg. it has used up all of its
retries) then the whole run is immediately abandoned.
If you turn on lazy errors, then the rest of the run will attempt to
continue. This means that you might end up with a run in which only
that one job (or perhaps only a small number of jobs) has failed. The
restart log (*.rlog) should then let you run again to try that small
number again.
Increase the number of retries in swift.properties - execution.retries.
This is set to 2 by default, meaning that a job will be executed up to
three times - once originally, and twice more as retries if there are
failures. You can increase this a small amount, eg to 5, to massively
reduce the probability of of a job caused by random job failures. (eg
if you have p=0.01 chance of a job submission failing, then
exection.retries=2 gives p^3 = 0.000001 chance of failure; but
execution.retries=5 gives p^6 = 0..000000000001 chance of failure
This does not help when the failures are caused by a broken job (such
as missing input files on the submit side); in such a case it will
increase load on remote systems and slow the run down.
--
From benc at hawaga.org.uk Thu Jul 3 03:34:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Jul 2008 08:34:18 +0000 (GMT)
Subject: [Swift-user] Passing hostType for MPI jobs
In-Reply-To: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com>
<82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com>
Message-ID:
On Wed, 2 Jul 2008, Andriy Fedorov wrote:
> Ok, I tried that. It indeed allocates correct number of the requested
> hosts. But, there's still a problem. It appears that only one instance
> of the executable is running, at least when I specify jpbType to mpi.
> I am not sure it is being run as an MPI job.
I can replicate that with plain GRAM4 on TG UC. In the PBS Epilogue, I
see:
Limits: nodes=5:ia64-compute:ppn=1,walltime=00:15:00
Nodes: tg-c053 tg-c034 tg-c020 tg-c011 tg-c007
but my code only has COMM_WORLD size 1.
This code doesn't run at all if it is not run through mpi, so I think the
code *is* being run as an mpi job but the mpi node count is not getting
specified correctly.
My present recommended way of doing mpi in Swift is not using jobtype=mpi
in gram, though, so I don't want to spend too much time figuring this out.
The gram-user at globus.org list and/or help at teragrid.org probably can offer
more.
> By the way, I also discovered, that sometimes the order of tags in
> .xml makes difference (meaning, with certain order of "count",
> "walltime" and "hostCount" globusrun-ws will abort). I had no idea
> order matters...
yes.
Those options are defined with an XML Schema which means, to be
valid, they must appear in the order they are defined in:
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/schemas/gram_job_description.html#type_JobDescriptionType
--
From lixi at uchicago.edu Thu Jul 3 07:21:24 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Thu, 3 Jul 2008 07:21:24 -0500 (CDT)
Subject: [Swift-user] Swift run finished with
errors
Message-ID: <20080703072124.BBW63459@m4500-03.uchicago.edu>
Thank you for detailed explanations.
In addition, I want to know to which sites were this 3 tries
submitted and how about the replications, because I want to
explore details of scheduler's behavior.
Thanks,
Xi
---- Original message ----
>Date: Thu, 3 Jul 2008 08:22:57 +0000 (GMT)
>From: Ben Clifford
>Subject: Re: [Swift-user] Swift run finished with errors
>To: lixi at uchicago.edu
>Cc: swift-user
>
>
>That job failed 3 times. Sometimes that will happen.
>
>There are various things you can do to reduce the effect
this has on your
>run:
>
>Turn on lazy.errors in swift.properties:
> Normally when one job has failed (eg. it has used up
all of its
> retries) then the whole run is immediately abandoned.
> If you turn on lazy errors, then the rest of the run
will attempt to
> continue. This means that you might end up with a run
in which only
> that one job (or perhaps only a small number of jobs)
has failed. The
> restart log (*.rlog) should then let you run again to
try that small
> number again.
>
>Increase the number of retries in swift.properties -
execution.retries.
> This is set to 2 by default, meaning that a job will be
executed up to
> three times - once originally, and twice more as retries
if there are
> failures. You can increase this a small amount, eg to 5,
to massively
> reduce the probability of of a job caused by random job
failures. (eg
> if you have p=0.01 chance of a job submission failing,
then
> exection.retries=2 gives p^3 = 0.000001 chance of
failure; but
> execution.retries=5 gives p^6 = 0..000000000001 chance
of failure
>
> This does not help when the failures are caused by a
broken job (such
> as missing input files on the submit side); in such a
case it will
> increase load on remote systems and slow the run down.
>
>--
>
From benc at hawaga.org.uk Thu Jul 3 07:57:14 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 3 Jul 2008 12:57:14 +0000 (GMT)
Subject: [Swift-user] Swift run finished with errors
In-Reply-To: <20080703072124.BBW63459@m4500-03.uchicago.edu>
References: <20080703072124.BBW63459@m4500-03.uchicago.edu>
Message-ID:
On Thu, 3 Jul 2008, lixi at uchicago.edu wrote:
> Thank you for detailed explanations.
>
> In addition, I want to know to which sites were this 3 tries
> submitted and how about the replications, because I want to
> explore details of scheduler's behavior.
You can get such numbers from the sites/score table in log processing
ouputput. APPLICATION_EXCEPTION means a job failed on a site; and
JOB_CANCELED (using log processing >r2082) means a job was cancelled on
this site, which is usually because replication meant a different site ran
the same job.
--
From lixi at uchicago.edu Thu Jul 3 08:05:44 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Thu, 3 Jul 2008 08:05:44 -0500 (CDT)
Subject: [Swift-user] Swift run finished with
errors
Message-ID: <20080703080544.BBW65610@m4500-03.uchicago.edu>
>You can get such numbers from the sites/score table in log
processing
>ouputput. APPLICATION_EXCEPTION means a job failed on a
site; and
>JOB_CANCELED (using log processing >r2082) means a job was
cancelled on
>this site, which is usually because replication meant a
different site ran
>the same job.
Do you mean sites/success table? Yes, I got it. However,
that could only give the general information for all jobs. I
really want to know the trace of this single failed job.
Sorry to trouble.
From lixi at uchicago.edu Sun Jul 6 12:29:01 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sun, 6 Jul 2008 12:29:01 -0500 (CDT)
Subject: [Swift-user] Re: No response of Swift
run
Message-ID: <20080706122901.BBY13774@m4500-03.uchicago.edu>
>Could you do the following for me:
>1. edit dist/vdsk-xyz/bin/swift
>2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug
>-
Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"
' (a
>single line) (you may need to do this every time you
compile swift)
>3. then run it again and let me know when it hangs. Don't
kill the
>hanging workflow. Let it hang instead.
>4. Also let me know what machine you run this on.
>
Today, I ran a workflow with 5001 jobs using swift-debugger,
but it finished with error message:
ERROR: transport error 202: handshake failed - received >GET
http://www< - excepted >JDWP-Handshake< ["transport.c",L41]
This is the first time for me to encounter this error. The
log file is on
CI: /home/lixi/newswift/test/newversion/0706/workflowtest-
20080706-1134-o8s4a3ig.log
Thanks,
xi
From hategan at mcs.anl.gov Sun Jul 6 22:04:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 06 Jul 2008 22:04:36 -0500
Subject: [Swift-user] Re: No response of Swift run
In-Reply-To: <20080706122901.BBY13774@m4500-03.uchicago.edu>
References: <20080706122901.BBY13774@m4500-03.uchicago.edu>
Message-ID: <1215399876.29501.2.camel@localhost>
On Sun, 2008-07-06 at 12:29 -0500, lixi at uchicago.edu wrote:
> >Could you do the following for me:
> >1. edit dist/vdsk-xyz/bin/swift
> >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug
> >-
> Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"
> ' (a
> >single line) (you may need to do this every time you
> compile swift)
> >3. then run it again and let me know when it hangs. Don't
> kill the
> >hanging workflow. Let it hang instead.
> >4. Also let me know what machine you run this on.
> >
>
> Today, I ran a workflow with 5001 jobs using swift-debugger,
> but it finished with error message:
> ERROR: transport error 202: handshake failed - received >GET
> http://www< - excepted >JDWP-Handshake< ["transport.c",L41]
>
> This is the first time for me to encounter this error. The
> log file is on
> CI: /home/lixi/newswift/test/newversion/0706/workflowtest-
> 20080706-1134-o8s4a3ig.log
Well, probably somebody was nice enough to portscan that machine while
the workflow was running. I guess there isn't any easy solution to this.
Maybe somebody else has a better idea.
>
> Thanks,
>
> xi
From benc at hawaga.org.uk Tue Jul 8 01:59:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 8 Jul 2008 06:59:45 +0000 (GMT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To:
References: <380944.70847.qm@web52304.mail.re2.yahoo.com>
Message-ID:
On Tue, 17 Jun 2008, Ben Clifford wrote:
> > I can definitely see the benefit of having separate pipelines for
> > non-dependent parts within the same script, but perhaps there is a way
> > to chain dependent functions that is not dependent on files produced by
> > previous functions?
>
> I've been playing with some code to do that as someone else requested it.
>
> Basically you will be able to have a swiftscript variable that expresses
> the dependency, but doesn't have any actual content (such as a file).
>
> Hopefully later this week there will be something in SVN.
Somewhat later than I'd hoped. Swift SVN r2095 has 'extern' types. You can
use like this:
(external o) a() {
app {
helperA @strcat(@arg("dir"),"/restart-extern.1.out") "/etc/group"
"qux";
}
}
b(external o) {
app {
helperB @strcat(@arg("dir"),"/restart-extern.2.out") "/etc/group"
"baz";
}
}
external sync;
sync=a();
b(sync);
This makes a dependency between a and b, but doesn't actually move any
data around; its entirely up to you to ensure that when the a procedure
finishes your data is in the right place for b to find it.
--
From iraicu at cs.uchicago.edu Tue Jul 8 15:14:04 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 08 Jul 2008 15:14:04 -0500
Subject: [Swift-user] talk at SCC08 / SWF08 on "Scientific Workflow Systems
for 21st Century, New Bottle or New Wine?"
Message-ID: <4873CA8C.80509@cs.uchicago.edu>
Hi all,
In case any of you are attending SCC08 or SWF08 in Hawaii, please join
me a talk on Scientific Workflow Systems for 21st Century, which will
take place at 1:30PM (Hawaii time).
Here are the slides to my talk:
http://people.cs.uchicago.edu/~iraicu/presentations/2008_SWF08.pdf
Cheers,
Ioan
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
From iraicu at cs.uchicago.edu Wed Jul 9 11:43:43 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 09 Jul 2008 11:43:43 -0500
Subject: [Swift-user] CFP: Workshop on Many-Task Computing on Grids and
Supercomputers (MTAGS08) co-located with IEEE/ACM SC08
Message-ID: <4874EABF.7080503@cs.uchicago.edu>
-------------------------------------------------------------------------------
Call for Papers
-------------------------------------------------------------------------------
The 1st Workshop on Many-Task Computing on Grids and Supercomputers
(MTAGS08)
http://dsl.cs.uchicago.edu/MTAGS08/
http://dsl.cs.uchicago.edu/MTAGS08/MTAGS08_CFP.txt
http://dsl.cs.uchicago.edu/MTAGS08/MTAGS08_CFP.pdf
-------------------------------------------------------------------------------
November 17, 2008
Austin, Texas, USA
Co-located with with IEEE/ACM International Conference for
High Performance Computing, Networking, Storage and Analysis (SC08)
===============================================================================
The 1st workshop on Many-Task Computing on Grids and Supercomputers (MTAGS)
will provide the scientific community a dedicated forum for presenting new
research, development, and deployment efforts of loosely coupled large
scale
applications on large scale clusters, Grids, and/or Supercomputers.
Many-task
computing, the theme of the workshop encompasses loosely coupled
applications,
which are generally composed of many tasks (both independent and dependent
tasks) to achieve some larger application goal. We welcome paper
submissions
on all topics related to MTC on large scale systems. Papers will be
peer-reviewed, and accepted papers will be published by IEEE/ACM through
the
SC08 proceedings (pending approval). For more information, please visit
http://dsl.cs.uchicago.edu/MTAGS08/.
Scope
-------------------------------------------------------------------------------
This workshop will focus on the ability to manage and execute large scale
applications on today's largest clusters, Grids, and Supercomputers.
Clusters
with 50K+ processor cores are beginning to come online (i.e. TACC Sun
Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites
and
100K+ processors, and supercomputers with 160K processors (i.e. IBM
BlueGene/P).
Large clusters and supercomputers have traditionally been high performance
computing (HPC) systems, as they are efficient at executing tightly coupled
parallel jobs within a particular machine with low-latency
interconnects; the
applications typically use message passing interface (MPI) to achieve
the needed
inter-process communication. On the other hand, Grids have been the
preferred
platform for more loosely coupled applications that tend to be managed and
executed through workflow systems. In contrast to HPC (tightly coupled
applications), these loosely coupled applications make up a new class of
applications as what we call Many-Task Computing (MTC). MTC systems
generally
involve the execution of independent, sequential jobs that can be
individually
scheduled on many different computing resources across multiple
administrative
boundaries. MTC systems typically achieve this using various grid computing
technologies and techniques, and often times use files to achieve the
inter-process communication as alternative communication mechanisms than
MPI.
MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs
from HTC in the emphasis of using many computing resources over short
periods
of time to accomplish many computational tasks, where the primary
metrics are
measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the
other
hand requires large amounts of computing for longer times (months and
years,
rather than hours and days, and are generally measured in operations per
month).
Today's existing HPC systems are a viable platform to host MTC
applications.
However, some challenges arise in large scale applications when run on
large
scale systems, which can hamper the efficiency and utilization of these
large
scale systems. These challenges vary from local resource manager
scalability
and granularity, efficient utilization of the raw hardware, shared file
system
contention and scalability, reliability at scale, application
scalability, and
understanding the limitations of the HPC systems in order to identify good
candidate MTC applications. For more information, please visit
http://dsl.cs.uchicago.edu/MTAGS08/.
Topics
-------------------------------------------------------------------------------
MTAGS 2008 topics of interest include, but are not limited to:
* Compute Resource Management in large scale clusters, large Grids,
and Supercomputers
o Scheduling
o Job execution frameworks
o Local resource manager extensions
o Performance evaluation of resource managers in use on large
scale systems
o Challenges in running many-task workloads on HPC systems
* Data Management in large scale Grid and Supercomputer environments:
o Data-Aware Scheduling
o Shared File System performance and scalability in large deployments
o Distributed file systems
o Data caching frameworks and techniques
* Large-Scale Workflow Systems
o Workflow system performance and scalability analysis
o Scalability of workflow systems
o Workflow infrastructure and e-Science middleware
o Programming Paradigms and Models
* Large-Scale Many-Task Applications
o Large-scale many-task applications
o Large-scale many-task data-intensive applications
o Large-scale high throughput computing (HTC) applications
o Quasi-supercomputing applications, deployments, and experiences
Paper Submission and Publication
-------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of
not more
than 6/10 pages (6 pages for short papers, and 10 pages for standard
papers) of
double column text using single spaced 9 point size on 8.5 x 11 inch
pages, as
per ACM 8.5 x 11 manuscript guidelines
(http://www.acm.org/sigs/publications/proceedings-templates).
Papers conforming to the above guidelines (in PDF format) can be
submitted via
email to yozha at microsoft.com and iraicu at cs.uchicago.edu before the
deadline of
August 15th, 2008; please use the subject "MTAGS paper submission".
Accepted
papers from this workshop will be published by IEEE/ACM through the SC08
proceedings
(pending approval). Selected excellent work may be eligible for additional
post-conference publication as journal articles or book chapters.
Submission
implies the willingness of at least one of the authors to register and
present
the paper. For more information, please visit
http://dsl.cs.uchicago.edu/MTAGS08/.
Important Dates
-------------------------------------------------------------------------------
* Papers Due: August 15th, 2008
* Notification of Acceptance: October 1st, 2008
* Camera Ready Papers Due: October 15th, 2008
* Workshop Date: November 17th, 2008
Committee Members
-------------------------------------------------------------------------------
Workshop Chairs
* Yong Zhao, Microsoft
* Ian Foster, University of Chicago & Argonne National Laboratory
* Ioan Raicu, University of Chicago
Technical Committee
* Ian Foster, University of Chicago & Argonne National Laboratory
* David Abramson, Monash University
* Dan Ardelean, Google
* Pete Beckman, Argonne National Laboratory
* Bob Grossman, University of Illinois at Chicago
* Indranil Gupta, University of Illinois at Urbana Champaign
* Tevfik Kosar, Louisiana State University
* Chuang Liu, Ask.com
* Shiyong Lu, Wayne State University
* Reagan Moore, University of California at San Diego
* Cristina Nita-Rotaru, Purdue University
* Marlon Pierce, Indiana University
* Ioan Raicu, University of Chicago
* Dan Reed, Microsoft
* Matei Ripeanu, University of British Columbia
* Alex Szalay, The Johns Hopkins University
* Douglas Thain, University of Notre Dame
* Mike Wilde, University of Chicago & Argonne National Laboratory
* Matthew Woitaszek, The University Corporation for Atmospheric Research
* Lingyun Yang, Yahoo Search
* Sherali Zeadally, University of the District of Columbia
* Yong Zhao, Microsoft
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
From abejan at ci.uchicago.edu Thu Jul 10 07:23:44 2008
From: abejan at ci.uchicago.edu (Alina Bejan)
Date: Thu, 10 Jul 2008 14:23:44 +0200
Subject: [Swift-user] BioInformatics app question
Message-ID: <4875FF50.9070201@ci.uchicago.edu>
Hello Ben/Mihael (I guess),
I have a Swift data structure question, which I'll describe below. I am
trying to write a script that performs the following workflow (example
below is a computation between a one genome and one genome, i.e. between
2 .faa files):
formatdb ?i Ban.faa
formatdb ?i Bce.faa
blastall ?p blastp ?d Ban.faa ?i Bce.faa ?m 9 ?o out.Bce2Ban.txt
blastall ?p blastp ?d Bce.faa ?i Ban.faa ?m 9 ?o out.Ban2Bce.txt
simple_reciprocal_best_hits.00.pl ?i1 out.Bce2Ban.txt ?i2
out.Ban2Bce.txt ?o ortholog.pairs.txt
The 1-1 example works just fine (ortho.swift included) This file also
works well on multiple OSGEDU sites (that is when I use it with the
osgedu-sitex.xml included).
I am now trying to scale this up, using a set of 30 genomes (i.e.
30x30/2 computations - due to symmetry) -- ortho-many.swift included.
The 30 .faa files are located in the abejan/testBLAST/FASTA directory.
The problem is that I don't find a suitable mapper for this -- Idea is
that a need to store the intermediate files generated in the formatdb
step (the .phr, .pin., .psq files) into a structure, and map the
components of this structure to the newly generated files. Swift
complains with the way I do it now. Ultimately I would like to run
'ortho-many' on multiple sites.
Any help is appreciated.
Thanks,
Alina
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ortho.swift
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: osgedu-sites.xml
Type: text/xml
Size: 3921 bytes
Desc: not available
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ortho-many.swift
URL:
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blast-tc.data
URL:
From Majordomo at globus.org Thu Jul 10 15:41:14 2008
From: Majordomo at globus.org (Majordomo at globus.org)
Date: Thu, 10 Jul 2008 15:41:14 -0500 (CDT)
Subject: [Swift-user] Welcome to swift-user
Message-ID: <20080710204114.D77A812DA7@mailbouncer.mcs.anl.gov>
--
Welcome to the swift-user mailing list!
Please save this message for future reference. Thank you.
If you ever want to remove yourself from this mailing list,
you can send mail to with the following
command in the body of your email message:
unsubscribe swift-user
or from another account, besides swift-user at ci.uchicago.edu:
unsubscribe swift-user swift-user at ci.uchicago.edu
If you ever need to get in contact with the owner of the list,
(if you have trouble unsubscribing, or have questions about the
list itself) send email to .
This is the general rule for most mailing lists when you need
to contact a human.
Here's the general information for the list you've subscribed to,
in case you don't already have it:
Discussion list for swift users
From zhengxiongh at uchicago.edu Mon Jul 14 14:34:15 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Mon, 14 Jul 2008 14:34:15 -0500 (CDT)
Subject: [Swift-user] readdata or csv_mapper problem
Message-ID: <20080714143415.BIO88651@m4500-01.uchicago.edu>
Hello,
When I try to run many dock jobs on the osg grid sites,
there are problems for using readdata or csv_mapper. Please
help to solve it. Here is the experiment at localhost:
The swift code is as follows:
***********************************************************
[houzx at communicado dock]$ cat grid-many-dock6-string.swift
type file;
(file t) dockcompute (string ligandsfile, string targetlist)
{
app {
rundock ligandsfile targetlist stdout=@filename(t);
}
}
type params {
string ligandsfile;
string targetlist;
}
#params pset[] ;
doall(params pset[])
{
foreach params,i in pset {
#string mol2file ;
#string target ;
file sout ;
sout = dockcompute(pset[i].ligandsfile,pset
[i].targetlist);
}
}
params p[];
p = readdata("paramslist.txt");
doall(p);
***********************************************************
The content of "paramslist.txt" is as follows:
[houzx at communicado dock]$ cat paramslist.txt
ligandsfile,targetlist
/home/houzx/dock-
run/databases/KEGG_and_Drugs/D00180.mol2,1F9Y
/home/houzx/dock-
run/databases/KEGG_and_Drugs/D00181.mol2,1F9Y
/home/houzx/dock-
run/databases/KEGG_and_Drugs/D00182.mol2,1F9Y
(1) Use this "readdata" code, and the log file is in the
attachment "grid-many-dock6-string-20080714-readdata.log".
[houzx at communicado dock]$ swift grid-many-dock6-string.swift
Swift v0.4 swift-r1718 cog-r1934
RunID: 20080714-1405-letz6tcb
Progress:
Execution failed:
File header does not match type. Expected the
following header items (in no particular order):
[ligandsfile, targetlist]. Instead, the header was (again,
in no particular order): [ligandsfile,targetlist]
(2) Use csv_mapper, and the log file is in the attachment
"grid-many-dock6-string-20080714-1417-csv.log"
[houzx at communicado dock]$ swift grid-many-dock6-string.swift
Swift v0.4 swift-r1718 cog-r1934
RunID: 20080714-1417-pmo8hsjf
Progress:
rundock started
rundock started
rundock started
rundock completed
rundock completed
rundock completed
Final status: Finished:3
***********************************************************
[houzx at communicado dock]$ cat grid-many-dock6-string.swift
type file;
(file t) dockcompute (string ligandsfile, string targetlist)
{
app {
rundock ligandsfile targetlist stdout=@filename(t);
}
}
type params {
string ligandsfile;
string targetlist;
}
params pset[] ;
foreach params,i in pset {
file sout ;
sout = dockcompute(pset[i].ligandsfile,pset
[i].targetlist);
}
***********************************************************
But, in the "/home/houzx/dock-run/databases/results/", the
created files are: null-0-stdout.txt, null-1-stdout.txt,null-
2-stdout.txt.
It means that "pset[i].targetlist" is set to be "null", not
the data "1F9Y" from paramslist.txt!
If I use "Swift v0.3", the created files are:true-0-
stdout.txt,true-1-stdout.txt, true-2-stdout.txt.
Thanks!
B.R.
zhengxiong
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grid-many-dock6-string-20080714-1417-csv.log
Type: application/octet-stream
Size: 77494 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grid-many-dock6-string-20080714-readdata.log
Type: application/octet-stream
Size: 10140 bytes
Desc: not available
URL:
From wilde at mcs.anl.gov Mon Jul 14 14:43:16 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 14 Jul 2008 14:43:16 -0500
Subject: [Swift-user] readdata or csv_mapper problem
In-Reply-To: <20080714143415.BIO88651@m4500-01.uchicago.edu>
References: <20080714143415.BIO88651@m4500-01.uchicago.edu>
Message-ID: <487BAC54.9050705@mcs.anl.gov>
I think the readdata problem is that you need whitespace between the var
names in line 1. The Users Guide says:
"For structs of scalars, the file should contain two rows. The first row
should be structure member names separated by whitespace. The second row
should be the corresponding values for each structure member, separated
by whitespace, in the same order as the header row."
On 7/14/08 2:34 PM, Zhengxiong Hou wrote:
> Hello,
> When I try to run many dock jobs on the osg grid sites,
> there are problems for using readdata or csv_mapper. Please
> help to solve it. Here is the experiment at localhost:
>
> The swift code is as follows:
> ***********************************************************
> [houzx at communicado dock]$ cat grid-many-dock6-string.swift
> type file;
> (file t) dockcompute (string ligandsfile, string targetlist)
> {
> app {
> rundock ligandsfile targetlist stdout=@filename(t);
> }
> }
>
> type params {
> string ligandsfile;
> string targetlist;
> }
>
> #params pset[] ;
> doall(params pset[])
> {
> foreach params,i in pset {
> #string mol2file [i].ligandsfile>;
> #string target [i].targetlist>;
> file sout ("/home/houzx/dock-run/databases/results/",pset
> [i].targetlist,"-",i,"-stdout.txt")>;
> sout = dockcompute(pset[i].ligandsfile,pset
> [i].targetlist);
> }
> }
>
> params p[];
> p = readdata("paramslist.txt");
> doall(p);
> ***********************************************************
>
> The content of "paramslist.txt" is as follows:
> [houzx at communicado dock]$ cat paramslist.txt
> ligandsfile,targetlist
> /home/houzx/dock-
> run/databases/KEGG_and_Drugs/D00180.mol2,1F9Y
> /home/houzx/dock-
> run/databases/KEGG_and_Drugs/D00181.mol2,1F9Y
> /home/houzx/dock-
> run/databases/KEGG_and_Drugs/D00182.mol2,1F9Y
>
>
> (1) Use this "readdata" code, and the log file is in the
> attachment "grid-many-dock6-string-20080714-readdata.log".
>
> [houzx at communicado dock]$ swift grid-many-dock6-string.swift
> Swift v0.4 swift-r1718 cog-r1934
>
> RunID: 20080714-1405-letz6tcb
> Progress:
> Execution failed:
> File header does not match type. Expected the
> following header items (in no particular order):
> [ligandsfile, targetlist]. Instead, the header was (again,
> in no particular order): [ligandsfile,targetlist]
>
>
> (2) Use csv_mapper, and the log file is in the attachment
> "grid-many-dock6-string-20080714-1417-csv.log"
> [houzx at communicado dock]$ swift grid-many-dock6-string.swift
> Swift v0.4 swift-r1718 cog-r1934
>
> RunID: 20080714-1417-pmo8hsjf
> Progress:
> rundock started
> rundock started
> rundock started
> rundock completed
> rundock completed
> rundock completed
> Final status: Finished:3
> ***********************************************************
> [houzx at communicado dock]$ cat grid-many-dock6-string.swift
> type file;
> (file t) dockcompute (string ligandsfile, string targetlist)
> {
> app {
> rundock ligandsfile targetlist stdout=@filename(t);
> }
> }
>
> type params {
> string ligandsfile;
> string targetlist;
> }
>
> params pset[] ;
>
> foreach params,i in pset {
> file sout ("/home/houzx/dock-run/databases/results/",pset
> [i].targetlist,"-",i,"-stdout.txt")>;
> sout = dockcompute(pset[i].ligandsfile,pset
> [i].targetlist);
> }
> ***********************************************************
>
> But, in the "/home/houzx/dock-run/databases/results/", the
> created files are: null-0-stdout.txt, null-1-stdout.txt,null-
> 2-stdout.txt.
> It means that "pset[i].targetlist" is set to be "null", not
> the data "1F9Y" from paramslist.txt!
> If I use "Swift v0.3", the created files are:true-0-
> stdout.txt,true-1-stdout.txt, true-2-stdout.txt.
>
>
> Thanks!
> B.R.
> zhengxiong
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From zhengxiongh at uchicago.edu Mon Jul 14 15:08:25 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Mon, 14 Jul 2008 15:08:25 -0500 (CDT)
Subject: [Swift-user] readdata or csv_mapper problem
Message-ID: <20080714150825.BIO93135@m4500-01.uchicago.edu>
Hi Mike,
It works now for Swift 0.4. Actually, I modified it in
Swift 0.3 to use "space", but it seemed that the problem was
still there. I was puzzled.
Anyway, it can works now!
Thanks much!
Zhengxiong
---- Original message ----
>Date: Mon, 14 Jul 2008 14:43:16 -0500
>From: Michael Wilde
>Subject: Re: [Swift-user] readdata or csv_mapper problem
>To: Zhengxiong Hou
>Cc: swift-user at ci.uchicago.edu
>
>I think the readdata problem is that you need whitespace
between the var
>names in line 1. The Users Guide says:
>
>"For structs of scalars, the file should contain two rows.
The first row
>should be structure member names separated by whitespace.
The second row
>should be the corresponding values for each structure
member, separated
>by whitespace, in the same order as the header row."
>
From benc at hawaga.org.uk Tue Jul 15 05:30:41 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 15 Jul 2008 10:30:41 +0000 (GMT)
Subject: [Swift-user] BioInformatics app question
In-Reply-To: <4875FF50.9070201@ci.uchicago.edu>
References: <4875FF50.9070201@ci.uchicago.edu>
Message-ID:
this is not particularly elegant, but it will mostly do what i think you
want:
make two custom mappers as shell scripts:
$ cat inmapper
#!/bin/bash
i=0
ls data/* | while read n; do
echo [$i] $n
i=$(( $i + 1 ))
done
and medmapper contains:
$ cat medmapper
#!/bin/bash
i=0
ls data/* | while read n; do
echo [$i].left ${n}.left
echo [$i].right ${n}.right
i=$(( $i + 1 ))
done
This pair relies on the fact that mapping will happen at the start and
that ls will return files in the same order in both of them.
Then you can use them like this:
type file;
type medfiles {
file left;
file right;
}
(medfiles o) preprocess(file i) {
o.left = touch();
o.right = touch();
}
compare(file l, file r, medfiles lm, medfiles rm) {
trace("comparing ", at l," and ", at r);
process(lm.left);
process(rm.left);
}
(file f) touch() {
app {
echo "hi" stdout=@f;
}
process(file f) {
app {
cat "/dev/null" ;
}
}
file inputs[] ;
medfiles intermediates[] ;
foreach input,i in inputs {
intermediates[i] = preprocess(input);
}
foreach left, il in inputs {
foreach right, ir in inputs {
compare(left, right, intermediates[il], intermediates[ir]);
}
}
Also, because the intermediate files are stored in the same directory as
the source data, and there is nothing in the mappers to detect if a file
is an input or intermediate file, then if you run the same workflow twice
you will find the previous generations .left and .right files being picked
up as inputs. You will need to rm -v data/*.left data/*.right between
runs. This could be fixed in the mappers in a couple of ways, left as an
exercise to the reader.
A more elegant solution might involve the mapper for intermediates[] doing
a transform on the way that inputs[] is mapped, but there is no mapper to
do that at the moment in a way that is useful here. (I have some thoughts
about what it would look like but they are not developed enough for
implementation).
--
From lixi at uchicago.edu Sat Jul 19 10:13:48 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 10:13:48 -0500 (CDT)
Subject: [Swift-user] GT4
Message-ID: <20080719101348.BCI34463@m4500-03.uchicago.edu>
Hi,
In the past experiments, I always use gt2 as provider. Now I
think that it's better to use gt4 instead. The only way I
know to migrate from gt4 to gt2 in Swift is to modify the
sites file. Is that right?
In my current sites file, the site item is as follows:
/atlas/data08/OSG/DATA
Now according the default sites.xml, I replaced it with:
/atlas/data08/OSG/DATA
Then I'm going to test if it works by running first.swift on
each site one by one. Is it the right way to test if we can
use WS GRAM for that site?. For the first site AGLT2, I got
such output:
[lixi at communicado AGLT2]$ swift -sites.file
AGLT2.WSGRAM.sites.xml -
tc.file /home/lixi/osg/swifttest/tc.data ../first.swift
Unable to find required classes
(javax.activation.DataHandler and
javax.mail.internet.MimeMultipart). Attachment support is
disabled.
Swift svn swift-r2081 cog-r2065
RunID: 20080719-1005-dutdv6p4
Progress:
echo started
Progress: Selecting site:1
Progress: Selecting site:1
Progress: Selecting site:1
Progress: Selecting site:1
Progress: Selecting site:1
Progress: Selecting site:1
Progress: Selecting site:1
It seems that it doesn't work well.
Could you give me some instructions? Thanks,
Xi
From hategan at mcs.anl.gov Sat Jul 19 10:31:26 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 10:31:26 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719101348.BCI34463@m4500-03.uchicago.edu>
References: <20080719101348.BCI34463@m4500-03.uchicago.edu>
Message-ID: <1216481486.11366.5.camel@localhost>
On Sat, 2008-07-19 at 10:13 -0500, lixi at uchicago.edu wrote:
> Hi,
>
> In the past experiments, I always use gt2 as provider. Now I
> think that it's better to use gt4 instead. The only way I
> know to migrate from gt4 to gt2 in Swift is to modify the
> sites file. Is that right?
>
> In my current sites file, the site item is as follows:
>
>
> url="gate01.aglt2.org/jobmanager-condor" major="2" />
> /atlas/data08/OSG/DATA
>
>
> Now according the default sites.xml, I replaced it with:
>
>
> url="gate01.aglt2.org" />
> /atlas/data08/OSG/DATA
>
That looks about right. But you have to be sure there is a GT4 container
on that site.
>
> Then I'm going to test if it works by running first.swift on
> each site one by one. Is it the right way to test if we can
> use WS GRAM for that site?.
I don't think there is a "right" way here. Though there was this script
I wrote somewhere to test such things. It's in bin and called
checksites.k. It would test all the sites in sites.xml.
> For the first site AGLT2, I got
> such output:
> [lixi at communicado AGLT2]$ swift -sites.file
> AGLT2.WSGRAM.sites.xml -
> tc.file /home/lixi/osg/swifttest/tc.data ../first.swift
> Unable to find required classes
> (javax.activation.DataHandler and
> javax.mail.internet.MimeMultipart). Attachment support is
> disabled.
You can ignore that. It doesn't have any effects on things.
> Swift svn swift-r2081 cog-r2065
>
> RunID: 20080719-1005-dutdv6p4
> Progress:
> echo started
> Progress: Selecting site:1
> Progress: Selecting site:1
> Progress: Selecting site:1
> Progress: Selecting site:1
> Progress: Selecting site:1
> Progress: Selecting site:1
> Progress: Selecting site:1
>
> It seems that it doesn't work well.
Can you send logs?
>
> Could you give me some instructions? Thanks,
>
> Xi
From lixi at uchicago.edu Sat Jul 19 10:39:36 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 10:39:36 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719103936.BCI35055@m4500-03.uchicago.edu>
>I don't think there is a "right" way here. Though there was
this script
>I wrote somewhere to test such things. It's in bin and
called
>checksites.k. It would test all the sites in sites.xml.
I see that script. Can I run it on specified sites file
alone? Could you give me an example?
>Can you send logs?
I run it again with swift-debugger, it seems hanging up. I
just let it be. The log file is:
/home/lixi/osg/swifttest/AGLT2/first-20080719-1032-
h40vbfc8.log
Thanks,
Xi
From hategan at mcs.anl.gov Sat Jul 19 10:47:07 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 10:47:07 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719103936.BCI35055@m4500-03.uchicago.edu>
References: <20080719103936.BCI35055@m4500-03.uchicago.edu>
Message-ID: <1216482427.11775.0.camel@localhost>
On Sat, 2008-07-19 at 10:39 -0500, lixi at uchicago.edu wrote:
> >I don't think there is a "right" way here. Though there was
> this script
> >I wrote somewhere to test such things. It's in bin and
> called
> >checksites.k. It would test all the sites in sites.xml.
>
> I see that script. Can I run it on specified sites file
> alone? Could you give me an example?
cog-workflow checksites.k mysitesfile.xml
>
> >Can you send logs?
> I run it again with swift-debugger, it seems hanging up. I
> just let it be. The log file is:
> /home/lixi/osg/swifttest/AGLT2/first-20080719-1032-
> h40vbfc8.log
>
> Thanks,
>
> Xi
From hategan at mcs.anl.gov Sat Jul 19 10:49:10 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 10:49:10 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719103936.BCI35055@m4500-03.uchicago.edu>
References: <20080719103936.BCI35055@m4500-03.uchicago.edu>
Message-ID: <1216482550.11775.3.camel@localhost>
On Sat, 2008-07-19 at 10:39 -0500, lixi at uchicago.edu wrote:
> >Can you send logs?
> I run it again with swift-debugger, it seems hanging up. I
> just let it be. The log file is:
> /home/lixi/osg/swifttest/AGLT2/first-20080719-1032-
> h40vbfc8.log
Did you manually stop swift there?
Btw, this seems to be the problem:
2008-07-19 10:32:04,845-0500 DEBUG TaskImpl Task(type=FILE_OPERATION,
identity=urn:0-1-1216481524364) setting status to Failed org.glo
bus.cog.abstraction.impl.file.IrrecoverableResourceException: Error
communicating with the GridFTP server
So it has nothing to do with GRAM.
>
> Thanks,
>
> Xi
From lixi at uchicago.edu Sat Jul 19 10:52:38 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 10:52:38 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719105238.BCI35341@m4500-03.uchicago.edu>
>Btw, this seems to be the problem:
>2008-07-19 10:32:04,845-0500 DEBUG TaskImpl Task
(type=FILE_OPERATION,
>identity=urn:0-1-1216481524364) setting status to Failed
org.glo
>bus.cog.abstraction.impl.file.IrrecoverableResourceException
: Error
>communicating with the GridFTP server
I see, :)
Thanks,
Xi
From lixi at uchicago.edu Sat Jul 19 10:57:32 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 10:57:32 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719105732.BCI35433@m4500-03.uchicago.edu>
>cog-workflow checksites.k mysitesfile.xml
I ran it and got such output:
[lixi at communicado bin]$ cog-workflow
checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site
s.xml
Execution failed:
Missing argument major for sys:element(url, storage, major,
minor, patch)
gridftp @ checksites.k, line: 12
pool @ AGLT2.WSGRAM.sites.xml, line: 38
pool @ AGLT2.WSGRAM.sites.xml, line: 38
org.globus.cog.karajan.workflow.nodes.Sequential @
AGLT2.WSGRAM.sites.xml
sys:executefile @ checksites.k, line: 42
list:list @ checksites.k, line: 42
sys:set @ checksites.k, line: 42
kernel:karajan @ checksites.k, line: 1
checksites.k
Detailed exception:
Missing argument major for sys:element(url, storage, major,
minor, patch)
gridftp @ checksites.k, line: 12
pool @ AGLT2.WSGRAM.sites.xml, line: 38
pool @ AGLT2.WSGRAM.sites.xml, line: 38
org.globus.cog.karajan.workflow.nodes.Sequential @
AGLT2.WSGRAM.sites.xml
sys:executefile @ checksites.k, line: 42
list:list @ checksites.k, line: 42
sys:set @ checksites.k, line: 42
kernel:karajan @ checksites.k, line: 1
checksites.k
at
org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement
.prepareInstanceArguments(UserDefinedElement.java:196)
at
org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement
.startBody(UserDefinedElement.java:170)
at
org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit
ExecutionUDE.startBody
(SequentialImplicitExecutionUDE.java:55)
at
org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit
ExecutionUDE.childCompleted
(SequentialImplicitExecutionUDE.java:82)
at
org.globus.cog.karajan.workflow.nodes.Sequential.notification
Event(Sequential.java:33)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event
(FlowNode.java:335)
at
org.globus.cog.karajan.workflow.events.EventBus.send
(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked
(EventBus.java:99)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificati
onEvent(FlowNode.java:173)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.complete
(FlowNode.java:299)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.post
(FlowContainer.java:58)
at
org.globus.cog.karajan.workflow.nodes.Sequential.startNext
(Sequential.java:51)
at
org.globus.cog.karajan.workflow.nodes.Sequential.executeChild
ren(Sequential.java:27)
at
org.globus.cog.karajan.workflow.nodes.user.UDEWrapper.execute
Wrapper(UDEWrapper.java:115)
at
org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit
ExecutionUDE.startArguments
(SequentialImplicitExecutionUDE.java:46)
at
org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit
ExecutionUDE.startInstance
(SequentialImplicitExecutionUDE.java:37)
at
org.globus.cog.karajan.workflow.nodes.user.UDEWrapper.pre
(UDEWrapper.java:75)
at
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute
(FlowContainer.java:62)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.restart
(FlowNode.java:240)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.start
(FlowNode.java:281)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent
(FlowNode.java:393)
at
org.globus.cog.karajan.workflow.nodes.FlowNode.event
(FlowNode.java:332)
at
org.globus.cog.karajan.workflow.FlowElementWrapper.event
(FlowElementWrapper.java:227)
at
org.globus.cog.karajan.workflow.events.EventBus.send
(EventBus.java:125)
at
org.globus.cog.karajan.workflow.events.EventBus.sendHooked
(EventBus.java:99)
at
org.globus.cog.karajan.workflow.events.EventWorker.run
(EventWorker.java:69)
AGLT2.WSGRAM.sites.xml includes such content:
/atlas/data08/OSG/DATA
Does this output prove my sites file is improper?
Thanks,
Xi
From benc at hawaga.org.uk Sat Jul 19 11:04:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 19 Jul 2008 16:04:23 +0000 (GMT)
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719105732.BCI35433@m4500-03.uchicago.edu>
References: <20080719105732.BCI35433@m4500-03.uchicago.edu>
Message-ID:
gate01.aglt2.org is reachable from login.ci but not from my UK machine.
So there might be some strange network stuff going on with that host.
For GRAM4 on that machine you need (by the looks of it) to use tcp port
9443, not the default 8443 which is running some other services. I think
this will in general be true for all OSG resources.
I can't submit to GRAM4 on that machine because I'm not authorized:
$ globusrun-ws -submit -F gate01.aglt2.org:9443 -c /bin/hostname
Submitting job...Failed.
globusrun-ws: Error submitting job
globus_soap_message_module: SOAP Fault
Fault code: soapenv:Server.userException
Fault string:
org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException:
"/DC=org/DC=doegrids/OU=People/CN=Benjamin Clifford 418168" is not
authorized to use operation:
{http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on
this service
nor can I use gridftp on that machine:
$ globus-url-copy file:///etc/group gsiftp://gate01.aglt2.org/tmp/benc008
error: globus_ftp_client: the server responded with an error
530 530-Login incorrect. :
gridmap.c:globus_gss_assist_map_and_authorize:1944:
530-Error invoking callout
530-globus_callout.c:globus_callout_handle_call_type:727:
530-The callout returned an error
530-prima_module.c:Globus Gridmap Callout:470:
530-Gridmap lookup failure: Could not retrieve mapping for
/DC=org/DC=doegrids/OU=People/CN=Benjamin Clifford 418168 from identity
mapping server
530-
530 End.
You could try the above two commands yourself and see what you get.
--
From hategan at mcs.anl.gov Sat Jul 19 11:10:10 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 11:10:10 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719105732.BCI35433@m4500-03.uchicago.edu>
References: <20080719105732.BCI35433@m4500-03.uchicago.edu>
Message-ID: <1216483810.12254.0.camel@localhost>
On Sat, 2008-07-19 at 10:57 -0500, lixi at uchicago.edu wrote:
> >cog-workflow checksites.k mysitesfile.xml
>
> I ran it and got such output:
>
> [lixi at communicado bin]$ cog-workflow
> checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site
> s.xml
>
> Execution failed:
> Missing argument major for sys:element(url, storage, major,
> minor, patch)
Seems like it hasn't been updated in a while.
From lixi at uchicago.edu Sat Jul 19 11:10:36 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 11:10:36 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719111036.BCI35848@m4500-03.uchicago.edu>
>For GRAM4 on that machine you need (by the looks of it) to
use tcp port
>9443, not the default 8443 which is running some other
services. I think
>this will in general be true for all OSG resources.
How to change it? Does it like this:
/atlas/data08/OSG/DATA
>You could try the above two commands yourself and see what
you get.
[lixi at communicado AGLT2]$ globus-url-copy
file:///home/lixi/osg/swifttest/AGLT2/currenttime.tmp
gsiftp://gate01.aglt2.org/atlas/data08/OSG/APP/osglixi/
error: globus_ftp_client: the server responded with an error
530 530-Login incorrect. :
globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050:
530-Mapped user 'osg' is invalid.
530 End.
[lixi at communicado AGLT2]$ globusrun-ws -submit -F
gate01.aglt2.org:9443 -c /bin/hostname
Submitting job...Done.
Job ID: uuid:b7d614b8-55ac-11dd-b2fe-001a64784960
Termination time: 07/20/2008 16:06 GMT
Current job state: Failed
Destroying job...Done.
globusrun-ws: Job failed: Error code: 201
Script stderr:
/usr/bin/sudo: uid 825675 does not exist in the passwd file!
Thanks,
Xi
From hategan at mcs.anl.gov Sat Jul 19 11:13:25 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 11:13:25 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <1216483810.12254.0.camel@localhost>
References: <20080719105732.BCI35433@m4500-03.uchicago.edu>
<1216483810.12254.0.camel@localhost>
Message-ID: <1216484005.12366.0.camel@localhost>
On Sat, 2008-07-19 at 11:10 -0500, Mihael Hategan wrote:
> On Sat, 2008-07-19 at 10:57 -0500, lixi at uchicago.edu wrote:
> > >cog-workflow checksites.k mysitesfile.xml
> >
> > I ran it and got such output:
> >
> > [lixi at communicado bin]$ cog-workflow
> > checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site
> > s.xml
> >
> > Execution failed:
> > Missing argument major for sys:element(url, storage, major,
> > minor, patch)
>
> Seems like it hasn't been updated in a while.
In other words, use what Ben mentions for now.
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From benc at hawaga.org.uk Sat Jul 19 11:12:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 19 Jul 2008 16:12:20 +0000 (GMT)
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719111036.BCI35848@m4500-03.uchicago.edu>
References: <20080719111036.BCI35848@m4500-03.uchicago.edu>
Message-ID:
On Sat, 19 Jul 2008, lixi at uchicago.edu wrote:
> >You could try the above two commands yourself and see what
> you get.
so they both fail for you. interact with the site admins for that site to
make them work. when they work, try swift again.
--
From hategan at mcs.anl.gov Sat Jul 19 11:17:01 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 11:17:01 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719111036.BCI35848@m4500-03.uchicago.edu>
References: <20080719111036.BCI35848@m4500-03.uchicago.edu>
Message-ID: <1216484221.12426.1.camel@localhost>
On Sat, 2008-07-19 at 11:10 -0500, lixi at uchicago.edu wrote:
> >For GRAM4 on that machine you need (by the looks of it) to
> use tcp port
> >9443, not the default 8443 which is running some other
> services. I think
> >this will in general be true for all OSG resources.
>
> How to change it? Does it like this:
>
>
> url="gate01.aglt2.org:9443" />
> /atlas/data08/OSG/DATA
>
Yes.
>
> >You could try the above two commands yourself and see what
> you get.
>
> [lixi at communicado AGLT2]$ globus-url-copy
> file:///home/lixi/osg/swifttest/AGLT2/currenttime.tmp
> gsiftp://gate01.aglt2.org/atlas/data08/OSG/APP/osglixi/
>
> error: globus_ftp_client: the server responded with an error
> 530 530-Login incorrect. :
> globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050:
> 530-Mapped user 'osg' is invalid.
> 530 End.
Your account there seems messed up.
>
> [lixi at communicado AGLT2]$ globusrun-ws -submit -F
> gate01.aglt2.org:9443 -c /bin/hostname
> Submitting job...Done.
> Job ID: uuid:b7d614b8-55ac-11dd-b2fe-001a64784960
> Termination time: 07/20/2008 16:06 GMT
> Current job state: Failed
> Destroying job...Done.
> globusrun-ws: Job failed: Error code: 201
> Script stderr:
> /usr/bin/sudo: uid 825675 does not exist in the passwd file!
Again, your account seems broken. Did this ever work for you?
>
> Thanks,
>
> Xi
From lixi at uchicago.edu Sat Jul 19 11:14:28 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 11:14:28 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719111428.BCI35928@m4500-03.uchicago.edu>
>so they both fail for you. interact with the site admins
for that site to
>make them work. when they work, try swift again.
Thanks, so is it also the right way to check other sites one
by one ?
Xi
From benc at hawaga.org.uk Sat Jul 19 11:16:00 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 19 Jul 2008 16:16:00 +0000 (GMT)
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719111428.BCI35928@m4500-03.uchicago.edu>
References: <20080719111428.BCI35928@m4500-03.uchicago.edu>
Message-ID:
On Sat, 19 Jul 2008, lixi at uchicago.edu wrote:
> >so they both fail for you. interact with the site admins
> for that site to
> >make them work. when they work, try swift again.
>
> Thanks, so is it also the right way to check other sites one
> by one ?
Those two commands would be the commands I would use to test a site.
--
From lixi at uchicago.edu Sat Jul 19 11:18:15 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 11:18:15 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719111815.BCI36021@m4500-03.uchicago.edu>
>Again, your account seems broken. Did this ever work for
you?
Although I never use WS GRAM on sites, I use GRAM and
GridFtp well for running Swift workflow on that site before.
The most recent run was done successfully the day before
yesterday.
>> Thanks,
>>
>> Xi
>
From hategan at mcs.anl.gov Sat Jul 19 11:23:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 19 Jul 2008 11:23:36 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719111815.BCI36021@m4500-03.uchicago.edu>
References: <20080719111815.BCI36021@m4500-03.uchicago.edu>
Message-ID: <1216484616.12635.0.camel@localhost>
On Sat, 2008-07-19 at 11:18 -0500, lixi at uchicago.edu wrote:
> >Again, your account seems broken. Did this ever work for
> you?
>
> Although I never use WS GRAM on sites, I use GRAM and
> GridFtp well for running Swift workflow on that site before.
> The most recent run was done successfully the day before
> yesterday.
Did you use a different VO?
>
>
> >> Thanks,
> >>
> >> Xi
> >
From lixi at uchicago.edu Sat Jul 19 11:23:42 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Sat, 19 Jul 2008 11:23:42 -0500 (CDT)
Subject: [Swift-user] Re: GT4
Message-ID: <20080719112342.BCI36162@m4500-03.uchicago.edu>
>Did you use a different VO?
Before I use OSGEDU VO, but I already switch to use OSG VO
for more than a month.
From benc at hawaga.org.uk Sat Jul 19 11:27:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 19 Jul 2008 16:27:15 +0000 (GMT)
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719111815.BCI36021@m4500-03.uchicago.edu>
References: <20080719111815.BCI36021@m4500-03.uchicago.edu>
Message-ID:
On Sat, 19 Jul 2008, lixi at uchicago.edu wrote:
> >Again, your account seems broken. Did this ever work for
> you?
>
> Although I never use WS GRAM on sites, I use GRAM and
> GridFtp well for running Swift workflow on that site before.
> The most recent run was done successfully the day before
> yesterday.
ok.
That gridftp command is nothing gram4-specific. So if it doesn't work, it
suggests very strongly that there is a general site problem that has
arisen in the past few days.
Try the previous gram2 swift submissions and I think you will probably see
that that also does not work.
--
From tiejing at gmail.com Sat Jul 19 11:44:04 2008
From: tiejing at gmail.com (Jing Tie)
Date: Sat, 19 Jul 2008 11:44:04 -0500
Subject: [Swift-user] Re: GT4
In-Reply-To: <20080719112342.BCI36162@m4500-03.uchicago.edu>
References: <20080719112342.BCI36162@m4500-03.uchicago.edu>
Message-ID:
Yes, the site is failing authentication test. I worked yesterday.
Jing
On Sat, Jul 19, 2008 at 11:23 AM, wrote:
>>Did you use a different VO?
>
> Before I use OSGEDU VO, but I already switch to use OSG VO
> for more than a month.
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
From tiberius at ci.uchicago.edu Mon Jul 21 10:39:32 2008
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Mon, 21 Jul 2008 10:39:32 -0500
Subject: [Swift-user] Help needed with batching up parallel runs
Message-ID:
Hi
I work with some code that generates at some point a number (300 in my
case) of parallel identical runs, and I need to batch those up (10 at
a time in my case) because each individual run is too short.
I don't want Falkon at this point, and I'm not sure about the status
of the coaster provider, so I would prefer a clean swift solution
I was thinking of some array manipulation, but it was not obvious how
to do it with swift.
Thanks !
Tibi
Here is the code that I have so far, and I need help for:
//this is the code that batches a number of runs: based on the size of
the array (determined where I make the call), I will return the set of
parallel run results
(file simFile[])gj_batch_sim(file policyFile, file logFile){
app{
gj_batch_sim @filename(policyFile) @filename(logFile)
@filenames(simFile);
}
}
int parallelInstances=300;
file simOutputs[];
(file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){
// this is just some needed input
file logFile;
// I want to have batches of size 10
int localBatchSize=10;
int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*")
trace("Times to do batch_gj_batch_sim",batchRange);
foreach i in [1:batchRange] {
// HELP HERE: how to do this ?
// essentially I need to map the proper batch of file
names into the call of gj_batch_sim
simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile,
logFile);
}
}
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From wilde at mcs.anl.gov Mon Jul 21 11:46:42 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 21 Jul 2008 11:46:42 -0500
Subject: [Swift-user] Help needed with batching up parallel runs
In-Reply-To:
References:
Message-ID: <4884BD72.1050409@mcs.anl.gov>
Tibi, can you use the Swift clustering mechanism?
http://www.ci.uchicago.edu/swift/guides/userguide.php#clustering
Its meant for this sort of thing, and is nice because you dont need to
explicitly do the clustering in your Swift script.
"Swift can group a number of short job submissions into a single larger
job submission to minimize overhead involved in launching jobs..."
- Mike
On 7/21/08 10:39 AM, Tiberiu Stef-Praun wrote:
> Hi
>
> I work with some code that generates at some point a number (300 in my
> case) of parallel identical runs, and I need to batch those up (10 at
> a time in my case) because each individual run is too short.
> I don't want Falkon at this point, and I'm not sure about the status
> of the coaster provider, so I would prefer a clean swift solution
> I was thinking of some array manipulation, but it was not obvious how
> to do it with swift.
>
> Thanks !
> Tibi
>
> Here is the code that I have so far, and I need help for:
>
>
>
> //this is the code that batches a number of runs: based on the size of
> the array (determined where I make the call), I will return the set of
> parallel run results
> (file simFile[])gj_batch_sim(file policyFile, file logFile){
> app{
> gj_batch_sim @filename(policyFile) @filename(logFile)
> @filenames(simFile);
> }
> }
>
> int parallelInstances=300;
> file simOutputs[];
>
> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){
> // this is just some needed input
> file logFile;
>
> // I want to have batches of size 10
> int localBatchSize=10;
>
> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*")
> trace("Times to do batch_gj_batch_sim",batchRange);
>
> foreach i in [1:batchRange] {
> // HELP HERE: how to do this ?
> // essentially I need to map the proper batch of file
> names into the call of gj_batch_sim
>
> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile,
> logFile);
> }
> }
>
>
>
From zhaozhang at uchicago.edu Mon Jul 21 17:26:24 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 21 Jul 2008 17:26:24 -0500
Subject: [Swift-user] A naive run of Falkon+Swift on BGP login node.
Message-ID: <48850D10.7050103@uchicago.edu>
Hi,
I started a test on BGP login nodes, running falkon service and swift on
Login6, and a worker on Login2.
Good news is I got the output file. Swift return successful. Bad news is
there are some problems I don't
understand.
The swift stdout:
/Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file
./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
Line 2: Unable to find required classes (javax.activation.DataHandler
and javax.mail.internet.MimeMultipart). Attachment support is disabled.
Line 3: Swift svn swift-r2140 cog-r2070
Line 4: RunID: 20080721-1713-zkz78kcf
Line 5: Progress:
Line 6: echo started
Line 7: error: Notification(int timeout): socket = new
ServerSocket(recvPort); Address already in use
Line 8: Waiting for notification for 0 ms
Line 9: Received notification with 1 messages
Line 10: echo completed
Line 11: Final status: Finished successfully:1/
1. What is the exception in Line 2? is this ignorable or not?
2. What is the error in Line 7? Is it printed by swift or the
deef-provider? Is this ignorable or not?
The following exception from Falkon only occurs when I specify the
ip.address property in swift
The falkon stdout:
/2008-07-21 17:00:46,325 ERROR handler.AddressingHandler
[ServiceThread-6,invoke:120] Exception in AddressingHandler
AxisFault
faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
faultSubcode:
faultString: java.io.IOException: '' For input string: ""
faultActor:
faultNode:
faultDetail:
{http://xml.apache.org/axis/}stackTrace:java.io.IOException: ''
For input string: ""
at
org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161)
at
org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53)
at
org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown
Source)
at
org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at
org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
at
org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645)
at org.apache.axis.Message.getSOAPEnvelope(Message.java:424)
at
org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328)
at
org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77)
at
org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114)
at
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248)
at
org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
at
org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
at
org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
{http://xml.apache.org/axis/}hostname:login6
/
Ioan, any idea about this?
I am also attaching the swift log, could anyone check this to tell if
there is a problem there, and most important thing
is that if swift is using the IP address I specified in the --ip.address
parameter?
Thanks so much for the help
best wishes
zhangzhao
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: first-20080721-1713-zkz78kcf.log
URL:
From hategan at mcs.anl.gov Mon Jul 21 17:39:09 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Jul 2008 17:39:09 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <48850D10.7050103@uchicago.edu>
References: <48850D10.7050103@uchicago.edu>
Message-ID: <1216679949.18694.10.camel@localhost>
On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote:
> Hi,
>
> I started a test on BGP login nodes, running falkon service and swift on
> Login6, and a worker on Login2.
> Good news is I got the output file. Swift return successful. Bad news is
> there are some problems I don't
> understand.
>
> The swift stdout:
> /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file
> ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
> Line 2: Unable to find required classes (javax.activation.DataHandler
> and javax.mail.internet.MimeMultipart). Attachment support is disabled.
> Line 3: Swift svn swift-r2140 cog-r2070
>
> Line 4: RunID: 20080721-1713-zkz78kcf
> Line 5: Progress:
> Line 6: echo started
> Line 7: error: Notification(int timeout): socket = new
> ServerSocket(recvPort); Address already in use
> Line 8: Waiting for notification for 0 ms
> Line 9: Received notification with 1 messages
> Line 10: echo completed
> Line 11: Final status: Finished successfully:1/
>
> 1. What is the exception in Line 2? is this ignorable or not?
Yes. It's axis complaining about some missing stuff that is never used
in this case.
> 2. What is the error in Line 7? Is it printed by swift or the
> deef-provider?
provider-deef. Do you have another swift instance running by any chance?
> Is this ignorable or not?
It isn't. It probably means that the falkon notifications won't get to
you.
>
>
>
> The following exception from Falkon only occurs when I specify the
> ip.address property in swift
What exactly did you set it to?
Mihael
From hategan at mcs.anl.gov Mon Jul 21 17:41:36 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Jul 2008 17:41:36 -0500
Subject: [Swift-user] Re: [Swift-devel] Re: A naive run of Falkon+Swift on
BGP login node.
In-Reply-To: <48850F53.3010300@cs.uchicago.edu>
References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu>
Message-ID: <1216680096.18694.14.camel@localhost>
> > Line 7: error: Notification(int timeout): socket = new
> > ServerSocket(recvPort); Address already in use
> > Line 8: Waiting for notification for 0 ms
> > Line 9: Received notification with 1 messages
> > Line 10: echo completed
> > Line 11: Final status: Finished successfully:1/
> >
> > 1. What is the exception in Line 2? is this ignorable or not?
> This is not a Falkon provider exception, so I don't know.
> > 2. What is the error in Line 7? Is it printed by swift or the
> > deef-provider? Is this ignorable or not?
> >
> You can ignore this, it should really be just a warning.
Oops. Sorry. Nevermind what I said.
From hategan at mcs.anl.gov Mon Jul 21 17:43:30 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Jul 2008 17:43:30 -0500
Subject: [Swift-user] Re: [Swift-devel] Re: A naive run of Falkon+Swift on
BGP login node.
In-Reply-To: <48850F53.3010300@cs.uchicago.edu>
References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu>
Message-ID: <1216680210.18694.17.camel@localhost>
On Mon, 2008-07-21 at 17:36 -0500, Ioan Raicu wrote:
> > Ioan, any idea about this?
> Not really sure what is wrong. Try to fix the exception from line 2
> first.
Not the problem. Normally in the wsrf log4j.properties this is masked
out. It's the log4j.properties in swift that doesn't. We should change
that.
> Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs?
Yes. It's still on gt4.0
From iraicu at cs.uchicago.edu Mon Jul 21 17:38:35 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 21 Jul 2008 17:38:35 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <1216679949.18694.10.camel@localhost>
References: <48850D10.7050103@uchicago.edu>
<1216679949.18694.10.camel@localhost>
Message-ID: <48850FEB.2020108@cs.uchicago.edu>
Mihael Hategan wrote:
> On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote:
>
>> Hi,
>>
>> I started a test on BGP login nodes, running falkon service and swift on
>> Login6, and a worker on Login2.
>> Good news is I got the output file. Swift return successful. Bad news is
>> there are some problems I don't
>> understand.
>>
>> The swift stdout:
>> /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file
>> ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
>> Line 2: Unable to find required classes (javax.activation.DataHandler
>> and javax.mail.internet.MimeMultipart). Attachment support is disabled.
>> Line 3: Swift svn swift-r2140 cog-r2070
>>
>> Line 4: RunID: 20080721-1713-zkz78kcf
>> Line 5: Progress:
>> Line 6: echo started
>> Line 7: error: Notification(int timeout): socket = new
>> ServerSocket(recvPort); Address already in use
>> Line 8: Waiting for notification for 0 ms
>> Line 9: Received notification with 1 messages
>> Line 10: echo completed
>> Line 11: Final status: Finished successfully:1/
>>
>> 1. What is the exception in Line 2? is this ignorable or not?
>>
>
> Yes. It's axis complaining about some missing stuff that is never used
> in this case.
>
>
>> 2. What is the error in Line 7? Is it printed by swift or the
>> deef-provider?
>>
>
> provider-deef. Do you have another swift instance running by any chance?
>
>
>> Is this ignorable or not?
>>
>
> It isn't. It probably means that the falkon notifications won't get to
> you.
>
This error should just be a warning... as it tries a different port
until it finds a good one. It should only print an error when it gives
up. So, that is not your problem Zhao, especially as you seem to have
run OK, right?
Line 11: Final status: Finished successfully:1/
Ioan
>
>>
>> The following exception from Falkon only occurs when I specify the
>> ip.address property in swift
>>
>
> What exactly did you set it to?
>
> Mihael
>
>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From hategan at mcs.anl.gov Mon Jul 21 17:44:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 21 Jul 2008 17:44:50 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <48850FEB.2020108@cs.uchicago.edu>
References: <48850D10.7050103@uchicago.edu>
<1216679949.18694.10.camel@localhost>
<48850FEB.2020108@cs.uchicago.edu>
Message-ID: <1216680290.20073.0.camel@localhost>
On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote:
> >
> This error should just be a warning... as it tries a different port
> until it finds a good one. It should only print an error when it
> gives up. So, that is not your problem Zhao, especially as you seem
> to have run OK, right?
>
> Line 11: Final status: Finished successfully:1/
Yep. Sorry. Spoke without knowing.
From iraicu at cs.uchicago.edu Mon Jul 21 17:36:03 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 21 Jul 2008 17:36:03 -0500
Subject: [Swift-user] Re: A naive run of Falkon+Swift on BGP login node.
In-Reply-To: <48850D10.7050103@uchicago.edu>
References: <48850D10.7050103@uchicago.edu>
Message-ID: <48850F53.3010300@cs.uchicago.edu>
Zhao Zhang wrote:
> Hi,
>
> I started a test on BGP login nodes, running falkon service and swift
> on Login6, and a worker on Login2.
> Good news is I got the output file. Swift return successful. Bad news
> is there are some problems I don't
> understand.
>
> The swift stdout:
> /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file
> ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
> Line 2: Unable to find required classes
> (javax.activation.DataHandler and javax.mail.internet.MimeMultipart).
> Attachment support is disabled.
> Line 3: Swift svn swift-r2140 cog-r2070
>
> Line 4: RunID: 20080721-1713-zkz78kcf
> Line 5: Progress:
> Line 6: echo started
> Line 7: error: Notification(int timeout): socket = new
> ServerSocket(recvPort); Address already in use
> Line 8: Waiting for notification for 0 ms
> Line 9: Received notification with 1 messages
> Line 10: echo completed
> Line 11: Final status: Finished successfully:1/
>
> 1. What is the exception in Line 2? is this ignorable or not?
This is not a Falkon provider exception, so I don't know.
> 2. What is the error in Line 7? Is it printed by swift or the
> deef-provider? Is this ignorable or not?
>
You can ignore this, it should really be just a warning.
>
>
> The following exception from Falkon only occurs when I specify the
> ip.address property in swift
> The falkon stdout:
>
> /2008-07-21 17:00:46,325 ERROR handler.AddressingHandler
> [ServiceThread-6,invoke:120] Exception in AddressingHandler
> AxisFault
> faultCode:
> {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
> faultSubcode:
> faultString: java.io.IOException: '' For input string: ""
> faultActor:
> faultNode:
> faultDetail:
> {http://xml.apache.org/axis/}stackTrace:java.io.IOException: ''
> For input string: ""
> at
> org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161)
>
> at
> org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53)
>
> at
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown
> Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
> Source)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
> at
> org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
>
> at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645)
> at org.apache.axis.Message.getSOAPEnvelope(Message.java:424)
> at
> org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328)
>
> at
> org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77)
>
> at
> org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114)
>
> at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
>
> at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
> at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248)
> at
> org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
> at
> org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
> at
> org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
>
> {http://xml.apache.org/axis/}hostname:login6
> /
> Ioan, any idea about this?
Not really sure what is wrong. Try to fix the exception from line 2
first. Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs?
Ioan
>
> I am also attaching the swift log, could anyone check this to tell if
> there is a problem there, and most important thing
> is that if swift is using the IP address I specified in the
> --ip.address parameter?
>
> Thanks so much for the help
>
> best wishes
> zhangzhao
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
From iraicu at cs.uchicago.edu Mon Jul 21 17:46:32 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 21 Jul 2008 17:46:32 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <1216680290.20073.0.camel@localhost>
References: <48850D10.7050103@uchicago.edu>
<1216679949.18694.10.camel@localhost>
<48850FEB.2020108@cs.uchicago.edu>
<1216680290.20073.0.camel@localhost>
Message-ID: <488511C8.2000707@cs.uchicago.edu>
So Zhao, did it actually work, but you got those two errors and wanted
to know what the errors were? If things worked as expected, then you
should be fine, you can ignore both of those errors (I think). If
things didn't work as expected, then we need to dig deeper to find out why.
Ioan
Mihael Hategan wrote:
> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote:
>
>>>
>>>
>> This error should just be a warning... as it tries a different port
>> until it finds a good one. It should only print an error when it
>> gives up. So, that is not your problem Zhao, especially as you seem
>> to have run OK, right?
>>
>> Line 11: Final status: Finished successfully:1/
>>
>
> Yep. Sorry. Spoke without knowing.
>
>
>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From zhaozhang at uchicago.edu Mon Jul 21 18:04:41 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 21 Jul 2008 18:04:41 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <488511C8.2000707@cs.uchicago.edu>
References: <48850D10.7050103@uchicago.edu>
<1216679949.18694.10.camel@localhost>
<48850FEB.2020108@cs.uchicago.edu>
<1216680290.20073.0.camel@localhost>
<488511C8.2000707@cs.uchicago.edu>
Message-ID: <48851609.8050909@uchicago.edu>
In this test case, it actually worked. I talked with Mike, and we don't
quite understand these 2 things. So I sent them out.
After that I started another test. Running, swift on Login Node, falkon
service on IO node, and BGexec on CN.
At the very end of the service log, I got his:
847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
100 512 288 512
848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
100 512 288 512
849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
100 512 287 512
850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
100 512 287 512
851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
100 512 287 512
This means that we are still suffering the endpoint problem, right?
And from swift stdout,
zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml
-tc.file ./tc.data -ip.address 172.17.3.16 first.swift
Unable to find required classes (javax.activation.DataHandler and
javax.mail.internet.MimeMultipart). Attachment support is disabled.
Swift svn swift-r2140 cog-r2070
RunID: 20080721-1748-m9d39dg9
Progress:
echo started
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Progress: Executing:1
Swift kept waiting, which mean the -ip.address doesn't work as we expexted.
zhao
Ioan Raicu wrote:
> So Zhao, did it actually work, but you got those two errors and wanted
> to know what the errors were? If things worked as expected, then you
> should be fine, you can ignore both of those errors (I think). If
> things didn't work as expected, then we need to dig deeper to find out
> why.
>
> Ioan
>
> Mihael Hategan wrote:
>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote:
>>
>>>>
>>>>
>>> This error should just be a warning... as it tries a different port
>>> until it finds a good one. It should only print an error when it
>>> gives up. So, that is not your problem Zhao, especially as you seem
>>> to have run OK, right?
>>>
>>> Line 11: Final status: Finished successfully:1/
>>>
>>
>> Yep. Sorry. Spoke without knowing.
>>
>>
>>
>>
>
> --
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web: http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
>
>
From iraicu at cs.uchicago.edu Mon Jul 21 18:08:26 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 21 Jul 2008 18:08:26 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <48851609.8050909@uchicago.edu>
References: <48850D10.7050103@uchicago.edu>
<1216679949.18694.10.camel@localhost>
<48850FEB.2020108@cs.uchicago.edu>
<1216680290.20073.0.camel@localhost>
<488511C8.2000707@cs.uchicago.edu> <48851609.8050909@uchicago.edu>
Message-ID: <488516EA.1080703@cs.uchicago.edu>
Zhao Zhang wrote:
> In this test case, it actually worked. I talked with Mike, and we
> don't quite understand these 2 things. So I sent them out.
>
> After that I started another test. Running, swift on Login Node,
> falkon service on IO node, and BGexec on CN.
> At the very end of the service log, I got his:
> 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
> 100 512 288 512
> 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
> 100 512 288 512
> 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
> 100 512 287 512
> 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
> 100 512 287 512
> 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
> 100 512 287 512 \
Right, it can't deliver the 2 tasks, as there would have been a 2 before
the 0.0 in the middle.
>
> This means that we are still suffering the endpoint problem, right?
Right!
You might want to put some debug statements in the Falkon provider to
print the end point IP address, to make sure it is the one you are
expecting.
Ioan
>
> And from swift stdout,
> zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml
> -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
> Unable to find required classes (javax.activation.DataHandler and
> javax.mail.internet.MimeMultipart). Attachment support is disabled.
> Swift svn swift-r2140 cog-r2070
>
> RunID: 20080721-1748-m9d39dg9
> Progress:
> echo started
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
> Progress: Executing:1
>
> Swift kept waiting, which mean the -ip.address doesn't work as we
> expexted.
>
> zhao
>
> Ioan Raicu wrote:
>> So Zhao, did it actually work, but you got those two errors and
>> wanted to know what the errors were? If things worked as expected,
>> then you should be fine, you can ignore both of those errors (I
>> think). If things didn't work as expected, then we need to dig
>> deeper to find out why.
>>
>> Ioan
>>
>> Mihael Hategan wrote:
>>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote:
>>>
>>>>>
>>>> This error should just be a warning... as it tries a different port
>>>> until it finds a good one. It should only print an error when it
>>>> gives up. So, that is not your problem Zhao, especially as you seem
>>>> to have run OK, right?
>>>> Line 11: Final status: Finished successfully:1/
>>>>
>>>
>>> Yep. Sorry. Spoke without knowing.
>>>
>>>
>>>
>>>
>>
>> --
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web: http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
From wilde at mcs.anl.gov Mon Jul 21 18:10:45 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 21 Jul 2008 18:10:45 -0500
Subject: [Swift-user] Using > 1 CPU per compute node under GRAM
Message-ID: <48851775.5000604@mcs.anl.gov>
Im asking this on behalf of Mike Kubal while I wait for more info on his
settings:
Mike is running under Swift on teragrid/Abe which has 8-core nodes. His
jobs are all running 1-job-per-node, wasting 7 cores.
I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
In the meantime, does anyone know if there's a way to specify
compute-node-sharing between separate single-cpu jobs via both GRAMs?
And if this is dependent on the local job manager code or settings? (Ie
might work on some sites but not others)?
On globus doc page:
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
I see:
...
but cant tell if this applies to single-core jobs or only to multi-core
jobs.
This will ideally be handled as desired by Falkon or Coaster, but in the
meantime I was hoping there was a simple setting to give MikeK better
CPU yield on Abe.
- Mike Wilde
---
A sample of one of his jobs looks like this under qstat -ef:
Job Id: 395980.abem5.ncsa.uiuc.edu
Job_Name = STDIN
Job_Owner = mkubal at abe1196
job_state = Q
queue = normal
server = abem5.ncsa.uiuc.edu
Account_Name = onm
Checkpoint = u
ctime = Mon Jul 21 17:43:47 2008
Error_Path = abe1196:/dev/null
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = n
mtime = Mon Jul 21 17:43:47 2008
Output_Path = abe1196:/dev/null
Priority = 0
qtime = Mon Jul 21 17:43:47 2008
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:10:00
Shell_Path_List = /bin/sh
etime = Mon Jul 21 17:43:47 2008
submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
And his jobs show up like this under qstat -n (ie are all on core /0 ):
395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1 --
00:10 R --
abe0872/0
While multi-core jobs use
+abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
+abe0579/3+abe0579/2+abe0579/1+abe0579/0
From wilde at mcs.anl.gov Mon Jul 21 18:18:16 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 21 Jul 2008 18:18:16 -0500
Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP
login node.
In-Reply-To: <488516EA.1080703@cs.uchicago.edu>
References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu>
<48851609.8050909@uchicago.edu> <488516EA.1080703@cs.uchicago.edu>
Message-ID: <48851938.4070507@mcs.anl.gov>
On 7/21/08 6:08 PM, Ioan Raicu wrote:
>
>
> Zhao Zhang wrote:
>> In this test case, it actually worked. I talked with Mike, and we
>> don't quite understand these 2 things. So I sent them out.
>>
>> After that I started another test. Running, swift on Login Node,
>> falkon service on IO node, and BGexec on CN.
>> At the very end of the service log, I got his:
>> 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
>> 100 512 288 512
>> 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
>> 100 512 288 512
>> 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
>> 100 512 287 512
>> 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
>> 100 512 287 512
>> 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0
>> 100 512 287 512 \
> Right, it can't deliver the 2 tasks, as there would have been a 2 before
> the 0.0 in the middle.
>>
>> This means that we are still suffering the endpoint problem, right?
> Right!
>
> You might want to put some debug statements in the Falkon provider to
> print the end point IP address, to make sure it is the one you are
> expecting.
that debug logging is there, but not sure if or where its getting logged:
In
src/cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/ResourcePool.java
the changed code tries to log as follows:
public static String getMachNamePort(Notification userNot){
//String machIP = VDL2Config.getIP();
String machIP = CoGProperties.getDefault().getIPAddress();
String machNamePort = new String (machIP + ":" +
userNot.recvPort);
logger.debug("WORKER: Machine ID = " + machNamePort);
return machNamePort;
}
Zhao, did you see "WORKER: Machine ID = " in your swift log?
- Mike
> Ioan
>>
>> And from swift stdout,
>> zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml
>> -tc.file ./tc.data -ip.address 172.17.3.16 first.swift
>> Unable to find required classes (javax.activation.DataHandler and
>> javax.mail.internet.MimeMultipart). Attachment support is disabled.
>> Swift svn swift-r2140 cog-r2070
>>
>> RunID: 20080721-1748-m9d39dg9
>> Progress:
>> echo started
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>> Progress: Executing:1
>>
>> Swift kept waiting, which mean the -ip.address doesn't work as we
>> expexted.
>>
>> zhao
>>
>> Ioan Raicu wrote:
>>> So Zhao, did it actually work, but you got those two errors and
>>> wanted to know what the errors were? If things worked as expected,
>>> then you should be fine, you can ignore both of those errors (I
>>> think). If things didn't work as expected, then we need to dig
>>> deeper to find out why.
>>>
>>> Ioan
>>>
>>> Mihael Hategan wrote:
>>>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote:
>>>>
>>>>>>
>>>>> This error should just be a warning... as it tries a different port
>>>>> until it finds a good one. It should only print an error when it
>>>>> gives up. So, that is not your problem Zhao, especially as you seem
>>>>> to have run OK, right? Line 11: Final status: Finished
>>>>> successfully:1/
>>>>>
>>>>
>>>> Yep. Sorry. Spoke without knowing.
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> ===================================================
>>> Ioan Raicu
>>> Ph.D. Candidate
>>> ===================================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ===================================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web: http://www.cs.uchicago.edu/~iraicu
>>> http://dev.globus.org/wiki/Incubator/Falkon
>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>> ===================================================
>>> ===================================================
>>>
>>>
>>
>
From wilde at mcs.anl.gov Mon Jul 21 18:45:24 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 21 Jul 2008 18:45:24 -0500
Subject: [Swift-user] Re: Using > 1 CPU per compute node under GRAM
In-Reply-To:
References: <48851775.5000604@mcs.anl.gov>
Message-ID: <48851F94.6010509@mcs.anl.gov>
Thanks, JP.
I'll forward this to the TeraGrid Help Desk and report back to this list.
- Mike
On 7/21/08 6:28 PM, JP Navarro wrote:
> It's definitely subject to local resource manager/scheduling policy
> configuration.
> At UC/ANL, for example, there's an explicit policy that says 1 job per
> node. Each
> job can of course run 1-n processes that share the 2 processors. There's
> nothing
> gram can do to get around that policy.
>
> You'll need to ask NCSA whether their policies allow multiple jobs on
> one node.
> If Abe allows only one job per node, then it's up to your one job to
> spawn off
> enough processes/threads to use the 8 cores.
>
> JP
>
> On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote:
>
>> Im asking this on behalf of Mike Kubal while I wait for more info on
>> his settings:
>>
>> Mike is running under Swift on teragrid/Abe which has 8-core nodes.
>> His jobs are all running 1-job-per-node, wasting 7 cores.
>>
>> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>>
>> In the meantime, does anyone know if there's a way to specify
>> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>>
>> And if this is dependent on the local job manager code or settings?
>> (Ie might work on some sites but not others)?
>>
>> On globus doc page:
>> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>>
>>
>> I see:
>>
>> ...
>>
>>
>> but cant tell if this applies to single-core jobs or only to
>> multi-core jobs.
>>
>> This will ideally be handled as desired by Falkon or Coaster, but in
>> the meantime I was hoping there was a simple setting to give MikeK
>> better CPU yield on Abe.
>>
>> - Mike Wilde
>>
>> ---
>>
>> A sample of one of his jobs looks like this under qstat -ef:
>>
>> Job Id: 395980.abem5.ncsa.uiuc.edu
>> Job_Name = STDIN
>> Job_Owner = mkubal at abe1196
>> job_state = Q
>> queue = normal
>> server = abem5.ncsa.uiuc.edu
>> Account_Name = onm
>> Checkpoint = u
>> ctime = Mon Jul 21 17:43:47 2008
>> Error_Path = abe1196:/dev/null
>> Hold_Types = n
>> Join_Path = n
>> Keep_Files = n
>> Mail_Points = n
>> mtime = Mon Jul 21 17:43:47 2008
>> Output_Path = abe1196:/dev/null
>> Priority = 0
>> qtime = Mon Jul 21 17:43:47 2008
>> Rerunable = True
>> Resource_List.ncpus = 1
>> Resource_List.nodect = 1
>> Resource_List.nodes = 1
>> Resource_List.walltime = 00:10:00
>> Shell_Path_List = /bin/sh
>> etime = Mon Jul 21 17:43:47 2008
>> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>>
>> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>>
>> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1
>> -- 00:10 R --
>> abe0872/0
>>
>> While multi-core jobs use
>>
>> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
>> +abe0579/3+abe0579/2+abe0579/1+abe0579/0
>
From iraicu at cs.uchicago.edu Mon Jul 21 18:57:26 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 21 Jul 2008 18:57:26 -0500
Subject: [Swift-user] Using > 1 CPU per compute node under GRAM
In-Reply-To: <48851775.5000604@mcs.anl.gov>
References: <48851775.5000604@mcs.anl.gov>
Message-ID: <48852266.2000502@cs.uchicago.edu>
In the past (i.e. MolDyn), I don't think we ever found a easy solution
to this when running straight through GRAM (if the LRM didn't support
this policy). But, as JP said, it is site specific, so some sites will
allow getting only 1 CPU per node, such as Teraport, in which case GRAM
should work just fine.
Ioan
Michael Wilde wrote:
> Im asking this on behalf of Mike Kubal while I wait for more info on
> his settings:
>
> Mike is running under Swift on teragrid/Abe which has 8-core nodes.
> His jobs are all running 1-job-per-node, wasting 7 cores.
>
> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>
> In the meantime, does anyone know if there's a way to specify
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings?
> (Ie might work on some sites but not others)?
>
> On globus doc page:
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>
>
> I see:
>
> ...
>
>
> but cant tell if this applies to single-core jobs or only to
> multi-core jobs.
>
> This will ideally be handled as desired by Falkon or Coaster, but in
> the meantime I was hoping there was a simple setting to give MikeK
> better CPU yield on Abe.
>
> - Mike Wilde
>
> ---
>
> A sample of one of his jobs looks like this under qstat -ef:
>
> Job Id: 395980.abem5.ncsa.uiuc.edu
> Job_Name = STDIN
> Job_Owner = mkubal at abe1196
> job_state = Q
> queue = normal
> server = abem5.ncsa.uiuc.edu
> Account_Name = onm
> Checkpoint = u
> ctime = Mon Jul 21 17:43:47 2008
> Error_Path = abe1196:/dev/null
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Mon Jul 21 17:43:47 2008
> Output_Path = abe1196:/dev/null
> Priority = 0
> qtime = Mon Jul 21 17:43:47 2008
> Rerunable = True
> Resource_List.ncpus = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 00:10:00
> Shell_Path_List = /bin/sh
> etime = Mon Jul 21 17:43:47 2008
> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>
> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>
> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1
> -- 00:10 R --
> abe0872/0
>
> While multi-core jobs use
>
> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
> +abe0579/3+abe0579/2+abe0579/1+abe0579/0
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
From mikekubal at yahoo.com Mon Jul 21 22:42:22 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Mon, 21 Jul 2008 20:42:22 -0700 (PDT)
Subject: [Swift-user] Using > 1 CPU per compute node under GRAM
In-Reply-To: <48852266.2000502@cs.uchicago.edu>
Message-ID: <510897.59042.qm@web52303.mail.re2.yahoo.com>
I'm using pre-WS-GRAM.
MikeK
--- On Mon, 7/21/08, Ioan Raicu wrote:
From: Ioan Raicu
Subject: Re: [Swift-user] Using > 1 CPU per compute node under GRAM
To: "Michael Wilde"
Cc: "Swift User Discussion List" , "Stu Martin" , "Martin Feller" , "JP Navarro" , "Mike Kubal"
Date: Monday, July 21, 2008, 6:57 PM
In the past (i.e. MolDyn), I don't think we ever found a easy solution
to this when running straight through GRAM (if the LRM didn't support
this policy). But, as JP said, it is site specific, so some sites will
allow getting only 1 CPU per node, such as Teraport, in which case GRAM
should work just fine.
Ioan
Michael Wilde wrote:
> Im asking this on behalf of Mike Kubal while I wait for more info on
> his settings:
>
> Mike is running under Swift on teragrid/Abe which has 8-core nodes.
> His jobs are all running 1-job-per-node, wasting 7 cores.
>
> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>
> In the meantime, does anyone know if there's a way to specify
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings?
> (Ie might work on some sites but not others)?
>
> On globus doc page:
>
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>
>
> I see:
>
> ...
>
>
> but cant tell if this applies to single-core jobs or only to
> multi-core jobs.
>
> This will ideally be handled as desired by Falkon or Coaster, but in
> the meantime I was hoping there was a simple setting to give MikeK
> better CPU yield on Abe.
>
> - Mike Wilde
>
> ---
>
> A sample of one of his jobs looks like this under qstat -ef:
>
> Job Id: 395980.abem5.ncsa.uiuc.edu
> Job_Name = STDIN
> Job_Owner = mkubal at abe1196
> job_state = Q
> queue = normal
> server = abem5.ncsa.uiuc.edu
> Account_Name = onm
> Checkpoint = u
> ctime = Mon Jul 21 17:43:47 2008
> Error_Path = abe1196:/dev/null
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Mon Jul 21 17:43:47 2008
> Output_Path = abe1196:/dev/null
> Priority = 0
> qtime = Mon Jul 21 17:43:47 2008
> Rerunable = True
> Resource_List.ncpus = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 00:10:00
> Shell_Path_List = /bin/sh
> etime = Mon Jul 21 17:43:47 2008
> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>
> And his jobs show up like this under qstat -n (ie are all on core /0 ):
>
> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1
> -- 00:10 R --
> abe0872/0
>
> While multi-core jobs use
>
> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
> +abe0579/3+abe0579/2+abe0579/1+abe0579/0
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
--
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From navarro at mcs.anl.gov Mon Jul 21 18:28:28 2008
From: navarro at mcs.anl.gov (JP Navarro)
Date: Mon, 21 Jul 2008 18:28:28 -0500
Subject: [Swift-user] Re: Using > 1 CPU per compute node under GRAM
In-Reply-To: <48851775.5000604@mcs.anl.gov>
References: <48851775.5000604@mcs.anl.gov>
Message-ID:
It's definitely subject to local resource manager/scheduling policy
configuration.
At UC/ANL, for example, there's an explicit policy that says 1 job per
node. Each
job can of course run 1-n processes that share the 2 processors.
There's nothing
gram can do to get around that policy.
You'll need to ask NCSA whether their policies allow multiple jobs on
one node.
If Abe allows only one job per node, then it's up to your one job to
spawn off
enough processes/threads to use the 8 cores.
JP
On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote:
> Im asking this on behalf of Mike Kubal while I wait for more info on
> his settings:
>
> Mike is running under Swift on teragrid/Abe which has 8-core nodes.
> His jobs are all running 1-job-per-node, wasting 7 cores.
>
> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM.
>
> In the meantime, does anyone know if there's a way to specify
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings?
> (Ie might work on some sites but not others)?
>
> On globus doc page:
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes
>
> I see:
>
> ...
>
>
> but cant tell if this applies to single-core jobs or only to multi-
> core jobs.
>
> This will ideally be handled as desired by Falkon or Coaster, but in
> the meantime I was hoping there was a simple setting to give MikeK
> better CPU yield on Abe.
>
> - Mike Wilde
>
> ---
>
> A sample of one of his jobs looks like this under qstat -ef:
>
> Job Id: 395980.abem5.ncsa.uiuc.edu
> Job_Name = STDIN
> Job_Owner = mkubal at abe1196
> job_state = Q
> queue = normal
> server = abem5.ncsa.uiuc.edu
> Account_Name = onm
> Checkpoint = u
> ctime = Mon Jul 21 17:43:47 2008
> Error_Path = abe1196:/dev/null
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Mon Jul 21 17:43:47 2008
> Output_Path = abe1196:/dev/null
> Priority = 0
> qtime = Mon Jul 21 17:43:47 2008
> Rerunable = True
> Resource_List.ncpus = 1
> Resource_List.nodect = 1
> Resource_List.nodes = 1
> Resource_List.walltime = 00:10:00
> Shell_Path_List = /bin/sh
> etime = Mon Jul 21 17:43:47 2008
> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN
>
> And his jobs show up like this under qstat -n (ie are all on core /
> 0 ):
>
> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1
> 1 -- 00:10 R --
> abe0872/0
>
> While multi-core jobs use
>
> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4
> +abe0579/3+abe0579/2+abe0579/1+abe0579/0
From benc at hawaga.org.uk Tue Jul 22 02:14:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Jul 2008 07:14:06 +0000 (GMT)
Subject: [Swift-user] Help needed with batching up parallel runs
In-Reply-To:
References:
Message-ID:
Use clustering. Read the docs mike linked to. Basically you need to
specify a maxwalltime for the jobs you want clustered, and then a
clustering time that is some multiple (eg 10 in your case).
You might try coasters if you are submitting using GT2 (there is something
wrong with gt4 + coasters at the moment that prevents them being used
together).
On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote:
> Hi
>
> I work with some code that generates at some point a number (300 in my
> case) of parallel identical runs, and I need to batch those up (10 at
> a time in my case) because each individual run is too short.
> I don't want Falkon at this point, and I'm not sure about the status
> of the coaster provider, so I would prefer a clean swift solution
> I was thinking of some array manipulation, but it was not obvious how
> to do it with swift.
>
> Thanks !
> Tibi
>
> Here is the code that I have so far, and I need help for:
>
>
>
> //this is the code that batches a number of runs: based on the size of
> the array (determined where I make the call), I will return the set of
> parallel run results
> (file simFile[])gj_batch_sim(file policyFile, file logFile){
> app{
> gj_batch_sim @filename(policyFile) @filename(logFile)
> @filenames(simFile);
> }
> }
>
> int parallelInstances=300;
> file simOutputs[];
>
> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){
> // this is just some needed input
> file logFile;
>
> // I want to have batches of size 10
> int localBatchSize=10;
>
> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*")
> trace("Times to do batch_gj_batch_sim",batchRange);
>
> foreach i in [1:batchRange] {
> // HELP HERE: how to do this ?
> // essentially I need to map the proper batch of file
> names into the call of gj_batch_sim
>
> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile,
> logFile);
> }
> }
>
>
>
>
From benc at hawaga.org.uk Tue Jul 22 02:19:57 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Jul 2008 07:19:57 +0000 (GMT)
Subject: [Swift-user] Using > 1 CPU per compute node under GRAM
In-Reply-To: <48851775.5000604@mcs.anl.gov>
References: <48851775.5000604@mcs.anl.gov>
Message-ID:
On Mon, 21 Jul 2008, Michael Wilde wrote:
> In the meantime, does anyone know if there's a way to specify
> compute-node-sharing between separate single-cpu jobs via both GRAMs?
>
> And if this is dependent on the local job manager code or settings? (Ie might
> work on some sites but not others)?
You can specify via GRAM RSL; however at least TGUC deliberately does not
allow that - one job gets an entire node. I imagine other sites are
similar.
Coasters should allow this to be done, by running two coaster workers on
one node. I plan to look at doing that sometime.
> This will ideally be handled as desired by Falkon or Coaster, but in the
> meantime I was hoping there was a simple setting to give MikeK better
> CPU yield on Abe.
There isn't.
Him and I have investigated this before, I think. I've just put this in
the swift bugzilla as bug 150.
--
From benc at hawaga.org.uk Tue Jul 22 03:48:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Jul 2008 08:48:55 +0000 (GMT)
Subject: [Swift-user] swift + mpi
Message-ID:
I added a note to the swift user guide a couple weeks ago about how to run
MPI jobs in Swift:
http://www.ci.uchicago.edu/swift/guides/userguide.php#tips.mpi
This is based on some playing round by Andriy Fedorov and myself.
--
From wilde at mcs.anl.gov Tue Jul 22 11:21:20 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 22 Jul 2008 11:21:20 -0500
Subject: [Swift-user] Problem with scope and writability for complex types
Message-ID: <48860900.7040805@mcs.anl.gov>
This script compiles:
1 type file;
2
3 file fphr[];
4 file fpin[];
5 file fpsq[];
6
7 (file phr, file pin, file psq) formatdb (file input) {
8 app {
9 formatdb "-i" @input ;
10 }
11 }
12
13 file inputs[] ;
14
15 foreach f, i in inputs {
16 (fphr[i], fpin[i], fpsq[i]) = formatdb(f);
17 }
--- while this script gives a compile-time error:
1 type file;
2
3 type aux {
4 file phr;
5 file pin;
6 file psq;
7 };
8
9 (file phr, file pin, file psq) formatdb (file input) {
10 app {
11 formatdb "-i" @input ;
12 }
13 }
14
15 file inputs[] ;
16 aux a[];
17
18 foreach f, i in inputs {
19 (a[i].phr, a[i].pin, a[i].psq) = formatdb(f);
20 }
--- error is:
Could not start execution.
Compile error in foreach statement at line 18: Compile error in
procedure invocation at line 19: variable a is not writeable in this scope
---
It seems that the second script should be valid. Both set global
variables from with a foreach() in global scope.
When the variable is of complex type "array of file" the variable
indices seem to be writable.
When the variable is of complex type "array of struct of file" the
indexed struct fields seem not to be writable.
From benc at hawaga.org.uk Tue Jul 22 11:38:47 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 22 Jul 2008 16:38:47 +0000 (GMT)
Subject: [Swift-user] Problem with scope and writability for complex types
In-Reply-To: <48860900.7040805@mcs.anl.gov>
References: <48860900.7040805@mcs.anl.gov>
Message-ID:
On Tue, 22 Jul 2008, Michael Wilde wrote:
> It seems that the second script should be valid. Both set global variables
> from with a foreach() in global scope.
yes, I think that is correct.
seems to be a bug.
(the handling of write-once semantics at compile time for SwiftScript is
kinda hard because this array syntax doesn't look like write-once...)
--
From zhengxiongh at uchicago.edu Tue Jul 22 18:33:32 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Tue, 22 Jul 2008 18:33:32 -0500 (CDT)
Subject: [Swift-user] How to transmit data dynamically on Grid
Message-ID: <20080722183332.BIW75329@m4500-01.uchicago.edu>
Hi,
I'm using the Swift to execute application jobs on the
OSG grid sites.
In the sites.xml file, if the jobmanager is not "fork",
e.g. url="abitibi.sbgrid.org/jobmanager-condor".
The job is usually executed on a local computing node,
which is not the "Gateway node" of the grid site.
But when executing the job, in the executable command,
such as a wrapper script "rundock", I want to dynamically
transmit the input data files from CI to the remote grid
site by "globus-url-copy". e.g. (
globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath
file://$work/$ligfile)
And transmit the results data from remote grid site to CI
machine, e.g. (globus-url-copy file://$work/result.tar.gz
gsiftp://communicado.ci.uchicago.edu/home/houzx/dock-
run/databases/results/abitibi.sbgrid.org-$ligfile-
result.tar.gz)
The problem is that, the executing computing node is not
connected to the outside network, So the "globus-url-copy"
fails! Only using "jobmanager-fork", can it succeed, because
the job is executed on the Gateway node of the Grid site.
The user may want to use the "jobmanager-condor" to
execute the jobs. At the same time, according to the
dynamically seleted grid sites of Swift,they also want to
transmit the input and results data dynamically and
automatically by "jobmanager-fork". Because it is
troublesome to "globus-url-copy" the input and results data
to the remote grid sites manually, if there are large
amounts of data files.
So, the quesiton is how to implement it in Swift? Maybe
it's a common problem, but I didn't find it in the documents.
Thanks,
Zhengxiong
From wilde at mcs.anl.gov Tue Jul 22 22:16:48 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 22 Jul 2008 22:16:48 -0500
Subject: [Swift-user] How to transmit data dynamically on Grid
In-Reply-To: <20080722183332.BIW75329@m4500-01.uchicago.edu>
References: <20080722183332.BIW75329@m4500-01.uchicago.edu>
Message-ID: <4886A2A0.9010604@mcs.anl.gov>
Zhengxiong,
By default, Swift automatically moves your data from a directory on the
submit host (the host on which you run the swift command) to a shared
directory on the execution site, where its accessed by your job, running
on a worker node in the remote cluster.
This is explained in the User Guide intro:
http://www.ci.uchicago.edu/swift/guides/userguide.php
"SwiftScript programs are dataflow oriented - they are primarily
concerned with processing (possibly large) data files, by invoking
programs to do that processing. Swift handles execution of such programs
on remote sites by choosing sites, handling the staging of input and
output files to and from the chosen sites and remote execution of
program code".
Staging is detailed in section 8: "Invoking an Application from Swift":
http://www.ci.uchicago.edu/swift/guides/userguide.php#id2931120
I think the example shell script you are looking at, "rundock", is
misleading you, because it was written to run under Falkon without
Swift, and hence does some staging between the cluster's shared
filesystem and local worker-node directories.
I would start by dividing the files that DOCK uses into two categories:
1) files that you will declare as inputs or outputs of Swift atomic
procedures, which you should let Swift stage in an out automatically;
and 2) files which can be considered part of the application's install
directory (which can stay on each cluster's shared filesytem with the
application code, or which can be shipped to each site in a preparation
stage).
In addition Swift will, within the execution of a script, avoid staging
a file in twice, if it can. The users guide explains this under the
property "caching.algorithm":
"Swift caches files that are staged in on remote resources, and files
that are produced remotely by applications, such that they can be
re-used if needed without being transfered again. However, the amount of
remote file system space to be used for caching can be limited using the
swift:storagesize profile entry in the sites.xml file."
So you could let Swift bring in even large files for you, to the shared
filesystem, and your application wrapper script can cache these in a
persistent application directory on the worker node.
In rundock, you could use this aproach by declaring the "receptor"
protein molecule files (grid files and "selected spheres") as Swift
inputs, and let swift bring them to the grid site for you.
Lastly, see this note in the Environment Variables section of the users
guide:
"SWIFT_JOBDIR_PATH - set in env namespace profiles. If set, then Swift
will use the path specified here as a worker-node local temporary
directory to copy input files to before running a job. If unset, Swift
will keep input files on the site-shared filesystem. In some cases,
copying to a worker-node local directory can be much faster than having
applications access the site-shared filesystem directly."
You can achieve the same effect of copying data to the local worker node
disk, by doing so explicitly in your application wrapper script
("rundock" in your case). If you know that you will be running many
applications consecutively on the same worker nodes, eg because you are
using Coaster or Falkon, then you can do what rundock does on the BG/P,
and cache data in a local directory *between* jobs. But, like rundock,
you need to be careful to avoid races between multiple jobs on the same
node, and much ensure that you can always get your data from the shared
filesystem when its not already cached there. Bash functions in rundock
have the locking logic to do this.
Caching data that will be read by many jobs on the worker node disk
makes sense for the receptor files, as each of these will be read by 15K
jobs.
So there's actually several ways in which to manage your data.
Lets work out some of these cases, and then document them in the users
guide for future users, with examples.
- Mike
On 7/22/08 6:33 PM, Zhengxiong Hou wrote:
> Hi,
> I'm using the Swift to execute application jobs on the
> OSG grid sites.
> In the sites.xml file, if the jobmanager is not "fork",
> e.g. url="abitibi.sbgrid.org/jobmanager-condor".
> The job is usually executed on a local computing node,
> which is not the "Gateway node" of the grid site.
> But when executing the job, in the executable command,
> such as a wrapper script "rundock", I want to dynamically
> transmit the input data files from CI to the remote grid
> site by "globus-url-copy". e.g. (
> globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath
> file://$work/$ligfile)
> And transmit the results data from remote grid site to CI
> machine, e.g. (globus-url-copy file://$work/result.tar.gz
> gsiftp://communicado.ci.uchicago.edu/home/houzx/dock-
> run/databases/results/abitibi.sbgrid.org-$ligfile-
> result.tar.gz)
> The problem is that, the executing computing node is not
> connected to the outside network, So the "globus-url-copy"
> fails! Only using "jobmanager-fork", can it succeed, because
> the job is executed on the Gateway node of the Grid site.
>
> The user may want to use the "jobmanager-condor" to
> execute the jobs. At the same time, according to the
> dynamically seleted grid sites of Swift,they also want to
> transmit the input and results data dynamically and
> automatically by "jobmanager-fork". Because it is
> troublesome to "globus-url-copy" the input and results data
> to the remote grid sites manually, if there are large
> amounts of data files.
>
> So, the quesiton is how to implement it in Swift? Maybe
> it's a common problem, but I didn't find it in the documents.
>
> Thanks,
> Zhengxiong
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From tiberius at ci.uchicago.edu Wed Jul 23 15:37:10 2008
From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun)
Date: Wed, 23 Jul 2008 15:37:10 -0500
Subject: [Swift-user] Help needed with batching up parallel runs
In-Reply-To:
References:
Message-ID:
Hi, thanks, I forgot about that.
I tried running it on teraport, and it failed.
The log is here:
http://www.ci.uchicago.edu/~tiberius/issues/gj-batched-20080723-1522-fypk29g6.log
Mihael suggested that I should file a bug report, but I'm not sure
what to report (other than the failure)
Tibi
On Tue, Jul 22, 2008 at 2:14 AM, Ben Clifford wrote:
>
> Use clustering. Read the docs mike linked to. Basically you need to
> specify a maxwalltime for the jobs you want clustered, and then a
> clustering time that is some multiple (eg 10 in your case).
>
> You might try coasters if you are submitting using GT2 (there is something
> wrong with gt4 + coasters at the moment that prevents them being used
> together).
>
> On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote:
>
>> Hi
>>
>> I work with some code that generates at some point a number (300 in my
>> case) of parallel identical runs, and I need to batch those up (10 at
>> a time in my case) because each individual run is too short.
>> I don't want Falkon at this point, and I'm not sure about the status
>> of the coaster provider, so I would prefer a clean swift solution
>> I was thinking of some array manipulation, but it was not obvious how
>> to do it with swift.
>>
>> Thanks !
>> Tibi
>>
>> Here is the code that I have so far, and I need help for:
>>
>>
>>
>> //this is the code that batches a number of runs: based on the size of
>> the array (determined where I make the call), I will return the set of
>> parallel run results
>> (file simFile[])gj_batch_sim(file policyFile, file logFile){
>> app{
>> gj_batch_sim @filename(policyFile) @filename(logFile)
>> @filenames(simFile);
>> }
>> }
>>
>> int parallelInstances=300;
>> file simOutputs[];
>>
>> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){
>> // this is just some needed input
>> file logFile;
>>
>> // I want to have batches of size 10
>> int localBatchSize=10;
>>
>> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*")
>> trace("Times to do batch_gj_batch_sim",batchRange);
>>
>> foreach i in [1:batchRange] {
>> // HELP HERE: how to do this ?
>> // essentially I need to map the proper batch of file
>> names into the call of gj_batch_sim
>>
>> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile,
>> logFile);
>> }
>> }
>>
>>
>>
>>
>
--
Tiberiu (Tibi) Stef-Praun, PhD
Computational Sciences Researcher
Computation Institute
5640 S. Ellis Ave, #405
University of Chicago
http://www-unix.mcs.anl.gov/~tiberius/
From benc at hawaga.org.uk Sun Jul 27 06:03:02 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 27 Jul 2008 11:03:02 +0000 (GMT)
Subject: [Swift-user] Help needed with batching up parallel runs
In-Reply-To:
References:
Message-ID:
If you're using GRAM4 to submit, then it looks like you are hitting a bug
that I fixed a week or so ago, cog svn r2066, which deals with the way
that walltimes are formatted.
On Wed, 23 Jul 2008, Tiberiu Stef-Praun wrote:
> Hi, thanks, I forgot about that.
> I tried running it on teraport, and it failed.
> The log is here:
> http://www.ci.uchicago.edu/~tiberius/issues/gj-batched-20080723-1522-fypk29g6.log
>
> Mihael suggested that I should file a bug report, but I'm not sure
> what to report (other than the failure)
>
> Tibi
>
> On Tue, Jul 22, 2008 at 2:14 AM, Ben Clifford wrote:
> >
> > Use clustering. Read the docs mike linked to. Basically you need to
> > specify a maxwalltime for the jobs you want clustered, and then a
> > clustering time that is some multiple (eg 10 in your case).
> >
> > You might try coasters if you are submitting using GT2 (there is something
> > wrong with gt4 + coasters at the moment that prevents them being used
> > together).
> >
> > On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote:
> >
> >> Hi
> >>
> >> I work with some code that generates at some point a number (300 in my
> >> case) of parallel identical runs, and I need to batch those up (10 at
> >> a time in my case) because each individual run is too short.
> >> I don't want Falkon at this point, and I'm not sure about the status
> >> of the coaster provider, so I would prefer a clean swift solution
> >> I was thinking of some array manipulation, but it was not obvious how
> >> to do it with swift.
> >>
> >> Thanks !
> >> Tibi
> >>
> >> Here is the code that I have so far, and I need help for:
> >>
> >>
> >>
> >> //this is the code that batches a number of runs: based on the size of
> >> the array (determined where I make the call), I will return the set of
> >> parallel run results
> >> (file simFile[])gj_batch_sim(file policyFile, file logFile){
> >> app{
> >> gj_batch_sim @filename(policyFile) @filename(logFile)
> >> @filenames(simFile);
> >> }
> >> }
> >>
> >> int parallelInstances=300;
> >> file simOutputs[];
> >>
> >> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){
> >> // this is just some needed input
> >> file logFile;
> >>
> >> // I want to have batches of size 10
> >> int localBatchSize=10;
> >>
> >> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*")
> >> trace("Times to do batch_gj_batch_sim",batchRange);
> >>
> >> foreach i in [1:batchRange] {
> >> // HELP HERE: how to do this ?
> >> // essentially I need to map the proper batch of file
> >> names into the call of gj_batch_sim
> >>
> >> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile,
> >> logFile);
> >> }
> >> }
> >>
> >>
> >>
> >>
> >
>
>
>
>
From zhengxiongh at uchicago.edu Tue Jul 29 09:22:49 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Tue, 29 Jul 2008 09:22:49 -0500 (CDT)
Subject: [Swift-user] pegasus?
Message-ID: <20080729092249.BJC48144@m4500-01.uchicago.edu>
Hi,
Recently, I met with an error when using Swift:
[houzx at communicado results]$ swift -sites.file ./sites-
20.xml -tc.file ./tc.data grid-many-dock6-auto.swift
2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to
set -Dpegasus.home=$PEGASUS_HOME!
[houzx at communicado dock]$ swift flipper.swift
2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to
set -Dpegasus.home=$PEGASUS_HOME!
Swift did NOT need this. Is there anything wrong with my
account at CI?
[houzx at communicado dock]$ echo $PEGASUS_HOME
/soft/osg-client-1.0.0-r1/pegasus
[houzx at communicado dock]$ cd ~
[houzx at communicado ~]$ cat .soft
#
# This is your SoftEnv configuration run control file.
#
# It is used to tell SoftEnv how to customize your
environment by
# setting up variables such as PATH and MANPATH. To learn
more
# about this file, do a "man softenv".
#
@default
@osg
@globus-4
Thanks!
Zhengxiong
From wilde at mcs.anl.gov Tue Jul 29 09:35:05 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 29 Jul 2008 09:35:05 -0500
Subject: [Swift-user] pegasus?
In-Reply-To: <20080729092249.BJC48144@m4500-01.uchicago.edu>
References: <20080729092249.BJC48144@m4500-01.uchicago.edu>
Message-ID: <488F2A99.9080500@mcs.anl.gov>
See if you have CLASSPATH set, and have Pegasus jars in it.
Then try unsetting CLASSPATH and see if the same error occurs.
The Swift command should put the correct Swift jars in the final
classpath before any of your local jars, but perhaps there's some
strange dynamic class interaction between the Swift version of
tcdata/sites code and code that you have been experimenting with from
the Peagsus release (eg get-sites etc).
- Mike
On 7/29/08 9:22 AM, Zhengxiong Hou wrote:
> Hi,
> Recently, I met with an error when using Swift:
>
> [houzx at communicado results]$ swift -sites.file ./sites-
> 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift
> 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to
> set -Dpegasus.home=$PEGASUS_HOME!
>
> [houzx at communicado dock]$ swift flipper.swift
> 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to
> set -Dpegasus.home=$PEGASUS_HOME!
>
> Swift did NOT need this. Is there anything wrong with my
> account at CI?
>
> [houzx at communicado dock]$ echo $PEGASUS_HOME
> /soft/osg-client-1.0.0-r1/pegasus
> [houzx at communicado dock]$ cd ~
> [houzx at communicado ~]$ cat .soft
> #
> # This is your SoftEnv configuration run control file.
> #
> # It is used to tell SoftEnv how to customize your
> environment by
> # setting up variables such as PATH and MANPATH. To learn
> more
> # about this file, do a "man softenv".
> #
> @default
> @osg
> @globus-4
>
> Thanks!
> Zhengxiong
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From benc at hawaga.org.uk Tue Jul 29 09:40:48 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Jul 2008 14:40:48 +0000 (GMT)
Subject: [Swift-user] pegasus?
In-Reply-To: <488F2A99.9080500@mcs.anl.gov>
References: <20080729092249.BJC48144@m4500-01.uchicago.edu>
<488F2A99.9080500@mcs.anl.gov>
Message-ID:
On Tue, 29 Jul 2008, Michael Wilde wrote:
> The Swift command should put the correct Swift jars in the final classpath
> before any of your local jars, but perhaps there's some strange dynamic class
Right. Although this was only changed around the time that we did the grid
school in georgetown (?April). Versions of Swift older than that will have
the originally posted problem.
--
From zhengxiongh at uchicago.edu Tue Jul 29 11:13:30 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Tue, 29 Jul 2008 11:13:30 -0500 (CDT)
Subject: [Swift-user] Illegal character
Message-ID: <20080729111330.BJC64453@m4500-01.uchicago.edu>
Hi Mike,
Yes,you are right.
If I unset CLASSPATH in .soft.cache.sh, or just mark
#@osg in the .soft file, the original error disappeared.
But there is a new ERROR, although swift job was finished.
[houzx at communicado dock]$ swift flipper.swift
2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on
line 19 Illegal character ' 'at position 5 :Illegal
character ' '
Swift 0.5 swift-r1783 cog-r1962
RunID: 20080729-1105-qbqofzya
Progress:
convert started
2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on
line 19 Illegal character ' 'at position 5 :Illegal
character ' '
convert completed
Final status: Finished successfully:1 Finished:1
Thanks much,
zhengxiong
---- Original message ----
>Date: Tue, 29 Jul 2008 09:35:05 -0500
>From: Michael Wilde
>Subject: Re: [Swift-user] pegasus?
>To: Zhengxiong Hou
>Cc: swift-user at ci.uchicago.edu, support at ci.uchicago.edu
>
>See if you have CLASSPATH set, and have Pegasus jars in it.
>Then try unsetting CLASSPATH and see if the same error
occurs.
>
>The Swift command should put the correct Swift jars in the
final
>classpath before any of your local jars, but perhaps
there's some
>strange dynamic class interaction between the Swift version
of
>tcdata/sites code and code that you have been experimenting
with from
>the Peagsus release (eg get-sites etc).
>
>- Mike
>
>
>On 7/29/08 9:22 AM, Zhengxiong Hou wrote:
>> Hi,
>> Recently, I met with an error when using Swift:
>>
>> [houzx at communicado results]$ swift -sites.file ./sites-
>> 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift
>> 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to
>> set -Dpegasus.home=$PEGASUS_HOME!
>>
>> [houzx at communicado dock]$ swift flipper.swift
>> 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to
>> set -Dpegasus.home=$PEGASUS_HOME!
>>
>> Swift did NOT need this. Is there anything wrong with
my
>> account at CI?
>>
>> [houzx at communicado dock]$ echo $PEGASUS_HOME
>> /soft/osg-client-1.0.0-r1/pegasus
>> [houzx at communicado dock]$ cd ~
>> [houzx at communicado ~]$ cat .soft
>> #
>> # This is your SoftEnv configuration run control file.
>> #
>> # It is used to tell SoftEnv how to customize your
>> environment by
>> # setting up variables such as PATH and MANPATH. To
learn
>> more
>> # about this file, do a "man softenv".
>> #
>> @default
>> @osg
>> @globus-4
>>
>> Thanks!
>> Zhengxiong
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
From benc at hawaga.org.uk Tue Jul 29 11:21:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Jul 2008 16:21:46 +0000 (GMT)
Subject: [Swift-user] Illegal character
In-Reply-To: <20080729111330.BJC64453@m4500-01.uchicago.edu>
References: <20080729111330.BJC64453@m4500-01.uchicago.edu>
Message-ID:
(I removed CI support as this is not their business now)
The fifth byte of the 19th line of your tc.data file is something
unexpected.
Type:
hexdump -C tc.data
(for the tc.data file that you are using)
and send that output.
On Tue, 29 Jul 2008, Zhengxiong Hou wrote:
> Hi Mike,
> Yes,you are right.
> If I unset CLASSPATH in .soft.cache.sh, or just mark
> #@osg in the .soft file, the original error disappeared.
>
> But there is a new ERROR, although swift job was finished.
>
> [houzx at communicado dock]$ swift flipper.swift
> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on
> line 19 Illegal character ' 'at position 5 :Illegal
> character ' '
> Swift 0.5 swift-r1783 cog-r1962
>
> RunID: 20080729-1105-qbqofzya
> Progress:
> convert started
> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on
> line 19 Illegal character ' 'at position 5 :Illegal
> character ' '
> convert completed
> Final status: Finished successfully:1 Finished:1
--
From zhengxiongh at uchicago.edu Tue Jul 29 12:03:16 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Tue, 29 Jul 2008 12:03:16 -0500 (CDT)
Subject: [Swift-user] Illegal character
Message-ID: <20080729120316.BJC71514@m4500-01.uchicago.edu>
Hi Benc,
Sorry, so the error message came from the tc.data. I just
re-edit it. Maybe it is due to a "space". Now, it works
normally.
Thanks!
zhengxiong
---- Original message ----
>Date: Tue, 29 Jul 2008 16:21:46 +0000 (GMT)
>From: Ben Clifford
>Subject: Re: [Swift-user] Illegal character
>To: Zhengxiong Hou
>Cc: Michael Wilde , swift-
user at ci.uchicago.edu
>
>
>(I removed CI support as this is not their business now)
>
>The fifth byte of the 19th line of your tc.data file is
something
>unexpected.
>
>Type:
>
>hexdump -C tc.data
>
>(for the tc.data file that you are using)
>
>and send that output.
>
>On Tue, 29 Jul 2008, Zhengxiong Hou wrote:
>
>> Hi Mike,
>> Yes,you are right.
>> If I unset CLASSPATH in .soft.cache.sh, or just mark
>> #@osg in the .soft file, the original error disappeared.
>>
>> But there is a new ERROR, although swift job was
finished.
>>
>> [houzx at communicado dock]$ swift flipper.swift
>> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on
>> line 19 Illegal character ' 'at position 5 :Illegal
>> character ' '
>> Swift 0.5 swift-r1783 cog-r1962
>>
>> RunID: 20080729-1105-qbqofzya
>> Progress:
>> convert started
>> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on
>> line 19 Illegal character ' 'at position 5 :Illegal
>> character ' '
>> convert completed
>> Final status: Finished successfully:1 Finished:1
>
>--
From wilde at mcs.anl.gov Tue Jul 29 12:11:09 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 29 Jul 2008 12:11:09 -0500
Subject: [Swift-user] Illegal character
In-Reply-To:
References: <20080729111330.BJC64453@m4500-01.uchicago.edu>
Message-ID: <488F4F2D.1030207@mcs.anl.gov>
And I see that you're using Swift 0.5, which may not have the CLASSPATH
improvements in the swift command as Ben mentioned.
Ben, should the nightly builds show up on this page, and if so should
local developers use those to get a recent snapshot:
http://www.ci.uchicago.edu/swift/tests/tests-2008-07-13.html#packages
(in other words, is that page broken, or was it never intended to be a
source of nightly snapshots for download?)
You can also build your own Swift release from SVN. Instructions are at:
http://www.ci.uchicago.edu/swift/downloads/index.php
- Mike
On 7/29/08 11:21 AM, Ben Clifford wrote:
> (I removed CI support as this is not their business now)
>
> The fifth byte of the 19th line of your tc.data file is something
> unexpected.
>
> Type:
>
> hexdump -C tc.data
>
> (for the tc.data file that you are using)
>
> and send that output.
>
> On Tue, 29 Jul 2008, Zhengxiong Hou wrote:
>
>> Hi Mike,
>> Yes,you are right.
>> If I unset CLASSPATH in .soft.cache.sh, or just mark
>> #@osg in the .soft file, the original error disappeared.
>>
>> But there is a new ERROR, although swift job was finished.
>>
>> [houzx at communicado dock]$ swift flipper.swift
>> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on
>> line 19 Illegal character ' 'at position 5 :Illegal
>> character ' '
>> Swift 0.5 swift-r1783 cog-r1962
>>
>> RunID: 20080729-1105-qbqofzya
>> Progress:
>> convert started
>> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on
>> line 19 Illegal character ' 'at position 5 :Illegal
>> character ' '
>> convert completed
>> Final status: Finished successfully:1 Finished:1
>
From benc at hawaga.org.uk Tue Jul 29 12:26:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 29 Jul 2008 17:26:04 +0000 (GMT)
Subject: [Swift-user] Illegal character
In-Reply-To: <488F4F2D.1030207@mcs.anl.gov>
References: <20080729111330.BJC64453@m4500-01.uchicago.edu>
<488F4F2D.1030207@mcs.anl.gov>
Message-ID:
On Tue, 29 Jul 2008, Michael Wilde wrote:
> Ben, should the nightly builds show up on this page, and if so should local
> developers use those to get a recent snapshot:
>
> http://www.ci.uchicago.edu/swift/tests/tests-2008-07-13.html#packages
Its broken I suspect - it always had a tendency to go wrong; and from a
testing perspective has been mostly replaced by the NMI testing system.
> (in other words, is that page broken, or was it never intended to be a source
> of nightly snapshots for download?)
It was originally intended so, however, mostly I find myself preferring
users to either stick with a real release or build from source like you
say below, because the people using latest often seem to want features
more rapidly than waiting a day so end up building from SVN source anyway.
> You can also build your own Swift release from SVN. Instructions are at:
> http://www.ci.uchicago.edu/swift/downloads/index.php
--
From zhaozhang at uchicago.edu Thu Jul 31 12:41:35 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Thu, 31 Jul 2008 12:41:35 -0500
Subject: [Swift-user] swift script calling procedure
Message-ID: <4891F94F.6090004@uchicago.edu>
Hi, Mike
I am using the same structure of the swift script you used to run dock5
in April. The old file is at surveyor:/home/wilde/doc5/run01.
Could some one take a look at this and point out why it failed to
compile with the procedure readdata( ) ? Thanks so much.
best wishes
zhangzhao
My script is like this:
/type DockOut;
type Mol2;
dock (string id, Mol2 mfile, DockOut ofile, string protein)
{
app { rundock @id @mfile @ofile; }
}
type params {
string idname;
string mname;
string oname;
string pname;
};
doall(params pset[])
{
foreach p in pset {
string id=p.idname;
Mol2 mfile=p.mname;
DockOut ofile=p.oname;
string protein=p.pname;
dock(id, mfile, ofile, protein);
}
}
// Main
params p[];
p = readdata("paramlist");
doall(p);/
It failed to be compiled with this message:
/
zzhang at login6.surveyor:~/swift/etc> swift dock2.swift
Could not start execution.
Compile error in procedure invocation at line 30: Procedure
readdata is not declared./
I am also attaching the paramlist file:
/idname mname oname pname
0 /home/zzhang/swift_dock6/run05/000/000/run05_in.0000000.mol2
/home/zzhang/swift_dock6/run05/000/000/run05_out.0000000.tar.gz 1KQP/
From wilde at mcs.anl.gov Thu Jul 31 13:21:50 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 31 Jul 2008 13:21:50 -0500
Subject: [Swift-user] Re: swift script calling procedure
In-Reply-To: <4891F94F.6090004@uchicago.edu>
References: <4891F94F.6090004@uchicago.edu>
Message-ID: <489202BE.8090606@mcs.anl.gov>
Zhao reports that changing readdata to readData() as per the User Guide
compiles correctly.
Perhaps the code he tried was wrong or never worked, or perhaps the case
of the function or the case checking rules changed between the time this
last worked and now.
- Mike
On 7/31/08 12:41 PM, Zhao Zhang wrote:
> Hi, Mike
>
> I am using the same structure of the swift script you used to run dock5
> in April. The old file is at surveyor:/home/wilde/doc5/run01.
> Could some one take a look at this and point out why it failed to
> compile with the procedure readdata( ) ? Thanks so much.
>
> best wishes
> zhangzhao
>
>
> My script is like this:
>
> /type DockOut;
> type Mol2;
>
> dock (string id, Mol2 mfile, DockOut ofile, string protein)
> {
> app { rundock @id @mfile @ofile; }
> }
>
> type params {
> string idname;
> string mname;
> string oname;
> string pname;
> };
>
> doall(params pset[])
> {
> foreach p in pset {
> string id=p.idname;
> Mol2 mfile=p.mname;
> DockOut ofile=p.oname;
> string protein=p.pname;
> dock(id, mfile, ofile, protein);
> }
> }
>
> // Main
>
> params p[];
> p = readdata("paramlist");
> doall(p);/
>
> It failed to be compiled with this message:
> /
> zzhang at login6.surveyor:~/swift/etc> swift dock2.swift
> Could not start execution.
> Compile error in procedure invocation at line 30: Procedure
> readdata is not declared./
>
>
> I am also attaching the paramlist file:
>
> /idname mname oname pname
> 0 /home/zzhang/swift_dock6/run05/000/000/run05_in.0000000.mol2
> /home/zzhang/swift_dock6/run05/000/000/run05_out.0000000.tar.gz 1KQP/
From grog at ci.uchicago.edu Tue Jul 29 14:08:09 2008
From: grog at ci.uchicago.edu (Greg Cross)
Date: Tue, 29 Jul 2008 14:08:09 -0500
Subject: [Swift-user] Illegal character
In-Reply-To: <20080729111330.BJC64453@m4500-01.uchicago.edu>
References: <20080729111330.BJC64453@m4500-01.uchicago.edu>
Message-ID: <1CC22143-5A0D-46A8-91FD-561694CFD654@ci.uchicago.edu>
OSG sets dozens of environmental variables. Normally this is done
through sourcing the setup.*sh files in $VDT_LOCATION, but softenv
does the same thing automatically with any "osg" macro.
Unfortunately, many variables get set that would otherwise be
unnecessary. Unfortunately (and obviously) the result isn't always
desirable, and so you either have to remove it like you did or have
CLASSPATH and other variables defined/appended for swift (and
anything else that conflicts).
-- Greg
On Tue 29 Jul 2008, at 11:13, Zhengxiong Hou wrote:
> Hi Mike,
> Yes,you are right.
> If I unset CLASSPATH in .soft.cache.sh, or just mark
> #@osg in the .soft file, the original error disappeared.
>
> But there is a new ERROR, although swift job was finished.
>
> [houzx at communicado dock]$ swift flipper.swift
> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on
> line 19 Illegal character ' 'at position 5 :Illegal
> character ' '
> Swift 0.5 swift-r1783 cog-r1962
>
> RunID: 20080729-1105-qbqofzya
> Progress:
> convert started
> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on
> line 19 Illegal character ' 'at position 5 :Illegal
> character ' '
> convert completed
> Final status: Finished successfully:1 Finished:1
>
> Thanks much,
> zhengxiong
> ---- Original message ----
>> Date: Tue, 29 Jul 2008 09:35:05 -0500
>> From: Michael Wilde
>> Subject: Re: [Swift-user] pegasus?
>> To: Zhengxiong Hou
>> Cc: swift-user at ci.uchicago.edu, support at ci.uchicago.edu
>>
>> See if you have CLASSPATH set, and have Pegasus jars in it.
>> Then try unsetting CLASSPATH and see if the same error
> occurs.
>>
>> The Swift command should put the correct Swift jars in the
> final
>> classpath before any of your local jars, but perhaps
> there's some
>> strange dynamic class interaction between the Swift version
> of
>> tcdata/sites code and code that you have been experimenting
> with from
>> the Peagsus release (eg get-sites etc).
>>
>> - Mike
>>
>>
>> On 7/29/08 9:22 AM, Zhengxiong Hou wrote:
>>> Hi,
>>> Recently, I met with an error when using Swift:
>>>
>>> [houzx at communicado results]$ swift -sites.file ./sites-
>>> 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift
>>> 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to
>>> set -Dpegasus.home=$PEGASUS_HOME!
>>>
>>> [houzx at communicado dock]$ swift flipper.swift
>>> 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to
>>> set -Dpegasus.home=$PEGASUS_HOME!
>>>
>>> Swift did NOT need this. Is there anything wrong with
> my
>>> account at CI?
>>>
>>> [houzx at communicado dock]$ echo $PEGASUS_HOME
>>> /soft/osg-client-1.0.0-r1/pegasus
>>> [houzx at communicado dock]$ cd ~
>>> [houzx at communicado ~]$ cat .soft
>>> #
>>> # This is your SoftEnv configuration run control file.
>>> #
>>> # It is used to tell SoftEnv how to customize your
>>> environment by
>>> # setting up variables such as PATH and MANPATH. To
> learn
>>> more
>>> # about this file, do a "man softenv".
>>> #
>>> @default
>>> @osg
>>> @globus-4
>>>
>>> Thanks!
>>> Zhengxiong
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user