From fedorov at cs.wm.edu Tue Jul 1 08:39:28 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 1 Jul 2008 09:39:28 -0400 Subject: [Swift-user] Passing hostType for MPI jobs Message-ID: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> Hi, I am having problems passing host type for MPI jobs. This appears to happen both when I am using globusrun-ws (XML job description), although the errors are different. I am trying to request nodes of type "compute" on UC TeraGrid site. This host type is recognized by PBS when I pass it to "qsub". Basically, when I am using XML job description, I am specifying hostType using Job description extension support (http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes). What happens is that I get the correct type of nodes, but the count is not what I request. When I specify hostType parameter in tc.data I either get an error (when I have hostCount="4:compute"): ===> RunID: 20080701-0829-xstp5l98 Progress: hello_mpi started Progress: Stage in:1 Failed to transfer wrapper log from hello_mpi_swift-20080701-0829-xstp5l98/info/a/UC-GT4 Failed to transfer wrapper log from hello_mpi_swift-20080701-0829-xstp5l98/info/b/UC-GT4 Failed to transfer wrapper log from hello_mpi_swift-20080701-0829-xstp5l98/info/c/UC-GT4 hello_mpi failed Execution failed: Exception in hello_mpi: Arguments: [] Host: UC-GT4 Directory: hello_mpi_swift-20080701-0829-xstp5l98/jobs/c/hello_mpi-cltmnvui stderr.txt: stdout.txt: ---- Caused by: For input string: "4:compute" <=== or I get the nodes of the wrong type (when I use hostType="compute" -- looks like it is just ignored). Does anyone know how to specify host type correctly? Is this a GT4 bug? I suspect there is a GT4 bug involved, because when I skip , I can correctly run MPI job on 4 hosts. I don't know what is the Swift support for host type functionality. For the reference, I attach my XML job description, tc.data, sites.xml, Swift script, and the simple MPI "hello world" code. hello_mpi.c (compile with `mpicc -o hello_mpi hello_mpi.c') ==> #include #include int main(int argc, char **argv){ int myrank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &size); fprintf(stderr, "Hello, world from cpu %i (total %i)\n", myrank, size); MPI_Finalize(); return 0; } <=== hello_mpi_xml.xml ===> https://tg-grid.uc.teragrid.org:8443/wsrf/services/ManagedJobFactoryService PBS /home/fedorov/local/bin/hello_mpi /home/fedorov/scratch/hello_mpi_xml.stdout /home/fedorov/scratch/hello_mpi_xml.stderr 4 4 10 mpi compute <=== hello_mpi_swift.swift ===> type messagefile {} (messagefile t) greeting() { app { hello_mpi stderr=@filename(t); } } messagefile outfile <"hello_mpi.txt">; outfile = greeting(); <=== tc.data ===> UC-GT4 hello_mpi /home/fedorov/local/bin/hello_mpi_v INSTALLED INTEL32::LINUX GLOBUS::hostCount="4",jobType=mpi,maxWallTime="10",count="4",hostType="compute" <=== sites.xml ===> /home/fedorov/scratch <=== From benc at hawaga.org.uk Tue Jul 1 09:34:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 1 Jul 2008 14:34:15 +0000 (GMT) Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> Message-ID: On Tue, 1 Jul 2008, Andriy Fedorov wrote: > I am having problems passing host type for MPI jobs. This appears to > happen both when I am using globusrun-ws (XML job description), although > the errors are different. I've been working on running MPI jobs inside Swift today. On TG UC I find a problem that sounds like that when using GRAM4. Using GRAM2 works ok (but slower). I can specify the host type ok, but not the job node count. I will interact with the TG UC admins to see if they know what is going on there. -- From fedorov at cs.wm.edu Tue Jul 1 09:47:21 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Tue, 1 Jul 2008 10:47:21 -0400 Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> Message-ID: <82f536810807010747i1615b5c8l87186035aed3f118@mail.gmail.com> > I've been working on running MPI jobs inside Swift today. On TG UC I find > a problem that sounds like that when using GRAM4. Using GRAM2 works ok > (but slower). I can specify the host type ok, but not the job node count. > By the way, specifying the job node count works fine for me both with GRAM4+Swift, and with just GRAM4 (XML) -- try the configuration and scripts I attach to the initial post. It does NOT work if I try to specify both host count and host type for GRAM4 XML. Andrey From benc at hawaga.org.uk Wed Jul 2 03:34:35 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 08:34:35 +0000 (GMT) Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> Message-ID: So: /bin/hostname /home/benc/mpi /home/benc/mpi/test.stdout /home/benc/mpi/test.stderr 3 allocates three hosts for me, without specifying the type. This seems to give the correct behaviour. /bin/hostname /home/benc/mpi /home/benc/mpi/test.stdout /home/benc/mpi/test.stderr 3 ia64-compute allocates one host for me (ignoring the hostCount) but it is of the correct type, ia64-compute. This seems to be incorrect behaviour because it ignores the hostcount. A different approach, using a different hostcount field that the job extensions web page at http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html suggests: /bin/hostname /home/benc/mpi /home/benc/mpi/test.stdout /home/benc/mpi/test.stderr ia64-compute 3 results in: [benc at tg-login1 mpi]$ globusrun-ws -submit -Ft PBS -F tg-grid.uc.teragrid.org -job-description-file ./gram4-dbg.rsl Submitting job...Done. Job ID: uuid:cc6b465e-4810-11dd-9981-0007e9d811ce Termination time: 07/03/2008 08:28 GMT Current job state: Failed Destroying job...Done. globusrun-ws: Job failed: The executable could not be started. qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes Likewise if I use this extension: ia64-compute 3 But finally... ia64-compute 5 1 allocates 5 hosts. So it looks like you need to specify both hostCount and cpusPerHost. So that is how to specify it with GRAM4 direct submission. I'll have to have a play around to figure out how that can be specified in Swift+GRAM4. -- From lixi at uchicago.edu Wed Jul 2 08:07:49 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 08:07:49 -0500 (CDT) Subject: [Swift-user] Re: No response of Swift run Message-ID: <20080702080749.BBV69776@m4500-03.uchicago.edu> >Hi, > >I launched a Swift workflow (including 2001 jobs) at 16:16 >yesterday. At 17:20, it returned the results of 2000 jobs, >then there is no reponse any more. I wonder why? I enabled >the replication option. > >The log file is very large (more 1G) and is on CI: >/home/lixi/newswift/test/newversion/workflowtest-20080629- >1616-c4h22j03.log > >Please check it, thanks > The similar execution result occurred again. The log file is on CI: /home/lixi/newswift/test/newversion/0701/workflowtest- 20080701-1206-sjuu3cnc.log Thanks, Xi From benc at hawaga.org.uk Wed Jul 2 08:14:17 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 13:14:17 +0000 (GMT) Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu> References: <20080702080749.BBV69776@m4500-03.uchicago.edu> Message-ID: cog r2064 and r2065 introduce some changes in the scheduling code which will reduce the size of log files substantially and fix a hanging problem that was introduced with my r2058 scheduler changes. This might or might not fix your problem. I think probably not, but it is worth a try. -- From lixi at uchicago.edu Wed Jul 2 08:34:04 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 08:34:04 -0500 (CDT) Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run Message-ID: <20080702083404.BBV71639@m4500-03.uchicago.edu> >cog r2064 and r2065 introduce some changes in the scheduling code which >will reduce the size of log files substantially and fix a hanging problem >that was introduced with my r2058 scheduler changes. > >This might or might not fix your problem. I think probably not, but it is >worth a try. > Thanks, I'll try. In fact, this is the result of Swift svn swift-r2079 cog- r2063. Xi From benc at hawaga.org.uk Wed Jul 2 08:43:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 13:43:00 +0000 (GMT) Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu> References: <20080702083404.BBV71639@m4500-03.uchicago.edu> Message-ID: On Wed, 2 Jul 2008, lixi at uchicago.edu wrote: > In fact, this is the result of Swift svn swift-r2079 cog- > r2063. Yes, I can see that from the log file. Actually it is r2063 with some changes that you have applied, according to the log file (presumably one of the patches that mihael and I sent earlier that you will not need to use after r2065) In your log workflowtest-20080701-1206-sjuu3cnc, a single task appears to still be in 'Active' state, which is possibly why the run does not end. The task ID for that is 0-1-1550-2-1214932015745. It is a file transfer of some kind. I think to site AGLT2 though the log information is a little vague - probably we should give more information there. -- From benc at hawaga.org.uk Wed Jul 2 09:10:03 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 14:10:03 +0000 (GMT) Subject: [Swift-user] Re: [Swift-devel] Re: No response of Swift run In-Reply-To: <20080702083404.BBV71639@m4500-03.uchicago.edu> References: <20080702083404.BBV71639@m4500-03.uchicago.edu> Message-ID: unrelated to your problem: here are log plots: http://www.ci.uchicago.edu/~benc/tmp/report-workflowtest-20080701-1206-sjuu3cnc/ the table: 'sites/success table' gives some quantification of what replication is doing. the columns in that table mean, basically: JOB_SUCCESS - a job ran all the way through APPLICATION_EXCEPTION - a job was attempted but failed JOB_CANCELLED - a job was submitted to the queue, but a replica ran first so this was cancelled. On the big (high success rate) sites, it looks like around a third of submissions end up getting cancelled due to replication. -- From fedorov at cs.wm.edu Wed Jul 2 10:01:02 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 2 Jul 2008 11:01:02 -0400 Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> Message-ID: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> > But finally... > > > > ia64-compute > 5 > 1 > > > > allocates 5 hosts. > > So it looks like you need to specify both hostCount and cpusPerHost. > Ok, I tried that. It indeed allocates correct number of the requested hosts. But, there's still a problem. It appears that only one instance of the executable is running, at least when I specify jpbType to mpi. I am not sure it is being run as an MPI job. I have a simple mpi code that outputs rank and COMM_WORLD size, ant the test says I have the total of 1 process, when I submit my job with the following job specification: /home/fedorov/local/bin/hello_mpi /home/fedorov/scratch/hello_mpi_xml.stdout /home/fedorov/scratch/hello_mpi_xml.stderr 10 mpi compute 4 1 4 Ben, can you try to run some MPI executable, and see if it works for you? By the way, I also discovered, that sometimes the order of tags in .xml makes difference (meaning, with certain order of "count", "walltime" and "hostCount" globusrun-ws will abort). I had no idea order matters... Andrey From hategan at mcs.anl.gov Wed Jul 2 10:12:44 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 10:12:44 -0500 Subject: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702080749.BBV69776@m4500-03.uchicago.edu> References: <20080702080749.BBV69776@m4500-03.uchicago.edu> Message-ID: <1215011564.469.4.camel@localhost> Could you do the following for me: 1. edit dist/vdsk-xyz/bin/swift 2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n"' (a single line) (you may need to do this every time you compile swift) 3. then run it again and let me know when it hangs. Don't kill the hanging workflow. Let it hang instead. 4. Also let me know what machine you run this on. On Wed, 2008-07-02 at 08:07 -0500, lixi at uchicago.edu wrote: > >Hi, > > > >I launched a Swift workflow (including 2001 jobs) at 16:16 > >yesterday. At 17:20, it returned the results of 2000 jobs, > >then there is no reponse any more. I wonder why? I enabled > >the replication option. > > > >The log file is very large (more 1G) and is on CI: > >/home/lixi/newswift/test/newversion/workflowtest-20080629- > >1616-c4h22j03.log > > > >Please check it, thanks > > > The similar execution result occurred again. The log file is > on CI: > /home/lixi/newswift/test/newversion/0701/workflowtest- > 20080701-1206-sjuu3cnc.log > > Thanks, > > Xi > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From lixi at uchicago.edu Wed Jul 2 12:22:09 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 12:22:09 -0500 (CDT) Subject: [Swift-user] Re: No response of Swift run Message-ID: <20080702122209.BBV97884@m4500-03.uchicago.edu> >Could you do the following for me: >1. edit dist/vdsk-xyz/bin/swift >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug >- Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" ' (a >single line) (you may need to do this every time you compile swift) >3. then run it again and let me know when it hangs. Don't kill the >hanging workflow. Let it hang instead. >4. Also let me know what machine you run this on. Now I'm running this workflow again on login.ci.uchicago.edu. Meanwhile, I launched another swift run to test a single site, but I got such error: [lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file tc.data workflowtest.swift ERROR: transport error 202: bind failed: Address already in use ["transport.c",L41] ERROR: JDWP Transport dt_socket failed to initialize, TRANSPORT_INIT(510) ["debugInit.c",L500] JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports initializedFATAL ERROR in native method: JDWP No transports initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113) Is there something to do with this option? Thanks, Xi From hategan at mcs.anl.gov Wed Jul 2 12:30:52 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 12:30:52 -0500 Subject: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702122209.BBV97884@m4500-03.uchicago.edu> References: <20080702122209.BBV97884@m4500-03.uchicago.edu> Message-ID: <1215019852.3631.4.camel@localhost> On Wed, 2008-07-02 at 12:22 -0500, lixi at uchicago.edu wrote: > >Could you do the following for me: > >1. edit dist/vdsk-xyz/bin/swift > >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug > >- > Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" > ' (a > >single line) (you may need to do this every time you > compile swift) > >3. then run it again and let me know when it hangs. Don't > kill the > >hanging workflow. Let it hang instead. > >4. Also let me know what machine you run this on. > > Now I'm running this workflow again on login.ci.uchicago.edu. > > Meanwhile, I launched another swift run to test a single > site, but I got such error: > [lixi at login GLOW]$ swift -sites.file GLOW.sites.xml -tc.file > tc.data workflowtest.swift > ERROR: transport error 202: bind failed: Address already in > use ["transport.c",L41] > ERROR: JDWP Transport dt_socket failed to initialize, > TRANSPORT_INIT(510) ["debugInit.c",L500] > JDWP exit error JVMTI_ERROR_INTERNAL(113): No transports > initializedFATAL ERROR in native method: JDWP No transports > initialized, jvmtiError=JVMTI_ERROR_INTERNAL(113) > > Is there something to do with this option? It has everything to do with that option :) As far as I remember, things should continue to run ok (except for the debugger not being started), so you should ignore the error message. If swift doesn't run, then you could make two copies of the swift startup script (say swift-debugger with the option and swift without the option). Then if you want the debugger on, use swift-debugger, and for normal runs, use swift. > > Thanks, > > Xi From lixi at uchicago.edu Wed Jul 2 12:38:29 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 12:38:29 -0500 (CDT) Subject: [Swift-user] Re: No response of Swift run Message-ID: <20080702123829.BBV99464@m4500-03.uchicago.edu> >> Now I'm running this workflow again on login.ci.uchicago.edu. This workflow with 2001 jobs finished successfully and quickly without hanging up. Then I continue to launch a workflow with 3001 jobs and see the result. >As far as I remember, things should continue to run ok (except for the >debugger not being started), so you should ignore the error message. If >swift doesn't run, then you could make two copies of the swift startup >script (say swift-debugger with the option and swift without the >option). Then if you want the debugger on, use swift- debugger, and for >normal runs, use swift. Do you mean that I could copy swift into swift-debugger (specifying that option). I could choose one of these ways to run swift, e.g: swift first.swift swift-debugger first.swift Then it will invoke the corresponding script. >> >> Thanks, >> >> Xi > From benc at hawaga.org.uk Wed Jul 2 14:10:56 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 19:10:56 +0000 (GMT) Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> Message-ID: On Wed, 2 Jul 2008, Andriy Fedorov wrote: > Ok, I tried that. It indeed allocates correct number of the requested > hosts. But, there's still a problem. It appears that only one instance > of the executable is running, at least when I specify jpbType to mpi. Specify jobType=single. Don't specify jobtype=mpi. Then in your executable, use mpirun. The idea is to make GRAM run only a single job, and use mpirun to launch the executables. Look at mpi.sh in the example that I posted. -- From hategan at mcs.anl.gov Wed Jul 2 14:26:49 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 02 Jul 2008 14:26:49 -0500 Subject: [Swift-user] Re: No response of Swift run In-Reply-To: <20080702123829.BBV99464@m4500-03.uchicago.edu> References: <20080702123829.BBV99464@m4500-03.uchicago.edu> Message-ID: <1215026809.5659.0.camel@localhost> On Wed, 2008-07-02 at 12:38 -0500, lixi at uchicago.edu wrote: > >> Now I'm running this workflow again on > login.ci.uchicago.edu. > > This workflow with 2001 jobs finished successfully and > quickly without hanging up. Then I continue to launch a > workflow with 3001 jobs and see the result. > > >As far as I remember, things should continue to run ok > (except for the > >debugger not being started), so you should ignore the error > message. If > >swift doesn't run, then you could make two copies of the > swift startup > >script (say swift-debugger with the option and swift > without the > >option). Then if you want the debugger on, use swift- > debugger, and for > >normal runs, use swift. > > Do you mean that I could copy swift into swift-debugger > (specifying that option). I could choose one of these ways > to run swift, e.g: > swift first.swift > swift-debugger first.swift > > Then it will invoke the corresponding script. Yes. > > >> > >> Thanks, > >> > >> Xi > > From fedorov at cs.wm.edu Wed Jul 2 14:43:42 2008 From: fedorov at cs.wm.edu (Andriy Fedorov) Date: Wed, 2 Jul 2008 15:43:42 -0400 Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> Message-ID: <82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com> On Wed, Jul 2, 2008 at 3:10 PM, Ben Clifford wrote: > > On Wed, 2 Jul 2008, Andriy Fedorov wrote: > >> Ok, I tried that. It indeed allocates correct number of the requested >> hosts. But, there's still a problem. It appears that only one instance >> of the executable is running, at least when I specify jpbType to mpi. > > Specify jobType=single. Don't specify jobtype=mpi. Then in your > executable, use mpirun. The idea is to make GRAM run only a single job, > and use mpirun to launch the executables. Look at mpi.sh in the example > that I posted. > I was referring to using GT4 GRAM directly -- no Swift. What is happening doesn't seem right to me. Not that this is the right place to talk about GRAM issues, just reporting my experience. > -- > From benc at hawaga.org.uk Wed Jul 2 17:14:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 2 Jul 2008 22:14:55 +0000 (GMT) Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: <82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com> References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> <82f536810807021243y1922913brf2fb5d0b2b41bbf@mail.gmail.com> Message-ID: On Wed, 2 Jul 2008, Andriy Fedorov wrote: > I was referring to using GT4 GRAM directly -- no Swift. What is > happening doesn't seem right to me. Not that this is the right place > to talk about GRAM issues, just reporting my experience. what I was showing was specifically for running swift+mpi - it needs to happen very differently to plain gram+mpi because of the server-side components of Swift. -- From lixi at uchicago.edu Wed Jul 2 17:40:45 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Wed, 2 Jul 2008 17:40:45 -0500 (CDT) Subject: [Swift-user] Swift run finished with errors Message-ID: <20080702174045.BBW30506@m4500-03.uchicago.edu> Hi, I just ran a workflow with 5001 jobs, but it terminated with errors during execution. It seems that job 0-1-588 produces a failure which is caused by site SWT2_CPB's sudden connection error and leads to the failure of whole workflow. The log file plot is: http://www.ci.uchicago.edu/~lixi/Log/report-workflowtest- 20080702-1415-s9vmjplf/ The log file is on CI: /home/lixi/newswift/test/newversion/0702/workflowtest- 20080702-1415-s9vmjplf.log Could you find if this job is resubmitted to another site or the same site before the final failure? Thanks, Xi From benc at hawaga.org.uk Thu Jul 3 03:22:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Jul 2008 08:22:57 +0000 (GMT) Subject: [Swift-user] Swift run finished with errors In-Reply-To: <20080702174045.BBW30506@m4500-03.uchicago.edu> References: <20080702174045.BBW30506@m4500-03.uchicago.edu> Message-ID: That job failed 3 times. Sometimes that will happen. There are various things you can do to reduce the effect this has on your run: Turn on lazy.errors in swift.properties: Normally when one job has failed (eg. it has used up all of its retries) then the whole run is immediately abandoned. If you turn on lazy errors, then the rest of the run will attempt to continue. This means that you might end up with a run in which only that one job (or perhaps only a small number of jobs) has failed. The restart log (*.rlog) should then let you run again to try that small number again. Increase the number of retries in swift.properties - execution.retries. This is set to 2 by default, meaning that a job will be executed up to three times - once originally, and twice more as retries if there are failures. You can increase this a small amount, eg to 5, to massively reduce the probability of of a job caused by random job failures. (eg if you have p=0.01 chance of a job submission failing, then exection.retries=2 gives p^3 = 0.000001 chance of failure; but execution.retries=5 gives p^6 = 0..000000000001 chance of failure This does not help when the failures are caused by a broken job (such as missing input files on the submit side); in such a case it will increase load on remote systems and slow the run down. -- From benc at hawaga.org.uk Thu Jul 3 03:34:18 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Jul 2008 08:34:18 +0000 (GMT) Subject: [Swift-user] Passing hostType for MPI jobs In-Reply-To: <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> References: <82f536810807010639k7fc97510gf0dde83b47038fb3@mail.gmail.com> <82f536810807020801u16fcb952i14d8fc5246f432a7@mail.gmail.com> Message-ID: On Wed, 2 Jul 2008, Andriy Fedorov wrote: > Ok, I tried that. It indeed allocates correct number of the requested > hosts. But, there's still a problem. It appears that only one instance > of the executable is running, at least when I specify jpbType to mpi. > I am not sure it is being run as an MPI job. I can replicate that with plain GRAM4 on TG UC. In the PBS Epilogue, I see: Limits: nodes=5:ia64-compute:ppn=1,walltime=00:15:00 Nodes: tg-c053 tg-c034 tg-c020 tg-c011 tg-c007 but my code only has COMM_WORLD size 1. This code doesn't run at all if it is not run through mpi, so I think the code *is* being run as an mpi job but the mpi node count is not getting specified correctly. My present recommended way of doing mpi in Swift is not using jobtype=mpi in gram, though, so I don't want to spend too much time figuring this out. The gram-user at globus.org list and/or help at teragrid.org probably can offer more. > By the way, I also discovered, that sometimes the order of tags in > .xml makes difference (meaning, with certain order of "count", > "walltime" and "hostCount" globusrun-ws will abort). I had no idea > order matters... yes. Those options are defined with an XML Schema which means, to be valid, they must appear in the order they are defined in: http://www.globus.org/toolkit/docs/4.0/execution/wsgram/schemas/gram_job_description.html#type_JobDescriptionType -- From lixi at uchicago.edu Thu Jul 3 07:21:24 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Thu, 3 Jul 2008 07:21:24 -0500 (CDT) Subject: [Swift-user] Swift run finished with errors Message-ID: <20080703072124.BBW63459@m4500-03.uchicago.edu> Thank you for detailed explanations. In addition, I want to know to which sites were this 3 tries submitted and how about the replications, because I want to explore details of scheduler's behavior. Thanks, Xi ---- Original message ---- >Date: Thu, 3 Jul 2008 08:22:57 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-user] Swift run finished with errors >To: lixi at uchicago.edu >Cc: swift-user > > >That job failed 3 times. Sometimes that will happen. > >There are various things you can do to reduce the effect this has on your >run: > >Turn on lazy.errors in swift.properties: > Normally when one job has failed (eg. it has used up all of its > retries) then the whole run is immediately abandoned. > If you turn on lazy errors, then the rest of the run will attempt to > continue. This means that you might end up with a run in which only > that one job (or perhaps only a small number of jobs) has failed. The > restart log (*.rlog) should then let you run again to try that small > number again. > >Increase the number of retries in swift.properties - execution.retries. > This is set to 2 by default, meaning that a job will be executed up to > three times - once originally, and twice more as retries if there are > failures. You can increase this a small amount, eg to 5, to massively > reduce the probability of of a job caused by random job failures. (eg > if you have p=0.01 chance of a job submission failing, then > exection.retries=2 gives p^3 = 0.000001 chance of failure; but > execution.retries=5 gives p^6 = 0..000000000001 chance of failure > > This does not help when the failures are caused by a broken job (such > as missing input files on the submit side); in such a case it will > increase load on remote systems and slow the run down. > >-- > From benc at hawaga.org.uk Thu Jul 3 07:57:14 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 3 Jul 2008 12:57:14 +0000 (GMT) Subject: [Swift-user] Swift run finished with errors In-Reply-To: <20080703072124.BBW63459@m4500-03.uchicago.edu> References: <20080703072124.BBW63459@m4500-03.uchicago.edu> Message-ID: On Thu, 3 Jul 2008, lixi at uchicago.edu wrote: > Thank you for detailed explanations. > > In addition, I want to know to which sites were this 3 tries > submitted and how about the replications, because I want to > explore details of scheduler's behavior. You can get such numbers from the sites/score table in log processing ouputput. APPLICATION_EXCEPTION means a job failed on a site; and JOB_CANCELED (using log processing >r2082) means a job was cancelled on this site, which is usually because replication meant a different site ran the same job. -- From lixi at uchicago.edu Thu Jul 3 08:05:44 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Thu, 3 Jul 2008 08:05:44 -0500 (CDT) Subject: [Swift-user] Swift run finished with errors Message-ID: <20080703080544.BBW65610@m4500-03.uchicago.edu> >You can get such numbers from the sites/score table in log processing >ouputput. APPLICATION_EXCEPTION means a job failed on a site; and >JOB_CANCELED (using log processing >r2082) means a job was cancelled on >this site, which is usually because replication meant a different site ran >the same job. Do you mean sites/success table? Yes, I got it. However, that could only give the general information for all jobs. I really want to know the trace of this single failed job. Sorry to trouble. From lixi at uchicago.edu Sun Jul 6 12:29:01 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sun, 6 Jul 2008 12:29:01 -0500 (CDT) Subject: [Swift-user] Re: No response of Swift run Message-ID: <20080706122901.BBY13774@m4500-03.uchicago.edu> >Could you do the following for me: >1. edit dist/vdsk-xyz/bin/swift >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug >- Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" ' (a >single line) (you may need to do this every time you compile swift) >3. then run it again and let me know when it hangs. Don't kill the >hanging workflow. Let it hang instead. >4. Also let me know what machine you run this on. > Today, I ran a workflow with 5001 jobs using swift-debugger, but it finished with error message: ERROR: transport error 202: handshake failed - received >GET http://www< - excepted >JDWP-Handshake< ["transport.c",L41] This is the first time for me to encounter this error. The log file is on CI: /home/lixi/newswift/test/newversion/0706/workflowtest- 20080706-1134-o8s4a3ig.log Thanks, xi From hategan at mcs.anl.gov Sun Jul 6 22:04:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 06 Jul 2008 22:04:36 -0500 Subject: [Swift-user] Re: No response of Swift run In-Reply-To: <20080706122901.BBY13774@m4500-03.uchicago.edu> References: <20080706122901.BBY13774@m4500-03.uchicago.edu> Message-ID: <1215399876.29501.2.camel@localhost> On Sun, 2008-07-06 at 12:29 -0500, lixi at uchicago.edu wrote: > >Could you do the following for me: > >1. edit dist/vdsk-xyz/bin/swift > >2. replace the 'OPTIONS=' line with 'OPTIONS="-Xdebug > >- > Xrunjdwp:transport=dt_socket,address=8888,server=y,suspend=n" > ' (a > >single line) (you may need to do this every time you > compile swift) > >3. then run it again and let me know when it hangs. Don't > kill the > >hanging workflow. Let it hang instead. > >4. Also let me know what machine you run this on. > > > > Today, I ran a workflow with 5001 jobs using swift-debugger, > but it finished with error message: > ERROR: transport error 202: handshake failed - received >GET > http://www< - excepted >JDWP-Handshake< ["transport.c",L41] > > This is the first time for me to encounter this error. The > log file is on > CI: /home/lixi/newswift/test/newversion/0706/workflowtest- > 20080706-1134-o8s4a3ig.log Well, probably somebody was nice enough to portscan that machine while the workflow was running. I guess there isn't any easy solution to this. Maybe somebody else has a better idea. > > Thanks, > > xi From benc at hawaga.org.uk Tue Jul 8 01:59:45 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 8 Jul 2008 06:59:45 +0000 (GMT) Subject: [Swift-user] suggestion for program flow control In-Reply-To: References: <380944.70847.qm@web52304.mail.re2.yahoo.com> Message-ID: On Tue, 17 Jun 2008, Ben Clifford wrote: > > I can definitely see the benefit of having separate pipelines for > > non-dependent parts within the same script, but perhaps there is a way > > to chain dependent functions that is not dependent on files produced by > > previous functions? > > I've been playing with some code to do that as someone else requested it. > > Basically you will be able to have a swiftscript variable that expresses > the dependency, but doesn't have any actual content (such as a file). > > Hopefully later this week there will be something in SVN. Somewhat later than I'd hoped. Swift SVN r2095 has 'extern' types. You can use like this: (external o) a() { app { helperA @strcat(@arg("dir"),"/restart-extern.1.out") "/etc/group" "qux"; } } b(external o) { app { helperB @strcat(@arg("dir"),"/restart-extern.2.out") "/etc/group" "baz"; } } external sync; sync=a(); b(sync); This makes a dependency between a and b, but doesn't actually move any data around; its entirely up to you to ensure that when the a procedure finishes your data is in the right place for b to find it. -- From iraicu at cs.uchicago.edu Tue Jul 8 15:14:04 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 08 Jul 2008 15:14:04 -0500 Subject: [Swift-user] talk at SCC08 / SWF08 on "Scientific Workflow Systems for 21st Century, New Bottle or New Wine?" Message-ID: <4873CA8C.80509@cs.uchicago.edu> Hi all, In case any of you are attending SCC08 or SWF08 in Hawaii, please join me a talk on Scientific Workflow Systems for 21st Century, which will take place at 1:30PM (Hawaii time). Here are the slides to my talk: http://people.cs.uchicago.edu/~iraicu/presentations/2008_SWF08.pdf Cheers, Ioan -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Wed Jul 9 11:43:43 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Wed, 09 Jul 2008 11:43:43 -0500 Subject: [Swift-user] CFP: Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) co-located with IEEE/ACM SC08 Message-ID: <4874EABF.7080503@cs.uchicago.edu> ------------------------------------------------------------------------------- Call for Papers ------------------------------------------------------------------------------- The 1st Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08) http://dsl.cs.uchicago.edu/MTAGS08/ http://dsl.cs.uchicago.edu/MTAGS08/MTAGS08_CFP.txt http://dsl.cs.uchicago.edu/MTAGS08/MTAGS08_CFP.pdf ------------------------------------------------------------------------------- November 17, 2008 Austin, Texas, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC08) =============================================================================== The 1st workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of loosely coupled large scale applications on large scale clusters, Grids, and/or Supercomputers. Many-task computing, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. We welcome paper submissions on all topics related to MTC on large scale systems. Papers will be peer-reviewed, and accepted papers will be published by IEEE/ACM through the SC08 proceedings (pending approval). For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Scope ------------------------------------------------------------------------------- This workshop will focus on the ability to manage and execute large scale applications on today's largest clusters, Grids, and Supercomputers. Clusters with 50K+ processor cores are beginning to come online (i.e. TACC Sun Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites and 100K+ processors, and supercomputers with 160K processors (i.e. IBM BlueGene/P). Large clusters and supercomputers have traditionally been high performance computing (HPC) systems, as they are efficient at executing tightly coupled parallel jobs within a particular machine with low-latency interconnects; the applications typically use message passing interface (MPI) to achieve the needed inter-process communication. On the other hand, Grids have been the preferred platform for more loosely coupled applications that tend to be managed and executed through workflow systems. In contrast to HPC (tightly coupled applications), these loosely coupled applications make up a new class of applications as what we call Many-Task Computing (MTC). MTC systems generally involve the execution of independent, sequential jobs that can be individually scheduled on many different computing resources across multiple administrative boundaries. MTC systems typically achieve this using various grid computing technologies and techniques, and often times use files to achieve the inter-process communication as alternative communication mechanisms than MPI. MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs from HTC in the emphasis of using many computing resources over short periods of time to accomplish many computational tasks, where the primary metrics are measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other hand requires large amounts of computing for longer times (months and years, rather than hours and days, and are generally measured in operations per month). Today's existing HPC systems are a viable platform to host MTC applications. However, some challenges arise in large scale applications when run on large scale systems, which can hamper the efficiency and utilization of these large scale systems. These challenges vary from local resource manager scalability and granularity, efficient utilization of the raw hardware, shared file system contention and scalability, reliability at scale, application scalability, and understanding the limitations of the HPC systems in order to identify good candidate MTC applications. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Topics ------------------------------------------------------------------------------- MTAGS 2008 topics of interest include, but are not limited to: * Compute Resource Management in large scale clusters, large Grids, and Supercomputers o Scheduling o Job execution frameworks o Local resource manager extensions o Performance evaluation of resource managers in use on large scale systems o Challenges in running many-task workloads on HPC systems * Data Management in large scale Grid and Supercomputer environments: o Data-Aware Scheduling o Shared File System performance and scalability in large deployments o Distributed file systems o Data caching frameworks and techniques * Large-Scale Workflow Systems o Workflow system performance and scalability analysis o Scalability of workflow systems o Workflow infrastructure and e-Science middleware o Programming Paradigms and Models * Large-Scale Many-Task Applications o Large-scale many-task applications o Large-scale many-task data-intensive applications o Large-scale high throughput computing (HTC) applications o Quasi-supercomputing applications, deployments, and experiences Paper Submission and Publication ------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 6/10 pages (6 pages for short papers, and 10 pages for standard papers) of double column text using single spaced 9 point size on 8.5 x 11 inch pages, as per ACM 8.5 x 11 manuscript guidelines (http://www.acm.org/sigs/publications/proceedings-templates). Papers conforming to the above guidelines (in PDF format) can be submitted via email to yozha at microsoft.com and iraicu at cs.uchicago.edu before the deadline of August 15th, 2008; please use the subject "MTAGS paper submission". Accepted papers from this workshop will be published by IEEE/ACM through the SC08 proceedings (pending approval). Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. Important Dates ------------------------------------------------------------------------------- * Papers Due: August 15th, 2008 * Notification of Acceptance: October 1st, 2008 * Camera Ready Papers Due: October 15th, 2008 * Workshop Date: November 17th, 2008 Committee Members ------------------------------------------------------------------------------- Workshop Chairs * Yong Zhao, Microsoft * Ian Foster, University of Chicago & Argonne National Laboratory * Ioan Raicu, University of Chicago Technical Committee * Ian Foster, University of Chicago & Argonne National Laboratory * David Abramson, Monash University * Dan Ardelean, Google * Pete Beckman, Argonne National Laboratory * Bob Grossman, University of Illinois at Chicago * Indranil Gupta, University of Illinois at Urbana Champaign * Tevfik Kosar, Louisiana State University * Chuang Liu, Ask.com * Shiyong Lu, Wayne State University * Reagan Moore, University of California at San Diego * Cristina Nita-Rotaru, Purdue University * Marlon Pierce, Indiana University * Ioan Raicu, University of Chicago * Dan Reed, Microsoft * Matei Ripeanu, University of British Columbia * Alex Szalay, The Johns Hopkins University * Douglas Thain, University of Notre Dame * Mike Wilde, University of Chicago & Argonne National Laboratory * Matthew Woitaszek, The University Corporation for Atmospheric Research * Lingyun Yang, Yahoo Search * Sherali Zeadally, University of the District of Columbia * Yong Zhao, Microsoft -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From abejan at ci.uchicago.edu Thu Jul 10 07:23:44 2008 From: abejan at ci.uchicago.edu (Alina Bejan) Date: Thu, 10 Jul 2008 14:23:44 +0200 Subject: [Swift-user] BioInformatics app question Message-ID: <4875FF50.9070201@ci.uchicago.edu> Hello Ben/Mihael (I guess), I have a Swift data structure question, which I'll describe below. I am trying to write a script that performs the following workflow (example below is a computation between a one genome and one genome, i.e. between 2 .faa files): formatdb ?i Ban.faa formatdb ?i Bce.faa blastall ?p blastp ?d Ban.faa ?i Bce.faa ?m 9 ?o out.Bce2Ban.txt blastall ?p blastp ?d Bce.faa ?i Ban.faa ?m 9 ?o out.Ban2Bce.txt simple_reciprocal_best_hits.00.pl ?i1 out.Bce2Ban.txt ?i2 out.Ban2Bce.txt ?o ortholog.pairs.txt The 1-1 example works just fine (ortho.swift included) This file also works well on multiple OSGEDU sites (that is when I use it with the osgedu-sitex.xml included). I am now trying to scale this up, using a set of 30 genomes (i.e. 30x30/2 computations - due to symmetry) -- ortho-many.swift included. The 30 .faa files are located in the abejan/testBLAST/FASTA directory. The problem is that I don't find a suitable mapper for this -- Idea is that a need to store the intermediate files generated in the formatdb step (the .phr, .pin., .psq files) into a structure, and map the components of this structure to the newly generated files. Swift complains with the way I do it now. Ultimately I would like to run 'ortho-many' on multiple sites. Any help is appreciated. Thanks, Alina -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ortho.swift URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: osgedu-sites.xml Type: text/xml Size: 3921 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ortho-many.swift URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: blast-tc.data URL: From Majordomo at globus.org Thu Jul 10 15:41:14 2008 From: Majordomo at globus.org (Majordomo at globus.org) Date: Thu, 10 Jul 2008 15:41:14 -0500 (CDT) Subject: [Swift-user] Welcome to swift-user Message-ID: <20080710204114.D77A812DA7@mailbouncer.mcs.anl.gov> -- Welcome to the swift-user mailing list! Please save this message for future reference. Thank you. If you ever want to remove yourself from this mailing list, you can send mail to with the following command in the body of your email message: unsubscribe swift-user or from another account, besides swift-user at ci.uchicago.edu: unsubscribe swift-user swift-user at ci.uchicago.edu If you ever need to get in contact with the owner of the list, (if you have trouble unsubscribing, or have questions about the list itself) send email to . This is the general rule for most mailing lists when you need to contact a human. Here's the general information for the list you've subscribed to, in case you don't already have it: Discussion list for swift users From zhengxiongh at uchicago.edu Mon Jul 14 14:34:15 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Mon, 14 Jul 2008 14:34:15 -0500 (CDT) Subject: [Swift-user] readdata or csv_mapper problem Message-ID: <20080714143415.BIO88651@m4500-01.uchicago.edu> Hello, When I try to run many dock jobs on the osg grid sites, there are problems for using readdata or csv_mapper. Please help to solve it. Here is the experiment at localhost: The swift code is as follows: *********************************************************** [houzx at communicado dock]$ cat grid-many-dock6-string.swift type file; (file t) dockcompute (string ligandsfile, string targetlist) { app { rundock ligandsfile targetlist stdout=@filename(t); } } type params { string ligandsfile; string targetlist; } #params pset[] ; doall(params pset[]) { foreach params,i in pset { #string mol2file ; #string target ; file sout ; sout = dockcompute(pset[i].ligandsfile,pset [i].targetlist); } } params p[]; p = readdata("paramslist.txt"); doall(p); *********************************************************** The content of "paramslist.txt" is as follows: [houzx at communicado dock]$ cat paramslist.txt ligandsfile,targetlist /home/houzx/dock- run/databases/KEGG_and_Drugs/D00180.mol2,1F9Y /home/houzx/dock- run/databases/KEGG_and_Drugs/D00181.mol2,1F9Y /home/houzx/dock- run/databases/KEGG_and_Drugs/D00182.mol2,1F9Y (1) Use this "readdata" code, and the log file is in the attachment "grid-many-dock6-string-20080714-readdata.log". [houzx at communicado dock]$ swift grid-many-dock6-string.swift Swift v0.4 swift-r1718 cog-r1934 RunID: 20080714-1405-letz6tcb Progress: Execution failed: File header does not match type. Expected the following header items (in no particular order): [ligandsfile, targetlist]. Instead, the header was (again, in no particular order): [ligandsfile,targetlist] (2) Use csv_mapper, and the log file is in the attachment "grid-many-dock6-string-20080714-1417-csv.log" [houzx at communicado dock]$ swift grid-many-dock6-string.swift Swift v0.4 swift-r1718 cog-r1934 RunID: 20080714-1417-pmo8hsjf Progress: rundock started rundock started rundock started rundock completed rundock completed rundock completed Final status: Finished:3 *********************************************************** [houzx at communicado dock]$ cat grid-many-dock6-string.swift type file; (file t) dockcompute (string ligandsfile, string targetlist) { app { rundock ligandsfile targetlist stdout=@filename(t); } } type params { string ligandsfile; string targetlist; } params pset[] ; foreach params,i in pset { file sout ; sout = dockcompute(pset[i].ligandsfile,pset [i].targetlist); } *********************************************************** But, in the "/home/houzx/dock-run/databases/results/", the created files are: null-0-stdout.txt, null-1-stdout.txt,null- 2-stdout.txt. It means that "pset[i].targetlist" is set to be "null", not the data "1F9Y" from paramslist.txt! If I use "Swift v0.3", the created files are:true-0- stdout.txt,true-1-stdout.txt, true-2-stdout.txt. Thanks! B.R. zhengxiong -------------- next part -------------- A non-text attachment was scrubbed... Name: grid-many-dock6-string-20080714-1417-csv.log Type: application/octet-stream Size: 77494 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: grid-many-dock6-string-20080714-readdata.log Type: application/octet-stream Size: 10140 bytes Desc: not available URL: From wilde at mcs.anl.gov Mon Jul 14 14:43:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 14 Jul 2008 14:43:16 -0500 Subject: [Swift-user] readdata or csv_mapper problem In-Reply-To: <20080714143415.BIO88651@m4500-01.uchicago.edu> References: <20080714143415.BIO88651@m4500-01.uchicago.edu> Message-ID: <487BAC54.9050705@mcs.anl.gov> I think the readdata problem is that you need whitespace between the var names in line 1. The Users Guide says: "For structs of scalars, the file should contain two rows. The first row should be structure member names separated by whitespace. The second row should be the corresponding values for each structure member, separated by whitespace, in the same order as the header row." On 7/14/08 2:34 PM, Zhengxiong Hou wrote: > Hello, > When I try to run many dock jobs on the osg grid sites, > there are problems for using readdata or csv_mapper. Please > help to solve it. Here is the experiment at localhost: > > The swift code is as follows: > *********************************************************** > [houzx at communicado dock]$ cat grid-many-dock6-string.swift > type file; > (file t) dockcompute (string ligandsfile, string targetlist) > { > app { > rundock ligandsfile targetlist stdout=@filename(t); > } > } > > type params { > string ligandsfile; > string targetlist; > } > > #params pset[] ; > doall(params pset[]) > { > foreach params,i in pset { > #string mol2file [i].ligandsfile>; > #string target [i].targetlist>; > file sout ("/home/houzx/dock-run/databases/results/",pset > [i].targetlist,"-",i,"-stdout.txt")>; > sout = dockcompute(pset[i].ligandsfile,pset > [i].targetlist); > } > } > > params p[]; > p = readdata("paramslist.txt"); > doall(p); > *********************************************************** > > The content of "paramslist.txt" is as follows: > [houzx at communicado dock]$ cat paramslist.txt > ligandsfile,targetlist > /home/houzx/dock- > run/databases/KEGG_and_Drugs/D00180.mol2,1F9Y > /home/houzx/dock- > run/databases/KEGG_and_Drugs/D00181.mol2,1F9Y > /home/houzx/dock- > run/databases/KEGG_and_Drugs/D00182.mol2,1F9Y > > > (1) Use this "readdata" code, and the log file is in the > attachment "grid-many-dock6-string-20080714-readdata.log". > > [houzx at communicado dock]$ swift grid-many-dock6-string.swift > Swift v0.4 swift-r1718 cog-r1934 > > RunID: 20080714-1405-letz6tcb > Progress: > Execution failed: > File header does not match type. Expected the > following header items (in no particular order): > [ligandsfile, targetlist]. Instead, the header was (again, > in no particular order): [ligandsfile,targetlist] > > > (2) Use csv_mapper, and the log file is in the attachment > "grid-many-dock6-string-20080714-1417-csv.log" > [houzx at communicado dock]$ swift grid-many-dock6-string.swift > Swift v0.4 swift-r1718 cog-r1934 > > RunID: 20080714-1417-pmo8hsjf > Progress: > rundock started > rundock started > rundock started > rundock completed > rundock completed > rundock completed > Final status: Finished:3 > *********************************************************** > [houzx at communicado dock]$ cat grid-many-dock6-string.swift > type file; > (file t) dockcompute (string ligandsfile, string targetlist) > { > app { > rundock ligandsfile targetlist stdout=@filename(t); > } > } > > type params { > string ligandsfile; > string targetlist; > } > > params pset[] ; > > foreach params,i in pset { > file sout ("/home/houzx/dock-run/databases/results/",pset > [i].targetlist,"-",i,"-stdout.txt")>; > sout = dockcompute(pset[i].ligandsfile,pset > [i].targetlist); > } > *********************************************************** > > But, in the "/home/houzx/dock-run/databases/results/", the > created files are: null-0-stdout.txt, null-1-stdout.txt,null- > 2-stdout.txt. > It means that "pset[i].targetlist" is set to be "null", not > the data "1F9Y" from paramslist.txt! > If I use "Swift v0.3", the created files are:true-0- > stdout.txt,true-1-stdout.txt, true-2-stdout.txt. > > > Thanks! > B.R. > zhengxiong > > > ------------------------------------------------------------------------ > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From zhengxiongh at uchicago.edu Mon Jul 14 15:08:25 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Mon, 14 Jul 2008 15:08:25 -0500 (CDT) Subject: [Swift-user] readdata or csv_mapper problem Message-ID: <20080714150825.BIO93135@m4500-01.uchicago.edu> Hi Mike, It works now for Swift 0.4. Actually, I modified it in Swift 0.3 to use "space", but it seemed that the problem was still there. I was puzzled. Anyway, it can works now! Thanks much! Zhengxiong ---- Original message ---- >Date: Mon, 14 Jul 2008 14:43:16 -0500 >From: Michael Wilde >Subject: Re: [Swift-user] readdata or csv_mapper problem >To: Zhengxiong Hou >Cc: swift-user at ci.uchicago.edu > >I think the readdata problem is that you need whitespace between the var >names in line 1. The Users Guide says: > >"For structs of scalars, the file should contain two rows. The first row >should be structure member names separated by whitespace. The second row >should be the corresponding values for each structure member, separated >by whitespace, in the same order as the header row." > From benc at hawaga.org.uk Tue Jul 15 05:30:41 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 15 Jul 2008 10:30:41 +0000 (GMT) Subject: [Swift-user] BioInformatics app question In-Reply-To: <4875FF50.9070201@ci.uchicago.edu> References: <4875FF50.9070201@ci.uchicago.edu> Message-ID: this is not particularly elegant, but it will mostly do what i think you want: make two custom mappers as shell scripts: $ cat inmapper #!/bin/bash i=0 ls data/* | while read n; do echo [$i] $n i=$(( $i + 1 )) done and medmapper contains: $ cat medmapper #!/bin/bash i=0 ls data/* | while read n; do echo [$i].left ${n}.left echo [$i].right ${n}.right i=$(( $i + 1 )) done This pair relies on the fact that mapping will happen at the start and that ls will return files in the same order in both of them. Then you can use them like this: type file; type medfiles { file left; file right; } (medfiles o) preprocess(file i) { o.left = touch(); o.right = touch(); } compare(file l, file r, medfiles lm, medfiles rm) { trace("comparing ", at l," and ", at r); process(lm.left); process(rm.left); } (file f) touch() { app { echo "hi" stdout=@f; } process(file f) { app { cat "/dev/null" ; } } file inputs[] ; medfiles intermediates[] ; foreach input,i in inputs { intermediates[i] = preprocess(input); } foreach left, il in inputs { foreach right, ir in inputs { compare(left, right, intermediates[il], intermediates[ir]); } } Also, because the intermediate files are stored in the same directory as the source data, and there is nothing in the mappers to detect if a file is an input or intermediate file, then if you run the same workflow twice you will find the previous generations .left and .right files being picked up as inputs. You will need to rm -v data/*.left data/*.right between runs. This could be fixed in the mappers in a couple of ways, left as an exercise to the reader. A more elegant solution might involve the mapper for intermediates[] doing a transform on the way that inputs[] is mapped, but there is no mapper to do that at the moment in a way that is useful here. (I have some thoughts about what it would look like but they are not developed enough for implementation). -- From lixi at uchicago.edu Sat Jul 19 10:13:48 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 10:13:48 -0500 (CDT) Subject: [Swift-user] GT4 Message-ID: <20080719101348.BCI34463@m4500-03.uchicago.edu> Hi, In the past experiments, I always use gt2 as provider. Now I think that it's better to use gt4 instead. The only way I know to migrate from gt4 to gt2 in Swift is to modify the sites file. Is that right? In my current sites file, the site item is as follows: /atlas/data08/OSG/DATA Now according the default sites.xml, I replaced it with: /atlas/data08/OSG/DATA Then I'm going to test if it works by running first.swift on each site one by one. Is it the right way to test if we can use WS GRAM for that site?. For the first site AGLT2, I got such output: [lixi at communicado AGLT2]$ swift -sites.file AGLT2.WSGRAM.sites.xml - tc.file /home/lixi/osg/swifttest/tc.data ../first.swift Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Swift svn swift-r2081 cog-r2065 RunID: 20080719-1005-dutdv6p4 Progress: echo started Progress: Selecting site:1 Progress: Selecting site:1 Progress: Selecting site:1 Progress: Selecting site:1 Progress: Selecting site:1 Progress: Selecting site:1 Progress: Selecting site:1 It seems that it doesn't work well. Could you give me some instructions? Thanks, Xi From hategan at mcs.anl.gov Sat Jul 19 10:31:26 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 10:31:26 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719101348.BCI34463@m4500-03.uchicago.edu> References: <20080719101348.BCI34463@m4500-03.uchicago.edu> Message-ID: <1216481486.11366.5.camel@localhost> On Sat, 2008-07-19 at 10:13 -0500, lixi at uchicago.edu wrote: > Hi, > > In the past experiments, I always use gt2 as provider. Now I > think that it's better to use gt4 instead. The only way I > know to migrate from gt4 to gt2 in Swift is to modify the > sites file. Is that right? > > In my current sites file, the site item is as follows: > > > url="gate01.aglt2.org/jobmanager-condor" major="2" /> > /atlas/data08/OSG/DATA > > > Now according the default sites.xml, I replaced it with: > > > url="gate01.aglt2.org" /> > /atlas/data08/OSG/DATA > That looks about right. But you have to be sure there is a GT4 container on that site. > > Then I'm going to test if it works by running first.swift on > each site one by one. Is it the right way to test if we can > use WS GRAM for that site?. I don't think there is a "right" way here. Though there was this script I wrote somewhere to test such things. It's in bin and called checksites.k. It would test all the sites in sites.xml. > For the first site AGLT2, I got > such output: > [lixi at communicado AGLT2]$ swift -sites.file > AGLT2.WSGRAM.sites.xml - > tc.file /home/lixi/osg/swifttest/tc.data ../first.swift > Unable to find required classes > (javax.activation.DataHandler and > javax.mail.internet.MimeMultipart). Attachment support is > disabled. You can ignore that. It doesn't have any effects on things. > Swift svn swift-r2081 cog-r2065 > > RunID: 20080719-1005-dutdv6p4 > Progress: > echo started > Progress: Selecting site:1 > Progress: Selecting site:1 > Progress: Selecting site:1 > Progress: Selecting site:1 > Progress: Selecting site:1 > Progress: Selecting site:1 > Progress: Selecting site:1 > > It seems that it doesn't work well. Can you send logs? > > Could you give me some instructions? Thanks, > > Xi From lixi at uchicago.edu Sat Jul 19 10:39:36 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 10:39:36 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719103936.BCI35055@m4500-03.uchicago.edu> >I don't think there is a "right" way here. Though there was this script >I wrote somewhere to test such things. It's in bin and called >checksites.k. It would test all the sites in sites.xml. I see that script. Can I run it on specified sites file alone? Could you give me an example? >Can you send logs? I run it again with swift-debugger, it seems hanging up. I just let it be. The log file is: /home/lixi/osg/swifttest/AGLT2/first-20080719-1032- h40vbfc8.log Thanks, Xi From hategan at mcs.anl.gov Sat Jul 19 10:47:07 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 10:47:07 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719103936.BCI35055@m4500-03.uchicago.edu> References: <20080719103936.BCI35055@m4500-03.uchicago.edu> Message-ID: <1216482427.11775.0.camel@localhost> On Sat, 2008-07-19 at 10:39 -0500, lixi at uchicago.edu wrote: > >I don't think there is a "right" way here. Though there was > this script > >I wrote somewhere to test such things. It's in bin and > called > >checksites.k. It would test all the sites in sites.xml. > > I see that script. Can I run it on specified sites file > alone? Could you give me an example? cog-workflow checksites.k mysitesfile.xml > > >Can you send logs? > I run it again with swift-debugger, it seems hanging up. I > just let it be. The log file is: > /home/lixi/osg/swifttest/AGLT2/first-20080719-1032- > h40vbfc8.log > > Thanks, > > Xi From hategan at mcs.anl.gov Sat Jul 19 10:49:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 10:49:10 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719103936.BCI35055@m4500-03.uchicago.edu> References: <20080719103936.BCI35055@m4500-03.uchicago.edu> Message-ID: <1216482550.11775.3.camel@localhost> On Sat, 2008-07-19 at 10:39 -0500, lixi at uchicago.edu wrote: > >Can you send logs? > I run it again with swift-debugger, it seems hanging up. I > just let it be. The log file is: > /home/lixi/osg/swifttest/AGLT2/first-20080719-1032- > h40vbfc8.log Did you manually stop swift there? Btw, this seems to be the problem: 2008-07-19 10:32:04,845-0500 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1216481524364) setting status to Failed org.glo bus.cog.abstraction.impl.file.IrrecoverableResourceException: Error communicating with the GridFTP server So it has nothing to do with GRAM. > > Thanks, > > Xi From lixi at uchicago.edu Sat Jul 19 10:52:38 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 10:52:38 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719105238.BCI35341@m4500-03.uchicago.edu> >Btw, this seems to be the problem: >2008-07-19 10:32:04,845-0500 DEBUG TaskImpl Task (type=FILE_OPERATION, >identity=urn:0-1-1216481524364) setting status to Failed org.glo >bus.cog.abstraction.impl.file.IrrecoverableResourceException : Error >communicating with the GridFTP server I see, :) Thanks, Xi From lixi at uchicago.edu Sat Jul 19 10:57:32 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 10:57:32 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719105732.BCI35433@m4500-03.uchicago.edu> >cog-workflow checksites.k mysitesfile.xml I ran it and got such output: [lixi at communicado bin]$ cog-workflow checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site s.xml Execution failed: Missing argument major for sys:element(url, storage, major, minor, patch) gridftp @ checksites.k, line: 12 pool @ AGLT2.WSGRAM.sites.xml, line: 38 pool @ AGLT2.WSGRAM.sites.xml, line: 38 org.globus.cog.karajan.workflow.nodes.Sequential @ AGLT2.WSGRAM.sites.xml sys:executefile @ checksites.k, line: 42 list:list @ checksites.k, line: 42 sys:set @ checksites.k, line: 42 kernel:karajan @ checksites.k, line: 1 checksites.k Detailed exception: Missing argument major for sys:element(url, storage, major, minor, patch) gridftp @ checksites.k, line: 12 pool @ AGLT2.WSGRAM.sites.xml, line: 38 pool @ AGLT2.WSGRAM.sites.xml, line: 38 org.globus.cog.karajan.workflow.nodes.Sequential @ AGLT2.WSGRAM.sites.xml sys:executefile @ checksites.k, line: 42 list:list @ checksites.k, line: 42 sys:set @ checksites.k, line: 42 kernel:karajan @ checksites.k, line: 1 checksites.k at org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement .prepareInstanceArguments(UserDefinedElement.java:196) at org.globus.cog.karajan.workflow.nodes.user.UserDefinedElement .startBody(UserDefinedElement.java:170) at org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit ExecutionUDE.startBody (SequentialImplicitExecutionUDE.java:55) at org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit ExecutionUDE.childCompleted (SequentialImplicitExecutionUDE.java:82) at org.globus.cog.karajan.workflow.nodes.Sequential.notification Event(Sequential.java:33) at org.globus.cog.karajan.workflow.nodes.FlowNode.event (FlowNode.java:335) at org.globus.cog.karajan.workflow.events.EventBus.send (EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked (EventBus.java:99) at org.globus.cog.karajan.workflow.nodes.FlowNode.fireNotificati onEvent(FlowNode.java:173) at org.globus.cog.karajan.workflow.nodes.FlowNode.complete (FlowNode.java:299) at org.globus.cog.karajan.workflow.nodes.FlowContainer.post (FlowContainer.java:58) at org.globus.cog.karajan.workflow.nodes.Sequential.startNext (Sequential.java:51) at org.globus.cog.karajan.workflow.nodes.Sequential.executeChild ren(Sequential.java:27) at org.globus.cog.karajan.workflow.nodes.user.UDEWrapper.execute Wrapper(UDEWrapper.java:115) at org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit ExecutionUDE.startArguments (SequentialImplicitExecutionUDE.java:46) at org.globus.cog.karajan.workflow.nodes.user.SequentialImplicit ExecutionUDE.startInstance (SequentialImplicitExecutionUDE.java:37) at org.globus.cog.karajan.workflow.nodes.user.UDEWrapper.pre (UDEWrapper.java:75) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute (FlowContainer.java:62) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart (FlowNode.java:240) at org.globus.cog.karajan.workflow.nodes.FlowNode.start (FlowNode.java:281) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent (FlowNode.java:393) at org.globus.cog.karajan.workflow.nodes.FlowNode.event (FlowNode.java:332) at org.globus.cog.karajan.workflow.FlowElementWrapper.event (FlowElementWrapper.java:227) at org.globus.cog.karajan.workflow.events.EventBus.send (EventBus.java:125) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked (EventBus.java:99) at org.globus.cog.karajan.workflow.events.EventWorker.run (EventWorker.java:69) AGLT2.WSGRAM.sites.xml includes such content: /atlas/data08/OSG/DATA Does this output prove my sites file is improper? Thanks, Xi From benc at hawaga.org.uk Sat Jul 19 11:04:23 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jul 2008 16:04:23 +0000 (GMT) Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719105732.BCI35433@m4500-03.uchicago.edu> References: <20080719105732.BCI35433@m4500-03.uchicago.edu> Message-ID: gate01.aglt2.org is reachable from login.ci but not from my UK machine. So there might be some strange network stuff going on with that host. For GRAM4 on that machine you need (by the looks of it) to use tcp port 9443, not the default 8443 which is running some other services. I think this will in general be true for all OSG resources. I can't submit to GRAM4 on that machine because I'm not authorized: $ globusrun-ws -submit -F gate01.aglt2.org:9443 -c /bin/hostname Submitting job...Failed. globusrun-ws: Error submitting job globus_soap_message_module: SOAP Fault Fault code: soapenv:Server.userException Fault string: org.globus.wsrf.impl.security.authorization.exceptions.AuthorizationException: "/DC=org/DC=doegrids/OU=People/CN=Benjamin Clifford 418168" is not authorized to use operation: {http://www.globus.org/namespaces/2004/10/gram/job}createManagedJob on this service nor can I use gridftp on that machine: $ globus-url-copy file:///etc/group gsiftp://gate01.aglt2.org/tmp/benc008 error: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : gridmap.c:globus_gss_assist_map_and_authorize:1944: 530-Error invoking callout 530-globus_callout.c:globus_callout_handle_call_type:727: 530-The callout returned an error 530-prima_module.c:Globus Gridmap Callout:470: 530-Gridmap lookup failure: Could not retrieve mapping for /DC=org/DC=doegrids/OU=People/CN=Benjamin Clifford 418168 from identity mapping server 530- 530 End. You could try the above two commands yourself and see what you get. -- From hategan at mcs.anl.gov Sat Jul 19 11:10:10 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 11:10:10 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719105732.BCI35433@m4500-03.uchicago.edu> References: <20080719105732.BCI35433@m4500-03.uchicago.edu> Message-ID: <1216483810.12254.0.camel@localhost> On Sat, 2008-07-19 at 10:57 -0500, lixi at uchicago.edu wrote: > >cog-workflow checksites.k mysitesfile.xml > > I ran it and got such output: > > [lixi at communicado bin]$ cog-workflow > checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site > s.xml > > Execution failed: > Missing argument major for sys:element(url, storage, major, > minor, patch) Seems like it hasn't been updated in a while. From lixi at uchicago.edu Sat Jul 19 11:10:36 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 11:10:36 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719111036.BCI35848@m4500-03.uchicago.edu> >For GRAM4 on that machine you need (by the looks of it) to use tcp port >9443, not the default 8443 which is running some other services. I think >this will in general be true for all OSG resources. How to change it? Does it like this: /atlas/data08/OSG/DATA >You could try the above two commands yourself and see what you get. [lixi at communicado AGLT2]$ globus-url-copy file:///home/lixi/osg/swifttest/AGLT2/currenttime.tmp gsiftp://gate01.aglt2.org/atlas/data08/OSG/APP/osglixi/ error: globus_ftp_client: the server responded with an error 530 530-Login incorrect. : globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050: 530-Mapped user 'osg' is invalid. 530 End. [lixi at communicado AGLT2]$ globusrun-ws -submit -F gate01.aglt2.org:9443 -c /bin/hostname Submitting job...Done. Job ID: uuid:b7d614b8-55ac-11dd-b2fe-001a64784960 Termination time: 07/20/2008 16:06 GMT Current job state: Failed Destroying job...Done. globusrun-ws: Job failed: Error code: 201 Script stderr: /usr/bin/sudo: uid 825675 does not exist in the passwd file! Thanks, Xi From hategan at mcs.anl.gov Sat Jul 19 11:13:25 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 11:13:25 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <1216483810.12254.0.camel@localhost> References: <20080719105732.BCI35433@m4500-03.uchicago.edu> <1216483810.12254.0.camel@localhost> Message-ID: <1216484005.12366.0.camel@localhost> On Sat, 2008-07-19 at 11:10 -0500, Mihael Hategan wrote: > On Sat, 2008-07-19 at 10:57 -0500, lixi at uchicago.edu wrote: > > >cog-workflow checksites.k mysitesfile.xml > > > > I ran it and got such output: > > > > [lixi at communicado bin]$ cog-workflow > > checksites.k /home/lixi/osg/swifttest/AGLT2/AGLT2.WSGRAM.site > > s.xml > > > > Execution failed: > > Missing argument major for sys:element(url, storage, major, > > minor, patch) > > Seems like it hasn't been updated in a while. In other words, use what Ben mentions for now. > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Sat Jul 19 11:12:20 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jul 2008 16:12:20 +0000 (GMT) Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719111036.BCI35848@m4500-03.uchicago.edu> References: <20080719111036.BCI35848@m4500-03.uchicago.edu> Message-ID: On Sat, 19 Jul 2008, lixi at uchicago.edu wrote: > >You could try the above two commands yourself and see what > you get. so they both fail for you. interact with the site admins for that site to make them work. when they work, try swift again. -- From hategan at mcs.anl.gov Sat Jul 19 11:17:01 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 11:17:01 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719111036.BCI35848@m4500-03.uchicago.edu> References: <20080719111036.BCI35848@m4500-03.uchicago.edu> Message-ID: <1216484221.12426.1.camel@localhost> On Sat, 2008-07-19 at 11:10 -0500, lixi at uchicago.edu wrote: > >For GRAM4 on that machine you need (by the looks of it) to > use tcp port > >9443, not the default 8443 which is running some other > services. I think > >this will in general be true for all OSG resources. > > How to change it? Does it like this: > > > url="gate01.aglt2.org:9443" /> > /atlas/data08/OSG/DATA > Yes. > > >You could try the above two commands yourself and see what > you get. > > [lixi at communicado AGLT2]$ globus-url-copy > file:///home/lixi/osg/swifttest/AGLT2/currenttime.tmp > gsiftp://gate01.aglt2.org/atlas/data08/OSG/APP/osglixi/ > > error: globus_ftp_client: the server responded with an error > 530 530-Login incorrect. : > globus_i_gfs_data.c:globus_l_gfs_data_authorize:1050: > 530-Mapped user 'osg' is invalid. > 530 End. Your account there seems messed up. > > [lixi at communicado AGLT2]$ globusrun-ws -submit -F > gate01.aglt2.org:9443 -c /bin/hostname > Submitting job...Done. > Job ID: uuid:b7d614b8-55ac-11dd-b2fe-001a64784960 > Termination time: 07/20/2008 16:06 GMT > Current job state: Failed > Destroying job...Done. > globusrun-ws: Job failed: Error code: 201 > Script stderr: > /usr/bin/sudo: uid 825675 does not exist in the passwd file! Again, your account seems broken. Did this ever work for you? > > Thanks, > > Xi From lixi at uchicago.edu Sat Jul 19 11:14:28 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 11:14:28 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719111428.BCI35928@m4500-03.uchicago.edu> >so they both fail for you. interact with the site admins for that site to >make them work. when they work, try swift again. Thanks, so is it also the right way to check other sites one by one ? Xi From benc at hawaga.org.uk Sat Jul 19 11:16:00 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jul 2008 16:16:00 +0000 (GMT) Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719111428.BCI35928@m4500-03.uchicago.edu> References: <20080719111428.BCI35928@m4500-03.uchicago.edu> Message-ID: On Sat, 19 Jul 2008, lixi at uchicago.edu wrote: > >so they both fail for you. interact with the site admins > for that site to > >make them work. when they work, try swift again. > > Thanks, so is it also the right way to check other sites one > by one ? Those two commands would be the commands I would use to test a site. -- From lixi at uchicago.edu Sat Jul 19 11:18:15 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 11:18:15 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719111815.BCI36021@m4500-03.uchicago.edu> >Again, your account seems broken. Did this ever work for you? Although I never use WS GRAM on sites, I use GRAM and GridFtp well for running Swift workflow on that site before. The most recent run was done successfully the day before yesterday. >> Thanks, >> >> Xi > From hategan at mcs.anl.gov Sat Jul 19 11:23:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 19 Jul 2008 11:23:36 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719111815.BCI36021@m4500-03.uchicago.edu> References: <20080719111815.BCI36021@m4500-03.uchicago.edu> Message-ID: <1216484616.12635.0.camel@localhost> On Sat, 2008-07-19 at 11:18 -0500, lixi at uchicago.edu wrote: > >Again, your account seems broken. Did this ever work for > you? > > Although I never use WS GRAM on sites, I use GRAM and > GridFtp well for running Swift workflow on that site before. > The most recent run was done successfully the day before > yesterday. Did you use a different VO? > > > >> Thanks, > >> > >> Xi > > From lixi at uchicago.edu Sat Jul 19 11:23:42 2008 From: lixi at uchicago.edu (lixi at uchicago.edu) Date: Sat, 19 Jul 2008 11:23:42 -0500 (CDT) Subject: [Swift-user] Re: GT4 Message-ID: <20080719112342.BCI36162@m4500-03.uchicago.edu> >Did you use a different VO? Before I use OSGEDU VO, but I already switch to use OSG VO for more than a month. From benc at hawaga.org.uk Sat Jul 19 11:27:15 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 19 Jul 2008 16:27:15 +0000 (GMT) Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719111815.BCI36021@m4500-03.uchicago.edu> References: <20080719111815.BCI36021@m4500-03.uchicago.edu> Message-ID: On Sat, 19 Jul 2008, lixi at uchicago.edu wrote: > >Again, your account seems broken. Did this ever work for > you? > > Although I never use WS GRAM on sites, I use GRAM and > GridFtp well for running Swift workflow on that site before. > The most recent run was done successfully the day before > yesterday. ok. That gridftp command is nothing gram4-specific. So if it doesn't work, it suggests very strongly that there is a general site problem that has arisen in the past few days. Try the previous gram2 swift submissions and I think you will probably see that that also does not work. -- From tiejing at gmail.com Sat Jul 19 11:44:04 2008 From: tiejing at gmail.com (Jing Tie) Date: Sat, 19 Jul 2008 11:44:04 -0500 Subject: [Swift-user] Re: GT4 In-Reply-To: <20080719112342.BCI36162@m4500-03.uchicago.edu> References: <20080719112342.BCI36162@m4500-03.uchicago.edu> Message-ID: Yes, the site is failing authentication test. I worked yesterday. Jing On Sat, Jul 19, 2008 at 11:23 AM, wrote: >>Did you use a different VO? > > Before I use OSGEDU VO, but I already switch to use OSG VO > for more than a month. > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > From tiberius at ci.uchicago.edu Mon Jul 21 10:39:32 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Mon, 21 Jul 2008 10:39:32 -0500 Subject: [Swift-user] Help needed with batching up parallel runs Message-ID: Hi I work with some code that generates at some point a number (300 in my case) of parallel identical runs, and I need to batch those up (10 at a time in my case) because each individual run is too short. I don't want Falkon at this point, and I'm not sure about the status of the coaster provider, so I would prefer a clean swift solution I was thinking of some array manipulation, but it was not obvious how to do it with swift. Thanks ! Tibi Here is the code that I have so far, and I need help for: //this is the code that batches a number of runs: based on the size of the array (determined where I make the call), I will return the set of parallel run results (file simFile[])gj_batch_sim(file policyFile, file logFile){ app{ gj_batch_sim @filename(policyFile) @filename(logFile) @filenames(simFile); } } int parallelInstances=300; file simOutputs[]; (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){ // this is just some needed input file logFile; // I want to have batches of size 10 int localBatchSize=10; int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*") trace("Times to do batch_gj_batch_sim",batchRange); foreach i in [1:batchRange] { // HELP HERE: how to do this ? // essentially I need to map the proper batch of file names into the call of gj_batch_sim simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile, logFile); } } -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From wilde at mcs.anl.gov Mon Jul 21 11:46:42 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 21 Jul 2008 11:46:42 -0500 Subject: [Swift-user] Help needed with batching up parallel runs In-Reply-To: References: Message-ID: <4884BD72.1050409@mcs.anl.gov> Tibi, can you use the Swift clustering mechanism? http://www.ci.uchicago.edu/swift/guides/userguide.php#clustering Its meant for this sort of thing, and is nice because you dont need to explicitly do the clustering in your Swift script. "Swift can group a number of short job submissions into a single larger job submission to minimize overhead involved in launching jobs..." - Mike On 7/21/08 10:39 AM, Tiberiu Stef-Praun wrote: > Hi > > I work with some code that generates at some point a number (300 in my > case) of parallel identical runs, and I need to batch those up (10 at > a time in my case) because each individual run is too short. > I don't want Falkon at this point, and I'm not sure about the status > of the coaster provider, so I would prefer a clean swift solution > I was thinking of some array manipulation, but it was not obvious how > to do it with swift. > > Thanks ! > Tibi > > Here is the code that I have so far, and I need help for: > > > > //this is the code that batches a number of runs: based on the size of > the array (determined where I make the call), I will return the set of > parallel run results > (file simFile[])gj_batch_sim(file policyFile, file logFile){ > app{ > gj_batch_sim @filename(policyFile) @filename(logFile) > @filenames(simFile); > } > } > > int parallelInstances=300; > file simOutputs[]; > > (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){ > // this is just some needed input > file logFile; > > // I want to have batches of size 10 > int localBatchSize=10; > > int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*") > trace("Times to do batch_gj_batch_sim",batchRange); > > foreach i in [1:batchRange] { > // HELP HERE: how to do this ? > // essentially I need to map the proper batch of file > names into the call of gj_batch_sim > > simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile, > logFile); > } > } > > > From zhaozhang at uchicago.edu Mon Jul 21 17:26:24 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 21 Jul 2008 17:26:24 -0500 Subject: [Swift-user] A naive run of Falkon+Swift on BGP login node. Message-ID: <48850D10.7050103@uchicago.edu> Hi, I started a test on BGP login nodes, running falkon service and swift on Login6, and a worker on Login2. Good news is I got the output file. Swift return successful. Bad news is there are some problems I don't understand. The swift stdout: /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift Line 2: Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Line 3: Swift svn swift-r2140 cog-r2070 Line 4: RunID: 20080721-1713-zkz78kcf Line 5: Progress: Line 6: echo started Line 7: error: Notification(int timeout): socket = new ServerSocket(recvPort); Address already in use Line 8: Waiting for notification for 0 ms Line 9: Received notification with 1 messages Line 10: echo completed Line 11: Final status: Finished successfully:1/ 1. What is the exception in Line 2? is this ignorable or not? 2. What is the error in Line 7? Is it printed by swift or the deef-provider? Is this ignorable or not? The following exception from Falkon only occurs when I specify the ip.address property in swift The falkon stdout: /2008-07-21 17:00:46,325 ERROR handler.AddressingHandler [ServiceThread-6,invoke:120] Exception in AddressingHandler AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.io.IOException: '' For input string: "" faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.io.IOException: '' For input string: "" at org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161) at org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53) at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645) at org.apache.axis.Message.getSOAPEnvelope(Message.java:424) at org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328) at org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77) at org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248) at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) {http://xml.apache.org/axis/}hostname:login6 / Ioan, any idea about this? I am also attaching the swift log, could anyone check this to tell if there is a problem there, and most important thing is that if swift is using the IP address I specified in the --ip.address parameter? Thanks so much for the help best wishes zhangzhao -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: first-20080721-1713-zkz78kcf.log URL: From hategan at mcs.anl.gov Mon Jul 21 17:39:09 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:39:09 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850D10.7050103@uchicago.edu> References: <48850D10.7050103@uchicago.edu> Message-ID: <1216679949.18694.10.camel@localhost> On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote: > Hi, > > I started a test on BGP login nodes, running falkon service and swift on > Login6, and a worker on Login2. > Good news is I got the output file. Swift return successful. Bad news is > there are some problems I don't > understand. > > The swift stdout: > /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file > ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Line 2: Unable to find required classes (javax.activation.DataHandler > and javax.mail.internet.MimeMultipart). Attachment support is disabled. > Line 3: Swift svn swift-r2140 cog-r2070 > > Line 4: RunID: 20080721-1713-zkz78kcf > Line 5: Progress: > Line 6: echo started > Line 7: error: Notification(int timeout): socket = new > ServerSocket(recvPort); Address already in use > Line 8: Waiting for notification for 0 ms > Line 9: Received notification with 1 messages > Line 10: echo completed > Line 11: Final status: Finished successfully:1/ > > 1. What is the exception in Line 2? is this ignorable or not? Yes. It's axis complaining about some missing stuff that is never used in this case. > 2. What is the error in Line 7? Is it printed by swift or the > deef-provider? provider-deef. Do you have another swift instance running by any chance? > Is this ignorable or not? It isn't. It probably means that the falkon notifications won't get to you. > > > > The following exception from Falkon only occurs when I specify the > ip.address property in swift What exactly did you set it to? Mihael From hategan at mcs.anl.gov Mon Jul 21 17:41:36 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:41:36 -0500 Subject: [Swift-user] Re: [Swift-devel] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850F53.3010300@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu> Message-ID: <1216680096.18694.14.camel@localhost> > > Line 7: error: Notification(int timeout): socket = new > > ServerSocket(recvPort); Address already in use > > Line 8: Waiting for notification for 0 ms > > Line 9: Received notification with 1 messages > > Line 10: echo completed > > Line 11: Final status: Finished successfully:1/ > > > > 1. What is the exception in Line 2? is this ignorable or not? > This is not a Falkon provider exception, so I don't know. > > 2. What is the error in Line 7? Is it printed by swift or the > > deef-provider? Is this ignorable or not? > > > You can ignore this, it should really be just a warning. Oops. Sorry. Nevermind what I said. From hategan at mcs.anl.gov Mon Jul 21 17:43:30 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:43:30 -0500 Subject: [Swift-user] Re: [Swift-devel] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850F53.3010300@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <48850F53.3010300@cs.uchicago.edu> Message-ID: <1216680210.18694.17.camel@localhost> On Mon, 2008-07-21 at 17:36 -0500, Ioan Raicu wrote: > > Ioan, any idea about this? > Not really sure what is wrong. Try to fix the exception from line 2 > first. Not the problem. Normally in the wsrf log4j.properties this is masked out. It's the log4j.properties in swift that doesn't. We should change that. > Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs? Yes. It's still on gt4.0 From iraicu at cs.uchicago.edu Mon Jul 21 17:38:35 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:38:35 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <1216679949.18694.10.camel@localhost> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> Message-ID: <48850FEB.2020108@cs.uchicago.edu> Mihael Hategan wrote: > On Mon, 2008-07-21 at 17:26 -0500, Zhao Zhang wrote: > >> Hi, >> >> I started a test on BGP login nodes, running falkon service and swift on >> Login6, and a worker on Login2. >> Good news is I got the output file. Swift return successful. Bad news is >> there are some problems I don't >> understand. >> >> The swift stdout: >> /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file >> ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift >> Line 2: Unable to find required classes (javax.activation.DataHandler >> and javax.mail.internet.MimeMultipart). Attachment support is disabled. >> Line 3: Swift svn swift-r2140 cog-r2070 >> >> Line 4: RunID: 20080721-1713-zkz78kcf >> Line 5: Progress: >> Line 6: echo started >> Line 7: error: Notification(int timeout): socket = new >> ServerSocket(recvPort); Address already in use >> Line 8: Waiting for notification for 0 ms >> Line 9: Received notification with 1 messages >> Line 10: echo completed >> Line 11: Final status: Finished successfully:1/ >> >> 1. What is the exception in Line 2? is this ignorable or not? >> > > Yes. It's axis complaining about some missing stuff that is never used > in this case. > > >> 2. What is the error in Line 7? Is it printed by swift or the >> deef-provider? >> > > provider-deef. Do you have another swift instance running by any chance? > > >> Is this ignorable or not? >> > > It isn't. It probably means that the falkon notifications won't get to > you. > This error should just be a warning... as it tries a different port until it finds a good one. It should only print an error when it gives up. So, that is not your problem Zhao, especially as you seem to have run OK, right? Line 11: Final status: Finished successfully:1/ Ioan > >> >> The following exception from Falkon only occurs when I specify the >> ip.address property in swift >> > > What exactly did you set it to? > > Mihael > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Mon Jul 21 17:44:50 2008 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 21 Jul 2008 17:44:50 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850FEB.2020108@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> Message-ID: <1216680290.20073.0.camel@localhost> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: > > > This error should just be a warning... as it tries a different port > until it finds a good one. It should only print an error when it > gives up. So, that is not your problem Zhao, especially as you seem > to have run OK, right? > > Line 11: Final status: Finished successfully:1/ Yep. Sorry. Spoke without knowing. From iraicu at cs.uchicago.edu Mon Jul 21 17:36:03 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:36:03 -0500 Subject: [Swift-user] Re: A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48850D10.7050103@uchicago.edu> References: <48850D10.7050103@uchicago.edu> Message-ID: <48850F53.3010300@cs.uchicago.edu> Zhao Zhang wrote: > Hi, > > I started a test on BGP login nodes, running falkon service and swift > on Login6, and a worker on Login2. > Good news is I got the output file. Swift return successful. Bad news > is there are some problems I don't > understand. > > The swift stdout: > /Line 1: zzhang at login6.surveyor:~/swift/etc> swift -sites.file > ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Line 2: Unable to find required classes > (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). > Attachment support is disabled. > Line 3: Swift svn swift-r2140 cog-r2070 > > Line 4: RunID: 20080721-1713-zkz78kcf > Line 5: Progress: > Line 6: echo started > Line 7: error: Notification(int timeout): socket = new > ServerSocket(recvPort); Address already in use > Line 8: Waiting for notification for 0 ms > Line 9: Received notification with 1 messages > Line 10: echo completed > Line 11: Final status: Finished successfully:1/ > > 1. What is the exception in Line 2? is this ignorable or not? This is not a Falkon provider exception, so I don't know. > 2. What is the error in Line 7? Is it printed by swift or the > deef-provider? Is this ignorable or not? > You can ignore this, it should really be just a warning. > > > The following exception from Falkon only occurs when I specify the > ip.address property in swift > The falkon stdout: > > /2008-07-21 17:00:46,325 ERROR handler.AddressingHandler > [ServiceThread-6,invoke:120] Exception in AddressingHandler > AxisFault > faultCode: > {http://schemas.xmlsoap.org/soap/envelope/}Server.userException > faultSubcode: > faultString: java.io.IOException: '' For input string: "" > faultActor: > faultNode: > faultDetail: > {http://xml.apache.org/axis/}stackTrace:java.io.IOException: '' > For input string: "" > at > org.apache.axis.transport.http.ChunkedInputStream.getChunked(ChunkedInputStream.java:161) > > at > org.apache.axis.transport.http.ChunkedInputStream.read(ChunkedInputStream.java:53) > > at > org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown > Source) > at > org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown > Source) > at > org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at > org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) > > at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645) > at org.apache.axis.Message.getSOAPEnvelope(Message.java:424) > at > org.apache.axis.message.addressing.handler.AddressingHandler.processServerRequest(AddressingHandler.java:328) > > at > org.globus.wsrf.handlers.AddressingHandler.processServerRequest(AddressingHandler.java:77) > > at > org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:114) > > at > org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) > > at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) > at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) > at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248) > at > org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664) > at > org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382) > at > org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291) > > {http://xml.apache.org/axis/}hostname:login6 > / > Ioan, any idea about this? Not really sure what is wrong. Try to fix the exception from line 2 first. Also, Falkon is using GT4.0.x, is Swift still on GT4.0.x libs? Ioan > > I am also attaching the swift log, could anyone check this to tell if > there is a problem there, and most important thing > is that if swift is using the IP address I specified in the > --ip.address parameter? > > Thanks so much for the help > > best wishes > zhangzhao -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From iraicu at cs.uchicago.edu Mon Jul 21 17:46:32 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 17:46:32 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <1216680290.20073.0.camel@localhost> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> Message-ID: <488511C8.2000707@cs.uchicago.edu> So Zhao, did it actually work, but you got those two errors and wanted to know what the errors were? If things worked as expected, then you should be fine, you can ignore both of those errors (I think). If things didn't work as expected, then we need to dig deeper to find out why. Ioan Mihael Hategan wrote: > On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: > >>> >>> >> This error should just be a warning... as it tries a different port >> until it finds a good one. It should only print an error when it >> gives up. So, that is not your problem Zhao, especially as you seem >> to have run OK, right? >> >> Line 11: Final status: Finished successfully:1/ >> > > Yep. Sorry. Spoke without knowing. > > > > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhaozhang at uchicago.edu Mon Jul 21 18:04:41 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Mon, 21 Jul 2008 18:04:41 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <488511C8.2000707@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> Message-ID: <48851609.8050909@uchicago.edu> In this test case, it actually worked. I talked with Mike, and we don't quite understand these 2 things. So I sent them out. After that I started another test. Running, swift on Login Node, falkon service on IO node, and BGexec on CN. At the very end of the service log, I got his: 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 288 512 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 288 512 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 100 512 287 512 This means that we are still suffering the endpoint problem, right? And from swift stdout, zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml -tc.file ./tc.data -ip.address 172.17.3.16 first.swift Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled. Swift svn swift-r2140 cog-r2070 RunID: 20080721-1748-m9d39dg9 Progress: echo started Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Progress: Executing:1 Swift kept waiting, which mean the -ip.address doesn't work as we expexted. zhao Ioan Raicu wrote: > So Zhao, did it actually work, but you got those two errors and wanted > to know what the errors were? If things worked as expected, then you > should be fine, you can ignore both of those errors (I think). If > things didn't work as expected, then we need to dig deeper to find out > why. > > Ioan > > Mihael Hategan wrote: >> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >> >>>> >>>> >>> This error should just be a warning... as it tries a different port >>> until it finds a good one. It should only print an error when it >>> gives up. So, that is not your problem Zhao, especially as you seem >>> to have run OK, right? >>> >>> Line 11: Final status: Finished successfully:1/ >>> >> >> Yep. Sorry. Spoke without knowing. >> >> >> >> > > -- > =================================================== > Ioan Raicu > Ph.D. Candidate > =================================================== > Distributed Systems Laboratory > Computer Science Department > University of Chicago > 1100 E. 58th Street, Ryerson Hall > Chicago, IL 60637 > =================================================== > Email: iraicu at cs.uchicago.edu > Web: http://www.cs.uchicago.edu/~iraicu > http://dev.globus.org/wiki/Incubator/Falkon > http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page > =================================================== > =================================================== > > From iraicu at cs.uchicago.edu Mon Jul 21 18:08:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 18:08:26 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <48851609.8050909@uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> <48851609.8050909@uchicago.edu> Message-ID: <488516EA.1080703@cs.uchicago.edu> Zhao Zhang wrote: > In this test case, it actually worked. I talked with Mike, and we > don't quite understand these 2 things. So I sent them out. > > After that I started another test. Running, swift on Login Node, > falkon service on IO node, and BGexec on CN. > At the very end of the service log, I got his: > 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 288 512 > 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 288 512 > 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 > 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 > 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 > 100 512 287 512 \ Right, it can't deliver the 2 tasks, as there would have been a 2 before the 0.0 in the middle. > > This means that we are still suffering the endpoint problem, right? Right! You might want to put some debug statements in the Falkon provider to print the end point IP address, to make sure it is the one you are expecting. Ioan > > And from swift stdout, > zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml > -tc.file ./tc.data -ip.address 172.17.3.16 first.swift > Unable to find required classes (javax.activation.DataHandler and > javax.mail.internet.MimeMultipart). Attachment support is disabled. > Swift svn swift-r2140 cog-r2070 > > RunID: 20080721-1748-m9d39dg9 > Progress: > echo started > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > Progress: Executing:1 > > Swift kept waiting, which mean the -ip.address doesn't work as we > expexted. > > zhao > > Ioan Raicu wrote: >> So Zhao, did it actually work, but you got those two errors and >> wanted to know what the errors were? If things worked as expected, >> then you should be fine, you can ignore both of those errors (I >> think). If things didn't work as expected, then we need to dig >> deeper to find out why. >> >> Ioan >> >> Mihael Hategan wrote: >>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >>> >>>>> >>>> This error should just be a warning... as it tries a different port >>>> until it finds a good one. It should only print an error when it >>>> gives up. So, that is not your problem Zhao, especially as you seem >>>> to have run OK, right? >>>> Line 11: Final status: Finished successfully:1/ >>>> >>> >>> Yep. Sorry. Spoke without knowing. >>> >>> >>> >>> >> >> -- >> =================================================== >> Ioan Raicu >> Ph.D. Candidate >> =================================================== >> Distributed Systems Laboratory >> Computer Science Department >> University of Chicago >> 1100 E. 58th Street, Ryerson Hall >> Chicago, IL 60637 >> =================================================== >> Email: iraicu at cs.uchicago.edu >> Web: http://www.cs.uchicago.edu/~iraicu >> http://dev.globus.org/wiki/Incubator/Falkon >> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >> =================================================== >> =================================================== >> >> > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From wilde at mcs.anl.gov Mon Jul 21 18:10:45 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 21 Jul 2008 18:10:45 -0500 Subject: [Swift-user] Using > 1 CPU per compute node under GRAM Message-ID: <48851775.5000604@mcs.anl.gov> Im asking this on behalf of Mike Kubal while I wait for more info on his settings: Mike is running under Swift on teragrid/Abe which has 8-core nodes. His jobs are all running 1-job-per-node, wasting 7 cores. I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM. In the meantime, does anyone know if there's a way to specify compute-node-sharing between separate single-cpu jobs via both GRAMs? And if this is dependent on the local job manager code or settings? (Ie might work on some sites but not others)? On globus doc page: http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes I see: ... but cant tell if this applies to single-core jobs or only to multi-core jobs. This will ideally be handled as desired by Falkon or Coaster, but in the meantime I was hoping there was a simple setting to give MikeK better CPU yield on Abe. - Mike Wilde --- A sample of one of his jobs looks like this under qstat -ef: Job Id: 395980.abem5.ncsa.uiuc.edu Job_Name = STDIN Job_Owner = mkubal at abe1196 job_state = Q queue = normal server = abem5.ncsa.uiuc.edu Account_Name = onm Checkpoint = u ctime = Mon Jul 21 17:43:47 2008 Error_Path = abe1196:/dev/null Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = n mtime = Mon Jul 21 17:43:47 2008 Output_Path = abe1196:/dev/null Priority = 0 qtime = Mon Jul 21 17:43:47 2008 Rerunable = True Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 00:10:00 Shell_Path_List = /bin/sh etime = Mon Jul 21 17:43:47 2008 submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN And his jobs show up like this under qstat -n (ie are all on core /0 ): 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1 -- 00:10 R -- abe0872/0 While multi-core jobs use +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4 +abe0579/3+abe0579/2+abe0579/1+abe0579/0 From wilde at mcs.anl.gov Mon Jul 21 18:18:16 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 21 Jul 2008 18:18:16 -0500 Subject: [Swift-user] Re: [Swift-devel] A naive run of Falkon+Swift on BGP login node. In-Reply-To: <488516EA.1080703@cs.uchicago.edu> References: <48850D10.7050103@uchicago.edu> <1216679949.18694.10.camel@localhost> <48850FEB.2020108@cs.uchicago.edu> <1216680290.20073.0.camel@localhost> <488511C8.2000707@cs.uchicago.edu> <48851609.8050909@uchicago.edu> <488516EA.1080703@cs.uchicago.edu> Message-ID: <48851938.4070507@mcs.anl.gov> On 7/21/08 6:08 PM, Ioan Raicu wrote: > > > Zhao Zhang wrote: >> In this test case, it actually worked. I talked with Mike, and we >> don't quite understand these 2 things. So I sent them out. >> >> After that I started another test. Running, swift on Login Node, >> falkon service on IO node, and BGexec on CN. >> At the very end of the service log, I got his: >> 847.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 288 512 >> 848.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 288 512 >> 849.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 >> 850.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 >> 851.985 2 2 25 256 256 0 0 0 0 0 0 0 0.0 2 0 0 0 0 0 0 0 0 0.0 0.0 0 0 >> 100 512 287 512 \ > Right, it can't deliver the 2 tasks, as there would have been a 2 before > the 0.0 in the middle. >> >> This means that we are still suffering the endpoint problem, right? > Right! > > You might want to put some debug statements in the Falkon provider to > print the end point IP address, to make sure it is the one you are > expecting. that debug logging is there, but not sure if or where its getting logged: In src/cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/ResourcePool.java the changed code tries to log as follows: public static String getMachNamePort(Notification userNot){ //String machIP = VDL2Config.getIP(); String machIP = CoGProperties.getDefault().getIPAddress(); String machNamePort = new String (machIP + ":" + userNot.recvPort); logger.debug("WORKER: Machine ID = " + machNamePort); return machNamePort; } Zhao, did you see "WORKER: Machine ID = " in your swift log? - Mike > Ioan >> >> And from swift stdout, >> zzhang at login6.surveyor:~/swift/etc> swift -sites.file ./sites.xml >> -tc.file ./tc.data -ip.address 172.17.3.16 first.swift >> Unable to find required classes (javax.activation.DataHandler and >> javax.mail.internet.MimeMultipart). Attachment support is disabled. >> Swift svn swift-r2140 cog-r2070 >> >> RunID: 20080721-1748-m9d39dg9 >> Progress: >> echo started >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> Progress: Executing:1 >> >> Swift kept waiting, which mean the -ip.address doesn't work as we >> expexted. >> >> zhao >> >> Ioan Raicu wrote: >>> So Zhao, did it actually work, but you got those two errors and >>> wanted to know what the errors were? If things worked as expected, >>> then you should be fine, you can ignore both of those errors (I >>> think). If things didn't work as expected, then we need to dig >>> deeper to find out why. >>> >>> Ioan >>> >>> Mihael Hategan wrote: >>>> On Mon, 2008-07-21 at 17:38 -0500, Ioan Raicu wrote: >>>> >>>>>> >>>>> This error should just be a warning... as it tries a different port >>>>> until it finds a good one. It should only print an error when it >>>>> gives up. So, that is not your problem Zhao, especially as you seem >>>>> to have run OK, right? Line 11: Final status: Finished >>>>> successfully:1/ >>>>> >>>> >>>> Yep. Sorry. Spoke without knowing. >>>> >>>> >>>> >>>> >>> >>> -- >>> =================================================== >>> Ioan Raicu >>> Ph.D. Candidate >>> =================================================== >>> Distributed Systems Laboratory >>> Computer Science Department >>> University of Chicago >>> 1100 E. 58th Street, Ryerson Hall >>> Chicago, IL 60637 >>> =================================================== >>> Email: iraicu at cs.uchicago.edu >>> Web: http://www.cs.uchicago.edu/~iraicu >>> http://dev.globus.org/wiki/Incubator/Falkon >>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page >>> =================================================== >>> =================================================== >>> >>> >> > From wilde at mcs.anl.gov Mon Jul 21 18:45:24 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 21 Jul 2008 18:45:24 -0500 Subject: [Swift-user] Re: Using > 1 CPU per compute node under GRAM In-Reply-To: References: <48851775.5000604@mcs.anl.gov> Message-ID: <48851F94.6010509@mcs.anl.gov> Thanks, JP. I'll forward this to the TeraGrid Help Desk and report back to this list. - Mike On 7/21/08 6:28 PM, JP Navarro wrote: > It's definitely subject to local resource manager/scheduling policy > configuration. > At UC/ANL, for example, there's an explicit policy that says 1 job per > node. Each > job can of course run 1-n processes that share the 2 processors. There's > nothing > gram can do to get around that policy. > > You'll need to ask NCSA whether their policies allow multiple jobs on > one node. > If Abe allows only one job per node, then it's up to your one job to > spawn off > enough processes/threads to use the 8 cores. > > JP > > On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote: > >> Im asking this on behalf of Mike Kubal while I wait for more info on >> his settings: >> >> Mike is running under Swift on teragrid/Abe which has 8-core nodes. >> His jobs are all running 1-job-per-node, wasting 7 cores. >> >> I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM. >> >> In the meantime, does anyone know if there's a way to specify >> compute-node-sharing between separate single-cpu jobs via both GRAMs? >> >> And if this is dependent on the local job manager code or settings? >> (Ie might work on some sites but not others)? >> >> On globus doc page: >> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes >> >> >> I see: >> >> ... >> >> >> but cant tell if this applies to single-core jobs or only to >> multi-core jobs. >> >> This will ideally be handled as desired by Falkon or Coaster, but in >> the meantime I was hoping there was a simple setting to give MikeK >> better CPU yield on Abe. >> >> - Mike Wilde >> >> --- >> >> A sample of one of his jobs looks like this under qstat -ef: >> >> Job Id: 395980.abem5.ncsa.uiuc.edu >> Job_Name = STDIN >> Job_Owner = mkubal at abe1196 >> job_state = Q >> queue = normal >> server = abem5.ncsa.uiuc.edu >> Account_Name = onm >> Checkpoint = u >> ctime = Mon Jul 21 17:43:47 2008 >> Error_Path = abe1196:/dev/null >> Hold_Types = n >> Join_Path = n >> Keep_Files = n >> Mail_Points = n >> mtime = Mon Jul 21 17:43:47 2008 >> Output_Path = abe1196:/dev/null >> Priority = 0 >> qtime = Mon Jul 21 17:43:47 2008 >> Rerunable = True >> Resource_List.ncpus = 1 >> Resource_List.nodect = 1 >> Resource_List.nodes = 1 >> Resource_List.walltime = 00:10:00 >> Shell_Path_List = /bin/sh >> etime = Mon Jul 21 17:43:47 2008 >> submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN >> >> And his jobs show up like this under qstat -n (ie are all on core /0 ): >> >> 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1 >> -- 00:10 R -- >> abe0872/0 >> >> While multi-core jobs use >> >> +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4 >> +abe0579/3+abe0579/2+abe0579/1+abe0579/0 > From iraicu at cs.uchicago.edu Mon Jul 21 18:57:26 2008 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Mon, 21 Jul 2008 18:57:26 -0500 Subject: [Swift-user] Using > 1 CPU per compute node under GRAM In-Reply-To: <48851775.5000604@mcs.anl.gov> References: <48851775.5000604@mcs.anl.gov> Message-ID: <48852266.2000502@cs.uchicago.edu> In the past (i.e. MolDyn), I don't think we ever found a easy solution to this when running straight through GRAM (if the LRM didn't support this policy). But, as JP said, it is site specific, so some sites will allow getting only 1 CPU per node, such as Teraport, in which case GRAM should work just fine. Ioan Michael Wilde wrote: > Im asking this on behalf of Mike Kubal while I wait for more info on > his settings: > > Mike is running under Swift on teragrid/Abe which has 8-core nodes. > His jobs are all running 1-job-per-node, wasting 7 cores. > > I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM. > > In the meantime, does anyone know if there's a way to specify > compute-node-sharing between separate single-cpu jobs via both GRAMs? > > And if this is dependent on the local job manager code or settings? > (Ie might work on some sites but not others)? > > On globus doc page: > http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes > > > I see: > > ... > > > but cant tell if this applies to single-core jobs or only to > multi-core jobs. > > This will ideally be handled as desired by Falkon or Coaster, but in > the meantime I was hoping there was a simple setting to give MikeK > better CPU yield on Abe. > > - Mike Wilde > > --- > > A sample of one of his jobs looks like this under qstat -ef: > > Job Id: 395980.abem5.ncsa.uiuc.edu > Job_Name = STDIN > Job_Owner = mkubal at abe1196 > job_state = Q > queue = normal > server = abem5.ncsa.uiuc.edu > Account_Name = onm > Checkpoint = u > ctime = Mon Jul 21 17:43:47 2008 > Error_Path = abe1196:/dev/null > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Mon Jul 21 17:43:47 2008 > Output_Path = abe1196:/dev/null > Priority = 0 > qtime = Mon Jul 21 17:43:47 2008 > Rerunable = True > Resource_List.ncpus = 1 > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Resource_List.walltime = 00:10:00 > Shell_Path_List = /bin/sh > etime = Mon Jul 21 17:43:47 2008 > submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN > > And his jobs show up like this under qstat -n (ie are all on core /0 ): > > 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1 > -- 00:10 R -- > abe0872/0 > > While multi-core jobs use > > +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4 > +abe0579/3+abe0579/2+abe0579/1+abe0579/0 > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== From mikekubal at yahoo.com Mon Jul 21 22:42:22 2008 From: mikekubal at yahoo.com (Mike Kubal) Date: Mon, 21 Jul 2008 20:42:22 -0700 (PDT) Subject: [Swift-user] Using > 1 CPU per compute node under GRAM In-Reply-To: <48852266.2000502@cs.uchicago.edu> Message-ID: <510897.59042.qm@web52303.mail.re2.yahoo.com> I'm using pre-WS-GRAM. MikeK --- On Mon, 7/21/08, Ioan Raicu wrote: From: Ioan Raicu Subject: Re: [Swift-user] Using > 1 CPU per compute node under GRAM To: "Michael Wilde" Cc: "Swift User Discussion List" , "Stu Martin" , "Martin Feller" , "JP Navarro" , "Mike Kubal" Date: Monday, July 21, 2008, 6:57 PM In the past (i.e. MolDyn), I don't think we ever found a easy solution to this when running straight through GRAM (if the LRM didn't support this policy). But, as JP said, it is site specific, so some sites will allow getting only 1 CPU per node, such as Teraport, in which case GRAM should work just fine. Ioan Michael Wilde wrote: > Im asking this on behalf of Mike Kubal while I wait for more info on > his settings: > > Mike is running under Swift on teragrid/Abe which has 8-core nodes. > His jobs are all running 1-job-per-node, wasting 7 cores. > > I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM. > > In the meantime, does anyone know if there's a way to specify > compute-node-sharing between separate single-cpu jobs via both GRAMs? > > And if this is dependent on the local job manager code or settings? > (Ie might work on some sites but not others)? > > On globus doc page: > http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes > > > I see: > > ... > > > but cant tell if this applies to single-core jobs or only to > multi-core jobs. > > This will ideally be handled as desired by Falkon or Coaster, but in > the meantime I was hoping there was a simple setting to give MikeK > better CPU yield on Abe. > > - Mike Wilde > > --- > > A sample of one of his jobs looks like this under qstat -ef: > > Job Id: 395980.abem5.ncsa.uiuc.edu > Job_Name = STDIN > Job_Owner = mkubal at abe1196 > job_state = Q > queue = normal > server = abem5.ncsa.uiuc.edu > Account_Name = onm > Checkpoint = u > ctime = Mon Jul 21 17:43:47 2008 > Error_Path = abe1196:/dev/null > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Mon Jul 21 17:43:47 2008 > Output_Path = abe1196:/dev/null > Priority = 0 > qtime = Mon Jul 21 17:43:47 2008 > Rerunable = True > Resource_List.ncpus = 1 > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Resource_List.walltime = 00:10:00 > Shell_Path_List = /bin/sh > etime = Mon Jul 21 17:43:47 2008 > submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN > > And his jobs show up like this under qstat -n (ie are all on core /0 ): > > 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 1 > -- 00:10 R -- > abe0872/0 > > While multi-core jobs use > > +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4 > +abe0579/3+abe0579/2+abe0579/1+abe0579/0 > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > -- =================================================== Ioan Raicu Ph.D. Candidate =================================================== Distributed Systems Laboratory Computer Science Department University of Chicago 1100 E. 58th Street, Ryerson Hall Chicago, IL 60637 =================================================== Email: iraicu at cs.uchicago.edu Web: http://www.cs.uchicago.edu/~iraicu http://dev.globus.org/wiki/Incubator/Falkon http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page =================================================== =================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: From navarro at mcs.anl.gov Mon Jul 21 18:28:28 2008 From: navarro at mcs.anl.gov (JP Navarro) Date: Mon, 21 Jul 2008 18:28:28 -0500 Subject: [Swift-user] Re: Using > 1 CPU per compute node under GRAM In-Reply-To: <48851775.5000604@mcs.anl.gov> References: <48851775.5000604@mcs.anl.gov> Message-ID: It's definitely subject to local resource manager/scheduling policy configuration. At UC/ANL, for example, there's an explicit policy that says 1 job per node. Each job can of course run 1-n processes that share the 2 processors. There's nothing gram can do to get around that policy. You'll need to ask NCSA whether their policies allow multiple jobs on one node. If Abe allows only one job per node, then it's up to your one job to spawn off enough processes/threads to use the 8 cores. JP On Jul 21, 2008, at 6:10 PM, Michael Wilde wrote: > Im asking this on behalf of Mike Kubal while I wait for more info on > his settings: > > Mike is running under Swift on teragrid/Abe which has 8-core nodes. > His jobs are all running 1-job-per-node, wasting 7 cores. > > I am waiting to hear if he is running on WS-GRAM or pre-WS-GRAM. > > In the meantime, does anyone know if there's a way to specify > compute-node-sharing between separate single-cpu jobs via both GRAMs? > > And if this is dependent on the local job manager code or settings? > (Ie might work on some sites but not others)? > > On globus doc page: > http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html#r-wsgram-extensions-constructs-nodes > > I see: > > ... > > > but cant tell if this applies to single-core jobs or only to multi- > core jobs. > > This will ideally be handled as desired by Falkon or Coaster, but in > the meantime I was hoping there was a simple setting to give MikeK > better CPU yield on Abe. > > - Mike Wilde > > --- > > A sample of one of his jobs looks like this under qstat -ef: > > Job Id: 395980.abem5.ncsa.uiuc.edu > Job_Name = STDIN > Job_Owner = mkubal at abe1196 > job_state = Q > queue = normal > server = abem5.ncsa.uiuc.edu > Account_Name = onm > Checkpoint = u > ctime = Mon Jul 21 17:43:47 2008 > Error_Path = abe1196:/dev/null > Hold_Types = n > Join_Path = n > Keep_Files = n > Mail_Points = n > mtime = Mon Jul 21 17:43:47 2008 > Output_Path = abe1196:/dev/null > Priority = 0 > qtime = Mon Jul 21 17:43:47 2008 > Rerunable = True > Resource_List.ncpus = 1 > Resource_List.nodect = 1 > Resource_List.nodes = 1 > Resource_List.walltime = 00:10:00 > Shell_Path_List = /bin/sh > etime = Mon Jul 21 17:43:47 2008 > submit_args = -A onm /tmp/.pbs_mkubal_21430/STDIN > > And his jobs show up like this under qstat -n (ie are all on core / > 0 ): > > 395653.abem5.ncsa.ui mkubal normal STDIN 1767 1 > 1 -- 00:10 R -- > abe0872/0 > > While multi-core jobs use > > +abe0582/2+abe0582/1+abe0582/0+abe0579/7+abe0579/6+abe0579/5+abe0579/4 > +abe0579/3+abe0579/2+abe0579/1+abe0579/0 From benc at hawaga.org.uk Tue Jul 22 02:14:06 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 07:14:06 +0000 (GMT) Subject: [Swift-user] Help needed with batching up parallel runs In-Reply-To: References: Message-ID: Use clustering. Read the docs mike linked to. Basically you need to specify a maxwalltime for the jobs you want clustered, and then a clustering time that is some multiple (eg 10 in your case). You might try coasters if you are submitting using GT2 (there is something wrong with gt4 + coasters at the moment that prevents them being used together). On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote: > Hi > > I work with some code that generates at some point a number (300 in my > case) of parallel identical runs, and I need to batch those up (10 at > a time in my case) because each individual run is too short. > I don't want Falkon at this point, and I'm not sure about the status > of the coaster provider, so I would prefer a clean swift solution > I was thinking of some array manipulation, but it was not obvious how > to do it with swift. > > Thanks ! > Tibi > > Here is the code that I have so far, and I need help for: > > > > //this is the code that batches a number of runs: based on the size of > the array (determined where I make the call), I will return the set of > parallel run results > (file simFile[])gj_batch_sim(file policyFile, file logFile){ > app{ > gj_batch_sim @filename(policyFile) @filename(logFile) > @filenames(simFile); > } > } > > int parallelInstances=300; > file simOutputs[]; > > (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){ > // this is just some needed input > file logFile; > > // I want to have batches of size 10 > int localBatchSize=10; > > int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*") > trace("Times to do batch_gj_batch_sim",batchRange); > > foreach i in [1:batchRange] { > // HELP HERE: how to do this ? > // essentially I need to map the proper batch of file > names into the call of gj_batch_sim > > simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile, > logFile); > } > } > > > > From benc at hawaga.org.uk Tue Jul 22 02:19:57 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 07:19:57 +0000 (GMT) Subject: [Swift-user] Using > 1 CPU per compute node under GRAM In-Reply-To: <48851775.5000604@mcs.anl.gov> References: <48851775.5000604@mcs.anl.gov> Message-ID: On Mon, 21 Jul 2008, Michael Wilde wrote: > In the meantime, does anyone know if there's a way to specify > compute-node-sharing between separate single-cpu jobs via both GRAMs? > > And if this is dependent on the local job manager code or settings? (Ie might > work on some sites but not others)? You can specify via GRAM RSL; however at least TGUC deliberately does not allow that - one job gets an entire node. I imagine other sites are similar. Coasters should allow this to be done, by running two coaster workers on one node. I plan to look at doing that sometime. > This will ideally be handled as desired by Falkon or Coaster, but in the > meantime I was hoping there was a simple setting to give MikeK better > CPU yield on Abe. There isn't. Him and I have investigated this before, I think. I've just put this in the swift bugzilla as bug 150. -- From benc at hawaga.org.uk Tue Jul 22 03:48:55 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 08:48:55 +0000 (GMT) Subject: [Swift-user] swift + mpi Message-ID: I added a note to the swift user guide a couple weeks ago about how to run MPI jobs in Swift: http://www.ci.uchicago.edu/swift/guides/userguide.php#tips.mpi This is based on some playing round by Andriy Fedorov and myself. -- From wilde at mcs.anl.gov Tue Jul 22 11:21:20 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 11:21:20 -0500 Subject: [Swift-user] Problem with scope and writability for complex types Message-ID: <48860900.7040805@mcs.anl.gov> This script compiles: 1 type file; 2 3 file fphr[]; 4 file fpin[]; 5 file fpsq[]; 6 7 (file phr, file pin, file psq) formatdb (file input) { 8 app { 9 formatdb "-i" @input ; 10 } 11 } 12 13 file inputs[] ; 14 15 foreach f, i in inputs { 16 (fphr[i], fpin[i], fpsq[i]) = formatdb(f); 17 } --- while this script gives a compile-time error: 1 type file; 2 3 type aux { 4 file phr; 5 file pin; 6 file psq; 7 }; 8 9 (file phr, file pin, file psq) formatdb (file input) { 10 app { 11 formatdb "-i" @input ; 12 } 13 } 14 15 file inputs[] ; 16 aux a[]; 17 18 foreach f, i in inputs { 19 (a[i].phr, a[i].pin, a[i].psq) = formatdb(f); 20 } --- error is: Could not start execution. Compile error in foreach statement at line 18: Compile error in procedure invocation at line 19: variable a is not writeable in this scope --- It seems that the second script should be valid. Both set global variables from with a foreach() in global scope. When the variable is of complex type "array of file" the variable indices seem to be writable. When the variable is of complex type "array of struct of file" the indexed struct fields seem not to be writable. From benc at hawaga.org.uk Tue Jul 22 11:38:47 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 22 Jul 2008 16:38:47 +0000 (GMT) Subject: [Swift-user] Problem with scope and writability for complex types In-Reply-To: <48860900.7040805@mcs.anl.gov> References: <48860900.7040805@mcs.anl.gov> Message-ID: On Tue, 22 Jul 2008, Michael Wilde wrote: > It seems that the second script should be valid. Both set global variables > from with a foreach() in global scope. yes, I think that is correct. seems to be a bug. (the handling of write-once semantics at compile time for SwiftScript is kinda hard because this array syntax doesn't look like write-once...) -- From zhengxiongh at uchicago.edu Tue Jul 22 18:33:32 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Tue, 22 Jul 2008 18:33:32 -0500 (CDT) Subject: [Swift-user] How to transmit data dynamically on Grid Message-ID: <20080722183332.BIW75329@m4500-01.uchicago.edu> Hi, I'm using the Swift to execute application jobs on the OSG grid sites. In the sites.xml file, if the jobmanager is not "fork", e.g. url="abitibi.sbgrid.org/jobmanager-condor". The job is usually executed on a local computing node, which is not the "Gateway node" of the grid site. But when executing the job, in the executable command, such as a wrapper script "rundock", I want to dynamically transmit the input data files from CI to the remote grid site by "globus-url-copy". e.g. ( globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath file://$work/$ligfile) And transmit the results data from remote grid site to CI machine, e.g. (globus-url-copy file://$work/result.tar.gz gsiftp://communicado.ci.uchicago.edu/home/houzx/dock- run/databases/results/abitibi.sbgrid.org-$ligfile- result.tar.gz) The problem is that, the executing computing node is not connected to the outside network, So the "globus-url-copy" fails! Only using "jobmanager-fork", can it succeed, because the job is executed on the Gateway node of the Grid site. The user may want to use the "jobmanager-condor" to execute the jobs. At the same time, according to the dynamically seleted grid sites of Swift,they also want to transmit the input and results data dynamically and automatically by "jobmanager-fork". Because it is troublesome to "globus-url-copy" the input and results data to the remote grid sites manually, if there are large amounts of data files. So, the quesiton is how to implement it in Swift? Maybe it's a common problem, but I didn't find it in the documents. Thanks, Zhengxiong From wilde at mcs.anl.gov Tue Jul 22 22:16:48 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 22 Jul 2008 22:16:48 -0500 Subject: [Swift-user] How to transmit data dynamically on Grid In-Reply-To: <20080722183332.BIW75329@m4500-01.uchicago.edu> References: <20080722183332.BIW75329@m4500-01.uchicago.edu> Message-ID: <4886A2A0.9010604@mcs.anl.gov> Zhengxiong, By default, Swift automatically moves your data from a directory on the submit host (the host on which you run the swift command) to a shared directory on the execution site, where its accessed by your job, running on a worker node in the remote cluster. This is explained in the User Guide intro: http://www.ci.uchicago.edu/swift/guides/userguide.php "SwiftScript programs are dataflow oriented - they are primarily concerned with processing (possibly large) data files, by invoking programs to do that processing. Swift handles execution of such programs on remote sites by choosing sites, handling the staging of input and output files to and from the chosen sites and remote execution of program code". Staging is detailed in section 8: "Invoking an Application from Swift": http://www.ci.uchicago.edu/swift/guides/userguide.php#id2931120 I think the example shell script you are looking at, "rundock", is misleading you, because it was written to run under Falkon without Swift, and hence does some staging between the cluster's shared filesystem and local worker-node directories. I would start by dividing the files that DOCK uses into two categories: 1) files that you will declare as inputs or outputs of Swift atomic procedures, which you should let Swift stage in an out automatically; and 2) files which can be considered part of the application's install directory (which can stay on each cluster's shared filesytem with the application code, or which can be shipped to each site in a preparation stage). In addition Swift will, within the execution of a script, avoid staging a file in twice, if it can. The users guide explains this under the property "caching.algorithm": "Swift caches files that are staged in on remote resources, and files that are produced remotely by applications, such that they can be re-used if needed without being transfered again. However, the amount of remote file system space to be used for caching can be limited using the swift:storagesize profile entry in the sites.xml file." So you could let Swift bring in even large files for you, to the shared filesystem, and your application wrapper script can cache these in a persistent application directory on the worker node. In rundock, you could use this aproach by declaring the "receptor" protein molecule files (grid files and "selected spheres") as Swift inputs, and let swift bring them to the grid site for you. Lastly, see this note in the Environment Variables section of the users guide: "SWIFT_JOBDIR_PATH - set in env namespace profiles. If set, then Swift will use the path specified here as a worker-node local temporary directory to copy input files to before running a job. If unset, Swift will keep input files on the site-shared filesystem. In some cases, copying to a worker-node local directory can be much faster than having applications access the site-shared filesystem directly." You can achieve the same effect of copying data to the local worker node disk, by doing so explicitly in your application wrapper script ("rundock" in your case). If you know that you will be running many applications consecutively on the same worker nodes, eg because you are using Coaster or Falkon, then you can do what rundock does on the BG/P, and cache data in a local directory *between* jobs. But, like rundock, you need to be careful to avoid races between multiple jobs on the same node, and much ensure that you can always get your data from the shared filesystem when its not already cached there. Bash functions in rundock have the locking logic to do this. Caching data that will be read by many jobs on the worker node disk makes sense for the receptor files, as each of these will be read by 15K jobs. So there's actually several ways in which to manage your data. Lets work out some of these cases, and then document them in the users guide for future users, with examples. - Mike On 7/22/08 6:33 PM, Zhengxiong Hou wrote: > Hi, > I'm using the Swift to execute application jobs on the > OSG grid sites. > In the sites.xml file, if the jobmanager is not "fork", > e.g. url="abitibi.sbgrid.org/jobmanager-condor". > The job is usually executed on a local computing node, > which is not the "Gateway node" of the grid site. > But when executing the job, in the executable command, > such as a wrapper script "rundock", I want to dynamically > transmit the input data files from CI to the remote grid > site by "globus-url-copy". e.g. ( > globus-url-copy gsiftp://communicado.ci.uchicago.edu$ligpath > file://$work/$ligfile) > And transmit the results data from remote grid site to CI > machine, e.g. (globus-url-copy file://$work/result.tar.gz > gsiftp://communicado.ci.uchicago.edu/home/houzx/dock- > run/databases/results/abitibi.sbgrid.org-$ligfile- > result.tar.gz) > The problem is that, the executing computing node is not > connected to the outside network, So the "globus-url-copy" > fails! Only using "jobmanager-fork", can it succeed, because > the job is executed on the Gateway node of the Grid site. > > The user may want to use the "jobmanager-condor" to > execute the jobs. At the same time, according to the > dynamically seleted grid sites of Swift,they also want to > transmit the input and results data dynamically and > automatically by "jobmanager-fork". Because it is > troublesome to "globus-url-copy" the input and results data > to the remote grid sites manually, if there are large > amounts of data files. > > So, the quesiton is how to implement it in Swift? Maybe > it's a common problem, but I didn't find it in the documents. > > Thanks, > Zhengxiong > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From tiberius at ci.uchicago.edu Wed Jul 23 15:37:10 2008 From: tiberius at ci.uchicago.edu (Tiberiu Stef-Praun) Date: Wed, 23 Jul 2008 15:37:10 -0500 Subject: [Swift-user] Help needed with batching up parallel runs In-Reply-To: References: Message-ID: Hi, thanks, I forgot about that. I tried running it on teraport, and it failed. The log is here: http://www.ci.uchicago.edu/~tiberius/issues/gj-batched-20080723-1522-fypk29g6.log Mihael suggested that I should file a bug report, but I'm not sure what to report (other than the failure) Tibi On Tue, Jul 22, 2008 at 2:14 AM, Ben Clifford wrote: > > Use clustering. Read the docs mike linked to. Basically you need to > specify a maxwalltime for the jobs you want clustered, and then a > clustering time that is some multiple (eg 10 in your case). > > You might try coasters if you are submitting using GT2 (there is something > wrong with gt4 + coasters at the moment that prevents them being used > together). > > On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote: > >> Hi >> >> I work with some code that generates at some point a number (300 in my >> case) of parallel identical runs, and I need to batch those up (10 at >> a time in my case) because each individual run is too short. >> I don't want Falkon at this point, and I'm not sure about the status >> of the coaster provider, so I would prefer a clean swift solution >> I was thinking of some array manipulation, but it was not obvious how >> to do it with swift. >> >> Thanks ! >> Tibi >> >> Here is the code that I have so far, and I need help for: >> >> >> >> //this is the code that batches a number of runs: based on the size of >> the array (determined where I make the call), I will return the set of >> parallel run results >> (file simFile[])gj_batch_sim(file policyFile, file logFile){ >> app{ >> gj_batch_sim @filename(policyFile) @filename(logFile) >> @filenames(simFile); >> } >> } >> >> int parallelInstances=300; >> file simOutputs[]; >> >> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){ >> // this is just some needed input >> file logFile; >> >> // I want to have batches of size 10 >> int localBatchSize=10; >> >> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*") >> trace("Times to do batch_gj_batch_sim",batchRange); >> >> foreach i in [1:batchRange] { >> // HELP HERE: how to do this ? >> // essentially I need to map the proper batch of file >> names into the call of gj_batch_sim >> >> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile, >> logFile); >> } >> } >> >> >> >> > -- Tiberiu (Tibi) Stef-Praun, PhD Computational Sciences Researcher Computation Institute 5640 S. Ellis Ave, #405 University of Chicago http://www-unix.mcs.anl.gov/~tiberius/ From benc at hawaga.org.uk Sun Jul 27 06:03:02 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 27 Jul 2008 11:03:02 +0000 (GMT) Subject: [Swift-user] Help needed with batching up parallel runs In-Reply-To: References: Message-ID: If you're using GRAM4 to submit, then it looks like you are hitting a bug that I fixed a week or so ago, cog svn r2066, which deals with the way that walltimes are formatted. On Wed, 23 Jul 2008, Tiberiu Stef-Praun wrote: > Hi, thanks, I forgot about that. > I tried running it on teraport, and it failed. > The log is here: > http://www.ci.uchicago.edu/~tiberius/issues/gj-batched-20080723-1522-fypk29g6.log > > Mihael suggested that I should file a bug report, but I'm not sure > what to report (other than the failure) > > Tibi > > On Tue, Jul 22, 2008 at 2:14 AM, Ben Clifford wrote: > > > > Use clustering. Read the docs mike linked to. Basically you need to > > specify a maxwalltime for the jobs you want clustered, and then a > > clustering time that is some multiple (eg 10 in your case). > > > > You might try coasters if you are submitting using GT2 (there is something > > wrong with gt4 + coasters at the moment that prevents them being used > > together). > > > > On Mon, 21 Jul 2008, Tiberiu Stef-Praun wrote: > > > >> Hi > >> > >> I work with some code that generates at some point a number (300 in my > >> case) of parallel identical runs, and I need to batch those up (10 at > >> a time in my case) because each individual run is too short. > >> I don't want Falkon at this point, and I'm not sure about the status > >> of the coaster provider, so I would prefer a clean swift solution > >> I was thinking of some array manipulation, but it was not obvious how > >> to do it with swift. > >> > >> Thanks ! > >> Tibi > >> > >> Here is the code that I have so far, and I need help for: > >> > >> > >> > >> //this is the code that batches a number of runs: based on the size of > >> the array (determined where I make the call), I will return the set of > >> parallel run results > >> (file simFile[])gj_batch_sim(file policyFile, file logFile){ > >> app{ > >> gj_batch_sim @filename(policyFile) @filename(logFile) > >> @filenames(simFile); > >> } > >> } > >> > >> int parallelInstances=300; > >> file simOutputs[]; > >> > >> (file simResults[])batch_gj_batch_sim(file policyFile, int parallelInstances){ > >> // this is just some needed input > >> file logFile; > >> > >> // I want to have batches of size 10 > >> int localBatchSize=10; > >> > >> int batchRange=@toint(@strcut(@strcat(parallelInstances/localBatchSize),"([0-9]*).?[0-9]*") > >> trace("Times to do batch_gj_batch_sim",batchRange); > >> > >> foreach i in [1:batchRange] { > >> // HELP HERE: how to do this ? > >> // essentially I need to map the proper batch of file > >> names into the call of gj_batch_sim > >> > >> simResults[batchSize*i:batchSize*(i+1)-1]=gj_batch_sim(policyFile, > >> logFile); > >> } > >> } > >> > >> > >> > >> > > > > > > From zhengxiongh at uchicago.edu Tue Jul 29 09:22:49 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Tue, 29 Jul 2008 09:22:49 -0500 (CDT) Subject: [Swift-user] pegasus? Message-ID: <20080729092249.BJC48144@m4500-01.uchicago.edu> Hi, Recently, I met with an error when using Swift: [houzx at communicado results]$ swift -sites.file ./sites- 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to set -Dpegasus.home=$PEGASUS_HOME! [houzx at communicado dock]$ swift flipper.swift 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to set -Dpegasus.home=$PEGASUS_HOME! Swift did NOT need this. Is there anything wrong with my account at CI? [houzx at communicado dock]$ echo $PEGASUS_HOME /soft/osg-client-1.0.0-r1/pegasus [houzx at communicado dock]$ cd ~ [houzx at communicado ~]$ cat .soft # # This is your SoftEnv configuration run control file. # # It is used to tell SoftEnv how to customize your environment by # setting up variables such as PATH and MANPATH. To learn more # about this file, do a "man softenv". # @default @osg @globus-4 Thanks! Zhengxiong From wilde at mcs.anl.gov Tue Jul 29 09:35:05 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 09:35:05 -0500 Subject: [Swift-user] pegasus? In-Reply-To: <20080729092249.BJC48144@m4500-01.uchicago.edu> References: <20080729092249.BJC48144@m4500-01.uchicago.edu> Message-ID: <488F2A99.9080500@mcs.anl.gov> See if you have CLASSPATH set, and have Pegasus jars in it. Then try unsetting CLASSPATH and see if the same error occurs. The Swift command should put the correct Swift jars in the final classpath before any of your local jars, but perhaps there's some strange dynamic class interaction between the Swift version of tcdata/sites code and code that you have been experimenting with from the Peagsus release (eg get-sites etc). - Mike On 7/29/08 9:22 AM, Zhengxiong Hou wrote: > Hi, > Recently, I met with an error when using Swift: > > [houzx at communicado results]$ swift -sites.file ./sites- > 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift > 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to > set -Dpegasus.home=$PEGASUS_HOME! > > [houzx at communicado dock]$ swift flipper.swift > 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to > set -Dpegasus.home=$PEGASUS_HOME! > > Swift did NOT need this. Is there anything wrong with my > account at CI? > > [houzx at communicado dock]$ echo $PEGASUS_HOME > /soft/osg-client-1.0.0-r1/pegasus > [houzx at communicado dock]$ cd ~ > [houzx at communicado ~]$ cat .soft > # > # This is your SoftEnv configuration run control file. > # > # It is used to tell SoftEnv how to customize your > environment by > # setting up variables such as PATH and MANPATH. To learn > more > # about this file, do a "man softenv". > # > @default > @osg > @globus-4 > > Thanks! > Zhengxiong > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Tue Jul 29 09:40:48 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 14:40:48 +0000 (GMT) Subject: [Swift-user] pegasus? In-Reply-To: <488F2A99.9080500@mcs.anl.gov> References: <20080729092249.BJC48144@m4500-01.uchicago.edu> <488F2A99.9080500@mcs.anl.gov> Message-ID: On Tue, 29 Jul 2008, Michael Wilde wrote: > The Swift command should put the correct Swift jars in the final classpath > before any of your local jars, but perhaps there's some strange dynamic class Right. Although this was only changed around the time that we did the grid school in georgetown (?April). Versions of Swift older than that will have the originally posted problem. -- From zhengxiongh at uchicago.edu Tue Jul 29 11:13:30 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Tue, 29 Jul 2008 11:13:30 -0500 (CDT) Subject: [Swift-user] Illegal character Message-ID: <20080729111330.BJC64453@m4500-01.uchicago.edu> Hi Mike, Yes,you are right. If I unset CLASSPATH in .soft.cache.sh, or just mark #@osg in the .soft file, the original error disappeared. But there is a new ERROR, although swift job was finished. [houzx at communicado dock]$ swift flipper.swift 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on line 19 Illegal character ' 'at position 5 :Illegal character ' ' Swift 0.5 swift-r1783 cog-r1962 RunID: 20080729-1105-qbqofzya Progress: convert started 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on line 19 Illegal character ' 'at position 5 :Illegal character ' ' convert completed Final status: Finished successfully:1 Finished:1 Thanks much, zhengxiong ---- Original message ---- >Date: Tue, 29 Jul 2008 09:35:05 -0500 >From: Michael Wilde >Subject: Re: [Swift-user] pegasus? >To: Zhengxiong Hou >Cc: swift-user at ci.uchicago.edu, support at ci.uchicago.edu > >See if you have CLASSPATH set, and have Pegasus jars in it. >Then try unsetting CLASSPATH and see if the same error occurs. > >The Swift command should put the correct Swift jars in the final >classpath before any of your local jars, but perhaps there's some >strange dynamic class interaction between the Swift version of >tcdata/sites code and code that you have been experimenting with from >the Peagsus release (eg get-sites etc). > >- Mike > > >On 7/29/08 9:22 AM, Zhengxiong Hou wrote: >> Hi, >> Recently, I met with an error when using Swift: >> >> [houzx at communicado results]$ swift -sites.file ./sites- >> 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift >> 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to >> set -Dpegasus.home=$PEGASUS_HOME! >> >> [houzx at communicado dock]$ swift flipper.swift >> 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to >> set -Dpegasus.home=$PEGASUS_HOME! >> >> Swift did NOT need this. Is there anything wrong with my >> account at CI? >> >> [houzx at communicado dock]$ echo $PEGASUS_HOME >> /soft/osg-client-1.0.0-r1/pegasus >> [houzx at communicado dock]$ cd ~ >> [houzx at communicado ~]$ cat .soft >> # >> # This is your SoftEnv configuration run control file. >> # >> # It is used to tell SoftEnv how to customize your >> environment by >> # setting up variables such as PATH and MANPATH. To learn >> more >> # about this file, do a "man softenv". >> # >> @default >> @osg >> @globus-4 >> >> Thanks! >> Zhengxiong >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From benc at hawaga.org.uk Tue Jul 29 11:21:46 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 16:21:46 +0000 (GMT) Subject: [Swift-user] Illegal character In-Reply-To: <20080729111330.BJC64453@m4500-01.uchicago.edu> References: <20080729111330.BJC64453@m4500-01.uchicago.edu> Message-ID: (I removed CI support as this is not their business now) The fifth byte of the 19th line of your tc.data file is something unexpected. Type: hexdump -C tc.data (for the tc.data file that you are using) and send that output. On Tue, 29 Jul 2008, Zhengxiong Hou wrote: > Hi Mike, > Yes,you are right. > If I unset CLASSPATH in .soft.cache.sh, or just mark > #@osg in the .soft file, the original error disappeared. > > But there is a new ERROR, although swift job was finished. > > [houzx at communicado dock]$ swift flipper.swift > 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on > line 19 Illegal character ' 'at position 5 :Illegal > character ' ' > Swift 0.5 swift-r1783 cog-r1962 > > RunID: 20080729-1105-qbqofzya > Progress: > convert started > 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on > line 19 Illegal character ' 'at position 5 :Illegal > character ' ' > convert completed > Final status: Finished successfully:1 Finished:1 -- From zhengxiongh at uchicago.edu Tue Jul 29 12:03:16 2008 From: zhengxiongh at uchicago.edu (Zhengxiong Hou) Date: Tue, 29 Jul 2008 12:03:16 -0500 (CDT) Subject: [Swift-user] Illegal character Message-ID: <20080729120316.BJC71514@m4500-01.uchicago.edu> Hi Benc, Sorry, so the error message came from the tc.data. I just re-edit it. Maybe it is due to a "space". Now, it works normally. Thanks! zhengxiong ---- Original message ---- >Date: Tue, 29 Jul 2008 16:21:46 +0000 (GMT) >From: Ben Clifford >Subject: Re: [Swift-user] Illegal character >To: Zhengxiong Hou >Cc: Michael Wilde , swift- user at ci.uchicago.edu > > >(I removed CI support as this is not their business now) > >The fifth byte of the 19th line of your tc.data file is something >unexpected. > >Type: > >hexdump -C tc.data > >(for the tc.data file that you are using) > >and send that output. > >On Tue, 29 Jul 2008, Zhengxiong Hou wrote: > >> Hi Mike, >> Yes,you are right. >> If I unset CLASSPATH in .soft.cache.sh, or just mark >> #@osg in the .soft file, the original error disappeared. >> >> But there is a new ERROR, although swift job was finished. >> >> [houzx at communicado dock]$ swift flipper.swift >> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on >> line 19 Illegal character ' 'at position 5 :Illegal >> character ' ' >> Swift 0.5 swift-r1783 cog-r1962 >> >> RunID: 20080729-1105-qbqofzya >> Progress: >> convert started >> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on >> line 19 Illegal character ' 'at position 5 :Illegal >> character ' ' >> convert completed >> Final status: Finished successfully:1 Finished:1 > >-- From wilde at mcs.anl.gov Tue Jul 29 12:11:09 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 29 Jul 2008 12:11:09 -0500 Subject: [Swift-user] Illegal character In-Reply-To: References: <20080729111330.BJC64453@m4500-01.uchicago.edu> Message-ID: <488F4F2D.1030207@mcs.anl.gov> And I see that you're using Swift 0.5, which may not have the CLASSPATH improvements in the swift command as Ben mentioned. Ben, should the nightly builds show up on this page, and if so should local developers use those to get a recent snapshot: http://www.ci.uchicago.edu/swift/tests/tests-2008-07-13.html#packages (in other words, is that page broken, or was it never intended to be a source of nightly snapshots for download?) You can also build your own Swift release from SVN. Instructions are at: http://www.ci.uchicago.edu/swift/downloads/index.php - Mike On 7/29/08 11:21 AM, Ben Clifford wrote: > (I removed CI support as this is not their business now) > > The fifth byte of the 19th line of your tc.data file is something > unexpected. > > Type: > > hexdump -C tc.data > > (for the tc.data file that you are using) > > and send that output. > > On Tue, 29 Jul 2008, Zhengxiong Hou wrote: > >> Hi Mike, >> Yes,you are right. >> If I unset CLASSPATH in .soft.cache.sh, or just mark >> #@osg in the .soft file, the original error disappeared. >> >> But there is a new ERROR, although swift job was finished. >> >> [houzx at communicado dock]$ swift flipper.swift >> 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on >> line 19 Illegal character ' 'at position 5 :Illegal >> character ' ' >> Swift 0.5 swift-r1783 cog-r1962 >> >> RunID: 20080729-1105-qbqofzya >> Progress: >> convert started >> 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on >> line 19 Illegal character ' 'at position 5 :Illegal >> character ' ' >> convert completed >> Final status: Finished successfully:1 Finished:1 > From benc at hawaga.org.uk Tue Jul 29 12:26:04 2008 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 29 Jul 2008 17:26:04 +0000 (GMT) Subject: [Swift-user] Illegal character In-Reply-To: <488F4F2D.1030207@mcs.anl.gov> References: <20080729111330.BJC64453@m4500-01.uchicago.edu> <488F4F2D.1030207@mcs.anl.gov> Message-ID: On Tue, 29 Jul 2008, Michael Wilde wrote: > Ben, should the nightly builds show up on this page, and if so should local > developers use those to get a recent snapshot: > > http://www.ci.uchicago.edu/swift/tests/tests-2008-07-13.html#packages Its broken I suspect - it always had a tendency to go wrong; and from a testing perspective has been mostly replaced by the NMI testing system. > (in other words, is that page broken, or was it never intended to be a source > of nightly snapshots for download?) It was originally intended so, however, mostly I find myself preferring users to either stick with a real release or build from source like you say below, because the people using latest often seem to want features more rapidly than waiting a day so end up building from SVN source anyway. > You can also build your own Swift release from SVN. Instructions are at: > http://www.ci.uchicago.edu/swift/downloads/index.php -- From zhaozhang at uchicago.edu Thu Jul 31 12:41:35 2008 From: zhaozhang at uchicago.edu (Zhao Zhang) Date: Thu, 31 Jul 2008 12:41:35 -0500 Subject: [Swift-user] swift script calling procedure Message-ID: <4891F94F.6090004@uchicago.edu> Hi, Mike I am using the same structure of the swift script you used to run dock5 in April. The old file is at surveyor:/home/wilde/doc5/run01. Could some one take a look at this and point out why it failed to compile with the procedure readdata( ) ? Thanks so much. best wishes zhangzhao My script is like this: /type DockOut; type Mol2; dock (string id, Mol2 mfile, DockOut ofile, string protein) { app { rundock @id @mfile @ofile; } } type params { string idname; string mname; string oname; string pname; }; doall(params pset[]) { foreach p in pset { string id=p.idname; Mol2 mfile=p.mname; DockOut ofile=p.oname; string protein=p.pname; dock(id, mfile, ofile, protein); } } // Main params p[]; p = readdata("paramlist"); doall(p);/ It failed to be compiled with this message: / zzhang at login6.surveyor:~/swift/etc> swift dock2.swift Could not start execution. Compile error in procedure invocation at line 30: Procedure readdata is not declared./ I am also attaching the paramlist file: /idname mname oname pname 0 /home/zzhang/swift_dock6/run05/000/000/run05_in.0000000.mol2 /home/zzhang/swift_dock6/run05/000/000/run05_out.0000000.tar.gz 1KQP/ From wilde at mcs.anl.gov Thu Jul 31 13:21:50 2008 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 31 Jul 2008 13:21:50 -0500 Subject: [Swift-user] Re: swift script calling procedure In-Reply-To: <4891F94F.6090004@uchicago.edu> References: <4891F94F.6090004@uchicago.edu> Message-ID: <489202BE.8090606@mcs.anl.gov> Zhao reports that changing readdata to readData() as per the User Guide compiles correctly. Perhaps the code he tried was wrong or never worked, or perhaps the case of the function or the case checking rules changed between the time this last worked and now. - Mike On 7/31/08 12:41 PM, Zhao Zhang wrote: > Hi, Mike > > I am using the same structure of the swift script you used to run dock5 > in April. The old file is at surveyor:/home/wilde/doc5/run01. > Could some one take a look at this and point out why it failed to > compile with the procedure readdata( ) ? Thanks so much. > > best wishes > zhangzhao > > > My script is like this: > > /type DockOut; > type Mol2; > > dock (string id, Mol2 mfile, DockOut ofile, string protein) > { > app { rundock @id @mfile @ofile; } > } > > type params { > string idname; > string mname; > string oname; > string pname; > }; > > doall(params pset[]) > { > foreach p in pset { > string id=p.idname; > Mol2 mfile=p.mname; > DockOut ofile=p.oname; > string protein=p.pname; > dock(id, mfile, ofile, protein); > } > } > > // Main > > params p[]; > p = readdata("paramlist"); > doall(p);/ > > It failed to be compiled with this message: > / > zzhang at login6.surveyor:~/swift/etc> swift dock2.swift > Could not start execution. > Compile error in procedure invocation at line 30: Procedure > readdata is not declared./ > > > I am also attaching the paramlist file: > > /idname mname oname pname > 0 /home/zzhang/swift_dock6/run05/000/000/run05_in.0000000.mol2 > /home/zzhang/swift_dock6/run05/000/000/run05_out.0000000.tar.gz 1KQP/ From grog at ci.uchicago.edu Tue Jul 29 14:08:09 2008 From: grog at ci.uchicago.edu (Greg Cross) Date: Tue, 29 Jul 2008 14:08:09 -0500 Subject: [Swift-user] Illegal character In-Reply-To: <20080729111330.BJC64453@m4500-01.uchicago.edu> References: <20080729111330.BJC64453@m4500-01.uchicago.edu> Message-ID: <1CC22143-5A0D-46A8-91FD-561694CFD654@ci.uchicago.edu> OSG sets dozens of environmental variables. Normally this is done through sourcing the setup.*sh files in $VDT_LOCATION, but softenv does the same thing automatically with any "osg" macro. Unfortunately, many variables get set that would otherwise be unnecessary. Unfortunately (and obviously) the result isn't always desirable, and so you either have to remove it like you did or have CLASSPATH and other variables defined/appended for swift (and anything else that conflicts). -- Greg On Tue 29 Jul 2008, at 11:13, Zhengxiong Hou wrote: > Hi Mike, > Yes,you are right. > If I unset CLASSPATH in .soft.cache.sh, or just mark > #@osg in the .soft file, the original error disappeared. > > But there is a new ERROR, although swift job was finished. > > [houzx at communicado dock]$ swift flipper.swift > 2008.07.29 11:05:59.115 CDT: [ERROR] Parsing profiles on > line 19 Illegal character ' 'at position 5 :Illegal > character ' ' > Swift 0.5 swift-r1783 cog-r1962 > > RunID: 20080729-1105-qbqofzya > Progress: > convert started > 2008.07.29 11:05:59.900 CDT: [ERROR] Parsing profiles on > line 19 Illegal character ' 'at position 5 :Illegal > character ' ' > convert completed > Final status: Finished successfully:1 Finished:1 > > Thanks much, > zhengxiong > ---- Original message ---- >> Date: Tue, 29 Jul 2008 09:35:05 -0500 >> From: Michael Wilde >> Subject: Re: [Swift-user] pegasus? >> To: Zhengxiong Hou >> Cc: swift-user at ci.uchicago.edu, support at ci.uchicago.edu >> >> See if you have CLASSPATH set, and have Pegasus jars in it. >> Then try unsetting CLASSPATH and see if the same error > occurs. >> >> The Swift command should put the correct Swift jars in the > final >> classpath before any of your local jars, but perhaps > there's some >> strange dynamic class interaction between the Swift version > of >> tcdata/sites code and code that you have been experimenting > with from >> the Peagsus release (eg get-sites etc). >> >> - Mike >> >> >> On 7/29/08 9:22 AM, Zhengxiong Hou wrote: >>> Hi, >>> Recently, I met with an error when using Swift: >>> >>> [houzx at communicado results]$ swift -sites.file ./sites- >>> 20.xml -tc.file ./tc.data grid-many-dock6-auto.swift >>> 2008.07.29 08:45:37.416 CDT: [FATAL ERROR] You forgot to >>> set -Dpegasus.home=$PEGASUS_HOME! >>> >>> [houzx at communicado dock]$ swift flipper.swift >>> 2008.07.29 08:55:56.512 CDT: [FATAL ERROR] You forgot to >>> set -Dpegasus.home=$PEGASUS_HOME! >>> >>> Swift did NOT need this. Is there anything wrong with > my >>> account at CI? >>> >>> [houzx at communicado dock]$ echo $PEGASUS_HOME >>> /soft/osg-client-1.0.0-r1/pegasus >>> [houzx at communicado dock]$ cd ~ >>> [houzx at communicado ~]$ cat .soft >>> # >>> # This is your SoftEnv configuration run control file. >>> # >>> # It is used to tell SoftEnv how to customize your >>> environment by >>> # setting up variables such as PATH and MANPATH. To > learn >>> more >>> # about this file, do a "man softenv". >>> # >>> @default >>> @osg >>> @globus-4 >>> >>> Thanks! >>> Zhengxiong >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user