From jamalphd at gmail.com  Mon Jun  2 18:20:46 2008
From: jamalphd at gmail.com (J A)
Date: Mon, 2 Jun 2008 19:20:46 -0400
Subject: [Swift-user] Performance of Swift
Message-ID: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>

Hi All:


Based on my reading, the performance from execution a swift workflow depends
on the parallelism that a workflow has.


If I have a workflow that contains several processors where each processor
(procedure) depends on the previous one (output of a processor "A" is the
input for processors "B" and so on.)


How the performance of using swift in this case compare to other systems
that execute workflows where there isn't any parallelism in the workflow?


-- 
Thanks,

Jamal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080602/ff7a94c6/attachment.html>

From hategan at mcs.anl.gov  Mon Jun  2 18:40:34 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 02 Jun 2008 18:40:34 -0500
Subject: [Swift-user] Performance of Swift
In-Reply-To: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
Message-ID: <1212450034.22473.5.camel@localhost>

The swift engine itself should be about equally fast (or equally slow,
depending on perspective) whether you have N jobs that are sequential or
N jobs that are parallel.

However, there may be scheduling constraints (such as "don't run more
than 2 parallel jobs on this site right now") that may interfere with
that.

You can measure a large portion of the Swift overhead by using something
like -dryrun on the command line. Some of it is constant overhead (i.e.
the jvm + engine startup), and some of it is nearly linear in the number
of jobs (whether parallel or sequential).

In terms of comparisons with other systems, I am not aware of any such
recent comparisons. Others may know different.

Mihael

On Mon, 2008-06-02 at 19:20 -0400, J A wrote:
> Hi All:
>  
> Based on my reading, the performance from execution a swift workflow
> depends on the parallelism that a workflow has.
> 
>  
> 
> If I have a workflow that contains several processors where each
> processor (procedure) depends on the previous one (output of a
> processor "A" is the input for processors "B" and so on.)
> 
>  
> 
> How the performance of using swift in this case compare to other
> systems that execute workflows where there isn't any parallelism in
> the workflow?
> 
>  
> 
> -- 
> Thanks,
> 
> Jamal 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Mon Jun  2 18:47:44 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 02 Jun 2008 18:47:44 -0500
Subject: [Swift-user] Performance of Swift
In-Reply-To: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
Message-ID: <484486A0.5050508@mcs.anl.gov>

We dont have any recent data that I know of comparing Swift performance 
to other workflow systems.

When executing a serial workflow, performance will depend a lot on what 
job and data provider you are using.

What kind of application are you considering, and what kind of latency 
between jobs are you looking for?

What execution environment are you interested in? (GRAM, PBS, local, or 
something different?)

Whats the profile of your jobs in terms of input data size, job 
duration, output data size?

All of these will affect the performance of a serial job pipeline of the 
kind you describe.

If your procedures behave in a streaming manner (in that they start 
writing output while still reading input) then perhaps you want to run 
them as a UNIX pipeline instead of separate procedures under Swift.

- Mike


On 6/2/08 6:20 PM, J A wrote:
> Hi All:
>  
> 
> Based on my reading, the performance from execution a swift workflow 
> depends on the parallelism that a workflow has.
> 
>  
> 
> If I have a workflow that contains several processors where each 
> processor (procedure) depends on the previous one (output of a processor 
> "A" is the input for processors "B" and so on.)
> 
>  
> 
> How the performance of using swift in this case compare to other systems 
> that execute workflows where there isn't any parallelism in the workflow?
> 
>  
> 
> -- 
> Thanks,
> 
> Jamal
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
-- 
Michael Wilde
Computation Institute
University of Chicago and Argonne National Laboratory


From iraicu at cs.uchicago.edu  Tue Jun  3 11:05:57 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 03 Jun 2008 11:05:57 -0500
Subject: [Swift-user] Performance of Swift
In-Reply-To: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
Message-ID: <48456BE5.3080307@cs.uchicago.edu>

Hi,
There are several papers out there from our group that shows different 
aspects of performance of Swift.  Here are a few:

    * http://people.cs.uchicago.edu/~iraicu/publications/2008_NOVA08_book-chapter_Swift.pdf
          o Figure 9: Shows the memory footprint per job (aka tasks, or
            nodes in the DAG graph)
                + shows memory footprint of 3.2KB per node
          o Figure 18: Shows a large scale application
                + 20K tasks on 200 CPUs with average task lengths of 200
                  seconds is a comfortable range for Swift and Falkon
                + we have more recent results, not published yet, that
                  has 16K tasks on 2048 CPUs with an average task length
                  of 87 seconds which worked well
    * http://people.cs.uchicago.edu/~iraicu/publications/2007_SWF07_Swift.pdf
          o Figure 6: Shows the speedup achieved with different task lengths
                + Conclusion is that using multi-level scheduling with
                  the Falkon provider, even tasks in the range of
                  seconds long can achieve good speedup
          o Figure 7: Shows the throughput in tasks/sec achieved by Swift
                + shows Swift achieving 50+ tasks/sec throughputs using
                  Falkon
                + the paragraph right after this figure mentions that
                  Swift running directly with GRAM2 and PBS can achieve
                  2 jobs/sec; the implication of this is that jobs
                  typically take 15~60 seconds to startup, which
                  reflects the cost of scheduling, scheduling cycles,
                  and local resource manager's (LRM) time to setup the
                  remote nodes; there are also limitations on how many
                  jobs can be submitted at a time, as each job queued
                  might consume some resources on the LRM, or there
                  might be policies in place that limit the number of
                  jobs that can be queued; this means that aggressive
                  throttling must take place, which in practice, reduces
                  the sustained rate that Swift can submit/execute jobs
                  to a single site, to even lower than 2 jobs/sec

So, to answer you question, the performance of Swift (and any other 
workflow system) will heavily rely on how efficient you can dispatch 
jobs/tasks to remote resources, how long jobs/tasks are, how data 
intensive the application is, and how much data movement must happen 
before the job runs and after.  If you have a fast enough file system, 
and your application execution times are small, you can expect anywhere 
from 1 to 50 jobs/sec from Swift, depending on what technologies you use 
to interface between Swift and the remote resources (e.g. GRAM, PBS, 
Condor, Falkon, etc).

Cheers,
Ioan


J A wrote:
> Hi All:
>  
>
> Based on my reading, the performance from execution a swift workflow 
> depends on the parallelism that a workflow has.
>
>  
>
> If I have a workflow that contains several processors where each 
> processor (procedure) depends on the previous one (output of a 
> processor "A" is the input for processors "B" and so on.)
>
>  
>
> How the performance of using swift in this case compare to other 
> systems that execute workflows where there isn't any parallelism in 
> the workflow?
>
>  
>
> -- 
> Thanks,
>
> Jamal
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080603/73779a77/attachment.html>

From lixi at uchicago.edu  Tue Jun  3 14:24:11 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Tue,  3 Jun 2008 14:24:11 -0500 (CDT)
Subject: [Swift-user] Failed to link input file 
Message-ID: <20080603142411.BAW94307@m4500-03.uchicago.edu>

Hi,

Recently, I encountered the following error many times:
...
Host: UCSDT2
Directory: workflowtest-20080603-0934-cctuq211/jobs/2/node-
2sl4okti
stderr.txt: 

stdout.txt: 

----

Caused by:
       UCSDT2 Failed to link input file 
_concurrent/intermediatefile-272352a4-9803-4509-8f19-
fcddb7de230b-
...

In fact, it did happen before on other sites 
except "UCSDT2". Who can help me to determine what on earth 
the error is. Is something to do with my workflow itself or 
remote sites. If I want to avoid such things, what 
conditions I could judge based on when selecting sites? I 
will appreciate any suggestions.

Thanks,

Xi 


From hategan at mcs.anl.gov  Tue Jun  3 14:40:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Jun 2008 14:40:31 -0500
Subject: [Swift-user] Failed to link input file
In-Reply-To: <20080603142411.BAW94307@m4500-03.uchicago.edu>
References: <20080603142411.BAW94307@m4500-03.uchicago.edu>
Message-ID: <1212522031.14982.13.camel@localhost>

On Tue, 2008-06-03 at 14:24 -0500, lixi at uchicago.edu wrote:
> Hi,
> 
> Recently, I encountered the following error many times:
> ...
> Host: UCSDT2
> Directory: workflowtest-20080603-0934-cctuq211/jobs/2/node-
> 2sl4okti
> stderr.txt: 
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>        UCSDT2 Failed to link input file 
> _concurrent/intermediatefile-272352a4-9803-4509-8f19-
> fcddb7de230b-
> ...

What's in the wrapper log for that job?


From lixi at uchicago.edu  Tue Jun  3 14:52:47 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Tue,  3 Jun 2008 14:52:47 -0500 (CDT)
Subject: [Swift-user] Failed to link input file
Message-ID: <20080603145247.BAX00142@m4500-03.uchicago.edu>

I couldn't find the wrapper log for that node. 

The log file and wrapper log are on CI:
/home/lixi/newswift/latest/score/1000/workflowtest-20080603-
1243-i380lqr6.log 
/home/lixi/newswift/latest/score/1000/workflowtest-20080603-
1243-i380lqr6.d

Thanks,

Xi

---- Original message ----
>Date: Tue, 03 Jun 2008 14:40:31 -0500
>From: Mihael Hategan <hategan at mcs.anl.gov>  
>Subject: Re: [Swift-user] Failed to link input file  
>To: lixi at uchicago.edu
>Cc: swift-user <swift-user at ci.uchicago.edu>
>
>On Tue, 2008-06-03 at 14:24 -0500, lixi at uchicago.edu wrote:
>> Hi,
>> 
>> Recently, I encountered the following error many times:
>> ...
>> Host: UCSDT2
>> Directory: workflowtest-20080603-0934-
cctuq211/jobs/2/node-
>> 2sl4okti
>> stderr.txt: 
>> 
>> stdout.txt: 
>> 
>> ----
>> 
>> Caused by:
>>        UCSDT2 Failed to link input file 
>> _concurrent/intermediatefile-272352a4-9803-4509-8f19-
>> fcddb7de230b-
>> ...
>
>What's in the wrapper log for that job?
>
>


From hategan at mcs.anl.gov  Tue Jun  3 14:58:19 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 03 Jun 2008 14:58:19 -0500
Subject: [Swift-user] Failed to link input file
In-Reply-To: <20080603145247.BAX00142@m4500-03.uchicago.edu>
References: <20080603145247.BAX00142@m4500-03.uchicago.edu>
Message-ID: <1212523099.16304.0.camel@localhost>

On Tue, 2008-06-03 at 14:52 -0500, lixi at uchicago.edu wrote:
> I couldn't find the wrapper log for that node. 

That very likely means that the shared filesystem is not working
properly on that node.

> 
> The log file and wrapper log are on CI:
> /home/lixi/newswift/latest/score/1000/workflowtest-20080603-
> 1243-i380lqr6.log 
> /home/lixi/newswift/latest/score/1000/workflowtest-20080603-
> 1243-i380lqr6.d
> 
> Thanks,
> 
> Xi
> 
> ---- Original message ----
> >Date: Tue, 03 Jun 2008 14:40:31 -0500
> >From: Mihael Hategan <hategan at mcs.anl.gov>  
> >Subject: Re: [Swift-user] Failed to link input file  
> >To: lixi at uchicago.edu
> >Cc: swift-user <swift-user at ci.uchicago.edu>
> >
> >On Tue, 2008-06-03 at 14:24 -0500, lixi at uchicago.edu wrote:
> >> Hi,
> >> 
> >> Recently, I encountered the following error many times:
> >> ...
> >> Host: UCSDT2
> >> Directory: workflowtest-20080603-0934-
> cctuq211/jobs/2/node-
> >> 2sl4okti
> >> stderr.txt: 
> >> 
> >> stdout.txt: 
> >> 
> >> ----
> >> 
> >> Caused by:
> >>        UCSDT2 Failed to link input file 
> >> _concurrent/intermediatefile-272352a4-9803-4509-8f19-
> >> fcddb7de230b-
> >> ...
> >
> >What's in the wrapper log for that job?
> >
> >


From jamalphd at gmail.com  Tue Jun  3 15:41:44 2008
From: jamalphd at gmail.com (J A)
Date: Tue, 3 Jun 2008 16:41:44 -0400
Subject: [Swift-user] Performance of Swift
In-Reply-To: <48456BE5.3080307@cs.uchicago.edu>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
	<48456BE5.3080307@cs.uchicago.edu>
Message-ID: <b19f113c0806031341y37a57ce4t90ec90e54964ec00@mail.gmail.com>

Thank you all for your replies ...

I have a workflow that i developed using C code.  I am thinking of using
Swift to execute the workflow, so my thinking is that i need first to change
the code to be Swift script.

More info about my workflow:

The workflow consist of several major tasks:

Task 1:  create a 1000 uniquely strings where each string is 1000 bytes.
Task 2:  merge the strings  where every 2 strings ( A, B) will exchange a
segment of it at a certain point and produce 1 string (C) with the same
length (1000 bytes).  Then C replaces B.
Task 3:  duplicate the list of string so we will have now 2000 strings
Task 4: randomly choose 1000 strings from the current 2000 strings.
Task 5: repeat Tasks 2, 3, and 4   for N times  (N is given) and now the
list of strings used in the next iteration is the output of Task 4.


Do you think changing the whole program into Swift script is necessary or
just certain sections?  Can i just use wrappers around certain tasks and use
Swift Script to call these tasks?

Will the performance be the same?


Any suggestion will be really appreciated.

Thanks,
Jamal


On 6/3/08, Ioan Raicu <iraicu at cs.uchicago.edu> wrote:
>
> Hi,
> There are several papers out there from our group that shows different
> aspects of performance of Swift.  Here are a few:
>
>    -
>    http://people.cs.uchicago.edu/~iraicu/publications/2008_NOVA08_book-chapter_Swift.pdf<http://people.cs.uchicago.edu/%7Eiraicu/publications/2008_NOVA08_book-chapter_Swift.pdf>
>       - Figure 9: Shows the memory footprint per job (aka tasks, or nodes
>       in the DAG graph)
>          - shows memory footprint of 3.2KB per node
>       - Figure 18: Shows a large scale application
>          - 20K tasks on 200 CPUs with average task lengths of 200 seconds
>          is a comfortable range for Swift and Falkon
>          - we have more recent results, not published yet, that has 16K
>          tasks on 2048 CPUs with an average task length of 87 seconds which worked
>          well
>          -
>    http://people.cs.uchicago.edu/~iraicu/publications/2007_SWF07_Swift.pdf<http://people.cs.uchicago.edu/%7Eiraicu/publications/2007_SWF07_Swift.pdf>
>       - Figure 6: Shows the speedup achieved with different task lengths
>          - Conclusion is that using multi-level scheduling with the Falkon
>          provider, even tasks in the range of seconds long can achieve good speedup
>          - Figure 7: Shows the throughput in tasks/sec achieved by Swift
>          - shows Swift achieving 50+ tasks/sec throughputs using Falkon
>          - the paragraph right after this figure mentions that Swift
>          running directly with GRAM2 and PBS can achieve 2 jobs/sec; the implication
>          of this is that jobs typically take 15~60 seconds to startup, which reflects
>          the cost of scheduling, scheduling cycles, and local resource manager's
>          (LRM) time to setup the remote nodes; there are also limitations on how many
>          jobs can be submitted at a time, as each job queued might consume some
>          resources on the LRM, or there might be policies in place that limit the
>          number of jobs that can be queued; this means that aggressive throttling
>          must take place, which in practice, reduces the sustained rate that Swift
>          can submit/execute jobs to a single site, to even lower than 2 jobs/sec
>
> So, to answer you question, the performance of Swift (and any other
> workflow system) will heavily rely on how efficient you can dispatch
> jobs/tasks to remote resources, how long jobs/tasks are, how data intensive
> the application is, and how much data movement must happen before the job
> runs and after.  If you have a fast enough file system, and your application
> execution times are small, you can expect anywhere from 1 to 50 jobs/sec
> from Swift, depending on what technologies you use to interface between
> Swift and the remote resources (e.g. GRAM, PBS, Condor, Falkon, etc).
>
> Cheers,
> Ioan
>
>
> J A wrote:
>
>  Hi All:
>
>
> Based on my reading, the performance from execution a swift workflow
> depends on the parallelism that a workflow has.
>
>
>
> If I have a workflow that contains several processors where each processor
> (procedure) depends on the previous one (output of a processor "A" is the
> input for processors "B" and so on.)
>
>
>
> How the performance of using swift in this case compare to other systems
> that execute workflows where there isn't any parallelism in the workflow?
>
>
>
> --
> Thanks,
>
> Jamal
>
> ------------------------------
>
> _______________________________________________
> Swift-user mailing listSwift-user at ci.uchicago.eduhttp://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>
> --
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu <http://www.cs.uchicago.edu/%7Eiraicu>http://dev.globus.org/wiki/Incubator/Falkonhttp://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080603/4060fe51/attachment.html>

From jamalphd at gmail.com  Sat Jun  7 05:18:29 2008
From: jamalphd at gmail.com (J A)
Date: Sat, 7 Jun 2008 06:18:29 -0400
Subject: [Swift-user] Performance of Swift
In-Reply-To: <b19f113c0806031341y37a57ce4t90ec90e54964ec00@mail.gmail.com>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
	<48456BE5.3080307@cs.uchicago.edu>
	<b19f113c0806031341y37a57ce4t90ec90e54964ec00@mail.gmail.com>
Message-ID: <b19f113c0806070318x13882f46r81c79bdd78bfd783@mail.gmail.com>

Thank you all for your replies ...


>
> I have a workflow that i developed using C code.  I am thinking of using
> Swift to execute the workflow, so my thinking is that i need first to change
> the code to be Swift script.
>
> More info about my workflow:
>
> The workflow consist of several major tasks:
>
> Task 1:  create a 1000 uniquely strings where each string is 1000 bytes.
> Task 2:  merge the strings  where every 2 strings ( A, B) will exchange a
> segment of it at a certain point and produce 1 string (C) with the same
> length (1000 bytes).  Then C replaces B.
> Task 3:  duplicate the list of string so we will have now 2000 strings
> Task 4: randomly choose 1000 strings from the current 2000 strings.
> Task 5: repeat Tasks 2, 3, and 4   for N times  (N is given) and now the
> list of strings used in the next iteration is the output of Task 4.
>
>
> Do you think changing the whole program into Swift script is necessary or
> just certain sections?  Can i just use wrappers around certain tasks and use
> Swift Script to call these tasks?
>
> Will the performance be the same?
>
>
> Any suggestion will be really appreciated.
>
> Thanks,
> Jamal
>
>
>
>
>
> On 6/3/08, Ioan Raicu <iraicu at cs.uchicago.edu> wrote:
>>
>> Hi,
>> There are several papers out there from our group that shows different
>> aspects of performance of Swift.  Here are a few:
>>
>>    -
>>    http://people.cs.uchicago.edu/~iraicu/publications/2008_NOVA08_book-chapter_Swift.pdf<http://people.cs.uchicago.edu/%7Eiraicu/publications/2008_NOVA08_book-chapter_Swift.pdf>
>>       - Figure 9: Shows the memory footprint per job (aka tasks, or nodes
>>       in the DAG graph)
>>          - shows memory footprint of 3.2KB per node
>>       - Figure 18: Shows a large scale application
>>          - 20K tasks on 200 CPUs with average task lengths of 200 seconds
>>          is a comfortable range for Swift and Falkon
>>          - we have more recent results, not published yet, that has 16K
>>          tasks on 2048 CPUs with an average task length of 87 seconds which worked
>>          well
>>          -
>>    http://people.cs.uchicago.edu/~iraicu/publications/2007_SWF07_Swift.pdf<http://people.cs.uchicago.edu/%7Eiraicu/publications/2007_SWF07_Swift.pdf>
>>       - Figure 6: Shows the speedup achieved with different task lengths
>>          - Conclusion is that using multi-level scheduling with the
>>          Falkon provider, even tasks in the range of seconds long can achieve good
>>          speedup
>>          - Figure 7: Shows the throughput in tasks/sec achieved by Swift
>>          - shows Swift achieving 50+ tasks/sec throughputs using Falkon
>>          - the paragraph right after this figure mentions that Swift
>>          running directly with GRAM2 and PBS can achieve 2 jobs/sec; the implication
>>          of this is that jobs typically take 15~60 seconds to startup, which reflects
>>          the cost of scheduling, scheduling cycles, and local resource manager's
>>          (LRM) time to setup the remote nodes; there are also limitations on how many
>>          jobs can be submitted at a time, as each job queued might consume some
>>          resources on the LRM, or there might be policies in place that limit the
>>          number of jobs that can be queued; this means that aggressive throttling
>>          must take place, which in practice, reduces the sustained rate that Swift
>>          can submit/execute jobs to a single site, to even lower than 2 jobs/sec
>>
>> So, to answer you question, the performance of Swift (and any other
>> workflow system) will heavily rely on how efficient you can dispatch
>> jobs/tasks to remote resources, how long jobs/tasks are, how data intensive
>> the application is, and how much data movement must happen before the job
>> runs and after.  If you have a fast enough file system, and your application
>> execution times are small, you can expect anywhere from 1 to 50 jobs/sec
>> from Swift, depending on what technologies you use to interface between
>> Swift and the remote resources (e.g. GRAM, PBS, Condor, Falkon, etc).
>>
>> Cheers,
>> Ioan
>>
>>
>> J A wrote:
>>
>>  Hi All:
>>
>>
>> Based on my reading, the performance from execution a swift workflow
>> depends on the parallelism that a workflow has.
>>
>>
>>
>> If I have a workflow that contains several processors where each processor
>> (procedure) depends on the previous one (output of a processor "A" is the
>> input for processors "B" and so on.)
>>
>>
>>
>> How the performance of using swift in this case compare to other systems
>> that execute workflows where there isn't any parallelism in the workflow?
>>
>>
>>
>> --
>> Thanks,
>>
>> Jamal
>>
>> ------------------------------
>>
>> _______________________________________________
>> Swift-user mailing listSwift-user at ci.uchicago.eduhttp://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>>
>> --
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu <http://www.cs.uchicago.edu/%7Eiraicu>http://dev.globus.org/wiki/Incubator/Falkonhttp://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080607/48a89d31/attachment.html>

From benc at hawaga.org.uk  Sat Jun  7 07:03:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 7 Jun 2008 12:03:21 +0000 (GMT)
Subject: [Swift-user] Performance of Swift
In-Reply-To: <b19f113c0806070318x13882f46r81c79bdd78bfd783@mail.gmail.com>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
	<48456BE5.3080307@cs.uchicago.edu>
	<b19f113c0806031341y37a57ce4t90ec90e54964ec00@mail.gmail.com>
	<b19f113c0806070318x13882f46r81c79bdd78bfd783@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0806071152530.11097@dildano.hawaga.org.uk>


> > Do you think changing the whole program into Swift script is necessary or
> > just certain sections?  Can i just use wrappers around certain tasks and use
> > Swift Script to call these tasks?

Almost definitely do not convert that whole program into SwiftScript - 
Swift is not intended to efficiently execute "short" operations like 
string operations. It would deal better with plugging together larger 
pieces of your application (eg pieces that take minutes to run), with 
those implemented in (in your case) perhaps C.

To get decent benefit, though, I think you will need to figure out which 
pieces can run in parallel - breaking your app into eg 4 pieces and then 
only running them in sequence won't give much/any performance improvement.

You program looks almost, but not quite, like a genetic algorithm 
implementation; and there is a lot on the web about parallelising those.


-- 


From jamalphd at gmail.com  Tue Jun 10 08:10:43 2008
From: jamalphd at gmail.com (J A)
Date: Tue, 10 Jun 2008 09:10:43 -0400
Subject: [Swift-user] Performance of Swift
In-Reply-To: <Pine.LNX.4.64.0806071152530.11097@dildano.hawaga.org.uk>
References: <b19f113c0806021620v48c5a116r6f990f43877b899b@mail.gmail.com>
	<48456BE5.3080307@cs.uchicago.edu>
	<b19f113c0806031341y37a57ce4t90ec90e54964ec00@mail.gmail.com>
	<b19f113c0806070318x13882f46r81c79bdd78bfd783@mail.gmail.com>
	<Pine.LNX.4.64.0806071152530.11097@dildano.hawaga.org.uk>
Message-ID: <b19f113c0806100610wbd16b16m4d3fb59a6e132fcf@mail.gmail.com>

Thanks ...

On 6/7/08, Ben Clifford <benc at hawaga.org.uk> wrote:
>
>
> > > Do you think changing the whole program into Swift script is necessary
> or
> > > just certain sections?  Can i just use wrappers around certain tasks
> and use
> > > Swift Script to call these tasks?
>
> Almost definitely do not convert that whole program into SwiftScript -
> Swift is not intended to efficiently execute "short" operations like
> string operations. It would deal better with plugging together larger
> pieces of your application (eg pieces that take minutes to run), with
> those implemented in (in your case) perhaps C.
>
> To get decent benefit, though, I think you will need to figure out which
> pieces can run in parallel - breaking your app into eg 4 pieces and then
> only running them in sequence won't give much/any performance improvement.
>
> You program looks almost, but not quite, like a genetic algorithm
> implementation; and there is a lot on the web about parallelising those.
>
>
>
> --
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080610/6d03feee/attachment.html>

From mikekubal at yahoo.com  Wed Jun 11 13:48:36 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Wed, 11 Jun 2008 11:48:36 -0700 (PDT)
Subject: [Swift-user] suggestion for program flow control
Message-ID: <370852.53715.qm@web52312.mail.re2.yahoo.com>

In a swift script I have a loop that iterates over potentially thousands of files and performs a function on each.

After the loop ends another swift function is called to parse the results.

Since I do no want to pass a result file for each of the thousand processed files in the loop, I have added an additional localhost script that checks to see if X number of results file have been produced and then generates a text file that  is passed into the swift function that parses the results. 

It wasn't a big deal to add the extra script, but it may be desirable for the swift program to not move to a function call outside of the loop until the loop is finished.  

Mike


From hategan at mcs.anl.gov  Wed Jun 11 14:16:51 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 14:16:51 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <370852.53715.qm@web52312.mail.re2.yahoo.com>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
Message-ID: <1213211811.7071.10.camel@localhost>

On Wed, 2008-06-11 at 11:48 -0700, Mike Kubal wrote:

> It wasn't a big deal to add the extra script, but it may be desirable for the swift program to not move to a function call outside of the loop until the loop is finished.  

One of the design issues with Swift was that it is also desirable for
dependent loops to be pipelined. Which seems to be the opposite of what
you want.

Why is it that you don't want to pass a result file for each of the
thousand processed files?

> 
> Mike
> 
> 
>       
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Wed Jun 11 16:50:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Jun 2008 21:50:34 +0000 (GMT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <370852.53715.qm@web52312.mail.re2.yahoo.com>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>


> Since I do no want to pass a result file for each of the thousand 
> processed files in the loop, I have added an additional localhost script 
> that checks to see if X number of results file have been produced and 
> then generates a text file that is passed into the swift function that 
> parses the results.

>From a swift-purist perspective, you shouldn't be saying "I do not want to 
pass files around" without substantial justification (eg. evidence that it 
hurts performance - which it presumably does?). More interesting is to see 
how stuff works entirely file-based and figure out what is going wrong 
with that approach.

What you post sounds like there's some foldy stuff that has been talked 
about before - its probably interesting to talk about that a bit more - eg 
imagine if you could write what you are doing in SwiftScript and point out 
what is going wrong at the moment.

-- 


From hategan at mcs.anl.gov  Wed Jun 11 16:53:05 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 16:53:05 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
Message-ID: <1213221185.11775.0.camel@localhost>

On Wed, 2008-06-11 at 21:50 +0000, Ben Clifford wrote:
> > Since I do no want to pass a result file for each of the thousand 
> > processed files in the loop, I have added an additional localhost script 
> > that checks to see if X number of results file have been produced and 
> > then generates a text file that is passed into the swift function that 
> > parses the results.
> 
> >From a swift-purist perspective, you shouldn't be saying "I do not want to 
> pass files around" without substantial justification (eg. evidence that it 
> hurts performance - which it presumably does?). More interesting is to see 
> how stuff works entirely file-based and figure out what is going wrong 
> with that approach.
> 
> What you post sounds like there's some foldy

reducy?

>  stuff that has been talked 
> about before - its probably interesting to talk about that a bit more - eg 
> imagine if you could write what you are doing in SwiftScript and point out 
> what is going wrong at the moment.
> 


From mikekubal at yahoo.com  Wed Jun 11 16:54:35 2008
From: mikekubal at yahoo.com (Mike Kubal)
Date: Wed, 11 Jun 2008 14:54:35 -0700 (PDT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <1213211811.7071.10.camel@localhost>
Message-ID: <380944.70847.qm@web52304.mail.re2.yahoo.com>


--- On Wed, 6/11/08, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> From: Mihael Hategan <hategan at mcs.anl.gov>
> Subject: Re: [Swift-user] suggestion for program flow control
> To: mikekubal at yahoo.com
> Cc: swift-user at ci.uchicago.edu
> Date: Wednesday, June 11, 2008, 2:16 PM
> On Wed, 2008-06-11 at 11:48 -0700, Mike Kubal wrote:
> 
> > It wasn't a big deal to add the extra script, but
> it may be desirable for the swift program to not move to a
> function call outside of the loop until the loop is
> finished.  
> 
> One of the design issues with Swift was that it is also
> desirable for
> dependent loops to be pipelined. Which seems to be the
> opposite of what
> you want.
> 
> Why is it that you don't want to pass a result file for
> each of the
> thousand processed files?

laziness, 

and also I'm not sure what that should look like in the function definition? 

Is there a way I can make the argument list for the function dynamic since the number of result files will vary based on the selected database to process?

I can definitely see the benefit of having separate pipelines for non-dependent parts within the same script, but perhaps there is a way to chain dependent functions that is not dependent on files produced by previous functions? 

Like I said it wasn't a big deal to add the extra script to pause and count files, just different behavior than I expected from the loop code.

> 
> > 
> > Mike
> > 
> > 
> >       
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> >
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Wed Jun 11 16:54:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Jun 2008 21:54:52 +0000 (GMT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <1213221185.11775.0.camel@localhost>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com> 
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
	<1213221185.11775.0.camel@localhost>
Message-ID: <Pine.LNX.4.64.0806112154380.28721@dildano.hawaga.org.uk>


> > What you post sounds like there's some foldy
> 
> reducy?

same as.

-- 


From hategan at mcs.anl.gov  Wed Jun 11 17:09:16 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 17:09:16 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <380944.70847.qm@web52304.mail.re2.yahoo.com>
References: <380944.70847.qm@web52304.mail.re2.yahoo.com>
Message-ID: <1213222156.11996.6.camel@localhost>

> > 
> > Why is it that you don't want to pass a result file for
> > each of the
> > thousand processed files?
> 
> laziness, 

Fair enough an argument for me.

> 
> and also I'm not sure what that should look like in the function definition? 
> 
> Is there a way I can make the argument list for the function dynamic since the number of result files will vary based on the selected database to process?

If I understand this correctly, then you can pass the whole array. You'd
get each file name on the command line. There's a question of whether
you'd cross the command line length limitation.

We should have array slices in Swift maybe?

> 
> I can definitely see the benefit of having separate pipelines for non-dependent parts within the same script, but perhaps there is a way to chain dependent functions that is not dependent on files produced by previous functions? 

Only that it is, but Swift has no idea about it. The dependency is
handled/created by your script.

> 
> Like I said it wasn't a big deal to add the extra script to pause and count files, just different behavior than I expected from the loop code.

Right. If you look at it through the prism of an intuition constructed
by doing C and Java and the likes, it doesn't look right.

> 
> > 
> > > 
> > > Mike
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > >
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> > 
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> 
>       


From hategan at mcs.anl.gov  Wed Jun 11 17:09:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 17:09:54 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <Pine.LNX.4.64.0806112154380.28721@dildano.hawaga.org.uk>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
	<1213221185.11775.0.camel@localhost>
	<Pine.LNX.4.64.0806112154380.28721@dildano.hawaga.org.uk>
Message-ID: <1213222194.11996.8.camel@localhost>

On Wed, 2008-06-11 at 21:54 +0000, Ben Clifford wrote:
> > > What you post sounds like there's some foldy
> > 
> > reducy?
> 
> same as.
> 

How so?


From benc at hawaga.org.uk  Wed Jun 11 17:13:36 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 11 Jun 2008 22:13:36 +0000 (GMT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <1213222194.11996.8.camel@localhost>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com> 
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk> 
	<1213221185.11775.0.camel@localhost>
	<Pine.LNX.4.64.0806112154380.28721@dildano.hawaga.org.uk>
	<1213222194.11996.8.camel@localhost>
Message-ID: <Pine.LNX.4.64.0806112212320.28721@dildano.hawaga.org.uk>


On Wed, 11 Jun 2008, Mihael Hategan wrote:

> On Wed, 2008-06-11 at 21:54 +0000, Ben Clifford wrote:
> > > > What you post sounds like there's some foldy
> > > 
> > > reducy?
> > 
> > same as.
> > 
> 
> How so?

repeated application of X -> x -> X

-- 


From wilde at mcs.anl.gov  Wed Jun 11 17:51:32 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 11 Jun 2008 17:51:32 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
Message-ID: <485056F4.3050904@mcs.anl.gov>

My understanding was that MikeK was not trying to do this for 
performance, but was simply unsure of how to pass a large set of files 
from one function to the next.

When he said "I do not want to pass a result file for each of the 
thousand processed files in the loop" I think he was talking about a 
coding issue, not a performance issue.

I did not yet look at his latest Swift code, but suggested he forward it 
to the list to ask for advice on how best to express the data flow.

I suspect that one of the existing mappers and correct use of a dataset 
type can solve his problem and eliminate the need to for localhost "file 
waiting" script.  Unless he's tripping into the issue of having an 
non-predetermined number of files in the dataset.

Its not clear to me if, when he does that, that new performance issues 
wont arise. But lets at least first look at how best to express the problem.

Also, I think the discussion of fold(y) and reduce(y) concepts is likely 
very cryptic to non-functional programmers.

- MikeW

On 6/11/08 4:50 PM, Ben Clifford wrote:
>> Since I do no want to pass a result file for each of the thousand 
>> processed files in the loop, I have added an additional localhost script 
>> that checks to see if X number of results file have been produced and 
>> then generates a text file that is passed into the swift function that 
>> parses the results.
> 
>>From a swift-purist perspective, you shouldn't be saying "I do not want to 
> pass files around" without substantial justification (eg. evidence that it 
> hurts performance - which it presumably does?). More interesting is to see 
> how stuff works entirely file-based and figure out what is going wrong 
> with that approach.
> 
> What you post sounds like there's some foldy stuff that has been talked 
> about before - its probably interesting to talk about that a bit more - eg 
> imagine if you could write what you are doing in SwiftScript and point out 
> what is going wrong at the moment.
> 


From hategan at mcs.anl.gov  Wed Jun 11 18:44:10 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 18:44:10 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <Pine.LNX.4.64.0806112212320.28721@dildano.hawaga.org.uk>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
	<1213221185.11775.0.camel@localhost>
	<Pine.LNX.4.64.0806112154380.28721@dildano.hawaga.org.uk>
	<1213222194.11996.8.camel@localhost>
	<Pine.LNX.4.64.0806112212320.28721@dildano.hawaga.org.uk>
Message-ID: <1213227850.12786.1.camel@localhost>

On Wed, 2008-06-11 at 22:13 +0000, Ben Clifford wrote:
> On Wed, 11 Jun 2008, Mihael Hategan wrote:
> 
> > On Wed, 2008-06-11 at 21:54 +0000, Ben Clifford wrote:
> > > > > What you post sounds like there's some foldy
> > > > 
> > > > reducy?
> > > 
> > > same as.
> > > 
> > 
> > How so?
> 
> repeated application of X -> x -> X
> 

Foldy less general than reducy here. The former assumes one step at a
time.


From hategan at mcs.anl.gov  Wed Jun 11 18:47:29 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 11 Jun 2008 18:47:29 -0500
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <485056F4.3050904@mcs.anl.gov>
References: <370852.53715.qm@web52312.mail.re2.yahoo.com>
	<Pine.LNX.4.64.0806112140110.11097@dildano.hawaga.org.uk>
	<485056F4.3050904@mcs.anl.gov>
Message-ID: <1213228049.12786.4.camel@localhost>

On Wed, 2008-06-11 at 17:51 -0500, Michael Wilde wrote:

> 
> Also, I think the discussion of fold(y) and reduce(y) concepts is likely 
> very cryptic to non-functional programmers.
> 

Is that meant to be gentle slap on the wrist?


From fedorov at cs.wm.edu  Thu Jun 12 20:19:11 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 12 Jun 2008 21:19:11 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
Message-ID: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>

Hello,

I am a beginner user of TeraGrid/Swift, with very little experience.

I am trying to run first.swift on NCSA Mercury. I updated
etc/sites.xml with (what I think) the latest correct information from
teragrid.org:

<pool handle="Mercury" >
    <gridftp  url="gsiftp://gridftp-hg.ncsa.teragrid.org" />
    <jobmanager universe="vanilla"
url="grid-hg.ncsa.teragrid.org/jobmanager" major="2" />
    <workdirectory >/home/ac/fedorov/scratch</workdirectory>
</pool>

I updated the etc/tc.data with the line where to find echo on Mercury:

Mercury     echo   /bin/echo INSTALLED INTEL32::LINUX null

I ran grid-proxy-init, and myproxy-init. I installed userkey.pem and
usercert.pem in ~/.globus, and I installed the package with the root
certificates from here
http://security.teragrid.org/docs/teragrid-certs.tar.gz in
~/.globus/certificates.

$GLOBUS_PATH is set, swift is in the $PATH.

I am able to log on to Mercury with gsissh, and I am able to execute
globus-url-copy, both without being asked password  I did run
'globusrun -a -r grid-hg.ncsa.teragrid.org' and the authentication
test was uccessful.

Now is my question, finally:

Why then, when I run 'swift first.swift', I see this (below) forever?
What did I miss? How to find what is the problem? Doesn't seem normal,
that 'echo' + scp takes minutes. I didn't have patience to wait until
it finishes, I'll leave it overnight to see if it finishes tomorrow.


[fedorov at ri vdsk] swift first.swift
Swift 0.5 swift-r1783 cog-r1962

RunID: 20080612-2101-yvp36l3c
Progress:
echo started
Progress:  Executing:1
Progress:  Executing:1
Progress:  Executing:1
Progress:  Executing:1
Progress:  Executing:1
Progress:  Executing:1


Thanks in advance for your kind attention......


Andrey Fedorov
--
Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From hategan at mcs.anl.gov  Thu Jun 12 20:26:28 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 12 Jun 2008 20:26:28 -0500
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
Message-ID: <1213320388.1281.1.camel@localhost>

On Thu, 2008-06-12 at 21:19 -0400, Andriy Fedorov wrote:
> Hello,
> 
> I am a beginner user of TeraGrid/Swift, with very little experience.
> 
> I am trying to run first.swift on NCSA Mercury. I updated
> etc/sites.xml with (what I think) the latest correct information from
> teragrid.org:
> 
> <pool handle="Mercury" >
>     <gridftp  url="gsiftp://gridftp-hg.ncsa.teragrid.org" />
>     <jobmanager universe="vanilla"
> url="grid-hg.ncsa.teragrid.org/jobmanager" major="2" />

That doesn't look right. You need a specific job manager, such as "fork"
or "pbs". I'd recommend trying "fork" for simple testing.

> 
> [fedorov at ri vdsk] swift first.swift
> Swift 0.5 swift-r1783 cog-r1962
> 
> RunID: 20080612-2101-yvp36l3c
> Progress:
> echo started
> Progress:  Executing:1
> Progress:  Executing:1
> Progress:  Executing:1
> Progress:  Executing:1
> Progress:  Executing:1
> Progress:  Executing:1

This may happen if the callback address for you submit host is unknown
to the GRAM service or if you're behind a fierewall or NAT. If you're
not, try setting $GLOBUS_HOSTNAME with your DNS address or IP.

Mihael


From fedorov at cs.wm.edu  Fri Jun 13 08:27:58 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 13 Jun 2008 09:27:58 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <1213320388.1281.1.camel@localhost>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
Message-ID: <82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>

Michael,

Thank you for the reply. Unfortunately, your suggestions didn't help.

>> <pool handle="Mercury" >
>>     <gridftp  url="gsiftp://gridftp-hg.ncsa.teragrid.org" />
>>     <jobmanager universe="vanilla"
>> url="grid-hg.ncsa.teragrid.org/jobmanager" major="2" />
>
> That doesn't look right. You need a specific job manager, such as "fork"
> or "pbs". I'd recommend trying "fork" for simple testing.
>

I got "grid-hg.ncsa.teragrid.org/jobmanager" from here:
http://www.teragrid.org/userinfo/jobs/gram.php I think it is just an
alias for "fork". I substituted "jobmanager" with "jobmanager-fork",
but everything is the same way.

I also tried to use "jobmanager-pbs", and I could see my job in the
queue, but the same result "Progress:  Executing:1" on the client
host.

>> [fedorov at ri vdsk] swift first.swift
>> Swift 0.5 swift-r1783 cog-r1962
>>
>> RunID: 20080612-2101-yvp36l3c
>> Progress:
>> echo started
>> Progress:  Executing:1
>> Progress:  Executing:1
>> Progress:  Executing:1
>> Progress:  Executing:1
>> Progress:  Executing:1
>> Progress:  Executing:1
>
> This may happen if the callback address for you submit host is unknown
> to the GRAM service or if you're behind a fierewall or NAT. If you're
> not, try setting $GLOBUS_HOSTNAME with your DNS address or IP.
>

No, I have a valid IP. I do have $GLOBUS_HOSTNAME set now, but this
doesn't help.

I did some more looking around, and I found directories named
"first-<date>-<time>-<RunID>" in the scratch directory on Mercury. The
log in "info" directory looks good to me:

Progress  2008-06-12 20:18:06.%N-0500  LOG_START

_____________________________________________________________________________

        Wrapper
_____________________________________________________________________________

DIR=jobs/5/echo-5zu081ui
EXEC=/bin/echo
STDIN=
STDOUT=hello.txt
STDERR=stderr.txt
DIRS=
INF=
OUTF=hello.txt
KICKSTART=
ARGS=Hello, world!
Progress  2008-06-12 20:18:06.%N-0500  CREATE_JOBDIR
Created job directory: jobs/5/echo-5zu081ui
Progress  2008-06-12 20:18:06.%N-0500  CREATE_INPUTDIR
Progress  2008-06-12 20:18:06.%N-0500  LINK_INPUTS
Progress  2008-06-12 20:18:06.%N-0500  EXECUTE
Progress  2008-06-12 20:18:06.%N-0500  EXECUTE_DONE
Job ran successfully
Progress  2008-06-12 20:18:06.%N-0500  COPYING_OUTPUTS
Progress  2008-06-12 20:18:06.%N-0500  RM_JOBDIR
Progress  2008-06-12 20:18:06.%N-0500  TOUCH_SUCCESS
Progress  2008-06-12 20:18:06.%N-0500  END

Note that I had this same successful (?) run when I used plain "jobmanager".

It seems that there is a problem returning the result back. I tried to
set GLOBUS_TCP_PORT_RANGE to 45000,45100, as suggested here
http://www-128.ibm.com/developerworks/cn/grid/gr-gsi4intro/index_eng.html,
but this also was of no help.

Does anybody know what is wrong?

Fedorov


From hategan at mcs.anl.gov  Fri Jun 13 08:59:41 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 13 Jun 2008 08:59:41 -0500
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
Message-ID: <1213365581.6448.8.camel@localhost>


> I got "grid-hg.ncsa.teragrid.org/jobmanager" from here:
> http://www.teragrid.org/userinfo/jobs/gram.php I think it is just an
> alias for "fork".

That makes sense/

> 
> I also tried to use "jobmanager-pbs", and I could see my job in the
> queue, but the same result "Progress:  Executing:1" on the client
> host.

Ok, that's good.

> 
> >> [fedorov at ri vdsk] swift first.swift
> >> Swift 0.5 swift-r1783 cog-r1962
> >>
> >> RunID: 20080612-2101-yvp36l3c
> >> Progress:
> >> echo started
> >> Progress:  Executing:1
> >> Progress:  Executing:1
> >> Progress:  Executing:1
> >> Progress:  Executing:1
> >> Progress:  Executing:1
> >> Progress:  Executing:1
> >
> > This may happen if the callback address for you submit host is unknown
> > to the GRAM service or if you're behind a fierewall or NAT. If you're
> > not, try setting $GLOBUS_HOSTNAME with your DNS address or IP.
> >
> 
> No, I have a valid IP. I do have $GLOBUS_HOSTNAME set now, but this
> doesn't help.

Well, your problem is somewhere around there.

> 
> It seems that there is a problem returning the result back. I tried to
> set GLOBUS_TCP_PORT_RANGE to 45000,45100, as suggested here

I don't think setting them to a random range will help. It would need to
match an open port range on your assumed firewall.

> http://www-128.ibm.com/developerworks/cn/grid/gr-gsi4intro/index_eng.html,
> but this also was of no help.
> 
> Does anybody know what is wrong?

You'll need to do some more troubleshooting.
Id try the following:
- Start swift as you did before and wait for the job to be submitted
- figure out what port the callback service is on (do a netstat -lntp)
- log into NCSA and see if you can telnet to GLOBUS_HOSTNAME:port. You
probably can't. You'll have to figure out why.

Mihael

> 
> Fedorov


From fedorov at cs.wm.edu  Fri Jun 13 09:34:05 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 13 Jun 2008 10:34:05 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <1213365581.6448.8.camel@localhost>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
Message-ID: <82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>

> You'll need to do some more troubleshooting.
> Id try the following:
> - Start swift as you did before and wait for the job to be submitted
> - figure out what port the callback service is on (do a netstat -lntp)
> - log into NCSA and see if you can telnet to GLOBUS_HOSTNAME:port. You
> probably can't. You'll have to figure out why.
>

Yep, you were right...

[fedorov at ri ~/.globus] netstat -lntp | grep java
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:45000           0.0.0.0:*
LISTEN      25468/java
tcp        0      0 127.0.0.1:10479         0.0.0.0:*
LISTEN      25468/java

and telnet to port 45000 from Mercury gets "connection refused"

To add more details, I am unable to execute jobs via globusrun either:

[fedorov at ri vdsk] globusrun -o -r grid-hg.ncsa.teragrid.org/jobmanager
'&(executable=/usr/bin/cal)'
GRAM Job submission failed
org.globus.gram.GramException: The job manager failed to open stderr
        at org.globus.gram.Gram.request(Gram.java:358)
        at org.globus.gram.GramJob.request(GramJob.java:262)
        at org.globus.tools.GlobusRun.gramRun(GlobusRun.java:557)
        at org.globus.tools.GlobusRun.main(GlobusRun.java:530)
GRAM Job submission failed: The job manager failed to open stderr

I will have to figure it out...

Andrey


From fedorov at cs.wm.edu  Mon Jun 16 16:59:12 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Mon, 16 Jun 2008 17:59:12 -0400
Subject: [Swift-user] Passing GRAM parameters
Message-ID: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>

Hello,

1. I have a workflow, which contains an MPI-parallelized component. I
haven't seen a reference on how explicitly parallel jobs can be
launched in the Swift guide. Is this functionality implemented, or
this requires extending Swift (writing a new site selector, maybe?)

2. How can I specify the walltime for a job? I see a reference to
GLOBUS::maxwalltime in the swift.properties comments, which seems to
be related, but no explanation. The ability to specify walltime seems
to me a basic functionality.

I know these two can be specified in an RSL for GRAM, but can I do it
with Swift? If not, is it very difficult to access the RSL passed to
GRAM?

Thanks
--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From benc at hawaga.org.uk  Mon Jun 16 17:40:24 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 16 Jun 2008 22:40:24 +0000 (GMT)
Subject: [Swift-user] Passing GRAM parameters
In-Reply-To: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>
References: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0806162228230.4338@dildano.hawaga.org.uk>


On Mon, 16 Jun 2008, Andriy Fedorov wrote:

> 1. I have a workflow, which contains an MPI-parallelized component. I
> haven't seen a reference on how explicitly parallel jobs can be
> launched in the Swift guide. Is this functionality implemented, or
> this requires extending Swift (writing a new site selector, maybe?)

I've never done anything with MPI-parallel stuff in GRAM. I think tis is 
normally specified in GRAM RSL. Do you have a command for GRAM to do it?

> 2. How can I specify the walltime for a job? I see a reference to
> GLOBUS::maxwalltime in the swift.properties comments, which seems to
> be related, but no explanation. The ability to specify walltime seems
> to me a basic functionality.

You can specify GRAM RSL properties in two places: either for every job 
running on a site in sites.xml, or for a particular executable in tc.data

For the sites.xml approach, this is an example of how I set the 
maxwalltime to 1 minute on the teragrid TACC site:

 <config>

  <pool handle="tgtacc" >
    <gridftp  url="gsiftp://tg-login.tacc.teragrid.org" /> 
    <jobmanager universe="vanilla" 
url="tg-login.tacc.teragrid.org:2119/jobmanag
er-lsf" major="2" />
    <workdirectory >/home/teragrid/tg458015/swifttest</workdirectory>
    <profile namespace="globus" key="queue">development</profile> 
    <profile namespace="globus" key="maxwalltime">1</profile> 
  </pool>

</config>

Two RSL values get defined there: &(queue=development)(maxwalltime=1)

In tc.data, you can specify per-executable by sticking it on the end of 
the tc.data line for that executable. Where, by default, there is 'null', 
instead write:  GLOBUS::maxwalltime=1

For example, something like this:

UCTERAPORT air::alignlinear /gpfs1/osg_data/fMRI_tools/AIR/bin/alignlinear 
INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=1


Hopefully you can specify what you need for the MPI stuff in this way too.

-- 


From benc at hawaga.org.uk  Tue Jun 17 08:53:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 17 Jun 2008 13:53:55 +0000 (GMT)
Subject: [Swift-user] suggestion for program flow control
In-Reply-To: <380944.70847.qm@web52304.mail.re2.yahoo.com>
References: <380944.70847.qm@web52304.mail.re2.yahoo.com>
Message-ID: <Pine.LNX.4.64.0806171351470.4338@dildano.hawaga.org.uk>


On Wed, 11 Jun 2008, Mike Kubal wrote:

> I can definitely see the benefit of having separate pipelines for 
> non-dependent parts within the same script, but perhaps there is a way 
> to chain dependent functions that is not dependent on files produced by 
> previous functions?

I've been playing with some code to do that as someone else requested it.

Basically you will be able to have a swiftscript variable that expresses 
the dependency, but doesn't have any actual content (such as a file).

Hopefully later this week there will be something in SVN.

-- 


From fedorov at cs.wm.edu  Tue Jun 17 11:21:48 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 17 Jun 2008 12:21:48 -0400
Subject: [Swift-user] Passing GRAM parameters
In-Reply-To: <Pine.LNX.4.64.0806162228230.4338@dildano.hawaga.org.uk>
References: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>
	<Pine.LNX.4.64.0806162228230.4338@dildano.hawaga.org.uk>
Message-ID: <82f536810806170921r6435386bh1eb2b670f6231c30@mail.gmail.com>

>
>> 1. I have a workflow, which contains an MPI-parallelized component. I
>> haven't seen a reference on how explicitly parallel jobs can be
>> launched in the Swift guide. Is this functionality implemented, or
>> this requires extending Swift (writing a new site selector, maybe?)
>
> I've never done anything with MPI-parallel stuff in GRAM. I think tis is
> normally specified in GRAM RSL. Do you have a command for GRAM to do it?

I know how to do that in pre-WS GRAM -- there were "count" and
"hostCount" attributes. I assume, same attributes should work with WS
GRAM.


>> 2. How can I specify the walltime for a job? I see a reference to
>> GLOBUS::maxwalltime in the swift.properties comments, which seems to
>> be related, but no explanation. The ability to specify walltime seems
>> to me a basic functionality.
>
> You can specify GRAM RSL properties in two places: either for every job
> running on a site in sites.xml, or for a particular executable in tc.data
>

Thank you for this explanation. I think for me it is only possible to
specify walltime in tc.data, because walltime will be different for
different tasks.

Can you tell me how to specify multiple GLOBUS::-like parameters? For
example, if I want to try to specify walltime, count and hostCount for
each item in tc.data, is this possible? I understand that I can
probably specify these attributes in sites.xml, but in this case the
resource request for that specific site will always ask for
"hostCount" nodes, even if I can only use 1.


Andrey


From hategan at mcs.anl.gov  Tue Jun 17 11:34:09 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 17 Jun 2008 11:34:09 -0500
Subject: [Swift-user] Passing GRAM parameters
In-Reply-To: <82f536810806170921r6435386bh1eb2b670f6231c30@mail.gmail.com>
References: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>
	<Pine.LNX.4.64.0806162228230.4338@dildano.hawaga.org.uk>
	<82f536810806170921r6435386bh1eb2b670f6231c30@mail.gmail.com>
Message-ID: <1213720449.14337.2.camel@localhost>

On Tue, 2008-06-17 at 12:21 -0400, Andriy Fedorov wrote:
> >
> >> 1. I have a workflow, which contains an MPI-parallelized component. I
> >> haven't seen a reference on how explicitly parallel jobs can be
> >> launched in the Swift guide. Is this functionality implemented, or
> >> this requires extending Swift (writing a new site selector, maybe?)
> >
> > I've never done anything with MPI-parallel stuff in GRAM. I think tis is
> > normally specified in GRAM RSL. Do you have a command for GRAM to do it?
> 
> I know how to do that in pre-WS GRAM -- there were "count" and
> "hostCount" attributes. I assume, same attributes should work with WS
> GRAM.

An possibly jobType="mpi".

Yes, the same attributes should work with WS-GRAM. If they don't, I
would consider it a bug.

> 
> 
> >>...
> 
> Can you tell me how to specify multiple GLOBUS::-like parameters? For
> example, if I want to try to specify walltime, count and hostCount for
> each item in tc.data, is this possible? I understand that I can
> probably specify these attributes in sites.xml, but in this case the
> resource request for that specific site will always ask for
> "hostCount" nodes, even if I can only use 1.

You should be able to separate them using commas, if they share the same
namespace. For example:
GLOBUS::walltime="10:10:00",count=4,hostCount=2


From fedorov at cs.wm.edu  Tue Jun 17 11:37:46 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 17 Jun 2008 12:37:46 -0400
Subject: [Swift-user] Passing GRAM parameters
In-Reply-To: <1213720449.14337.2.camel@localhost>
References: <82f536810806161459t378f673fr524f7128d578957@mail.gmail.com>
	<Pine.LNX.4.64.0806162228230.4338@dildano.hawaga.org.uk>
	<82f536810806170921r6435386bh1eb2b670f6231c30@mail.gmail.com>
	<1213720449.14337.2.camel@localhost>
Message-ID: <82f536810806170937h1022ec3dp447a332638ad3a1c@mail.gmail.com>

Michael,

Thank you for this clarification. I will post an update if I have problems.

I am sorry I cannot try things out right away, but I am in the process
of setting up my software on TeraGrid, and it's going to take some
time... But I WILL have updates on whether this works or not,
eventually! :)

Andrey


On Tue, Jun 17, 2008 at 12:34 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Tue, 2008-06-17 at 12:21 -0400, Andriy Fedorov wrote:
>> >
>> >> 1. I have a workflow, which contains an MPI-parallelized component. I
>> >> haven't seen a reference on how explicitly parallel jobs can be
>> >> launched in the Swift guide. Is this functionality implemented, or
>> >> this requires extending Swift (writing a new site selector, maybe?)
>> >
>> > I've never done anything with MPI-parallel stuff in GRAM. I think tis is
>> > normally specified in GRAM RSL. Do you have a command for GRAM to do it?
>>
>> I know how to do that in pre-WS GRAM -- there were "count" and
>> "hostCount" attributes. I assume, same attributes should work with WS
>> GRAM.
>
> An possibly jobType="mpi".
>
> Yes, the same attributes should work with WS-GRAM. If they don't, I
> would consider it a bug.
>
>>
>>
>> >>...
>>
>> Can you tell me how to specify multiple GLOBUS::-like parameters? For
>> example, if I want to try to specify walltime, count and hostCount for
>> each item in tc.data, is this possible? I understand that I can
>> probably specify these attributes in sites.xml, but in this case the
>> resource request for that specific site will always ask for
>> "hostCount" nodes, even if I can only use 1.
>
> You should be able to separate them using commas, if they share the same
> namespace. For example:
> GLOBUS::walltime="10:10:00",count=4,hostCount=2
>
>
>


From fedorov at cs.wm.edu  Fri Jun 20 15:23:54 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 20 Jun 2008 16:23:54 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
Message-ID: <82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>

>
> I will have to figure it out...
>

The problem turned out to be in the setup of our firewall. For the
time being, I cannot resolve it, so I am trying to use TeraGrid UC
site for testing.

Unfortunately, I ran into problems again.

Here is my sites.xml:

<config>

<pool handle="UC-GT2-Fork">
        <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
        <jobmanager universe="vanilla"
url="tg-grid.uc.teragrid.org/jobmanager-fork" major="2" />
        <workdirectory>/home/fedorov/scratch</workdirectory>
</pool>

<pool handle="UC-GT2-PBS">
        <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
        <jobmanager universe="vanilla"
url="tg-grid.uc.teragrid.org/jobmanager-pbs" major="2" />
        <workdirectory>/home/fedorov/scratch</workdirectory>
</pool>

<pool handle="UC-GT4">
        <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
        <execution provider="gt4" jobmanager="PBS"
        url="https://tg-grid.uc.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"
/>
        <workdirectory>/home/fedorov/scratch</workdirectory>
</pool>

</config>

And here is tc.data:

UC-GT2-Fork     echo_gt2_fork   /bin/echo       INSTALLED
INTEL32::LINUX  null
UC-GT2-PBS      echo_gt2_pbs    /bin/echo       INSTALLED
INTEL32::LINUX  null
UC-GT4          echo_gt4        /bin/echo       INSTALLED
INTEL32::LINUX  null

I am trying to run example first.swift, but use echo_gt2_fork,
echo_gt2_pbs, and echo_gt4 instead, to test different setups.

I have success with echo_gt2_fork and echo_gt4. But when I try to use
echo_gt2_pbs, it never finishes, and I see this:

[fedorov at tg-login1 vdsk]$ swift first.swift
Unable to find required classes (javax.activation.DataHandler and
javax.mail.internet.MimeMultipart). Attachment support is disabled.
Swift 0.5 swift-r1783 cog-r1962

RunID: 20080620-1514-in8cf2yg
Progress:
echo_gt2_pbs started
Progress:  Stage in:1
Progress:  Executing:1
Progress:  Executing:1
...

Note, that I can see my job started and completed with "qstat" on UC
site, but the result never gets back. This is on
tg-login.uc.teragrid.org, so there should be no problem with firewall.

The only reason I would like to get this working for pre-WS is because
my true goal is to be able to run MPI job, and be able to pass node
type to PBS. The only way to specify host type for GT4 GRAM is through
Job description extensions (see
http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html),
and I don't know how this can be translated into tc.data. With GT2, I
can simply use "host_count=2:compute".

Can someone help me with this simple example, so that I can try to
make more complicated scripts work and run MPI applications?

Thank you

Andrey


From hategan at mcs.anl.gov  Fri Jun 20 15:39:42 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Jun 2008 15:39:42 -0500
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
Message-ID: <1213994382.16201.5.camel@localhost>

On Fri, 2008-06-20 at 16:23 -0400, Andriy Fedorov wrote:
> >
> > I will have to figure it out...
> >
> 
> The problem turned out to be in the setup of our firewall. For the
> time being, I cannot resolve it, so I am trying to use TeraGrid UC
> site for testing.

You could set up an SSH tunnel...

> 
> Unfortunately, I ran into problems again.
> 
[...]
> RunID: 20080620-1514-in8cf2yg
> Progress:
> echo_gt2_pbs started
> Progress:  Stage in:1
> Progress:  Executing:1
> Progress:  Executing:1
> ...

Well, swift thinks the job is running.

> 
> Note, that I can see my job started and completed with "qstat" on UC
> site, but the result never gets back.

By never, do you mean more than 1 minute or less?

>  This is on
> tg-login.uc.teragrid.org, so there should be no problem with firewall.

Though there might still be a problem with GLOBUS_HOSTNAME. Is that set
properly? Can you do the telnet thingy?

> 
> The only reason I would like to get this working for pre-WS is because
> my true goal is to be able to run MPI job, and be able to pass node
> type to PBS. The only way to specify host type for GT4 GRAM is through
> Job description extensions (see
> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html),
> and I don't know how this can be translated into tc.data. With GT2, I
> can simply use "host_count=2:compute".

I think you should be able to use host_types=compute with both pre-WS
GRAM and WS-GRAM.

Mihael


From fedorov at cs.wm.edu  Fri Jun 20 16:08:57 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 20 Jun 2008 17:08:57 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <1213994382.16201.5.camel@localhost>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
Message-ID: <82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>

>> Note, that I can see my job started and completed with "qstat" on UC
>> site, but the result never gets back.
>
> By never, do you mean more than 1 minute or less?
>

More than 5 minutes.

>>  This is on
>> tg-login.uc.teragrid.org, so there should be no problem with firewall.
>
> Though there might still be a problem with GLOBUS_HOSTNAME. Is that set
> properly? Can you do the telnet thingy?
>

GLOBUS_HOSTNAME was not set. I set it, but nothing changed.
GLOBUS_TCP_PORT_RANGE is set to "50000,51000", port 50000 is open,
yes, I can telnet to that port.

>>
>> The only reason I would like to get this working for pre-WS is because
>> my true goal is to be able to run MPI job, and be able to pass node
>> type to PBS. The only way to specify host type for GT4 GRAM is through
>> Job description extensions (see
>> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html),
>> and I don't know how this can be translated into tc.data. With GT2, I
>> can simply use "host_count=2:compute".
>
> I think you should be able to use host_types=compute with both pre-WS
> GRAM and WS-GRAM.
>

Ok, so I added this line to tc.data:

UC-GT4          echo_gt4_mpi    /bin/echo       INSTALLED
INTEL32::LINUX  GLOBUS::host_count="2:compute",jobType=mpi

and ran first.swift with echo_gt4_mpi. The result is very strange.

In the qstat I see one job running on 1 node (the node type is NOT
"compute"), then it finishes, then Swift script reports "Final status:
 Finished successfully:1", and then I see SECOND job with 1 node
running in qstat.

Then I tried and added ",count=2" in the end of GLOBUS attributes.
Then I saw first job running on 2 nodes (again, node types were not
what I requested), then again Swift finished, and again 1-node job
started and finished....

So, I see two problems: 1) problems using jobmanager-pbs with pre-WS
GRAM, and 2) problems passing arguments to Globus for running MPI-type
jobs.

Note, I can request resources correctly and run MPI jobs when using
RSL job description with globusrun, and using XML (with job
description extensions) and globusrun-ws. I can post those, if you are
interested. Therefore, I do not think the second problem I mentioned
is a GT problem.

Andrey


From hategan at mcs.anl.gov  Fri Jun 20 16:50:48 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 20 Jun 2008 16:50:48 -0500
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
	<82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
Message-ID: <1213998648.17754.2.camel@localhost>

On Fri, 2008-06-20 at 17:08 -0400, Andriy Fedorov wrote:
> >> Note, that I can see my job started and completed with "qstat" on UC
> >> site, but the result never gets back.
> >
> > By never, do you mean more than 1 minute or less?
> >
> 
> More than 5 minutes.

That "never" enough.

> 
> >>  This is on
> >> tg-login.uc.teragrid.org, so there should be no problem with firewall.
> >
> > Though there might still be a problem with GLOBUS_HOSTNAME. Is that set
> > properly? Can you do the telnet thingy?
> >
> 
> GLOBUS_HOSTNAME was not set. I set it, but nothing changed.
> GLOBUS_TCP_PORT_RANGE is set to "50000,51000", port 50000 is open,
> yes, I can telnet to that port.

Ok, can you send me a link to the swift log for the run that behaves
badly?

> 
> >>
> >> The only reason I would like to get this working for pre-WS is because
> >> my true goal is to be able to run MPI job, and be able to pass node
> >> type to PBS. The only way to specify host type for GT4 GRAM is through
> >> Job description extensions (see
> >> http://www.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Job_Desc_Extensions.html),
> >> and I don't know how this can be translated into tc.data. With GT2, I
> >> can simply use "host_count=2:compute".
> >
> > I think you should be able to use host_types=compute with both pre-WS
> > GRAM and WS-GRAM.
> >
> 
> Ok, so I added this line to tc.data:
> 
> UC-GT4          echo_gt4_mpi    /bin/echo       INSTALLED
> INTEL32::LINUX  GLOBUS::host_count="2:compute",jobType=mpi

I'm a bit unsure whether that would work. I think that
host_types="ia32-compute" has worked in the past on that site.

Mihael

> 
> and ran first.swift with echo_gt4_mpi. The result is very strange.
> 
> In the qstat I see one job running on 1 node (the node type is NOT
> "compute"), then it finishes, then Swift script reports "Final status:
>  Finished successfully:1", and then I see SECOND job with 1 node
> running in qstat.
> 
> Then I tried and added ",count=2" in the end of GLOBUS attributes.
> Then I saw first job running on 2 nodes (again, node types were not
> what I requested), then again Swift finished, and again 1-node job
> started and finished....
> 
> So, I see two problems: 1) problems using jobmanager-pbs with pre-WS
> GRAM, and 2) problems passing arguments to Globus for running MPI-type
> jobs.
> 
> Note, I can request resources correctly and run MPI jobs when using
> RSL job description with globusrun, and using XML (with job
> description extensions) and globusrun-ws. I can post those, if you are
> interested. Therefore, I do not think the second problem I mentioned
> is a GT problem.
> 
> Andrey


From fedorov at cs.wm.edu  Fri Jun 20 17:27:48 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 20 Jun 2008 18:27:48 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <1213998648.17754.2.camel@localhost>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
	<82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
	<1213998648.17754.2.camel@localhost>
Message-ID: <82f536810806201527o1461538alf5e2da171cd3c3e8@mail.gmail.com>

>> > By never, do you mean more than 1 minute or less?
>> >
>>
>> More than 5 minutes.
>
> That "never" enough.
>

You were EXACTLY right! It took 5 minutes 46 seconds, and it finished!
My fault, I guess, didn't wait long enough...

But it is just "echo"! Is this normal that it takes so long? When I do
the same test with WS GRAM and Fork, it completes in less than minute.


>> UC-GT4          echo_gt4_mpi    /bin/echo       INSTALLED
>> INTEL32::LINUX  GLOBUS::host_count="2:compute",jobType=mpi
>
> I'm a bit unsure whether that would work. I think that
> host_types="ia32-compute" has worked in the past on that site.
>

Well, it does work with "qsub":

[fedorov at tg-login1 GT_test]$ qsub -I -l nodes=1:compute
qsub: waiting for job 1724062.tg-master.uc.teragrid.org to start
qsub: job 1724062.tg-master.uc.teragrid.org ready

----------------------------------------
Begin PBS Prologue Fri Jun 20 16:50:24 CDT 2008
Job ID:         1724062.tg-master.uc.teragrid.org
Username:       fedorov
Group:          allocate
Nodes:          tg-c013
End PBS Prologue Fri Jun 20 16:50:24 CDT 2008
----------------------------------------
[fedorov at tg-c013 ~]$ logout

It also does work with pre-WS and WS GRAM on UC, when I use globusrun.
Why doesn't it work with Swift tc.data?

ia32-compute requests visualization nodes on UC, I need 64-bit compute
nodes (node names look like tg-c0*). I tried to use ia64-compute:

UC-GT4          echo_gt4_mpi    /bin/echo       INSTALLED
INTEL32::LINUX  GLOBUS::host_count="2:ia64-compute",jobType=mpi

but this doesn't finish at all (this time I waited for 10 minutes, to
be safe), and the job doesn't appear in the qsub.

Andrey


From benc at hawaga.org.uk  Sun Jun 22 14:01:16 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 22 Jun 2008 19:01:16 +0000 (GMT)
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
	<82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0806221858530.3645@dildano.hawaga.org.uk>


On Fri, 20 Jun 2008, Andriy Fedorov wrote:

> In the qstat I see one job running on 1 node (the node type is NOT
> "compute"), then it finishes, then Swift script reports "Final status:
>  Finished successfully:1", and then I see SECOND job with 1 node
> running in qstat.

You should see a final job run after every successful Swift run, to clean 
up working files. So this second job does not seem incorrect - it isn't 
specified in your SwiftScript; instead you get it automatically.


> Note, I can request resources correctly and run MPI jobs when using
> RSL job description with globusrun, and using XML (with job
> description extensions) and globusrun-ws. I can post those, if you are
> interested. Therefore, I do not think the second problem I mentioned
> is a GT problem.

Yes, please post.

If you could make it something that we could run ourselves to recreate 
when poking round (eg some kind of hello world rather than a complicated 
application) that would be especially useful.

-- 


From benc at hawaga.org.uk  Sun Jun 22 14:05:58 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 22 Jun 2008 19:05:58 +0000 (GMT)
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0806221901470.3645@dildano.hawaga.org.uk>


On Fri, 20 Jun 2008, Andriy Fedorov wrote:

> The problem turned out to be in the setup of our firewall. For the
> time being, I cannot resolve it, so I am trying to use TeraGrid UC
> site for testing.

Were you running the Swift command on a TG UC login node, submitting to TG 
UC?


> $ swift first.swift                                     
> Unable to find required classes (javax.activation.DataHandler and               
> javax.mail.internet.MimeMultipart). Attachment support is disabled.             
> Swift 0.5 swift-r1783 cog-r1962                                                 
                                                                                
> RunID: 20080620-1514-in8cf2yg                                                   
> Progress:                                                                       
> echo_gt2_pbs started                                                            
> Progress:  Stage in:1                                                           
> Progress:  Executing:1                                                          
> Progress:  Executing:1                                                          
> ...                             

The above looks more like something I'd expect from a WS-GRAM setup - 
specifically, the "Unable to find required classes..." message (which is 
safe to ignore here) and the symptom of going into "Executing" state, 
seeing it run in qstat but not detected so by Swift.

Please can you run this one again and check that you really are submitting 
to pre-WS gram there?


-- 


From benc at hawaga.org.uk  Sun Jun 22 14:14:14 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 22 Jun 2008 19:14:14 +0000 (GMT)
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <82f536810806201527o1461538alf5e2da171cd3c3e8@mail.gmail.com>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<1213320388.1281.1.camel@localhost>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
	<82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
	<1213998648.17754.2.camel@localhost>
	<82f536810806201527o1461538alf5e2da171cd3c3e8@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0806221911030.3645@dildano.hawaga.org.uk>


On Fri, 20 Jun 2008, Andriy Fedorov wrote:

> But it is just "echo"! Is this normal that it takes so long? When I do
> the same test with WS GRAM and Fork, it completes in less than minute.

Its a symptom of pre-ws gram + pbs + the way PBS is configured on TG UC.

pre-ws gram reports a job finished to the caller (eg Swift) at the point 
that it disappears from the queue.

wsgram reports a job finished to the called (eg Swift) at the point that 
it is marked completed in the queue.

PBS on TG UC is configured to keep jobs around for a long time (5 mins or 
so) in completed state - that means you can run qstat and see the jobs 
after they have finished, and see that they have reached completed state. 
But it also means that pre-WS gram does not report completion for much 
longer.

That happens on all my pre-ws gram testing of UC and is expected (by me) 
behaviour.

-- 


From fedorov at cs.wm.edu  Sun Jun 22 14:44:17 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Sun, 22 Jun 2008 15:44:17 -0400
Subject: [Swift-user] Running first.swift remotely on NCSA
In-Reply-To: <Pine.LNX.4.64.0806221911030.3645@dildano.hawaga.org.uk>
References: <82f536810806121819g546ea332w187039d75b39be02@mail.gmail.com>
	<82f536810806130627k4de1cc1dta4860206a80de8a1@mail.gmail.com>
	<1213365581.6448.8.camel@localhost>
	<82f536810806130734m32aa1ac8sb1324bd9b78a931d@mail.gmail.com>
	<82f536810806201323y73684caaifa2770ce052da6c3@mail.gmail.com>
	<1213994382.16201.5.camel@localhost>
	<82f536810806201408t7b7643adw733756be522327ae@mail.gmail.com>
	<1213998648.17754.2.camel@localhost>
	<82f536810806201527o1461538alf5e2da171cd3c3e8@mail.gmail.com>
	<Pine.LNX.4.64.0806221911030.3645@dildano.hawaga.org.uk>
Message-ID: <82f536810806221244u4b786aam911b15197c01efe5@mail.gmail.com>

>> In the qstat I see one job running on 1 node (the node type is NOT
>> "compute"), then it finishes, then Swift script reports "Final status:
>>  Finished successfully:1", and then I see SECOND job with 1 node
>> running in qstat.

> You should see a final job run after every successful Swift run, to clean
> up working files. So this second job does not seem incorrect - it isn't
> specified in your SwiftScript; instead you get it automatically.

I understand. I was not aware of this.

>> But it is just "echo"! Is this normal that it takes so long? When I do
>> the same test with WS GRAM and Fork, it completes in less than minute.
>
> Its a symptom of pre-ws gram + pbs + the way PBS is configured on TG UC.
>

Your explanation makes perfect sense! I assume, in order to avoid this
delay, I better figure out how to run everything with WS GRAM.

It's really nice to know these little details, so that I can
understand a little bit how it works under the hood.


>> The problem turned out to be in the setup of our firewall. For the
>> time being, I cannot resolve it, so I am trying to use TeraGrid UC
>> site for testing.

> Were you running the Swift command on a TG UC login node, submitting to TG
> UC?

Yes, correct.

>> $ swift first.swift
>> Unable to find required classes (javax.activation.DataHandler and
>> javax.mail.internet.MimeMultipart). Attachment support is disabled.
>> Swift 0.5 swift-r1783 cog-r1962

>> RunID: 20080620-1514-in8cf2yg
>> Progress:
>> echo_gt2_pbs started
>> Progress:  Stage in:1
>> Progress:  Executing:1
>> Progress:  Executing:1
>> ...

> The above looks more like something I'd expect from a WS-GRAM setup -
> specifically, the "Unable to find required classes..." message (which is
> safe to ignore here) and the symptom of going into "Executing" state,
> seeing it run in qstat but not detected so by Swift.

> Please can you run this one again and check that you really are submitting
> to pre-WS gram there?

Yes, it was submitted to pre-WS GRAM. I just ran it again, and it
behaves the same way, the only difference now is that I can explain
what is going on!

I think the first problem I had (running Swift with pre-WS+PBS) is
completely resolved, now that I understand the sources of those
delays.

I will do some testing for the second issue (passing GRAM parameters
for WS-GRAM+PBS), and prepare simple scripts and configuration files I
use, so that you could recreate the problem.

Thanks again for your help!

Andrey


From lixi at uchicago.edu  Mon Jun 30 09:03:09 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Mon, 30 Jun 2008 09:03:09 -0500 (CDT)
Subject: [Swift-user] No response of Swift run
Message-ID: <20080630090309.BBT73100@m4500-03.uchicago.edu>

Hi,

I launched a Swift workflow (including 2001 jobs) at 16:16 
yesterday. At 17:20, it returned the results of 2000 jobs, 
then there is no reponse any more. I wonder why? I enabled 
the replication option.

The log file is very large (more 1G) and is on CI:
/home/lixi/newswift/test/newversion/workflowtest-20080629-
1616-c4h22j03.log

Please check it, thanks

Xi