From benc at hawaga.org.uk  Fri Aug  1 02:12:25 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 07:12:25 +0000 (GMT)
Subject: [Swift-user] Re: swift script calling procedure
In-Reply-To: <489202BE.8090606@mcs.anl.gov>
References: <4891F94F.6090004@uchicago.edu> <489202BE.8090606@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0808010709350.5076@dildano.hawaga.org.uk>


On Thu, 31 Jul 2008, Michael Wilde wrote:

> Perhaps the code he tried was wrong or never worked, or perhaps the case of
> the function or the case checking rules changed between the time this last
> worked and now.

The checking rules are definitely stronger than they were before - this is 
likely from Milena's checks.

Whether Swift should be case sensitive on identifiers seems poorly 
defined; karajan will take either case, but the XML-related history would 
suggest that case should be significiant (specifically a QName is case 
sensitive).

I'm not terribly fussed either way (though I'd probably go for case 
sensitivity to be more C/Java-like) but I think it would be good to be 
defined more than as accidents of the compile-time and runtime layer.

-- 


From wilde at mcs.anl.gov  Fri Aug  1 06:37:31 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 01 Aug 2008 06:37:31 -0500
Subject: [Swift-user] Re: swift script calling procedure
In-Reply-To: <Pine.LNX.4.64.0808010709350.5076@dildano.hawaga.org.uk>
References: <4891F94F.6090004@uchicago.edu> <489202BE.8090606@mcs.anl.gov>
	<Pine.LNX.4.64.0808010709350.5076@dildano.hawaga.org.uk>
Message-ID: <4892F57B.2040406@mcs.anl.gov>

I vote in favor of case sensitivity.

On 8/1/08 2:12 AM, Ben Clifford wrote:
> On Thu, 31 Jul 2008, Michael Wilde wrote:
> 
>> Perhaps the code he tried was wrong or never worked, or perhaps the case of
>> the function or the case checking rules changed between the time this last
>> worked and now.
> 
> The checking rules are definitely stronger than they were before - this is 
> likely from Milena's checks.
> 
> Whether Swift should be case sensitive on identifiers seems poorly 
> defined; karajan will take either case, but the XML-related history would 
> suggest that case should be significiant (specifically a QName is case 
> sensitive).
> 
> I'm not terribly fussed either way (though I'd probably go for case 
> sensitivity to be more C/Java-like) but I think it would be good to be 
> defined more than as accidents of the compile-time and runtime layer.
> 


From zhaozhang at uchicago.edu  Fri Aug  1 09:37:56 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 01 Aug 2008 09:37:56 -0500
Subject: [Swift-user] problem starting swift
Message-ID: <48931FC4.4050902@uchicago.edu>

Hi

I started swift to run 15352 jobs, then swift failed to start with this 
message

Execution failed:
        java.util.ConcurrentModificationException

The log file is at 
http://www.ci.uchicago.edu/~zzhang/dock2-20080801-0923-menbb7zg.log

Thus I started the first 12000 tasks, it is ok for now, jobs are going 
through, and I saw some return
successful.

best wishes
zhangzhao


From benc at hawaga.org.uk  Fri Aug  1 09:42:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 14:42:13 +0000 (GMT)
Subject: [Swift-user] problem starting swift
In-Reply-To: <48931FC4.4050902@uchicago.edu>
References: <48931FC4.4050902@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Zhao Zhang wrote:

> I started swift to run 15352 jobs, then swift failed to start with this
> message
> 
> Execution failed:
>        java.util.ConcurrentModificationException

ok, I see what causes that.

-- 


From zhaozhang at uchicago.edu  Fri Aug  1 10:29:36 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 01 Aug 2008 10:29:36 -0500
Subject: [Swift-user] problem starting swift
In-Reply-To: <Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
References: <48931FC4.4050902@uchicago.edu>
	<Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
Message-ID: <48932BE0.6010206@uchicago.edu>

Hi, Ben

could you tell me more in details?
Thanks

zhao

Ben Clifford wrote:
> On Fri, 1 Aug 2008, Zhao Zhang wrote:
>
>   
>> I started swift to run 15352 jobs, then swift failed to start with this
>> message
>>
>> Execution failed:
>>        java.util.ConcurrentModificationException
>>     
>
> ok, I see what causes that.
>
>   


From benc at hawaga.org.uk  Fri Aug  1 10:25:36 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 15:25:36 +0000 (GMT)
Subject: [Swift-user] problem starting swift
In-Reply-To: <Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
References: <48931FC4.4050902@uchicago.edu>
	<Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808011524530.22488@dildano.hawaga.org.uk>

try swift r2168


From benc at hawaga.org.uk  Fri Aug  1 10:38:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 15:38:52 +0000 (GMT)
Subject: [Swift-user] problem starting swift
In-Reply-To: <48932BE0.6010206@uchicago.edu>
References: <48931FC4.4050902@uchicago.edu>
	<Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
	<48932BE0.6010206@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808011536350.22488@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Zhao Zhang wrote:

> could you tell me more in details?

There is a map of job statuses maintained for the progress status display. 
Every time the progress line is displayed all of these statuses are 
counted. If any status is changed while that count is happening, then the 
exception you see is raised.

-- 


From fedorov at cs.wm.edu  Fri Aug  1 13:31:51 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 1 Aug 2008 14:31:51 -0400
Subject: [Swift-user] Swift scheduler
Message-ID: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>

Hi,

I have some general questions about the scheduling policy Swift is using.

For example, suppose I have an application, which has multiple
mappings to different remote sites. How is the submission site going
to be selected? In case I have long queueing delays on the selected
site, can Swift detect that, and submit job to a different site?

Can any of the developers point me to the specific part of the source
that is responsible for scheduling, so that I could try to figure this
out myself?

Thanks!

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From zhaozhang at uchicago.edu  Fri Aug  1 13:37:45 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 01 Aug 2008 13:37:45 -0500
Subject: [Swift-user] problem starting swift
In-Reply-To: <Pine.LNX.4.64.0808011536350.22488@dildano.hawaga.org.uk>
References: <48931FC4.4050902@uchicago.edu>
	<Pine.LNX.4.64.0808011440450.5076@dildano.hawaga.org.uk>
	<48932BE0.6010206@uchicago.edu>
	<Pine.LNX.4.64.0808011536350.22488@dildano.hawaga.org.uk>
Message-ID: <489357F9.9040707@uchicago.edu>

Thanks, Ben

I rebuild swift, and ran a small scale test,  it  works ok.
I am ready for a larger test, still waiting for resources.
will post the result as soon I got them

zhao

Ben Clifford wrote:
> On Fri, 1 Aug 2008, Zhao Zhang wrote:
>
>   
>> could you tell me more in details?
>>     
>
> There is a map of job statuses maintained for the progress status display. 
> Every time the progress line is displayed all of these statuses are 
> counted. If any status is changed while that count is happening, then the 
> exception you see is raised.
>
>   


From benc at hawaga.org.uk  Fri Aug  1 13:39:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 18:39:46 +0000 (GMT)
Subject: [Swift-user] Swift scheduler
In-Reply-To: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808011835410.5076@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Andriy Fedorov wrote:

> For example, suppose I have an application, which has multiple
> mappings to different remote sites. How is the submission site going
> to be selected? 

Each site has a score which reflects how many jobs will be sent to that 
site at once. The score goes up as the site is used successfully and goes 
down as there are problems with the site.

When its time to submit a job, one of the sites which has free space 
(score - actual load) will get the job.

> In case I have long queueing delays on the selected site, can Swift 
> detect that, and submit job to a different site?

yes. In recent trunk code there is a feature called 'replication' whereby 
jobs will be submitted to up to two (three?) more times if they take more 
than three times the average time for jobs. Look in swift.proeprties for 
the three replication.* properties.

In the past we've discussed more complicated selection algorithms than 3 * 
mean.

> Can any of the developers point me to the specific part of the source
> that is responsible for scheduling, so that I could try to figure this
> out myself?

Start here:

cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java.

-- 


From hategan at mcs.anl.gov  Fri Aug  1 13:46:29 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 01 Aug 2008 13:46:29 -0500
Subject: [Swift-user] Swift scheduler
In-Reply-To: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
Message-ID: <1217616389.22481.3.camel@localhost>

On Fri, 2008-08-01 at 14:31 -0400, Andriy Fedorov wrote:
> Hi,
> 
> I have some general questions about the scheduling policy Swift is using.
> 
> For example, suppose I have an application, which has multiple
> mappings to different remote sites. How is the submission site going
> to be selected?

In principle a score is kept for each site. The score varies based on
the number of successful submissions to that site and (negatively) with
current load. Sites are picked using a weighted random out of the pool
of sites (the weights being the scores).

>  In case I have long queueing delays on the selected
> site, can Swift detect that, and submit job to a different site?

There's something called replication which can be enabled in
swift.properties to do that.

> 
> Can any of the developers point me to the specific part of the source
> that is responsible for scheduling, so that I could try to figure this
> out myself?

Things start around here in principle:
http://cogkit.svn.sourceforge.net/viewvc/cogkit/trunk/current/src/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java?view=log

> 
> Thanks!
> 
> --
> Andrey Fedorov
> 
> Center for Real-Time Computing
> College of William and Mary
> http://www.cs.wm.edu/~fedorov
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From iraicu at cs.uchicago.edu  Fri Aug  1 13:19:42 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 01 Aug 2008 13:19:42 -0500
Subject: [Swift-user] CFP: Workshop on Many-Task Computing on Grids and
 Supercomputers (MTAGS08) co-located with IEEE/ACM SC08
Message-ID: <489353BE.8000205@cs.uchicago.edu>

Dear all,
This is our final CFP for MTAGS08.  Note that the submission guidelines have changed.  The relevant change is:
*A 250 word abstract (PDF format) must be submitted online at 
https://cmt.research.microsoft.com/MTAGS2008/ before the deadline of August 15th, 
2008 at 11:59PM PST; the final 6/10 page papers in PDF format will be due on 
September 6th, 2008 at 11:59PM PST.*

We look forward to a successful workshop!

Cheers,
Ioan Raicu
http://dsl.cs.uchicago.edu/MTAGS08/

================================================================================

Call for Papers

--------------------------------------------------------------------------------
The 1st IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS)
http://dsl.cs.uchicago.edu/MTAGS08/  
--------------------------------------------------------------------------------
November 17, 2008
Austin, Texas, USA

Co-located with with IEEE/ACM International Conference for 
High Performance Computing, Networking, Storage and Analysis (SC08)

================================================================================
The 1st workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 
will provide the scientific community a dedicated forum for presenting new 
research, development, and deployment efforts of loosely coupled large scale 
applications on large scale clusters, Grids, and/or Supercomputers. Many-task 
computing, the theme of the workshop encompasses loosely coupled applications, 
which are generally composed of many tasks (both independent and dependent 
tasks) to achieve some larger application goal.  We welcome paper submissions 
on all topics related to MTC on large scale systems.  Papers will be 
peer-reviewed, and accepted papers will be published by IEEE/ACM through the 
SC08 proceedings (pending approval). For more information, please visit 
http://dsl.cs.uchicago.edu/MTAGS08/.

Scope
--------------------------------------------------------------------------------
This workshop will focus on the ability to manage and execute large scale 
applications on today's largest clusters, Grids, and Supercomputers. Clusters 
with 50K+ processor cores are beginning to come online (i.e. TACC Sun 
Constellation System - Ranger), Grids (i.e. TeraGrid) with a dozen sites and 
100K+ processors, and supercomputers with 160K processors (i.e. IBM BlueGene/P). 
Large clusters and supercomputers have traditionally been high performance 
computing (HPC) systems, as they are efficient at executing tightly coupled 
parallel jobs within a particular machine with low-latency interconnects; the 
applications typically use message passing interface (MPI) to achieve the needed 
inter-process communication. On the other hand, Grids have been the preferred 
platform for more loosely coupled applications that tend to be managed and 
executed through workflow systems. In contrast to HPC (tightly coupled 
applications), these loosely coupled applications make up a new class of 
applications as what we call Many-Task Computing (MTC). MTC systems generally 
involve the execution of independent, sequential jobs that can be individually 
scheduled on many different computing resources across multiple administrative 
boundaries. MTC systems typically achieve this using various grid computing 
technologies and techniques, and often times use files to achieve the 
inter-process communication as alternative communication mechanisms than MPI. 
MTC is reminiscent to High Throughput Computing (HTC); however, MTC differs 
from HTC in the emphasis of using many computing resources over short periods 
of time to accomplish many computational tasks, where the primary metrics are 
measured in seconds (e.g. FLOPS, tasks/sec, MB/s I/O rates). HTC on the other 
hand requires large amounts of computing for longer times (months and years, 
rather than hours and days, and are generally measured in operations per month).
  
Today's existing HPC systems are a viable platform to host MTC applications. 
However, some challenges arise in large scale applications when run on large 
scale systems, which can hamper the efficiency and utilization of these large 
scale systems.  These challenges vary from local resource manager scalability 
and granularity, efficient utilization of the raw hardware, shared file system 
contention and scalability, reliability at scale, application scalability, and 
understanding the limitations of the HPC systems in order to identify good 
candidate MTC applications. 

For more information, please visit http://dsl.cs.uchicago.edu/MTAGS08/. 

Topics
--------------------------------------------------------------------------------
MTAGS 2008 topics of interest include, but are not limited to:
*	Compute Resource Management in large scale clusters, large Grids, and Supercomputers
	o	Scheduling
	o	Job execution frameworks
	o	Local resource manager extensions
	o	Performance evaluation of resource managers in use on large scale systems
	o	Challenges in running many-task workloads on HPC systems
*	Data Management in large scale Grid and Supercomputer environments: 
	o	Data-Aware Scheduling
	o	Shared File System performance and scalability in large deployments
	o	Distributed file systems
	o	Data caching frameworks and techniques
*	Large-Scale Workflow Systems
	o	Workflow system performance and scalability analysis
	o	Scalability of workflow systems
	o	Workflow infrastructure and e-Science middleware
	o	Programming Paradigms and Models
*	Large-Scale Many-Task Applications
	o	Large-scale many-task applications
	o	Large-scale many-task data-intensive applications
	o	Large-scale high throughput computing (HTC) applications
	o	Quasi-supercomputing applications, deployments, and experiences 

Paper Submission and Publication
--------------------------------------------------------------------------------
Authors are invited to submit papers with unpublished, original work of not more 
than 6/10 pages (6 pages for short papers, and 10 pages for standard papers) of 
double column text using single spaced 10 point size on 8.5 x 11 inch pages, as 
per IEEE 8.5 x 11 manuscript guidelines 
(ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct.pdf or 
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct.doc). A 250 word 
abstract (PDF format) must be submitted online at 
https://cmt.research.microsoft.com/MTAGS2008/ before the deadline of August 15th, 
2008 at 11:59PM PST; the final 6/10 page papers in PDF format will be due on 
September 6th, 2008 at 11:59PM PST.  Papers will be peer-reviewed, and accepted 
papers will be published in the workshop proceedings as part of the IEEE digital 
library. Notifications of the paper decisions will be sent out by October 1st. 
Selected excellent work may be eligible for additional post-conference publication 
as journal articles or book chapters. Submission implies the willingness of at 
least one of the authors to register and present the paper. For more information, 
please visit http://dsl.cs.uchicago.edu/MTAGS08/.  

Important Dates
--------------------------------------------------------------------------------
*	Abstract Due:			August 15th, 2008
*	Papers Due:			September 6th, 2008
*	Notification of Acceptance:	October 1st, 2008
*	Camera Ready Papers Due:	October 15th, 2008
*	Workshop Date:			November 17th, 2008


Committee Members
--------------------------------------------------------------------------------
Workshop Chairs
*	Yong Zhao, Microsoft
*	Ian Foster, University of Chicago & Argonne National Laboratory
*	Ioan Raicu, University of Chicago

Technical Committee
*	David Abramson, Monash University, Australia
*	Dan Ardelean, Google, USA
*	Pete Beckman, Argonne National Laboratory, USA 
*	Peter Dinda, Northwestern University, USA
*	Ian Foster, University of Chicago & Argonne National Laboratory, USA 
*	Alan Gara, IBM, USA 
*	Bob Grossman, University of Illinois at Chicago, USA 
*	Indranil Gupta, University of Illinois at Urbana Champaign, USA 
*	Alexandru Iosup, Delft University of Technology, Netherlands 
*	Tevfik Kosar, Louisiana State University, USA 
*	Chuang Liu, Ask.com, USA 
*	Shiyong Lu, Wayne State University, USA 
*	Reagan Moore, University of California at San Diego, USA 
*	Steven Newhouse, Microsoft, USA
*	Cristina Nita-Rotaru, Purdue University, USA 
*	Marlon Pierce, Indiana University, USA
*	Ioan Raicu, University of Chicago, USA 
*	Dan Reed, Microsoft, USA 
*	Matei Ripeanu, University of British Columbia, Canada 
*	Rick Stevens, University of Chicago & Argonne National Laboratory, USA 
*	Xian-He Sun, Illinois Institute of Technology, USA
*	Alex Szalay, The Johns Hopkins University, USA 
*	Douglas Thain, Univeristy of Notre Dame, USA 
*	Greg Thain, Univeristy of Wisconsin, USA
*	Mike Wilde, University of Chicago & Argonne National Laboratory, USA
*	Matthew Woitaszek, The University Corporation for Atmospheric Research, USA 
*	Lingyun Yang, Yahoo Search, USA 
*	Sherali Zeadally, University of the District of Columbia, USA 
*	Yong Zhao, Microsoft, USA


-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080801/a6f5e351/attachment.html>

From zhaozhang at uchicago.edu  Fri Aug  1 13:49:25 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 01 Aug 2008 13:49:25 -0500
Subject: [Swift-user] is there a way to send 256 tasks at a time to one site
Message-ID: <48935AB5.2080807@uchicago.edu>

Hi,

For the purpose of efficiency of swift on BGP, is there a way for us to 
send 256 tasks at a time to one site?
Thanks

zhao


From fedorov at cs.wm.edu  Fri Aug  1 13:53:50 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 1 Aug 2008 14:53:50 -0400
Subject: [Swift-user] Swift scheduler
In-Reply-To: <1217616389.22481.3.camel@localhost>
References: <82f536810808011131qc6ca52v528ef4f2cac73371@mail.gmail.com>
	<1217616389.22481.3.camel@localhost>
Message-ID: <82f536810808011153t4f2fbd60pccff7c912b997b36@mail.gmail.com>

Ben, Michael -- thank you for your quick and complete answers!


On Fri, Aug 1, 2008 at 2:46 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Fri, 2008-08-01 at 14:31 -0400, Andriy Fedorov wrote:
>> Hi,
>>
>> I have some general questions about the scheduling policy Swift is using.
>>
>> For example, suppose I have an application, which has multiple
>> mappings to different remote sites. How is the submission site going
>> to be selected?
>
> In principle a score is kept for each site. The score varies based on
> the number of successful submissions to that site and (negatively) with
> current load. Sites are picked using a weighted random out of the pool
> of sites (the weights being the scores).
>
>>  In case I have long queueing delays on the selected
>> site, can Swift detect that, and submit job to a different site?
>
> There's something called replication which can be enabled in
> swift.properties to do that.
>
>>
>> Can any of the developers point me to the specific part of the source
>> that is responsible for scheduling, so that I could try to figure this
>> out myself?
>
> Things start around here in principle:
> http://cogkit.svn.sourceforge.net/viewvc/cogkit/trunk/current/src/cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHostScoreScheduler.java?view=log
>
>>
>> Thanks!
>>
>> --
>> Andrey Fedorov
>>
>> Center for Real-Time Computing
>> College of William and Mary
>> http://www.cs.wm.edu/~fedorov
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>


From benc at hawaga.org.uk  Fri Aug  1 13:54:09 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 18:54:09 +0000 (GMT)
Subject: [Swift-user] is there a way to send 256 tasks at a time to one
	site
In-Reply-To: <48935AB5.2080807@uchicago.edu>
References: <48935AB5.2080807@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808011850340.5076@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Zhao Zhang wrote:

> For the purpose of efficiency of swift on BGP, is there a way for us to send
> 256 tasks at a time to one site?

Do you mean in one single job submission call? Swift will send the tasks 
separately to the provider layer (which probably in your case is 
provider-deef), so no. How the provider-layer gets tasks to the execution 
site is up to the provider - so potentially yes there. It might involve 
changing the falkon wire protocol though.

-- 


From iraicu at cs.uchicago.edu  Fri Aug  1 13:56:59 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 01 Aug 2008 13:56:59 -0500
Subject: [Swift-user] is there a way to send 256 tasks at a time to one
	site
In-Reply-To: <48935AB5.2080807@uchicago.edu>
References: <48935AB5.2080807@uchicago.edu>
Message-ID: <48935C7B.8060807@cs.uchicago.edu>

Zhao,
The Falkon provider has a queue which has tasks pile up on from Karajan, 
and once every so often (on some polling interval), the provider will 
submit tasks to Falkon (from this queue).  I think the polling interval 
is set to 1 second, so whatever tasks end up in the queue every second, 
will go out in 1 WS call to Falkon.  We can make this polling interval 
longer, in the code, and recompiling.  Now, if you are refering to the 
Swift scheduler, that it doesn't send enough tasks (i.e. 256 of them), 
which means that you never get to populate all CPUs with work, then that 
is a different question, which Mihael or Ben can hopefully answer.

Ioan

Zhao Zhang wrote:
> Hi,
>
> For the purpose of efficiency of swift on BGP, is there a way for us 
> to send 256 tasks at a time to one site?
> Thanks
>
> zhao
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Fri Aug  1 13:58:58 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 01 Aug 2008 13:58:58 -0500
Subject: [Swift-user] is there a way to send 256 tasks at a time to one
	site
In-Reply-To: <Pine.LNX.4.64.0808011850340.5076@dildano.hawaga.org.uk>
References: <48935AB5.2080807@uchicago.edu>
	<Pine.LNX.4.64.0808011850340.5076@dildano.hawaga.org.uk>
Message-ID: <48935CF2.6050601@cs.uchicago.edu>

The Falkon provider already does bunching of multiple tasks in a single 
WS call, as long as they are in the provider queue at the time when it 
checks the queue, which it does every second... this polling interval 
can be changed to be higher, if you are finding that only a few tasks 
get submitted every time.  Or we can make it threshold based as well, 
wait X seconds, or Y tasks... it wouldn't be hard to implement different 
strategies to wait for more tasks...

Ioan

Ben Clifford wrote:
> On Fri, 1 Aug 2008, Zhao Zhang wrote:
>
>   
>> For the purpose of efficiency of swift on BGP, is there a way for us to send
>> 256 tasks at a time to one site?
>>     
>
> Do you mean in one single job submission call? Swift will send the tasks 
> separately to the provider layer (which probably in your case is 
> provider-deef), so no. How the provider-layer gets tasks to the execution 
> site is up to the provider - so potentially yes there. It might involve 
> changing the falkon wire protocol though.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080801/42f12be8/attachment.html>

From benc at hawaga.org.uk  Fri Aug  1 14:08:59 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 1 Aug 2008 19:08:59 +0000 (GMT)
Subject: [Swift-user] is there a way to send 256 tasks at a time to one
	site
In-Reply-To: <48935C7B.8060807@cs.uchicago.edu>
References: <48935AB5.2080807@uchicago.edu> <48935C7B.8060807@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808011859130.5076@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Ioan Raicu wrote:

> recompiling.  Now, if you are refering to the Swift scheduler, that it doesn't
> send enough tasks (i.e. 256 of them), which means that you never get to
> populate all CPUs with work, then that is a different question, which Mihael
> or Ben can hopefully answer.

You can make swift send jobs quite fast; fiddle with jobThrottle and 
initialScore values for your site.

If you want swift to peak at sending three times the number of jobs as you 
have CPUs, set job throttle to 3 * numCPUs / 100 (eg 50 CPUs set it to 
1.5).

You can set initialScore to make submissions start nearer the full rate 
rather than starting slowly. Set it high (a few hundred, the exact value 
doesn't matter so much here).

Both of these are keys to set in the karajan namespace in profile entries 
in your sites file:

  <profile namespace="karajan" key="jobThrottle">1.5</profile>
  <profile namespace="karajan" key="initialScore">1000</profile>

-- 


From wilde at mcs.anl.gov  Fri Aug  1 19:04:17 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 01 Aug 2008 19:04:17 -0500
Subject: [Swift-user] ext mapper int params getting coerced to float?
Message-ID: <4893A481.5090505@mcs.anl.gov>

It seems as if params to the ext mapper are getting coerced to floats.

The mapper is invoked like this:

   DockOut out < ext; exec="map.dockoutput", d=dir, r=run, n=ndir, o=i >;

When the int values i and ndir reach my mapper script they are coming in 
with ".0" after the int values:

map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 
ndir: 5000.0 outid: 3.0
map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 
ndir: 5000.0 outid: 1.0
map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 
ndir: 5000.0 outid: 0.0
map.docoutput: arg: 8 dir: /home/wilde/ligandatlas/dock/ runid: run06 
ndir: 5000.0 outid: 2.0

Is thsis a result of the internal numeric types being uplifted to floats 
for to simplify type checking?

Im assuming this does not happen when ints are expanded into a command 
line inside an app{} construct, so I would have expected it would not 
happen when passed to a mapper.

--

Script is:

type File;

type DockTarget {
   File nrg;
   File bmp;
   File spheres;
}

type Mol2;
type DockOut;

// rundock-core runid ligfile outfile target grid.bmp grid.nrg 
selected.spheres

(DockOut out) rundock ( Mol2 ligand, string targetname, DockTarget t)
{
   app { rundockcore ligand out targetname t.nrg t.bmp t.spheres; }
}

string targetName = "1K4M";
string dir = "/home/wilde/ligandatlas/dock/";
string run = "run06";
int ndir = 5000;

DockTarget targetProtein<ext; exec="map.target">;

Mol2 ligand[] <filesys_mapper; 
location="/disks/gpfs/ligandatlas/databases/KEGG_and_Drugs-test", 
suffix=".mol2">;

foreach compound, i in ligand {
   DockOut out < ext; exec="map.dockoutput", d=dir, r=run, n=ndir, o=i >;
   out = rundock( compound, targetName, targetProtein );
}


--

mapper map.dockoutpout starts with:

#! /bin/bash

while getopts ":d:r:n:o:" options; do
   case $options in
     d) dir=$OPTARG;;
     r) runid=$OPTARG;;
     n) ndir=$OPTARG;;
     o) outid=$OPTARG;;
     *) echo $usage
        exit 1;;
   esac
done

echo map.docoutput: $*
echo map.docoutput: arg: $#  dir: $dir runid: $runid ndir: $ndir outid: 
$outid  >> mapper.log


From benc at hawaga.org.uk  Sat Aug  2 04:44:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 2 Aug 2008 09:44:31 +0000 (GMT)
Subject: [Swift-user] ext mapper int params getting coerced to float?
In-Reply-To: <4893A481.5090505@mcs.anl.gov>
References: <4893A481.5090505@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0808020941110.5076@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Michael Wilde wrote:

> When the int values i and ndir reach my mapper script they are coming in with
> ".0" after the int values:

> Is thsis a result of the internal numeric types being uplifted to floats for
> to simplify type checking?

Its not a result of recent type checking work (or shouldn't be) - that 
work is only prohibitive, in as much as Swift programs which work now 
should behave as they did before, but the set of Swift programs which will 
be accepted is now smaller.

Its likely an artifact of the rather hazy internal handling of numbers in 
the runtime layer.

> Im assuming this does not happen when ints are expanded into a command line
> inside an app{} construct

No idea, but it would be interesting to try.

-- 


From benc at hawaga.org.uk  Tue Aug  5 06:27:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 5 Aug 2008 11:27:40 +0000 (GMT)
Subject: [Swift-user] ext mapper int params getting coerced to float?
In-Reply-To: <4893A481.5090505@mcs.anl.gov>
References: <4893A481.5090505@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0808051126380.29009@dildano.hawaga.org.uk>


On Fri, 1 Aug 2008, Michael Wilde wrote:

> When the int values i and ndir reach my mapper script they are coming in with
> ".0" after the int values:

Swift r2175 changes this so that the same formatting code is used as for 
application arguments. You should now not see the .0 for integers.

-- 


From zhengxiongh at uchicago.edu  Wed Aug  6 16:00:27 2008
From: zhengxiongh at uchicago.edu (Zhengxiong Hou)
Date: Wed,  6 Aug 2008 16:00:27 -0500 (CDT)
Subject: [Swift-user] job monitor
Message-ID: <20080806160027.BJK70244@m4500-01.uchicago.edu>

Hi,
   When executing thousands of independent jobs by Swift on 
OSG, there are several questions about the job monitoring:

(1) After the jobs started, How to monitor the jobs status
(such as queueing,running,completed,failed) in real-time?

(2) Are the jobs dispatched to the grid sites, and waiting 
for execution in the LOCAL queues of the local resource 
manager/job scheduler (such as Condor,PBS,LSF,etc.)?

(3) In the standard output of swift execution, there are 
some information as follow:
Sorted: [AGLT2:15.215(39.422):16/16 overload: 0, NYSGRID-CCR-
U2:21.317(50.894):1
8/21 overload: 0]
Sorted: [GLOW-CMS:14.071(36.751):14/15 overload: 0]
Sorted: [GLOW-CMS:14.071(36.751):15/15 overload: 0]
     
  Does this mean that 16 jobs were dispatched to grid 
site "AGLT2", and 15 jobs were dispatched to grid site "GLOW-
CMS"?


Thanks!
B.R.
zhengxiong


From lixi at uchicago.edu  Wed Aug  6 23:18:56 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Wed,  6 Aug 2008 23:18:56 -0500 (CDT)
Subject: [Swift-user] Swift run: java.io.IOException: Unknown error
 512
Message-ID: <20080806231856.BCW02035@m4500-03.uchicago.edu>

Hi,

I ran a workflow like this:
[lixi at communicado test]
$ /home/lixi/performancetest/4/cog/modules/vdsk/dist/vdsk-
svn/bin/swift -
sites.file ../sitesfile/SELECT1/sites2.0808062300.xml -
tc.file ../tc.data testworkflow.swift >0808062300.log 2>&1 &

During the execution, it stopped suddenly and the stdout and 
stderr are included 
in /home/lixi/performancetest/test/0808062300.log. It seems 
that it stopped due to "java.io.IOException: Unknown error 
512"

The log file is /home/lixi/performancetest/test/testworkflow-
20080806-2301-m1qbxjr3.log

[lixi at communicado test]$ tail -n 20 0808062300.log 
Sorted: [LIGO_UWM_NEMO:140.112(90.071):37/37 overload: 0]
node10 completed
Sorted: [FLTECH:144.563(90.361):37/37 overload: 0]
node10 completed
Sorted: [UTA_SWT2:147.336(90.533):37/37 overload: 0]
node10 completed
Sorted: [FLTECH:146.739(90.497):37/37 overload: 0]
node10 completed
Sorted: [TTU-ANTAEUS:21.888(51.767):21/21 overload: 0]
Sorted: [TTU-ANTAEUS:22.888(53.230):21/22 overload: 0]
Sorted: [TTU-ANTAEUS:22.888(53.230):22/22 overload: 0]
node10 completed
Progress:  Selecting site:1497 Stage in:19 Executing:170 
Stage out:165 Finished successfully:106 Initializing site 
shared directory:2 Failed but can retry:41
java.io.IOException: Unknown error 512
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read
(FileInputStream.java:194)
        at java.io.BufferedInputStream.fill
(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read
(BufferedInputStream.java:235)
        at org.griphyn.vdl.karajan.InHook.run(InHook.java:39)
        at java.lang.Thread.run(Thread.java:595)

Would you please tell me why such an error happened and what 
to do with it?

Thanks,

Xi


From benc at hawaga.org.uk  Thu Aug  7 02:52:32 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 7 Aug 2008 07:52:32 +0000 (GMT)
Subject: [Swift-user] job monitor
In-Reply-To: <20080806160027.BJK70244@m4500-01.uchicago.edu>
References: <20080806160027.BJK70244@m4500-01.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808070747310.29009@dildano.hawaga.org.uk>


On Wed, 6 Aug 2008, Zhengxiong Hou wrote:

> (1) After the jobs started, How to monitor the jobs status
> (such as queueing,running,completed,failed) in real-time?

With any swift after r1696 you should see a status line between every 5 
and 60 seconds that looks like this:

Progress:  Selecting site:777 Stage in:80 Executing:83 Stage out:19 
Finished successfully:41

Do you want more than this?

> (2) Are the jobs dispatched to the grid sites, and waiting 
> for execution in the LOCAL queues of the local resource 
> manager/job scheduler (such as Condor,PBS,LSF,etc.)?

Pretty much jobs will be queued some before executing, yes.

> (3) In the standard output of swift execution, there are 
> some information as follow:
> Sorted: [AGLT2:15.215(39.422):16/16 overload: 0, NYSGRID-CCR-
> U2:21.317(50.894):1
> 8/21 overload: 0]
> Sorted: [GLOW-CMS:14.071(36.751):14/15 overload: 0]
> Sorted: [GLOW-CMS:14.071(36.751):15/15 overload: 0]
>      
>   Does this mean that 16 jobs were dispatched to grid 
> site "AGLT2", and 15 jobs were dispatched to grid site "GLOW-
> CMS"?

Basically yes.

-- 


From benc at hawaga.org.uk  Thu Aug  7 03:09:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 7 Aug 2008 08:09:53 +0000 (GMT)
Subject: [Swift-user] Re: [Swift-devel] Swift run: java.io.IOException:
	Unknown error 512
In-Reply-To: <20080806231856.BCW02035@m4500-03.uchicago.edu>
References: <20080806231856.BCW02035@m4500-03.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808070805470.29009@dildano.hawaga.org.uk>


Can you reproduce it?

Google shows occurences of that exception (unknown err 512 in 
FileInputStream.readBytes) happening when the java process has been set to 
run in the background, when reading from the console.

Were you doing anything like that? (eg running with & after the command or 
pressing ctrl-z)

-- 


From lixi at uchicago.edu  Thu Aug  7 08:01:59 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Thu,  7 Aug 2008 08:01:59 -0500 (CDT)
Subject: [Swift-user] Fwd: Re: [Swift-devel] Swift run: java.io.IOException:
 Unknown error 512
Message-ID: <20080807080159.BCW25121@m4500-03.uchicago.edu>


-------------- next part --------------
An embedded message was scrubbed...
From: <lixi at uchicago.edu>
Subject: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512
Date: Thu,  7 Aug 2008 07:26:49 -0500 (CDT)
Size: 971
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080807/07d08dd3/attachment.eml>

From lixi at uchicago.edu  Thu Aug  7 08:02:17 2008
From: lixi at uchicago.edu (lixi at uchicago.edu)
Date: Thu,  7 Aug 2008 08:02:17 -0500 (CDT)
Subject: [Swift-user] Fwd: Re: [Swift-devel] Swift run: java.io.IOException:
 Unknown error 512
Message-ID: <20080807080217.BCW25140@m4500-03.uchicago.edu>


-------------- next part --------------
An embedded message was scrubbed...
From: <lixi at uchicago.edu>
Subject: Re: [Swift-devel] Swift run: java.io.IOException: Unknown error 512
Date: Thu,  7 Aug 2008 07:38:01 -0500 (CDT)
Size: 1075
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20080807/8de91ef7/attachment.eml>

From fedorov at cs.wm.edu  Thu Aug  7 09:47:52 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 7 Aug 2008 10:47:52 -0400
Subject: [Swift-user] Need help debugging strange problem...
Message-ID: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com>

Hi,

I have a Swift script that is running fine on UC TG site, and now I am
trying to add NCSA to the set of execution sites, but I have some
strange problems, and I am not sure how to debug this.

First, I submit a simple script (below) to NCSA Mercury with GT4 Fork
jobmanager, and it works. When I change the provider from "fork" to
"PBS", the Swift execution does not finish after the PBS job
completion. I see the job submitted, queued in PBS, running,
completing, I see the output file is produced in the scratch
directory, but on the submission site I have "Progress: Executing:1".
The submission site is the same as for the example with "fork"
jobmanager, so I don't see how firewall can be an issue, and I can
telnet to the submission site from NCSA.

Note, that I was able to run the same simple test with both fork and
PBS providers on the SDSC TG site.

How can I figure out what is wrong about NCSA Mercury?


sites.xml: (as in http://www.teragrid.org/userinfo/jobs/gram.php)

<pool handle="NCSA-GT4">
  <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org:2811/" />
  <execution provider="gt4" jobmanager="PBS"         <=========== HERE
I change PBS/fork
  url="https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
  <workdirectory>/home/ac/fedorov/scratch</workdirectory>
</pool>


tc.data:

NCSA-GT4   NCSA_hostname /sbin/ifconfig INSTALLED INTEL32::LINUX null

hello.swift:

type messagefile{}

(messagefile uc_hostname) hostname2(){
  app{
    NCSA_hostname stdout=@filename(uc_hostname);
  }
}

messagefile uc_hostname<"uc_hostname.txt">;
messagefile ncsa_hostname<"ncsa_hostname.txt">;

ncsa_hostname = hostname2();

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From benc at hawaga.org.uk  Thu Aug  7 10:27:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 7 Aug 2008 15:27:13 +0000 (GMT)
Subject: [Swift-user] Need help debugging strange problem...
In-Reply-To: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com>
References: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808071521390.29009@dildano.hawaga.org.uk>

there is a somewhat common misconfiguration of gram4 on the server side 
where it is wired into the local queueing system incorrectly so that 
completion notifications do not find their way back. this matches the 
symptoms you describe - that fork works but that pbs doesn't, but that the 
job apepars to have run.

I just tried a submission using the GT4 command line job submission 
command:

$ globusrun-ws -submit -F 
https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
-Ft Fork -job-command /bin/hostname
Submitting job...


but it appears to hang without submitting. not sure what is happening with 
that site...

Aside from that, my advice for diagnosis would be to try the above command 
with both Fork and PBS and see if you get the same difference in behaviour 
between the two.

-- 


From fedorov at cs.wm.edu  Thu Aug  7 11:23:53 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 7 Aug 2008 12:23:53 -0400
Subject: [Swift-user] Need help debugging strange problem...
In-Reply-To: <Pine.LNX.4.64.0808071521390.29009@dildano.hawaga.org.uk>
References: <82f536810808070747u39a91838l3b41a01fd5fac602@mail.gmail.com>
	<Pine.LNX.4.64.0808071521390.29009@dildano.hawaga.org.uk>
Message-ID: <82f536810808070923o7ff2d3a9le86d0f7db41c61dc@mail.gmail.com>

Ben,

I tried what you suggested, and I have globusrun-ws working from UC
submitting to NCSA, using Fork factory type:

[fedorov at TG/UC:tg-login1 ~/swiftBiofem] globusrun-ws -submit -F
https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
-Ft Fork -job-command /bin/hostname
Submitting job...Done.
Job ID: uuid:3b8f1662-649c-11dd-9347-0007e9d811ce
Termination time: 08/08/2008 16:16 GMT
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.

But it fails when I am using PBS factory. globusrun-ws doesn't exit,
while I see job finished on NCSA.

[fedorov at TG/UC:tg-login1 ~/swiftBiofem] globusrun-ws -submit -F
https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
-Ft PBS -job-command /bin/hostname
Submitting job...Done.
Job ID: uuid:dc23433c-649c-11dd-9671-0007e9d811ce
Termination time: 08/08/2008 16:21 GMT
Current job state: Unsubmitted

I am going to report this to TG help.

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


On Thu, Aug 7, 2008 at 11:27 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
> there is a somewhat common misconfiguration of gram4 on the server side
> where it is wired into the local queueing system incorrectly so that
> completion notifications do not find their way back. this matches the
> symptoms you describe - that fork works but that pbs doesn't, but that the
> job apepars to have run.
>
> I just tried a submission using the GT4 command line job submission
> command:
>
> $ globusrun-ws -submit -F
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
> -Ft Fork -job-command /bin/hostname
> Submitting job...
>
>
>
> but it appears to hang without submitting. not sure what is happening with
> that site...
>
> Aside from that, my advice for diagnosis would be to try the above command
> with both Fork and PBS and see if you get the same difference in behaviour
> between the two.
>
> --
>


From feller at mcs.anl.gov  Thu Aug  7 11:29:09 2008
From: feller at mcs.anl.gov (Martin Feller)
Date: Thu, 07 Aug 2008 11:29:09 -0500
Subject: [Swift-user] Re: Need help debugging strange problem...
In-Reply-To: <FFDD81BC-AC71-48BE-9947-2028608BADFE@mcs.anl.gov>
References: <Pine.LNX.4.64.0808071521390.29009@dildano.hawaga.org.uk>
	<FFDD81BC-AC71-48BE-9947-2028608BADFE@mcs.anl.gov>
Message-ID: <489B22D5.6010500@mcs.anl.gov>

Andriy:

Can you please try the following:

submit a dummy job in batch mode to Fork and PBS and query for job status
instead of relying for notifications:

globusrun-ws -submit \
   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
   -Ft Fork
   -b -e forkJob.epr
   -c /bin/hostname

then try

globusrun-ws -status -j forkJob.epr

and see if you see changes in state of your job after a while

Same for PBS:

globusrun-ws -submit \
   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
   -Ft PBS
   -b -e pbsJob.epr
   -c /bin/hostname

globusrun-ws -status -j pbsJob.epr

(
  later on remove those jobs calling
  globusrun-ws -kill -j pbsJob.epr
  globusrun-ws -kill -j forkJob.epr
)

If you see job state changes that had not been reported using globusrun-ws in
interactive mode, then it's a notification problem. But i don't think this is
the case.
I suspect the problem is that Gram4 does not get informed about job state changes
by the scheduler event generator (SEG).
We once had the problem that the job state changes just didn't show up in the
SEG logs, due to SEG <--> filesystem issues (i think it was lustre).

Before speculating about this: Please run the batch jobs and tell what you get.

Martin


>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>> *Date: *August 7, 2008 10:27:13 AM CDT
>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>
>> there is a somewhat common misconfiguration of gram4 on the server side
>> where it is wired into the local queueing system incorrectly so that
>> completion notifications do not find their way back. this matches the
>> symptoms you describe - that fork works but that pbs doesn't, but that 
>> the
>> job apepars to have run.
>>
>> I just tried a submission using the GT4 command line job submission
>> command:
>>
>> $ globusrun-ws -submit -F
>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
>>
>> -Ft Fork -job-command /bin/hostname
>> Submitting job...
>>
>>
>>
>> but it appears to hang without submitting. not sure what is happening 
>> with
>> that site...
>>
>> Aside from that, my advice for diagnosis would be to try the above 
>> command
>> with both Fork and PBS and see if you get the same difference in 
>> behaviour
>> between the two.
>>
>> -- 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 


From feller at mcs.anl.gov  Thu Aug  7 11:32:00 2008
From: feller at mcs.anl.gov (Martin Feller)
Date: Thu, 07 Aug 2008 11:32:00 -0500
Subject: [Swift-user] Re: Need help debugging strange problem...
In-Reply-To: <489B22D5.6010500@mcs.anl.gov>
References: <Pine.LNX.4.64.0808071521390.29009@dildano.hawaga.org.uk>	<FFDD81BC-AC71-48BE-9947-2028608BADFE@mcs.anl.gov>
	<489B22D5.6010500@mcs.anl.gov>
Message-ID: <489B2380.3020804@mcs.anl.gov>

oh, i see i did an error here:
please replace "-b -e" by "-b -o" in the globusrun-ws options.

Martin

Martin Feller wrote:
> Andriy:
> 
> Can you please try the following:
> 
> submit a dummy job in batch mode to Fork and PBS and query for job status
> instead of relying for notifications:
> 
> globusrun-ws -submit \
>   -F 
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
> 
>   -Ft Fork
>   -b -e forkJob.epr
>   -c /bin/hostname
> 
> then try
> 
> globusrun-ws -status -j forkJob.epr
> 
> and see if you see changes in state of your job after a while
> 
> Same for PBS:
> 
> globusrun-ws -submit \
>   -F 
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
> 
>   -Ft PBS
>   -b -e pbsJob.epr
>   -c /bin/hostname
> 
> globusrun-ws -status -j pbsJob.epr
> 
> (
>  later on remove those jobs calling
>  globusrun-ws -kill -j pbsJob.epr
>  globusrun-ws -kill -j forkJob.epr
> )
> 
> If you see job state changes that had not been reported using 
> globusrun-ws in
> interactive mode, then it's a notification problem. But i don't think 
> this is
> the case.
> I suspect the problem is that Gram4 does not get informed about job 
> state changes
> by the scheduler event generator (SEG).
> We once had the problem that the job state changes just didn't show up 
> in the
> SEG logs, due to SEG <--> filesystem issues (i think it was lustre).
> 
> Before speculating about this: Please run the batch jobs and tell what 
> you get.
> 
> Martin
> 
> 
> 
>>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>>> *Date: *August 7, 2008 10:27:13 AM CDT
>>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>>
>>> there is a somewhat common misconfiguration of gram4 on the server side
>>> where it is wired into the local queueing system incorrectly so that
>>> completion notifications do not find their way back. this matches the
>>> symptoms you describe - that fork works but that pbs doesn't, but 
>>> that the
>>> job apepars to have run.
>>>
>>> I just tried a submission using the GT4 command line job submission
>>> command:
>>>
>>> $ globusrun-ws -submit -F
>>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
>>>
>>> -Ft Fork -job-command /bin/hostname
>>> Submitting job...
>>>
>>>
>>>
>>> but it appears to hang without submitting. not sure what is happening 
>>> with
>>> that site...
>>>
>>> Aside from that, my advice for diagnosis would be to try the above 
>>> command
>>> with both Fork and PBS and see if you get the same difference in 
>>> behaviour
>>> between the two.
>>>
>>> -- 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From fedorov at cs.wm.edu  Thu Aug  7 11:39:30 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 7 Aug 2008 12:39:30 -0400
Subject: [Swift-user] Re: Need help debugging strange problem...
Message-ID: <82f536810808070939s6092a279q7280c03874599640@mail.gmail.com>

Martin,

I tried what you suggested. The status of the job remains
"Unsubmitted" on the submission site, while I see the job completes on
NCSA Mercury.

I reported this problem to TG help, and will post an update if I hear
any explanation from them.

Andrey


> Date: Thu, 07 Aug 2008 11:29:09 -0500
> From: Martin Feller <feller at mcs.anl.gov>
> Subject: [Swift-user] Re: Need help debugging strange problem...
> To: swift-user at ci.uchicago.edu
> Message-ID: <489B22D5.6010500 at mcs.anl.gov>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Andriy:
>
> Can you please try the following:
>
> submit a dummy job in batch mode to Fork and PBS and query for job status
> instead of relying for notifications:
>
> globusrun-ws -submit \
>   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>   -Ft Fork
>   -b -e forkJob.epr
>   -c /bin/hostname
>
> then try
>
> globusrun-ws -status -j forkJob.epr
>
> and see if you see changes in state of your job after a while
>
> Same for PBS:
>
> globusrun-ws -submit \
>   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>   -Ft PBS
>   -b -e pbsJob.epr
>   -c /bin/hostname
>
> globusrun-ws -status -j pbsJob.epr
>
> (
>  later on remove those jobs calling
>  globusrun-ws -kill -j pbsJob.epr
>  globusrun-ws -kill -j forkJob.epr
> )
>
> If you see job state changes that had not been reported using globusrun-ws in
> interactive mode, then it's a notification problem. But i don't think this is
> the case.
> I suspect the problem is that Gram4 does not get informed about job state changes
> by the scheduler event generator (SEG).
> We once had the problem that the job state changes just didn't show up in the
> SEG logs, due to SEG <--> filesystem issues (i think it was lustre).
>
> Before speculating about this: Please run the batch jobs and tell what you get.
>
> Martin
>
>
>
>>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>>> *Date: *August 7, 2008 10:27:13 AM CDT
>>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>>
>>> there is a somewhat common misconfiguration of gram4 on the server side
>>> where it is wired into the local queueing system incorrectly so that
>>> completion notifications do not find their way back. this matches the
>>> symptoms you describe - that fork works but that pbs doesn't, but that
>>> the
>>> job apepars to have run.
>>>
>>> I just tried a submission using the GT4 command line job submission
>>> command:
>>>
>>> $ globusrun-ws -submit -F
>>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>>>
>>> -Ft Fork -job-command /bin/hostname
>>> Submitting job...
>>>
>>>
>>>
>>> but it appears to hang without submitting. not sure what is happening
>>> with
>>> that site...
>>>
>>> Aside from that, my advice for diagnosis would be to try the above
>>> command
>>> with both Fork and PBS and see if you get the same difference in
>>> behaviour
>>> between the two.
>>>
>>> --
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>
>
>
> ------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>
> End of Swift-user Digest, Vol 17, Issue 5
> *****************************************
>


From benc at hawaga.org.uk  Fri Aug  8 03:38:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 8 Aug 2008 08:38:27 +0000 (GMT)
Subject: [Swift-user] job monitor
In-Reply-To: <20080806160027.BJK70244@m4500-01.uchicago.edu>
References: <20080806160027.BJK70244@m4500-01.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0808080836210.29009@dildano.hawaga.org.uk>


On Wed, 6 Aug 2008, Zhengxiong Hou wrote:

> (2) Are the jobs dispatched to the grid sites, and waiting 
> for execution in the LOCAL queues of the local resource 
> manager/job scheduler (such as Condor,PBS,LSF,etc.)?

Related to this, in swift r2183, I changed the progress ticker a bit to 
reflect some of this state a bit more. What used to be "Executing" is now 
three different states:

 Submitting -> Submitted -> Active

'Submitted' is when the job is in a queue at the remote site, and 'Active' 
is when the job is running on the remote site (at least as far as the 
Swift runtime is aware).

-- 


From hategan at mcs.anl.gov  Mon Aug 11 17:23:27 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 11 Aug 2008 17:23:27 -0500
Subject: [Swift-user] test; please ignore
Message-ID: <1218493407.13994.2.camel@localhost>


From fedorov at cs.wm.edu  Thu Aug 14 16:16:11 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 14 Aug 2008 17:16:11 -0400
Subject: [Swift-user] Node specification for local PBS/Torque provider
Message-ID: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>

Hi,

I was unable to find the details on how nodes should be specified when
local PBS provider is used.

With sequential jobs, I am able to use GLOBUS attribute maxWallTime.
When I am trying to request multiple nodes, I tried to specify the
number with the host_types and host_count attributes, but each time I
am getting only one node, and also PBS_-variables (like PBS_NODEFILE)
are not defined on that single allocated node.

Can anyone help me with this?

Thanks
--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From hategan at mcs.anl.gov  Thu Aug 14 18:59:26 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 14 Aug 2008 18:59:26 -0500
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
Message-ID: <1218758366.25840.3.camel@localhost>

On Thu, 2008-08-14 at 17:16 -0400, Andriy Fedorov wrote:
> Hi,
> 
> I was unable to find the details on how nodes should be specified when
> local PBS provider is used.

Looking at the code, host_types and host_count don't seem to be handled
by the pbs provider.
You should file a bug report with cog
(http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so
that I don't forget about it) and it will get fixed eventually. The
exact timing depends on how badly you need it.

Mihael


From fedorov at cs.wm.edu  Fri Aug 15 07:51:38 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 15 Aug 2008 08:51:38 -0400
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <1218758366.25840.3.camel@localhost>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
	<1218758366.25840.3.camel@localhost>
Message-ID: <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>

Michael,

thank you for reply. I looked through the cog source, qsub scripts it
generates, and I think I localized the problem.

> You should file a bug report with cog
> (http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so
> that I don't forget about it) and it will get fixed eventually.

I logged the bug.

> The exact timing depends on how badly you need it.

Well, I am not sure what you mean by this... I certainly did not find
this bug having nothing to do and browsing through the CoG code. I
tried to accomplish something, and now I am blocked because of this
bug.

If it takes too long, I guess I will have to fix it myself. I don't
know how to rush a bug fix. I know you guys are very busy and have
your own priorities...

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


On Thu, Aug 14, 2008 at 7:59 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Thu, 2008-08-14 at 17:16 -0400, Andriy Fedorov wrote:
>> Hi,
>>
>> I was unable to find the details on how nodes should be specified when
>> local PBS provider is used.
>
> Looking at the code, host_types and host_count don't seem to be handled
> by the pbs provider.
> You should file a bug report with cog
> (http://bugzilla.mcs.anl.gov/globus/enter_bug.cgi?product=CoG%20Kit) (so
> that I don't forget about it) and it will get fixed eventually. The
> exact timing depends on how badly you need it.
>
> Mihael
>
>
>


From benc at hawaga.org.uk  Fri Aug 15 08:15:01 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 15 Aug 2008 13:15:01 +0000 (GMT)
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
	<1218758366.25840.3.camel@localhost>
	<82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808151313430.29009@dildano.hawaga.org.uk>


On Fri, 15 Aug 2008, Andriy Fedorov wrote:

> If it takes too long, I guess I will have to fix it myself.
> I don't know how to rush a bug fix. I know you guys are very busy and 
> have your own priorities...

Bug fixes are pretty much always welcome if you do fix it yourself. My 
priority at the moment is a week long vacation but if its still broken 
when I come back, I'll look at it then.

-- 


From fedorov at cs.wm.edu  Fri Aug 15 08:32:42 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 15 Aug 2008 09:32:42 -0400
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <Pine.LNX.4.64.0808151313430.29009@dildano.hawaga.org.uk>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
	<1218758366.25840.3.camel@localhost>
	<82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>
	<Pine.LNX.4.64.0808151313430.29009@dildano.hawaga.org.uk>
Message-ID: <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com>

Looking at the cog code, I see that I can use the attribute "count" to
specify the number and type of nodes for PBS. For me this is good
enough. I updated the bug status accordingly to WORKSFORME.

--
Fedorov

On Fri, Aug 15, 2008 at 9:15 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> On Fri, 15 Aug 2008, Andriy Fedorov wrote:
>
>> If it takes too long, I guess I will have to fix it myself.
>> I don't know how to rush a bug fix. I know you guys are very busy and
>> have your own priorities...
>
> Bug fixes are pretty much always welcome if you do fix it yourself. My
> priority at the moment is a week long vacation but if its still broken
> when I come back, I'll look at it then.
>
> --
>
>


From hategan at mcs.anl.gov  Fri Aug 15 10:33:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 15 Aug 2008 10:33:31 -0500
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
	<1218758366.25840.3.camel@localhost>
	<82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>
	<Pine.LNX.4.64.0808151313430.29009@dildano.hawaga.org.uk>
	<82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com>
Message-ID: <1218814411.31924.2.camel@localhost>

On Fri, 2008-08-15 at 09:32 -0400, Andriy Fedorov wrote:
> Looking at the cog code, I see that I can use the attribute "count" to
> specify the number and type of nodes for PBS. For me this is good
> enough. I updated the bug status accordingly to WORKSFORME.

There should be a uniform way of specifying this stuff. So I think that
should be fixed.

Mihael


From fedorov at cs.wm.edu  Fri Aug 15 10:35:18 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Fri, 15 Aug 2008 11:35:18 -0400
Subject: [Swift-user] Node specification for local PBS/Torque provider
In-Reply-To: <1218814411.31924.2.camel@localhost>
References: <82f536810808141416g16b14a78kd0829b3fe30ae211@mail.gmail.com>
	<1218758366.25840.3.camel@localhost>
	<82f536810808150551l67fcc8e7m54bf38778f6cc6d7@mail.gmail.com>
	<Pine.LNX.4.64.0808151313430.29009@dildano.hawaga.org.uk>
	<82f536810808150632r2fdb458aoea9553c724498186@mail.gmail.com>
	<1218814411.31924.2.camel@localhost>
Message-ID: <82f536810808150835j32cfe274x4b7c94e38837b889@mail.gmail.com>

On Fri, Aug 15, 2008 at 11:33 AM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Fri, 2008-08-15 at 09:32 -0400, Andriy Fedorov wrote:
>> Looking at the cog code, I see that I can use the attribute "count" to
>> specify the number and type of nodes for PBS. For me this is good
>> enough. I updated the bug status accordingly to WORKSFORME.
>
> There should be a uniform way of specifying this stuff.

Oh, I totally agree with you on this one!...

>
> Mihael
>
>
>


From benc at hawaga.org.uk  Mon Aug 25 03:51:31 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 25 Aug 2008 08:51:31 +0000 (GMT)
Subject: [Swift-user] Swift 0.6 released
Message-ID: <Pine.LNX.4.64.0808250831430.22488@dildano.hawaga.org.uk>


Swift 0.6 is online for download at 
http://www.ci.uchicago.edu/swift/downloads/

In addition to a bunch of bugfixes, the most interesting changes are:

 * much more rigourous compile time type checking - this catches many
   more errors at the start rather than hours into a run, and gives more
   useful error reports.

 * better multisite handling:
     +  job replication - when a job has been queued for much longer than 
        average, Swift can launch a replica of the job on another site. 
        This helps when making multisite runs where one site has a much
        longer queue time than another.
     +  rate limiting for bad sites - poorly scored sites are now rate
        limited much more than in previous versions of Swift, with very
        poorly scored sites being delayed between executions.

 * cog coasters - this is a new execution provider that allows a single
   'coaster' job to be submitted per worker node which pulls in Swift 
   jobs. This can greatly reduce the number of jobs submitted to the
   underlying job submission mechanism (such as GRAM2) allowing more jobs 
   to be submitted; it also can reduce the amount of time jobs spend in
   the LRM queue by sending them directly to an already-executing coaster.


-- 


From fedorov at cs.wm.edu  Mon Aug 25 18:46:46 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Mon, 25 Aug 2008 19:46:46 -0400
Subject: [Swift-user] Returning arrays of files
Message-ID: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com>

Hi,

I have a procedure that wraps an application, which creates a large
number of files. The names of those files are not passed as input
arguments, but I know their names in advance.

I was trying to handle this by doing the following (this is a fragment
of the complete code):

(file bImages[], file wImages[]) prepareImages(file fImage, file
rImage, file pList){
  app {
    BMPrepareImages @fImage @rImage @pList;
  }
}

prepareImages1(file fImage, file rImage, file pList){
  app {
    BMPrepareImages @fImage @rImage @pList;
  }
}

string bImageNames[];
string wImageNames[];

iterate i {
  bImageNames[i] = @strcat("block_",i,".nii.gz");
  wImageNames[i] = @strcat("window_",i,".nii.gz");
}until(i==numPoints-1);

file bImages[]<array_mapper;files=bImageNames>;
file wImages[]<array_mapper;files=wImageNames>;

(bImages,wImages) = prepareImages(fImageRsmooth,rImageRsmooth,fImagePointList);

But when I run this script, Swift tells me

Progress:
Progress:
Progress:
Progress:
Progress:
...

and nothing happens...

When I run "prepareImages1" instead of "prepareImages", it finishes,
but of course I don't get the output files. So there cannot be problem
in the application. I suspect there's something wrong with the way I
specify the output of the procedure.

Can anyone help me, what is wrong here?

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From benc at hawaga.org.uk  Tue Aug 26 03:34:21 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 26 Aug 2008 08:34:21 +0000 (GMT)
Subject: [Swift-user] Returning arrays of files
In-Reply-To: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com>
References: <82f536810808251646h18ada8e1ibb1e08fcb40d7eb0@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808260705490.29009@dildano.hawaga.org.uk>


Probably what is happening is that in the below section, Swift can't deal 
with mapper parameter inputs being constructed dynamically. One day 
hopefully it will be able to - it makes sense in the language.

This situation is a bit annoying - you could use simple mapper if these 
were input files...

> string bImageNames[];
> 
> iterate i {
>   bImageNames[i] = @strcat("block_",i,".nii.gz");
>   wImageNames[i] = @strcat("window_",i,".nii.gz");
> }until(i==numPoints-1);
> 
> file bImages[]<array_mapper;files=bImageNames>;

-- 


From fedorov at cs.wm.edu  Tue Aug 26 14:29:16 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 26 Aug 2008 15:29:16 -0400
Subject: [Swift-user] NullPointerException
Message-ID: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>

Hi,

Is there any advice on how I can debug the following error while
trying to run a Swift script:

Could not compile SwiftScript source: java.lang.NullPointerException

(this is the only error message I get)

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From hategan at mcs.anl.gov  Tue Aug 26 14:45:17 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 26 Aug 2008 14:45:17 -0500
Subject: [Swift-user] NullPointerException
In-Reply-To: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>
Message-ID: <1219779917.13302.5.camel@localhost>

On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote:
> Hi,
> 
> Is there any advice on how I can debug the following error while
> trying to run a Swift script:
> 
> Could not compile SwiftScript source: java.lang.NullPointerException
> 
> (this is the only error message I get)

The -d flag to swift should provide more details (perhaps a stack
trace).


From fedorov at cs.wm.edu  Tue Aug 26 15:03:57 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 26 Aug 2008 16:03:57 -0400
Subject: [Swift-user] NullPointerException
In-Reply-To: <1219779917.13302.5.camel@localhost>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>
	<1219779917.13302.5.camel@localhost>
Message-ID: <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>

On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote:
>> Hi,
>>
>> Is there any advice on how I can debug the following error while
>> trying to run a Swift script:
>>
>> Could not compile SwiftScript source: java.lang.NullPointerException
>>
>> (this is the only error message I get)
>
> The -d flag to swift should provide more details (perhaps a stack
> trace).
>

Yes, thanks. Still, no idea how to connect this error with my Swift source:

Could not compile SwiftScript source: java.lang.NullPointerException
Full parser exception
java.lang.NullPointerException
        at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
        at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
        at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
        at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
        at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
        at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
        at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
        at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
        at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
        at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
Exception when compiling ../pe_script_500-1.swift
org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile
SwiftScript source: null
        at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68)
        at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
        at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
        at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
Caused by:
java.lang.NullPointerException
        at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
        at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
        at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
        at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
        at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
        at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
        at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
        ... 3 more


>
>


From benc at hawaga.org.uk  Tue Aug 26 15:10:10 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 26 Aug 2008 20:10:10 +0000 (GMT)
Subject: [Swift-user] NullPointerException
In-Reply-To: <82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>
	<1219779917.13302.5.camel@localhost>
	<82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808262009370.29009@dildano.hawaga.org.uk>


something not right in the compiler - it should be either working or 
giving a more meaningful error message. can you send the source file that 
causes this?

On Tue, 26 Aug 2008, Andriy Fedorov wrote:

> On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> > On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote:
> >> Hi,
> >>
> >> Is there any advice on how I can debug the following error while
> >> trying to run a Swift script:
> >>
> >> Could not compile SwiftScript source: java.lang.NullPointerException
> >>
> >> (this is the only error message I get)
> >
> > The -d flag to swift should provide more details (perhaps a stack
> > trace).
> >
> 
> Yes, thanks. Still, no idea how to connect this error with my Swift source:
> 
> Could not compile SwiftScript source: java.lang.NullPointerException
> Full parser exception
> java.lang.NullPointerException
>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
>         at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
>         at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
>         at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
>         at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
>         at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
>         at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
> Exception when compiling ../pe_script_500-1.swift
> org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile
> SwiftScript source: null
>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68)
>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
>         at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
>         at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
> Caused by:
> java.lang.NullPointerException
>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
>         at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
>         at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
>         at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
>         at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
>         ... 3 more
> 
> 
> >
> >
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> 


From fedorov at cs.wm.edu  Tue Aug 26 15:52:43 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Tue, 26 Aug 2008 16:52:43 -0400
Subject: [Swift-user] NullPointerException
In-Reply-To: <Pine.LNX.4.64.0808262009370.29009@dildano.hawaga.org.uk>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com>
	<1219779917.13302.5.camel@localhost>
	<82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>
	<Pine.LNX.4.64.0808262009370.29009@dildano.hawaga.org.uk>
Message-ID: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com>

Just to update the list on the resolution of this problem (thanks, Ben
Clifford!).

I was not supposed to have ";" at the end of the "foreach" construct:

foreach i in [0:numPoints-1] {
 ...
}    // <=== NO ";" HERE!!!

--
Andrey Fedorov


On Tue, Aug 26, 2008 at 4:10 PM, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> something not right in the compiler - it should be either working or
> giving a more meaningful error message. can you send the source file that
> causes this?
>
> On Tue, 26 Aug 2008, Andriy Fedorov wrote:
>
>> On Tue, Aug 26, 2008 at 3:45 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> > On Tue, 2008-08-26 at 15:29 -0400, Andriy Fedorov wrote:
>> >> Hi,
>> >>
>> >> Is there any advice on how I can debug the following error while
>> >> trying to run a Swift script:
>> >>
>> >> Could not compile SwiftScript source: java.lang.NullPointerException
>> >>
>> >> (this is the only error message I get)
>> >
>> > The -d flag to swift should provide more details (perhaps a stack
>> > trace).
>> >
>>
>> Yes, thanks. Still, no idea how to connect this error with my Swift source:
>>
>> Could not compile SwiftScript source: java.lang.NullPointerException
>> Full parser exception
>> java.lang.NullPointerException
>>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
>>         at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
>>         at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
>>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
>>         at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
>>         at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
>>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
>>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
>>         at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
>>         at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
>> Exception when compiling ../pe_script_500-1.swift
>> org.griphyn.vdl.toolkit.VDLt2VDLx$ParsingException: Could not compile
>> SwiftScript source: null
>>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:68)
>>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:46)
>>         at org.griphyn.vdl.karajan.Loader.compile(Loader.java:207)
>>         at org.griphyn.vdl.karajan.Loader.main(Loader.java:123)
>> Caused by:
>> java.lang.NullPointerException
>>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:515)
>>         at org.antlr.stringtemplate.language.ASTExpr.writeAttribute(ASTExpr.java:458)
>>         at org.antlr.stringtemplate.language.ActionEvaluator.action(ActionEvaluator.java:86)
>>         at org.antlr.stringtemplate.language.ASTExpr.write(ASTExpr.java:93)
>>         at org.antlr.stringtemplate.StringTemplate.write(StringTemplate.java:693)
>>         at org.antlr.stringtemplate.StringTemplate.toString(StringTemplate.java:1412)
>>         at org.griphyn.vdl.toolkit.VDLt2VDLx.compile(VDLt2VDLx.java:64)
>>         ... 3 more
>>
>>
>> >
>> >
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>>
>


From benc at hawaga.org.uk  Tue Aug 26 15:54:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 26 Aug 2008 20:54:46 +0000 (GMT)
Subject: [Swift-user] NullPointerException
In-Reply-To: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> 
	<1219779917.13302.5.camel@localhost>
	<82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>
	<Pine.LNX.4.64.0808262009370.29009@dildano.hawaga.org.uk>
	<82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808262054160.22488@dildano.hawaga.org.uk>


On Tue, 26 Aug 2008, Andriy Fedorov wrote:

> Just to update the list on the resolution of this problem (thanks, Ben
> Clifford!).
> 
> I was not supposed to have ";" at the end of the "foreach" construct:
> 
> foreach i in [0:numPoints-1] {
>  ...
> }    // <=== NO ";" HERE!!!

Though this perhaps should be allowed, in as much as empty statements are 
allowed.

-- 


From benc at hawaga.org.uk  Wed Aug 27 04:49:29 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 27 Aug 2008 09:49:29 +0000 (GMT)
Subject: [Swift-user] NullPointerException
In-Reply-To: <82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com>
References: <82f536810808261229t2c515e18s43c5ac1aee429e09@mail.gmail.com> 
	<1219779917.13302.5.camel@localhost>
	<82f536810808261303x607b1070l3ad7ddf5f7a5b90f@mail.gmail.com>
	<Pine.LNX.4.64.0808262009370.29009@dildano.hawaga.org.uk>
	<82f536810808261352o483d3239id30d98d550f846e0@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808270937070.22488@dildano.hawaga.org.uk>


On Tue, 26 Aug 2008, Andriy Fedorov wrote:

> I was not supposed to have ";" at the end of the "foreach" construct:

Its actually nothing to do with the foreach - a swift program consisting 
of only this line:

  ;;

exhibits the same error.

I've changed the language grammar to disallow empty statements, in r2205. 
This will give a different error message now, pointing at the semicolon.

-- 


From fedorov at cs.wm.edu  Wed Aug 27 15:08:59 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Wed, 27 Aug 2008 16:08:59 -0400
Subject: [Swift-user] tcp.port.range in Swift 0.6
Message-ID: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>

Hello,

It appears to me that Swift 0.6 does not handle properly custom
GLOBUS_TCP_PORT_RANGE.

In the previous release, I used env variable $GLOBUS_TCP_PORT_RANGE to
set this range.

In the current release, it seems like tcp.port.range can be specified
in etc/swift.properties.

I still have the env variable, and tried to specify the port range in
swift.properties, and I tried passing it as -tcp.port.range, but Swift
keeps opening ports outside the specified range, as reported by
netstat. The same simple test script works with 0.5, but not with 0.6.

This seems like a bug to me.

--
Andrey Fedorov

Center for Real-Time Computing
College of William and Mary
http://www.cs.wm.edu/~fedorov


From benc at hawaga.org.uk  Thu Aug 28 10:09:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Aug 2008 15:09:54 +0000 (GMT)
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>


On Wed, 27 Aug 2008, Andriy Fedorov wrote:

> It appears to me that Swift 0.6 does not handle properly custom
> GLOBUS_TCP_PORT_RANGE.

I just tried 0.6 and a head build from about 24h ago on my laptop, and 
both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4 
notification sinks

Can you give more details about what you see? For example, the output of 
/usr/bin/env and the lines that you see in netstat -pant, the sites.xml 
file that you are using.

> I still have the env variable, and tried to specify the port range in
> swift.properties, and I tried passing it as -tcp.port.range, but Swift
> keeps opening ports outside the specified range, as reported by
> netstat. The same simple test script works with 0.5, but not with 0.6.

Note that GLOBUS_TCP_PORT_RANGE controls which ports are used for server 
sockets; it does not control which ports are used for outbound 
connections. There is a different variable, GLOBUS_TCP_SOURCE_PORT_RANGE 
that should control the latter.

-- 


From fedorov at cs.wm.edu  Thu Aug 28 11:10:39 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 28 Aug 2008 12:10:39 -0400
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
	<Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
Message-ID: <82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com>

On Thu, Aug 28, 2008 at 11:09 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
>
> On Wed, 27 Aug 2008, Andriy Fedorov wrote:
>
>> It appears to me that Swift 0.6 does not handle properly custom
>> GLOBUS_TCP_PORT_RANGE.
>
> I just tried 0.6 and a head build from about 24h ago on my laptop, and
> both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4
> notification sinks
>

Ok, I don't understand what's going on. I am sure it worked with 0.5
yesterday, but it doesn't anymore.

This happens only for UC TG site though. I checked again netstat, and
I do have port 50000 listening. I am also able to telnet to this port
from UC, firewall is ok. The job is submitted and executed, but
apparently notification doesn't reach my server.

The relevant part of my sites.xml has not changed since the last time
I had it working:

<pool handle="UC-GT4">
  <gridftp url="gsiftp://tg-gridftp.uc.teragrid.org" />
  <execution provider="gt4" jobmanager="PBS"
    url="https://tg-grid.uc.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
  <workdirectory>/home/fedorov/scratch</workdirectory>
</pool>


From hategan at mcs.anl.gov  Thu Aug 28 11:24:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 28 Aug 2008 11:24:43 -0500
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
	<Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
	<82f536810808280910t3b4b90a1k69b55400ea67e325@mail.gmail.com>
Message-ID: <1219940683.18551.9.camel@localhost>

On Thu, 2008-08-28 at 12:10 -0400, Andriy Fedorov wrote:
> On Thu, Aug 28, 2008 at 11:09 AM, Ben Clifford <benc at hawaga.org.uk> wrote:
> >
> > On Wed, 27 Aug 2008, Andriy Fedorov wrote:
> >
> >> It appears to me that Swift 0.6 does not handle properly custom
> >> GLOBUS_TCP_PORT_RANGE.
> >
> > I just tried 0.6 and a head build from about 24h ago on my laptop, and
> > both appear to respect GLOBUS_TCP_PORT_RANGE for the purposes of gram4
> > notification sinks
> >
> 
> Ok, I don't understand what's going on. I am sure it worked with 0.5
> yesterday, but it doesn't anymore.
> 
> This happens only for UC TG site though. I checked again netstat, and
> I do have port 50000 listening. I am also able to telnet to this port
> from UC, firewall is ok. The job is submitted and executed, but
> apparently notification doesn't reach my server.
> 
> The relevant part of my sites.xml has not changed since the last time
> I had it working:

There is one change which might affect things, depending on your exact
configuration.

Previously, in cog.properties or swift.properties, the ip= setting was
the one to use to force a specific client IP. This has changed to
hostname=. However, a higher priority is given to $GLOBUS_HOSTNAME,
which is copied by the startup scripts from $HOSTNAME (if not explicitly
set).

This was done because Ben noticed that it may be desirable to be able to
pass an unresolved DNS name as a callback address, which should be
resolved by the servers when trying to... call back.

So if you were previously relying on ip= in cog/swift.properties, try
setting hostname= instead (it can be a numeric IP). Remove ip=, which
has been deprecated in swift.properties. If that doesn't work (i.e. you
have an improper $HOSTNAME) set $GLOBUS_HOSTNAME.

Mihael


From benc at hawaga.org.uk  Thu Aug 28 12:11:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Aug 2008 17:11:54 +0000 (GMT)
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
	<Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0808281710560.22488@dildano.hawaga.org.uk>

tg uc gram4 seems a bit funny at the moment - it hangs like this when I do 
a manual job submission from a UC machine. That might be the problem, not 
the Swift version.

$ globusrun-ws -submit -F tg-grid.uc.teragrid.org -Ft Fork -c 
/bin/hostname
Submitting job...Done.
Job ID: uuid:407bb778-7524-11dd-8991-001a64784960
Termination time: 08/29/2008 17:10 GMT

-- 


From fedorov at cs.wm.edu  Thu Aug 28 12:25:09 2008
From: fedorov at cs.wm.edu (Andriy Fedorov)
Date: Thu, 28 Aug 2008 13:25:09 -0400
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <Pine.LNX.4.64.0808281710560.22488@dildano.hawaga.org.uk>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com>
	<Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0808281710560.22488@dildano.hawaga.org.uk>
Message-ID: <82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com>

On Thu, Aug 28, 2008 at 1:11 PM, Ben Clifford <benc at hawaga.org.uk> wrote:
> tg uc gram4 seems a bit funny at the moment - it hangs like this when I do
> a manual job submission from a UC machine. That might be the problem, not
> the Swift version.
>

This is what I was going to try right before I received your email.

I wanted to comment in general on TeraGrid. I find it to be a very
painful experience using it through GRAM. Initially, I wanted to use 4
sites: Linux clusters on SDSC, UC and NCSA, and Abe at NCSA. I found
problems with GRAM configuration on Abe and NCSA cluster, I reported
that to help about 3 weeks ago, and every time I ask about updates,
they tell me they would notify me. Now UC is also out of the loop, and
I have only SDSC left, which usually has very tight queue schedule.

I assume, these problems would be resolved more quickly if more people
were doing something similar to what I am doing.

I will keep bugging TG help until I run out of time for this project....

> $ globusrun-ws -submit -F tg-grid.uc.teragrid.org -Ft Fork -c
> /bin/hostname
> Submitting job...Done.
> Job ID: uuid:407bb778-7524-11dd-8991-001a64784960
> Termination time: 08/29/2008 17:10 GMT
>
> --
>
>


From benc at hawaga.org.uk  Thu Aug 28 12:56:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 28 Aug 2008 17:56:50 +0000 (GMT)
Subject: [Swift-user] tcp.port.range in Swift 0.6
In-Reply-To: <82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com>
References: <82f536810808271308g774416dbl5f49d0286987a567@mail.gmail.com> 
	<Pine.LNX.4.64.0808281504500.29009@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0808281710560.22488@dildano.hawaga.org.uk>
	<82f536810808281025i2f9ec69aj8f4af24b6638c5cb@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0808281750220.22488@dildano.hawaga.org.uk>


On Thu, 28 Aug 2008, Andriy Fedorov wrote:

> I wanted to comment in general on TeraGrid. I find it to be a very 
> painful experience using it through GRAM. Initially, I wanted to use 4

In cases where GRAM4 isn't working, you can probably submit (at a lower 
rate) with GRAM2 to the same site - that seems to be more reliable often.

--