From iraicu at cs.uchicago.edu  Tue Nov  2 14:47:36 2010
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 02 Nov 2010 14:47:36 -0500
Subject: [Swift-user] Call for Participation: 3rd IEEE Workshop on Many-Task
 Computing on Grids and Supercomputers (MTAGS10),
 co-located with Supercomputing 2010 -- win an Apple iPad!!!
Message-ID: <4CD06AD8.1050201@cs.uchicago.edu>

Dear all,
We invite you to participate in the 3rd workshop on Many-Task Computing 
on Grids and Supercomputers (MTAGS10) on Monday, November 15th, 2010, 
co-located with IEEE/ACM Supercomputing 2010 in New Orleans LA. MTAGS 
will provide the scientific community a dedicated forum for presenting 
new research, development, and deployment efforts of large-scale 
many-task computing (MTC) applications on large scale clusters, Grids, 
Supercomputers, and Cloud Computing infrastructure.

A few highlights of the workshop:

    * *Workshop Program: *The program can be found at
      http://www.cs.iit.edu/~iraicu/MTAGS10/program.htm; papers and
      slides will be posted by November 15th, 2010
    * *Keynote speaker: *Roger Barga, PhD, Architect, Extreme Computing
      Group, Microsoft Research
    * *Best Paper Nominees: *
          o Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian
            Foster. "/Scheduling Many-Task Workloads on Supercomputers:
            Dealing with Trailing Tasks/", 3rd IEEE Workshop on
            Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010
          o Thomas Budnik, Brant Knudson, Mark Megerian, Sam Miller,
            Mike Mundy, Will Stockdell. "/Blue Gene/Q Resource
            Management Architecture/", 3rd IEEE Workshop on Many-Task
            Computing on Grids and Supercomputers (MTAGS10) 2010
    * *Attendance Prize: *There will be a /*free Apple iPad */giveaway
      at the end of the workshop; must attend at least 1 talk throughout
      the day at the workshop, and must be present to win at the end of
      the workshop at 6:15PM

The workshop program is:

    * 9:00AM     Opening Remarks
    * 9:10AM *Keynote: Data Laden Clouds, Roger Barga, PhD, Architect,
      Extreme Computing Group, Microsoft Research *
    * Session 1: Applications
          o 10:30AM     Many Task Computing for Modeling the Fate of Oil
            Discharged from the Deep Water Horizon Well Blowout
          o 11:00AM     Many-Task Applications in the Integrated Plasma
            Simulator
          o 11:30AM     Compute and data management strategies for grid
            deployment of high throughput protein structure studies
    * Session 2: Storage
          o 1:30PM     Processing Massive Sized Graphs Using Sector/Sphere
          o 2:00PM     Easy and Instantaneous Processing for
            Data-Intensive Workflows
          o 2:30PM     Detecting Bottlenecks in Parallel DAG-based Data
            Flow Programs
    * Session 3: Resource Management
          o 3:30PM     Improving Many-Task Computing in Scientific
            Workflows Using P2P Techniques
          o 4:00PM     Dynamic Task Scheduling for the Uintah Framework
          o 4:30PM     Automatic and Coordinated Job Recovery for High
            Performance Computing
    * Session 4: Best Papers Nominees
          o 5:15PM     Scheduling Many-Task Workloads on Supercomputers:
            Dealing with Trailing Tasks
          o 5:45PM     Blue Gene/Q Resource Management Architecture
    * 6:15PM     Best Paper Award, Attendees Prizes, & Closing Remarks

We look forward to seeing you at the workshop in less than 2 weeks!

Regards,
Ioan Raicu, Yong Zhao, and Ian Foster
MTAGS10 Chairs
http://www.cs.iit.edu/~iraicu/MTAGS10/

-- 
=================================================================
Ioan Raicu, Ph.D.
Assistant Professor
=================================================================
Computer Science Department
Illinois Institute of Technology
10 W. 31st Street
Stuart Building, Room 237D
Chicago, IL 60616
=================================================================
Cel:    1-847-722-0876
Office: 1-312-567-5704
Email:  iraicu at cs.iit.edu
Web:    http://www.cs.iit.edu/~iraicu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101102/bb345a75/attachment.html>

From wilde at mcs.anl.gov  Wed Nov  3 15:37:07 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 3 Nov 2010 15:37:07 -0500 (CDT)
Subject: [Swift-user] Re: 1.0 vs 1
In-Reply-To: <0663F080-2B06-45C2-A822-4936D3895BF3@iro.umontreal.ca>
Message-ID: <329307237.9932.1288816627030.JavaMail.root@zimbra.anl.gov>

Marc, the solution here is to say:

string str_modulo = @strcat( int_index, ":1000" );

instead of:

string str_modulo = @strcat( @tostring( int_index ), ":1000" );

Its a handy feature of @strcat() that it coerces its arguments to strings when they are numeric, and does it in the desired way, unlike @tostring().

I am not sure if @tostring() has always behaved this way (ie always formatting as if its argument is a float), or of that was a recent, perhaps undesirable or inadvertent change.

This is the kind of question you should submit to swift-user for general discussion and so other developers can provide advice.

Also, we have a sprintf()-like function that I think is not yet documented, if you need it. I need to find the details on that.

- Mike


----- Original Message -----
> Hi guys,
> 
> I'm trying to generate a string whose content should be "0:1000",
> "1:1000", etc...
> 
> foreach mod_index in [0:1]
> {
> int int_index = mod_index;
> string str_modulo = @strcat( @tostring( int_index ), ":1000" );
> ...
> }
> 
> 
> that str_modulo string ends up to contain "0.0:1000", "1.0:1000", etc
> 
> I thought it was an int? So the culprit is the tostring function then?
> It is not defined for the int type so it is silently converted into
> float then passed to the tostring function??
> 
> 
> Very Best,
> Marc.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From matthew.woitaszek at gmail.com  Wed Nov  3 18:28:50 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Wed, 3 Nov 2010 17:28:50 -0600
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
Message-ID: <AANLkTimgntq4nt+qHe4aRXGFhE_ZEwv33-zzH2JLnn_B@mail.gmail.com>

Good afternoon,

Is there a way to update PBS resource requests when using coasters to supply
modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary
resource requests, such as node properties?)

Of course, I'm just trying to get coasters to allocate all of the processors
on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider.
Both submit jobs just fine. I found no discernible difference with the
"host_types" Globus namespace variable, presuming I'm setting it right.

The particular cluster I'm using allows node packing for users that run lots
of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1
and thus pack 8 jobs on each node before moving on to the next node. (I know
it won't be an issue at sites that make nodes exclusive. On this system, the
queue default is "nodes=1:ppn=8", but because coasters explicitly specifies
the number of nodes in its generated resource request, the ppn default seems
to get lost!)

I see that this has been discussed as far back as 2007, and I found Marcin
and Mike's previous discussion of the topic at

   http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html

but there didn't seem to be any definitive conclusion. Any suggestions would
be appreciated!

Matthew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101103/c79f8f2a/attachment.html>

From aespinosa at cs.uchicago.edu  Wed Nov  3 18:41:00 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 3 Nov 2010 18:41:00 -0500
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <AANLkTimgntq4nt+qHe4aRXGFhE_ZEwv33-zzH2JLnn_B@mail.gmail.com>
References: <AANLkTimgntq4nt+qHe4aRXGFhE_ZEwv33-zzH2JLnn_B@mail.gmail.com>
Message-ID: <AANLkTikpCueaVLfQb_rjTKWda4XoZbdjEKcFTFQwsVr3@mail.gmail.com>

Hi Matthew,

Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing?

If there is no node packing being initiated by PBS, you can just
specify workersPerNode=8 .  But then what you request to PBS is now
different to what you actually use.

-Allan

2010/11/3 Matthew Woitaszek <matthew.woitaszek at gmail.com>:
> Good afternoon,
>
> Is there a way to update PBS resource requests when using coasters to supply
> modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary
> resource requests, such as node properties?)
>
> Of course, I'm just trying to get coasters to allocate all of the processors
> on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider.
> Both submit jobs just fine. I found no discernible difference with the
> "host_types" Globus namespace variable, presuming I'm setting it right.
>
> The particular cluster I'm using allows node packing for users that run lots
> of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1
> and thus pack 8 jobs on each node before moving on to the next node. (I know
> it won't be an issue at sites that make nodes exclusive. On this system, the
> queue default is "nodes=1:ppn=8", but because coasters explicitly specifies
> the number of nodes in its generated resource request, the ppn default seems
> to get lost!)
>
> I see that this has been discussed as far back as 2007, and I found Marcin
> and Mike's previous discussion of the topic at
>
> ?? http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html
>
> but there didn't seem to be any definitive conclusion. Any suggestions would
> be appreciated!
>
> Matthew
>

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From wilde at mcs.anl.gov  Thu Nov  4 10:06:58 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 4 Nov 2010 10:06:58 -0500 (CDT)
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <AANLkTimgntq4nt+qHe4aRXGFhE_ZEwv33-zzH2JLnn_B@mail.gmail.com>
Message-ID: <234626292.12872.1288883218697.JavaMail.root@zimbra.anl.gov>

[long response follows, sorry - I tried to condense but this hits messy issues]

Hi Matthew,

Your question hits issues that we need to resolve and and do more testing on.

Most common modes seem to be working, but I have been worried that some bugs remain and its possible - but not 100% clear - that we'll need more attribute-setting control.  I think that some node-packing issues Marcin encountered on the Argonne PBS Fusion cluster went unresolved.

Specifically, Ive been suspicious that with automated (default) coaster operation, there may be cases with PBS and SGE where we either get too few (1 instead of N) or too many (N^2 instead of N) jobs running per node.

I'll try to cover these by answering your questions, below.

----- Original Message -----
> Good afternoon,
> 
> Is there a way to update PBS resource requests when using coasters to
> supply modified PBS resource strings such as "nodes=1:ppn=8"? (Or
> other arbitrary resource requests, such as node properties?)

Not that I know of.

You can set the number of cores that should be used on each node using the coasters pool attribute "workersPerNode". But see issues in the table below.

You can also start coaster workers manually, in which case you can set any scheduler attributes explicitly. We have a growing set of scripts that enable this, but they're not ready for release yet. We hope to integrate this option into the evolving swiftconfig/swiftrun tools that you may have seen discussed on the list and which are in the trunk but not yet documented in the users guide. Lets discuss this possibility in a separate thread if after reading thing you feel you need it.

> Of course, I'm just trying to get coasters to allocate all of the
> processors on an 8-core node, using either the "gt2:gt2:pbs" or
> "local:pbs" provider. Both submit jobs just fine. I found no
> discernible difference with the "host_types" Globus namespace
> variable, presuming I'm setting it right.

Did you try just setting workersPerNode (in the Globus profile) to 8?  This should be working with coasters on PBS and gt2:gt2:pbs, and Im pretty sure is working on TCC Ranger (and SGE machine with N=16). Note that this attribute is in the "Globus" profile set but that's a misnomer - many attributes in that profile affect coasters and the local providers and are unrelated to Globus operation per se.

> The particular cluster I'm using allows node packing for users that
> run lots of single-processor tasks, so without ppn, it will assume
> nodes=1,ncpus=1 and thus pack 8 jobs on each node before moving on to
> the next node. (I know it won't be an issue at sites that make nodes
> exclusive. On this system, the queue default is "nodes=1:ppn=8", but
> because coasters explicitly specifies the number of nodes in its
> generated resource request, the ppn default seems to get lost!)

You can set "debug=true" in etc/provider-pbs.properties, and then Swift will retain the submit file in $HOME/.globus/scripts, so you can verify the scheduler directives that Swift is setting.

> I see that this has been discussed as far back as 2007, and I found
> Marcin and Mike's previous discussion of the topic at
> 
> http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html

Right - that issue is still unresolved. I'll try to push it forward. Can you help us determine the right conventions and then verify that they are working for you?

The issue, I think, is that the user either needs to know whether or not the scheduler does node packing, or a way to specify job attributes in a way that makes such knowledge un-necessary. 

What I think we need is:

- a set of attributes that forces the scheduler to allocate complete nodes and give the user control over how many jobs to run per node

- a set of attributes that assumes the scheduler *will* pack nodes, and that does the right thing in that case.

In summary I think the current situation is this:

- when coasters submits a 1-node job:
  -- workersPerNode=1
     o works fine if scheduler packs nodes
     o uses only 1 core if scheduler does not pack nodes
  -- workersPerNode=N
     o runs up to N^2 tasks if scheduler packs nodes
     o works fine if scheduler does not pack nodes
- when coasters submits an N-node job, N>1
  -- workersPerNode=1
     o works fine if scheduler packs nodes
     o uses only 1 core per node if scheduler does not pack nodes
  -- workersPerNode=N
     o runs up to N^2 tasks if scheduler packs nodes
     o works fine if scheduler does not pack nodes

Based on the above cases, it seems that "all is fine" as long as the sites description is set based on whether the scheduler will node-pack or not, and that workersPerNode is set correctly.

Typically, you want to run either in 1-core packing mode, or N-core full-node-allocation mode.

I *think* that some schedulers (pbs, maybe sge) may determine packing behavior based on the queue, or in the case of SGE, perhaps by the parallel environment (PE). For now, the user must know how to match the sites.xml spec to the behavior of the target cluster.  We're trying to work out suggested specs for all the clusters in the Argonne/UChicago/TeraGrid mode, and most of Open Science Grid as well.

I am worried that there remains an issue in what we call "multi-node" operation. When a coaster job ("slot") uses more than one node, then Swift itself needs to start the coaster agent, worker.pl, on each node in the job. This is done with explicit shell code that Swift places in the submit file.

I have argued in the past that we simply need one more attribute (I called it coresPerNode) which tells swift exactly how many cores per node to request from the scheduler. The two typical values would be 1 (if the user wants to use node packing) and N, where N is the actual number of cores per node, when the user wants to allocate entire nodes.

I think this may be needed for SGE but possible not PBS. Im pretty sure we have cases in SGE local-scheduler coaster mode where the provider needs this in order to formulate a submit file that SGE will accept.

Mihael did not agree that this was necessary, and we never resolved the issue.

So what we need to discuss, test and resolve is:

- is the Coaster provider correctly handling both "node-packed" and "full-node" mode?

- is it handling these modes correctly with both the local-scheduler parent provider and with GT2?

- with SGE, are we working correctly with all or most processing environments and job-launching programs? How we we know, for a given SGE deployment? Are we starting multi-node jobs correctly on all schedulers in all modes?

- do we have a sufficient way to set scheduler attributes?

- we need to automate the testing of the many mode combinations that are likely to be used.

If you are willing to help define (or even develop) and test improvements, we'd welcome your assistance.

Sorry for the long response. I think with more analysis we can simplify the issue.

Regards,

Mike

> but there didn't seem to be any definitive conclusion. Any suggestions
> would be appreciated!
> 
> Matthew
> 
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From matthew.woitaszek at gmail.com  Thu Nov  4 10:06:48 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Thu, 4 Nov 2010 09:06:48 -0600
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <AANLkTikpCueaVLfQb_rjTKWda4XoZbdjEKcFTFQwsVr3@mail.gmail.com>
References: <AANLkTimgntq4nt+qHe4aRXGFhE_ZEwv33-zzH2JLnn_B@mail.gmail.com>
	<AANLkTikpCueaVLfQb_rjTKWda4XoZbdjEKcFTFQwsVr3@mail.gmail.com>
Message-ID: <AANLkTi=FUQ3UtA99u3WDG3EbHK1xg68xpn5A+GEF1toF@mail.gmail.com>

Hi Allan,

Yep, that's it. When the coasters resource request comes in with just
"nodes=1", it gets interpreted by PBS as nodes=1:ppn=1, and thus PBS puts
other jobs on the node, too, until all 8 CPUs are allocated (e.g., 8 1-cpu
PBS jobs are running on it).

I'd like to find some way to make the request as:
  nodes=1:ppn=8
along with
  workersPerNode=8
so that PBS allocates one node and all 8 processors, and then one Coasters
job would put 8 workers on it, matching the resource request with the use.

Matthew


On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa <aespinosa at cs.uchicago.edu>wrote:

> Hi Matthew,
>
> Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing?
>
> If there is no node packing being initiated by PBS, you can just
> specify workersPerNode=8 .  But then what you request to PBS is now
> different to what you actually use.
>
> -Allan
>
> 2010/11/3 Matthew Woitaszek <matthew.woitaszek at gmail.com>:
> > Good afternoon,
> >
> > Is there a way to update PBS resource requests when using coasters to
> supply
> > modified PBS resource strings such as "nodes=1:ppn=8"? (Or other
> arbitrary
> > resource requests, such as node properties?)
> >
> > Of course, I'm just trying to get coasters to allocate all of the
> processors
> > on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs"
> provider.
> > Both submit jobs just fine. I found no discernible difference with the
> > "host_types" Globus namespace variable, presuming I'm setting it right.
> >
> > The particular cluster I'm using allows node packing for users that run
> lots
> > of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1
> > and thus pack 8 jobs on each node before moving on to the next node. (I
> know
> > it won't be an issue at sites that make nodes exclusive. On this system,
> the
> > queue default is "nodes=1:ppn=8", but because coasters explicitly
> specifies
> > the number of nodes in its generated resource request, the ppn default
> seems
> > to get lost!)
> >
> > I see that this has been discussed as far back as 2007, and I found
> Marcin
> > and Mike's previous discussion of the topic at
> >
> >
> http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html
> >
> > but there didn't seem to be any definitive conclusion. Any suggestions
> would
> > be appreciated!
> >
> > Matthew
> >
>
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa<http://people.cs.uchicago.edu/%7Eaespinosa>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/9e5401ef/attachment.html>

From wilde at mcs.anl.gov  Thu Nov  4 10:20:12 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 4 Nov 2010 10:20:12 -0500 (CDT)
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <AANLkTi=FUQ3UtA99u3WDG3EbHK1xg68xpn5A+GEF1toF@mail.gmail.com>
Message-ID: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov>

Let me see if Mihael or I can add the ppn spec as a simple test to experiment with. I'll look for my coresPerNode mod which I think did this. 


- Mike 

----- Original Message -----


Hi Allan, 

Yep, that's it. When the coasters resource request comes in with just "nodes=1", it gets interpreted by PBS as nodes=1:ppn=1, and thus PBS puts other jobs on the node, too, until all 8 CPUs are allocated (e.g., 8 1-cpu PBS jobs are running on it). 

I'd like to find some way to make the request as: 
nodes=1:ppn=8 
along with 
workersPerNode=8 
so that PBS allocates one node and all 8 processors, and then one Coasters job would put 8 workers on it, matching the resource request with the use. 

Matthew 


On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa < aespinosa at cs.uchicago.edu > wrote: 


Hi Matthew, 

Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing? 

If there is no node packing being initiated by PBS, you can just 
specify workersPerNode=8 . But then what you request to PBS is now 
different to what you actually use. 

-Allan 

2010/11/3 Matthew Woitaszek < matthew.woitaszek at gmail.com >: 


> Good afternoon, 
> 
> Is there a way to update PBS resource requests when using coasters to supply 
> modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary 
> resource requests, such as node properties?) 
> 
> Of course, I'm just trying to get coasters to allocate all of the processors 
> on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider. 
> Both submit jobs just fine. I found no discernible difference with the 
> "host_types" Globus namespace variable, presuming I'm setting it right. 
> 
> The particular cluster I'm using allows node packing for users that run lots 
> of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1 
> and thus pack 8 jobs on each node before moving on to the next node. (I know 
> it won't be an issue at sites that make nodes exclusive. On this system, the 
> queue default is "nodes=1:ppn=8", but because coasters explicitly specifies 
> the number of nodes in its generated resource request, the ppn default seems 
> to get lost!) 
> 
> I see that this has been discussed as far back as 2007, and I found Marcin 
> and Mike's previous discussion of the topic at 
> 
> http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html 
> 
> but there didn't seem to be any definitive conclusion. Any suggestions would 
> be appreciated! 
> 
> Matthew 
> 

-- 
Allan M. Espinosa < http://amespinosa.wordpress.com > 
PhD student, Computer Science 
University of Chicago < http://people.cs.uchicago.edu/~aespinosa > 


_______________________________________________ 
Swift-user mailing list 
Swift-user at ci.uchicago.edu 
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/7b90396a/attachment.html>

From hategan at mcs.anl.gov  Thu Nov  4 15:41:51 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 04 Nov 2010 13:41:51 -0700
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov>
References: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov>
Message-ID: <1288903311.3175.1.camel@blabla2.none>

You can add the ppn profile in sites.xml:
<profile namespace="globus" key="ppn">8</profile>

This works in trunk and might work in the stable branch.

Mihael

On Thu, 2010-11-04 at 10:20 -0500, Michael Wilde wrote:
> Let me see if Mihael or I can add the ppn spec as a simple test to
> experiment with.  I'll look for my coresPerNode mod which I think did
> this.
> 
> 
> - Mike
> 
> 
> ______________________________________________________________________
>         Hi Allan,
>         
>         Yep, that's it. When the coasters resource request comes in
>         with just "nodes=1", it gets interpreted by PBS as
>         nodes=1:ppn=1, and thus PBS puts other jobs on the node, too,
>         until all 8 CPUs are allocated (e.g., 8 1-cpu PBS jobs are
>         running on it). 
>         
>         I'd like to find some way to make the request as:
>           nodes=1:ppn=8
>         along with
>           workersPerNode=8
>         so that PBS allocates one node and all 8 processors, and then
>         one Coasters job would put 8 workers on it, matching the
>         resource request with the use. 
>         
>         Matthew
>         
>         
>         
>         
>         On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa
>         <aespinosa at cs.uchicago.edu> wrote:
>                 Hi Matthew,
>                 
>                 Does this mean, coasters will now submit nodes=1;ppn=1
>                 and do node packing?
>                 
>                 If there is no node packing being initiated by PBS,
>                 you can just
>                 specify workersPerNode=8 .  But then what you request
>                 to PBS is now
>                 different to what you actually use.
>                 
>                 -Allan
>                 
>                 2010/11/3 Matthew Woitaszek
>                 <matthew.woitaszek at gmail.com>:
>                 
>                 > Good afternoon,
>                 >
>                 > Is there a way to update PBS resource requests when
>                 using coasters to supply
>                 > modified PBS resource strings such as
>                 "nodes=1:ppn=8"? (Or other arbitrary
>                 > resource requests, such as node properties?)
>                 >
>                 > Of course, I'm just trying to get coasters to
>                 allocate all of the processors
>                 > on an 8-core node, using either the "gt2:gt2:pbs" or
>                 "local:pbs" provider.
>                 > Both submit jobs just fine. I found no discernible
>                 difference with the
>                 > "host_types" Globus namespace variable, presuming
>                 I'm setting it right.
>                 >
>                 > The particular cluster I'm using allows node packing
>                 for users that run lots
>                 > of single-processor tasks, so without ppn, it will
>                 assume nodes=1,ncpus=1
>                 > and thus pack 8 jobs on each node before moving on
>                 to the next node. (I know
>                 > it won't be an issue at sites that make nodes
>                 exclusive. On this system, the
>                 > queue default is "nodes=1:ppn=8", but because
>                 coasters explicitly specifies
>                 > the number of nodes in its generated resource
>                 request, the ppn default seems
>                 > to get lost!)
>                 >
>                 > I see that this has been discussed as far back as
>                 2007, and I found Marcin
>                 > and Mike's previous discussion of the topic at
>                 >
>                 >
>                 http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html
>                 >
>                 > but there didn't seem to be any definitive
>                 conclusion. Any suggestions would
>                 > be appreciated!
>                 >
>                 > Matthew
>                 >
>                 
>                 
>                 --
>                 Allan M. Espinosa <http://amespinosa.wordpress.com>
>                 PhD student, Computer Science
>                 University of Chicago
>                 <http://people.cs.uchicago.edu/~aespinosa>
>         
>         
>         _______________________________________________
>         Swift-user mailing list
>         Swift-user at ci.uchicago.edu
>         http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> 
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From matthew.woitaszek at gmail.com  Thu Nov  4 16:35:08 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Thu, 4 Nov 2010 15:35:08 -0600
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <1288903311.3175.1.camel@blabla2.none>
References: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov>
	<1288903311.3175.1.camel@blabla2.none>
Message-ID: <AANLkTi=GEmYPRDPGLafcyv=0cOWRkNeXpTigq_XG2FPE@mail.gmail.com>

Hi Mihael,

Unfortunately, I can't seem to get that to work... I just did a svn update
on the trunk. With that line in sites.xml, I still get the nodes=1 in the
PBS file in ~/.globus/scripts using the local provider; the net result also
seems the same with the gt2 provider.

Do you have an example that I can try to make sure I'm not botching it up?

Matthew


On Thu, Nov 4, 2010 at 2:41 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> You can add the ppn profile in sites.xml:
> <profile namespace="globus" key="ppn">8</profile>
>
> This works in trunk and might work in the stable branch.
>
> Mihael
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/b6cf0035/attachment.html>

From wilde at mcs.anl.gov  Thu Nov  4 17:26:47 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 4 Nov 2010 17:26:47 -0500 (CDT)
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <AANLkTi=GEmYPRDPGLafcyv=0cOWRkNeXpTigq_XG2FPE@mail.gmail.com>
Message-ID: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov>

I'll try later as well. Mihael, Im wondering (from what I saw when I did the earlier experiement with a coresPerNode variable, if we need to add the line in BlockTask.java to copy the attribute from the coaster jobspec to the block's jobspec? 


- Mike 


----- Original Message -----


Hi Mihael, 

Unfortunately, I can't seem to get that to work... I just did a svn update on the trunk. With that line in sites.xml, I still get the nodes=1 in the PBS file in ~/.globus/scripts using the local provider; the net result also seems the same with the gt2 provider. 

Do you have an example that I can try to make sure I'm not botching it up? 

Matthew 


On Thu, Nov 4, 2010 at 2:41 PM, Mihael Hategan < hategan at mcs.anl.gov > wrote: 


You can add the ppn profile in sites.xml: 
<profile namespace="globus" key="ppn">8</profile> 

This works in trunk and might work in the stable branch. 

Mihael 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/ebe9bb83/attachment.html>

From hategan at mcs.anl.gov  Thu Nov  4 17:35:18 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 04 Nov 2010 15:35:18 -0700
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov>
References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov>
Message-ID: <1288910118.3775.0.camel@blabla2.none>

On Thu, 2010-11-04 at 17:26 -0500, Michael Wilde wrote:
> I'll try later as well. Mihael, Im wondering (from what I saw when I
> did the earlier experiement with a coresPerNode variable, if we need
> to add the line in BlockTask.java to copy the attribute from the
> coaster jobspec to the block's jobspec?

It should be copied. Maybe it's a bug. I'll have to try and see.

Mihael


From matthew.woitaszek at gmail.com  Thu Nov  4 18:14:32 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Thu, 4 Nov 2010 17:14:32 -0600
Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug
	output
Message-ID: <AANLkTik=25bTBRA2891vR1iTvJYSDPDSX05zQ0_V=ktU@mail.gmail.com>

When running with coasters using the gt2:gt2:pbs provider, an excellent log
file pops up in
  ~/.globus/coasters/coasters.log
containing useful lines like:
   DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:...

This file isn't generated when using the local:pbs provider, at least
out-of-box.

Is there a way to turn that output log file back on, or to get those debug
lines in the Swift job log file, using the local:pbs provider?

Thanks for your time,

Matthew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/f03f18c4/attachment.html>

From wilde at mcs.anl.gov  Thu Nov  4 22:47:52 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 4 Nov 2010 22:47:52 -0500 (CDT)
Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug
	output
In-Reply-To: <AANLkTik=25bTBRA2891vR1iTvJYSDPDSX05zQ0_V=ktU@mail.gmail.com>
Message-ID: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov>

Matthew, 


When the first token in the coaster jobmanager parameter is "local" as in "local:pbs", then the coaster server threads run inside the main swift JVM, and thus log to the main swift log. thats the file with the long name starting with your swift script file name, followed by a timestamp and unique ID, and ending in .log. 


All the log entries that you'd find in the coaster.log file should, I think, be in the main log. 


- Mike 


----- Original Message -----


When running with coasters using the gt2:gt2:pbs provider, an excellent log file pops up in 
~/.globus/coasters/coasters.log 
containing useful lines like: 
DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... 

This file isn't generated when using the local:pbs provider, at least out-of-box. 

Is there a way to turn that output log file back on, or to get those debug lines in the Swift job log file, using the local:pbs provider? 

Thanks for your time, 

Matthew 


_______________________________________________ 
Swift-user mailing list 
Swift-user at ci.uchicago.edu 
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101104/9933343a/attachment.html>

From matthew.woitaszek at gmail.com  Fri Nov  5 14:54:49 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Fri, 5 Nov 2010 13:54:49 -0600
Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug
	output
In-Reply-To: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov>
References: <AANLkTik=25bTBRA2891vR1iTvJYSDPDSX05zQ0_V=ktU@mail.gmail.com>
	<601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov>
Message-ID: <AANLkTikQYZ+kWqjJoaGY4+YDRr0KZ-Jh1D9rT7=LmJs+@mail.gmail.com>

Hi Mike,

I do see some of the Coasters messages in the file, such as the INFO Cpu
pull messages.

With apologies for a dumb question: Can you point me to how to turn on the
DEBUG TaskImpl messages?

Matthew


On Thu, Nov 4, 2010 at 9:47 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> Matthew,
>
> When the first token in the coaster jobmanager parameter is "local" as in
> "local:pbs", then the coaster server threads run inside the main swift JVM,
> and thus log to the main swift log. thats the file with the long name
> starting with your swift script file name, followed by a timestamp and
> unique ID, and ending in .log.
>
> All the log entries that you'd find in the coaster.log file should, I
> think, be in the main log.
>
> - Mike
>
>
> ------------------------------
>
>
> When running with coasters using the gt2:gt2:pbs provider, an excellent log
> file pops up in
>   ~/.globus/coasters/coasters.log
> containing useful lines like:
>    DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:...
>
> This file isn't generated when using the local:pbs provider, at least
> out-of-box.
>
> Is there a way to turn that output log file back on, or to get those debug
> lines in the Swift job log file, using the local:pbs provider?
>
> Thanks for your time,
>
> Matthew
>
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>
>
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101105/fc11f0d2/attachment.html>

From wilde at mcs.anl.gov  Fri Nov  5 15:28:16 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 5 Nov 2010 15:28:16 -0500 (CDT)
Subject: [Swift-user] Provider staging vs coaster data provider
In-Reply-To: <989023705.21562.1288986250337.JavaMail.root@zimbra.anl.gov>
Message-ID: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov>

Mihael, you may have explained this to me already, but can you clarify:

In addition to provider staging using the coaster provider, it seems you can say <filesystem provider=coaster> (or <gridftp>) which uses the coaster channel and agent to move data?

One difference is that with provider staging, *all* file transfer for all sites is done via that method, correct?

If a coaster filesystem provider is available, then the user can use that on selected sites, while other sites can use any other provider, correct?

Provider staging with the coaster provider is done via the coaster data channel and all data goes directly to a job directory typically placed on the worker node local filesystem, right?

Now, with coaster data provider staging, does that same restriction apply, or can the coaster data provider (assuming it really exists) place data on any path accessible to the worker? I.e., one could use a standard shared workdirectory if one was so inclined, although that would be counter-productive.

Once I have all the above straight, I'm going to try to figure out how these two methods interact with CDM.

Thanks,

Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Fri Nov  5 17:53:25 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 5 Nov 2010 17:53:25 -0500 (CDT)
Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug
	output
In-Reply-To: <AANLkTikQYZ+kWqjJoaGY4+YDRr0KZ-Jh1D9rT7=LmJs+@mail.gmail.com>
Message-ID: <483086198.22863.1288997605521.JavaMail.root@zimbra.anl.gov>

Matthew, thats actually a very *good* question, and I dont know the answer (but we should document it). 


I think you want to set a log4j property in the swift file etc/log4j.properties. Hopefully the message below, from the swift-user list provides a clue. Im not sure this is the exact setting you need, or something similar. 


I think to find the precise answer you'd need to look for the message you want in the source code under provider-coaster and then set the corresponding log4j property to debug. 


See also Justin's posts on simplified logging which may interact with this. 


Mihael can provide the exact answer; Justin is offline at the moment. 


Mike 


----- Forwarded Message ----- 
From: "Justin M Wozniak" <wozniak at mcs.anl.gov> 
To: "Allan Espinosa" <aespinosa at cs.uchicago.edu> 
Cc: "Swift-User" <swift-user at ci.uchicago.edu> 
Sent: Wednesday, October 20, 2010 1:23:41 PM 
Subject: Re: [Swift-user] log4j settings of vdl:* elements 


Try log4j.logger.swift=DEBUG 


See org.griphyn.vdl.karajan.lib.Log for more info. 


On Wed, 20 Oct 2010, Allan Espinosa wrote: 


> Hi, 
> 
> I think I have asked this before, but can't find the previous posts about it. 
> 
> I woud like to set the vdl:execute2 log level to DEBUG. Which 
> package/class path should I adjust in log4j.properties? 
> 
> Thanks, 
> -Allan ----- Original Message -----


Hi Mike, 

I do see some of the Coasters messages in the file, such as the INFO Cpu pull messages. 

With apologies for a dumb question: Can you point me to how to turn on the DEBUG TaskImpl messages? 

Matthew 


On Thu, Nov 4, 2010 at 9:47 PM, Michael Wilde < wilde at mcs.anl.gov > wrote: 


Matthew, 


When the first token in the coaster jobmanager parameter is "local" as in "local:pbs", then the coaster server threads run inside the main swift JVM, and thus log to the main swift log. thats the file with the long name starting with your swift script file name, followed by a timestamp and unique ID, and ending in .log. 


All the log entries that you'd find in the coaster.log file should, I think, be in the main log. 


- Mike 


When running with coasters using the gt2:gt2:pbs provider, an excellent log file pops up in 
~/.globus/coasters/coasters.log 
containing useful lines like: 
DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... 

This file isn't generated when using the local:pbs provider, at least out-of-box. 

Is there a way to turn that output log file back on, or to get those debug lines in the Swift job log file, using the local:pbs provider? 

Thanks for your time, 

Matthew 


_______________________________________________ 
Swift-user mailing list 
Swift-user at ci.uchicago.edu 
http://mail.ci.uchicago.edu/mailman/listinfo/swift-user 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 


-- 
Michael Wilde 
Computation Institute, University of Chicago 
Mathematics and Computer Science Division 
Argonne National Laboratory 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101105/917b4584/attachment.html>

From hategan at mcs.anl.gov  Sun Nov  7 13:57:55 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 07 Nov 2010 11:57:55 -0800
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <1288910118.3775.0.camel@blabla2.none>
References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov>
	<1288910118.3775.0.camel@blabla2.none>
Message-ID: <1289159875.29800.1.camel@blabla2.none>

On Thu, 2010-11-04 at 15:35 -0700, Mihael Hategan wrote:
> On Thu, 2010-11-04 at 17:26 -0500, Michael Wilde wrote:
> > I'll try later as well. Mihael, Im wondering (from what I saw when I
> > did the earlier experiement with a coresPerNode variable, if we need
> > to add the line in BlockTask.java to copy the attribute from the
> > coaster jobspec to the block's jobspec?
> 
> It should be copied. Maybe it's a bug. I'll have to try and see.

It wasn't copied and it wasn't a bug, but my misunderstanding.

Attributes are not directly copied since there is no one-to-one mapping
between jobs and coaster blocks. So theoretically some "merge" operation
needs to exist.

I added "ppn" as one of the attributes that is copied from the first
job, so the scenario I mentioned should now work.

This is cog r2927/trunk.

Mihael


From hategan at mcs.anl.gov  Sun Nov  7 14:03:52 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 07 Nov 2010 12:03:52 -0800
Subject: [Swift-user] Re: Provider staging vs coaster data provider
In-Reply-To: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov>
References: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov>
Message-ID: <1289160232.29800.7.camel@blabla2.none>

On Fri, 2010-11-05 at 15:28 -0500, Michael Wilde wrote:
> Mihael, you may have explained this to me already, but can you clarify:
> 
> In addition to provider staging using the coaster provider, it seems
> you can say <filesystem provider=coaster> (or <gridftp>) which uses
> the coaster channel and agent to move data?

Yes.

> 
> One difference is that with provider staging, *all* file transfer for
> all sites is done via that method, correct?

And without provider staging all file transfer for all sites is done
without provider staging (i.e. the converse is also true). This is a
consequence of vdl-int getting too messy to have both in the same run,
but it's not a theoretical impossibility.

> 
> If a coaster filesystem provider is available, then the user can use
> that on selected sites, while other sites can use any other provider,
> correct?

That is correct.

> 
> Provider staging with the coaster provider is done via the coaster
> data channel and all data goes directly to a job directory typically
> placed on the worker node local filesystem, right?

Also correct.

> 
> Now, with coaster data provider staging, does that same restriction
> apply, or can the coaster data provider (assuming it really exists)
> place data on any path accessible to the worker? I.e., one could use a
> standard shared workdirectory if one was so inclined, although that
> would be counter-productive.

The coaster data provider works like any other data provider and its
usage is consistent with swift's traditional way of working. In other
words, files are copied to a shared directory, cached there, and the
shared directory must be accessible to the worker node.

Though which one of the methods is "restrictive" I don't know.

Mihael


From hategan at mcs.anl.gov  Sun Nov  7 14:08:57 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 07 Nov 2010 12:08:57 -0800
Subject: [Swift-user] Coasters local:pbs and the coasters.log file
	debug output
In-Reply-To: <AANLkTikQYZ+kWqjJoaGY4+YDRr0KZ-Jh1D9rT7=LmJs+@mail.gmail.com>
References: <AANLkTik=25bTBRA2891vR1iTvJYSDPDSX05zQ0_V=ktU@mail.gmail.com>
	<601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov>
	<AANLkTikQYZ+kWqjJoaGY4+YDRr0KZ-Jh1D9rT7=LmJs+@mail.gmail.com>
Message-ID: <1289160537.29800.12.camel@blabla2.none>

On Fri, 2010-11-05 at 13:54 -0600, Matthew Woitaszek wrote:
> Hi Mike,
> 
> I do see some of the Coasters messages in the file, such as the INFO
> Cpu pull messages.
> 
> With apologies for a dumb question: Can you point me to how to turn on
> the DEBUG TaskImpl messages?

You would say something like:
log4j.logger.org.globus.cog.abstraction=DEBUG
in log4j.properties.

Which log4j.properties that is depends on how you run this. In one of
the local modes (i.e. coaster service in the same JVM as swift = coaster
messages in swift logs) you would edit swift-dist/etc/log4j.properties.

In the remote service case, and assuming you are not using a persistent
coaster service, you would have to edit
cog/modules/provider-coaster/resources/log4j.properties and re-compile
swift. I don't much like this, but it is like that for now.

Mihael


From matthew.woitaszek at gmail.com  Sun Nov  7 17:57:05 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Sun, 7 Nov 2010 16:57:05 -0700
Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug
	output
In-Reply-To: <1289160537.29800.12.camel@blabla2.none>
References: <AANLkTik=25bTBRA2891vR1iTvJYSDPDSX05zQ0_V=ktU@mail.gmail.com>
	<601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov>
	<AANLkTikQYZ+kWqjJoaGY4+YDRr0KZ-Jh1D9rT7=LmJs+@mail.gmail.com>
	<1289160537.29800.12.camel@blabla2.none>
Message-ID: <AANLkTikhL_Z8_Xs4X5YQ-2A3YUmkTUSU0Pmjy+LukUen@mail.gmail.com>

Hi Mihael,

Thanks so much for the logging pointers! I have to confess that I'm really
inexperienced with the log4j/Java logging facility, so those details were
exactly what I needed.

I tracked down the Coasters configuration in
cog/modules/provider-coaster/resources/log4j.properties and the two lines
that control what I was interested in were painfully obvious:

log4j.logger.org.globus.cog.abstraction.coaster=INFO
log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEBUG

After copying those over to swift-dist/etc/log4j.properties, the
Submitting|Submitted|Active messages I'm interested in get included when
running Swift with Coasters in the "local" (one JVM) mode.

Thanks again for your help!

Matthew


On Sun, Nov 7, 2010 at 1:08 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Fri, 2010-11-05 at 13:54 -0600, Matthew Woitaszek wrote:
> > Hi Mike,
> >
> > I do see some of the Coasters messages in the file, such as the INFO
> > Cpu pull messages.
> >
> > With apologies for a dumb question: Can you point me to how to turn on
> > the DEBUG TaskImpl messages?
>
> You would say something like:
> log4j.logger.org.globus.cog.abstraction=DEBUG
> in log4j.properties.
>
> Which log4j.properties that is depends on how you run this. In one of
> the local modes (i.e. coaster service in the same JVM as swift = coaster
> messages in swift logs) you would edit swift-dist/etc/log4j.properties.
>
> In the remote service case, and assuming you are not using a persistent
> coaster service, you would have to edit
> cog/modules/provider-coaster/resources/log4j.properties and re-compile
> swift. I don't much like this, but it is like that for now.
>
> Mihael
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101107/c95ea3ee/attachment.html>

From matthew.woitaszek at gmail.com  Mon Nov  8 11:34:03 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Mon, 8 Nov 2010 10:34:03 -0700
Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn
In-Reply-To: <1289159875.29800.1.camel@blabla2.none>
References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov>
	<1288910118.3775.0.camel@blabla2.none>
	<1289159875.29800.1.camel@blabla2.none>
Message-ID: <AANLkTikWKLG8f_1r2zSORTKBVkhSewVOD73vzfTRc=tL@mail.gmail.com>

Mihael,

I confirm that the "ppn" attribute now gets passed through to PBS, which can
be used to force Torque-based clusters using the local:pbs provider to
allocate the entire node. I tested with 1 and 2 nodes.

This is exactly what I was hoping for -- thank you very much.

* * *

One observational note:

At least on my Torque-scheduled cluster, using
  -l nodes=1:ppn=8
puts 8 copies of the hostname in the PBS_NODEFILE.

Since the Coasters multi-node PBS script does a simple cat/loop/ssh over
PBS_NODEFILE, when using PPN > 1, ppn copies of the Perl script get run on
each node. Thus, it's important that workersPerNode be set to 1.

  <profile namespace="globus" key="workersPerNode">1</profile>

This works fine for me. I'll defer to the broader discussion of nodes,
workers per node, and the variables that make things work... regarding
whether something like NODE= "cat $PBS_NODEFILE | sort | uniq" would be
prefered to run just one script per node with worker count control returned
to workersPerNode...

   NODES=`cat $PBS_NODEFILE`  [could be edited to enforce only one entry per
physical node]
   ...
   for NODE in $NODES; do
      ...
      ssh $NODE /bin/bash -c ...

Matthew


On Sun, Nov 7, 2010 at 12:57 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

>
> Attributes are not directly copied since there is no one-to-one mapping
> between jobs and coaster blocks. So theoretically some "merge" operation
> needs to exist.
>
> I added "ppn" as one of the attributes that is copied from the first
> job, so the scenario I mentioned should now work.
>
> This is cog r2927/trunk.
>
> Mihael
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101108/04ad2b8a/attachment.html>

From aespinosa at cs.uchicago.edu  Mon Nov  8 20:50:54 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Mon, 8 Nov 2010 20:50:54 -0600
Subject: [Swift-user] throttle transfers and vdl:stagein graphs
Message-ID: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>

Hi,

In my workflow, I use the default throttle.transfers=4 .  But my
dostagein-total plot indicates that there are 72 stagein events going
on for around 90 seconds.  shouldn't there be a linear ramp up or a
saw-tooth pattern at the plateau because of having throttled
transfers?

Or am I looking at the wrong setting for this behavior?

Thanks!
-Allan

-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dostagein-total.png
Type: image/png
Size: 3488 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101108/37f439f4/attachment.png>

From hategan at mcs.anl.gov  Mon Nov  8 22:34:50 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 08 Nov 2010 20:34:50 -0800
Subject: [Swift-user] throttle transfers and vdl:stagein graphs
In-Reply-To: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>
References: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>
Message-ID: <1289277290.18134.12.camel@blabla2.none>

On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote:
> Hi,
> 
> In my workflow, I use the default throttle.transfers=4 .  But my
> dostagein-total plot indicates that there are 72 stagein events going
> on for around 90 seconds.  shouldn't there be a linear ramp up or a
> saw-tooth pattern at the plateau because of having throttled
> transfers?

Lies. And statistics.

The plot indicates that a number of instances of a certain portion of
vdl-int is executing.

If you look at that portion of vdl-int (i.e. between setprogress("Stage
in") and setprogress("Submitting")) there are a few things happening,
including directory creation.

Essentially you are dealing with the following pattern:

parallelFor(...
  a()
  throttle(4, b())
  c()
)

The graph would show something like the parallelism in the invocation of
the body of parallelFor. And it is quite possible that all a()
invocations start well before any of the b() invocations start. The only
accurate way to see the effect of the throttle is to trace the b()
invocations, which you can probably do by looking at the status of file
transfer tasks (by enabling the relevant logging stuff).

Mihael


From mparisien at uchicago.edu  Tue Nov  9 13:50:23 2010
From: mparisien at uchicago.edu (Marc Parisien)
Date: Tue, 9 Nov 2010 13:50:23 -0600
Subject: [Swift-user] using queuing system
Message-ID: <D6409FDA-A634-4495-BF68-DC8CA8B9A8D2@uchicago.edu>

Hi All,

  I'm sorry, but I really don't know what I'm doing. This is what I want do to: make swift use the queuing system on the IBI cluster (qsub/qstat).


  I made a sites.xml file, like this:
  IBI has 8-cores nodes and I want to use max 4-cores/node.

--------------------------------------------------------
<config>
  <pool handle="pbs">
    <execution provider="coaster" url="none" jobmanager="local:pbs"/>

    <profile namespace="globus" key="workersPerNode">4</profile>
    <profile namespace="globus" key="slots">4096</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>

    <!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
    <profile namespace="karajan" key="jobThrottle">2.55</profile>   
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local" url="none"/>
    <workdirectory>/cchome/mparis_x/swift</workdirectory>
  </pool>
</config>
--------------------------------------------------------


Here's the shell trace of the exec:

--------------------------------------------------------
Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified locally)

RunID: 20101109-1328-he8pfis2
Progress:
Progress:  Selecting site:3  Initializing site shared directory:1
Progress:  Selecting site:2  Initializing site shared directory:1  Stage in:1
Progress:  Stage in:1  Submitting:3
Progress:  Submitted:3  Active:1
Progress:  Active:4
Worker task failed: 
org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode file not found 5 queue polls after the job was reported done
	at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66)
	at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177)
	at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126)
	at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169)
	at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82)
	at java.lang.Thread.run(Unknown Source)
Progress:  Active:3 Failed but can retry:1
Progress:  Stage in:1 Failed but can retry:3
Progress:  Stage in:1  Active:2 Failed but can retry:1
Progress:  Active:4
Progress:  Active:4
Progress:  Active:4
Progress:  Active:4
Progress:  Active:3  Checking status:1
Progress:  Active:2  Checking status:1  Finished successfully:1
Progress:  Checking status:1  Finished successfully:3
Progress:  Checking status:1  Finished successfully:4
Final status:  Finished successfully:5
Cleaning up...
Shutting down service at https://172.16.0.149:52228
Got channel MetaChannel: 1235930463[1821457857: {}] -> null[1821457857: {}]
+ Done
--------------------------------------------------------


Q1. When I log into the node that processes the job, I see that it has spawned 8 processes, but my swift script should only spawn at most 4 (because my for loop is [0:3]). Why? Because of the retries??

Q2. The worker task seems to fail; but then seems to come back on it's feet (Active:4)?  Active:4... no no, top tells me there are 8 processes running at the same time!

Q3. The swift returns, qstat shows that I don't have anything queued, but if I log into the node that treated the job, I still see active processes:

[mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x
mparis_x 12754  0.0  0.0   8696  1004 ?        Ss   13:29   0:00 bash /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341
mparis_x 12755  0.0  0.0  39360  6368 ?        S    13:29   0:00 /usr/bin/perl /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608 1109-290113-000000 /cchome/mparis_x/.globus/coasters

-> are these going to "finish" anytime by themselves... they just seem to hang there...


Thanks for your time,
    Marc.


From wilde at mcs.anl.gov  Tue Nov  9 14:35:00 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 9 Nov 2010 14:35:00 -0600 (CST)
Subject: [Swift-user] using queuing system
In-Reply-To: <D6409FDA-A634-4495-BF68-DC8CA8B9A8D2@uchicago.edu>
Message-ID: <1466816887.36711.1289334900697.JavaMail.root@zimbra.anl.gov>

Hi Marc,

The IBI cluster I think is an SGE machine, not PBS.

I had sent you previously a non-coaster-based sites entry that looked like:

  <pool handle="sge">
    <execution provider="sge" url="none" />
    <profile namespace="globus" key="pe">threaded</profile>
    <profile key="jobThrottle" namespace="karajan">.49</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>$(pwd)/swiftwork</workdirectory>
  </pool>

so change the coaster version you posted, below, to this:

  <pool handle="sge">
    <execution provider="coaster" url="none" jobmanager="local:sge"/>
    <profile namespace="globus" key="pe">threaded</profile>
    <profile namespace="globus" key="workersPerNode">4</profile>
    <profile namespace="globus" key="slots">128</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxnodes">1</profile>
    <profile namespace="karajan" key="jobThrottle">5.11</profile>   
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none"/>
    <workdirectory>/cchome/mparis_x/swift</workdirectory>
  </pool>

changes above are:
- added "pe" tag, needed for SGE ("parallel environment"). "threaded" seems to be the right PE for ibicluster.
- changed slots to 128: submit up to 128 SGE jobs at once
- nodeGranularity 1, maxnodes 1: each job should request 1 node
- throttle to allow up to 512 swift app() calls to run at once (4x128)

Then, change your tc.data file to read "sge" instead of "pbs".

Lastly, set the following in a file named "cf":

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false

and run swift using a command similar to this:

swift -config cf -sites.file sites.xml -tc.file tc.data yourscript.swift -args=etc

(changing the file names to match yours)

I will need to add a config file into my "latest/" Swift release on ibicluster to retain SGE submit files and stdout/err logs. But for now, you can proceed as above, without that.

More notes below...

----- Original Message -----
> Hi All,
> 
> I'm sorry, but I really don't know what I'm doing. This is what I want
> do to: make swift use the queuing system on the IBI cluster
> (qsub/qstat).
> 
> 
> 
> I made a sites.xml file, like this:
> IBI has 8-cores nodes and I want to use max 4-cores/node.
> 
> --------------------------------------------------------
> <config>
> <pool handle="pbs">
> <execution provider="coaster" url="none" jobmanager="local:pbs"/>
> 
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="slots">4096</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> 
> <!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
> <profile namespace="karajan" key="jobThrottle">2.55</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> 
> <filesystem provider="local" url="none"/>
> <workdirectory>/cchome/mparis_x/swift</workdirectory>
> </pool>
> </config>
> --------------------------------------------------------
> 
> 
> Here's the shell trace of the exec:
> 
> --------------------------------------------------------
> Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified
> locally)
> 
> RunID: 20101109-1328-he8pfis2
> Progress:
> Progress: Selecting site:3 Initializing site shared directory:1
> Progress: Selecting site:2 Initializing site shared directory:1 Stage
> in:1
> Progress: Stage in:1 Submitting:3
> Progress: Submitted:3 Active:1
> Progress: Active:4
> Worker task failed:
> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> Exitcode file not found 5 queue polls after the job was reported done
> at

I suspect this is because Swift submitted PBS-style jobs to SGE, based on the incorrect sites.xml attributes.

> at java.lang.Thread.run(Unknown Source)
> Progress: Active:3 Failed but can retry:1
> Progress: Stage in:1 Failed but can retry:3
> Progress: Stage in:1 Active:2 Failed but can retry:1
> Progress: Active:4
> Progress: Active:4
> Progress: Active:4
> Progress: Active:4
> Progress: Active:3 Checking status:1
> Progress: Active:2 Checking status:1 Finished successfully:1
> Progress: Checking status:1 Finished successfully:3
> Progress: Checking status:1 Finished successfully:4
> Final status: Finished successfully:5
> Cleaning up...
> Shutting down service at https://172.16.0.149:52228
> Got channel MetaChannel: 1235930463[1821457857: {}] ->
> null[1821457857: {}]
> + Done
> --------------------------------------------------------
> 
> 
> 
> Q1. When I log into the node that processes the job, I see that it has
> spawned 8 processes, but my swift script should only spawn at most 4
> (because my for loop is [0:3]). Why? Because of the retries??

I'm not sure. There should be one swift worker.pl process running per node. If the problem persists, please send a snapshot of what you see in ps using:
  ps -fjH -u mparis_x

> Q2. The worker task seems to fail; but then seems to come back on it's
> feet (Active:4)? Active:4... no no, top tells me there are 8 processes
> running at the same time!

> Q3. The swift returns, qstat shows that I don't have anything queued,
> but if I log into the node that treated the job, I still see active
> processes:
> 
> [mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x
> mparis_x 12754 0.0 0.0 8696 1004 ? Ss 13:29 0:00 bash
> /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341
> mparis_x 12755 0.0 0.0 39360 6368 ? S 13:29 0:00 /usr/bin/perl
> /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl
> http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608
> 1109-290113-000000 /cchome/mparis_x/.globus/coasters
> 
> -> are these going to "finish" anytime by themselves... they just seem
> to hang there...

I'll look at this with you if it persists once we correct the sites file.

> Thanks for your time,
> Marc.
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From matthew.woitaszek at gmail.com  Wed Nov 10 16:41:20 2010
From: matthew.woitaszek at gmail.com (Matthew Woitaszek)
Date: Wed, 10 Nov 2010 15:41:20 -0700
Subject: [Swift-user] Coasters - idle time exceeded
Message-ID: <AANLkTim3M9w3Uj5rQ+Q65C504sf3OabCXWhVXQy=ecWE@mail.gmail.com>

Good afternoon,

While running using Coasters, I occasionally get messages like this:

  Idle time exceeded at /home/username/.globus/coasters/cscript....pl line 627.

Then things go horribly wrong and the processing usually doesn't complete.

At first I thought this was in cases where my workflow had a long tail
and many workers were left idle as some long running tasks finished up
-- a symptom of my "let's try this 512-task workflow with 64-128 cores
and see what happens!" experimentation phase. I got around it by just
requesting fewer nodes from PBS in my Coasters configuration. But now
it's popping up on smaller workflows. The susceptible workflows seem
to be preloaded with less than one node's worth of tasks on the first
round of dependencies.

Is there a way that I can increase the idle time limit? Ideally, I'd
like the coasters to wait for the entire PBS job walltime.

Matthew


From wilde at mcs.anl.gov  Wed Nov 10 19:47:24 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 10 Nov 2010 19:47:24 -0600 (CST)
Subject: [Swift-user] Coasters - idle time exceeded
In-Reply-To: <AANLkTim3M9w3Uj5rQ+Q65C504sf3OabCXWhVXQy=ecWE@mail.gmail.com>
Message-ID: <121721591.44480.1289440044368.JavaMail.root@zimbra.anl.gov>

Hi Matthew,

Could you send your swift .log file to us at swift-devel, as well as your sites.xml file, tc.data, and swift.properties (if you have changed them)?

We'll also want to look at $HOME/.globus/coasters.log and any other coaster worker log files (from this run) that might be under .globus/coasters (although the latter is probably not there, as *I think* coasters doesnt write worker logs if there are more than some threshold of total works.

We may need to reproduce this scenario here to debug it.

Mihael may have better suggestions on how to proceed.

- Mike


----- Original Message -----
> Good afternoon,
> 
> While running using Coasters, I occasionally get messages like this:
> 
> Idle time exceeded at /home/username/.globus/coasters/cscript....pl
> line 627.
> 
> Then things go horribly wrong and the processing usually doesn't
> complete.
> 
> At first I thought this was in cases where my workflow had a long tail
> and many workers were left idle as some long running tasks finished up
> -- a symptom of my "let's try this 512-task workflow with 64-128 cores
> and see what happens!" experimentation phase. I got around it by just
> requesting fewer nodes from PBS in my Coasters configuration. But now
> it's popping up on smaller workflows. The susceptible workflows seem
> to be preloaded with less than one node's worth of tasks on the first
> round of dependencies.
> 
> Is there a way that I can increase the idle time limit? Ideally, I'd
> like the coasters to wait for the entire PBS job walltime.
> 
> Matthew
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From hategan at mcs.anl.gov  Wed Nov 10 21:06:51 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 10 Nov 2010 19:06:51 -0800
Subject: [Swift-user] Coasters - idle time exceeded
In-Reply-To: <AANLkTim3M9w3Uj5rQ+Q65C504sf3OabCXWhVXQy=ecWE@mail.gmail.com>
References: <AANLkTim3M9w3Uj5rQ+Q65C504sf3OabCXWhVXQy=ecWE@mail.gmail.com>
Message-ID: <1289444811.24457.20.camel@blabla2.none>

There is a way to increase that limit. That parameter also seems to be a
command line argument, though I don't see it used in that way.

In any event, look for "my $IDLETIMEOUT" in
provider-coaster/resources/worker.pl and change the default there (4 *
60) to whatever you want (I suggest "very large number"). Then
re-compile and re-run.

The idle time was used in a previous version of the coasters (when there
was no block allocation) as a mechanism to clean up unused workers. This
is now done by the coaster service itself. 

The problem with letting the workers do this is that they have no
knowledge that they are part of a block. In said previous version, a
worker dying would be seen immediately by the service through the fact
that the worker job ended. This is not the case with the current block
scheme in which workers are part of multi-node jobs.

The advantage of letting the workers do this is that it is simple
algorithmically.

So given the above, I'd be in favor of getting rid of this idle timeout.

The only concern remaining is preventing workers from running when the
coaster service has died. However, the heartbeat mechanism should take
care of that.

Opinions?

Mihael


On Wed, 2010-11-10 at 15:41 -0700, Matthew Woitaszek wrote:
> Good afternoon,
> 
> While running using Coasters, I occasionally get messages like this:
> 
>   Idle time exceeded at /home/username/.globus/coasters/cscript....pl line 627.
> 
> Then things go horribly wrong and the processing usually doesn't complete.
> 
> At first I thought this was in cases where my workflow had a long tail
> and many workers were left idle as some long running tasks finished up
> -- a symptom of my "let's try this 512-task workflow with 64-128 cores
> and see what happens!" experimentation phase. I got around it by just
> requesting fewer nodes from PBS in my Coasters configuration. But now
> it's popping up on smaller workflows. The susceptible workflows seem
> to be preloaded with less than one node's worth of tasks on the first
> round of dependencies.
> 
> Is there a way that I can increase the idle time limit? Ideally, I'd
> like the coasters to wait for the entire PBS job walltime.
> 
> Matthew
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From aespinosa at cs.uchicago.edu  Wed Nov 10 21:08:12 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Wed, 10 Nov 2010 21:08:12 -0600
Subject: [Swift-user] concurrent_mapper filenames
Message-ID: <AANLkTimts7GRKGVgeJ4ENmmn24TJbKX290=knyxj=3uz@mail.gmail.com>

Blowup fblow[] <concurrent_mapper; location="blowup",
    prefix="blowup-", suffix=".dat">;

Is this the expected generated sequence?

blowup/
blowup/_concurrent
blowup/_concurrent/blowup---array
blowup/_concurrent/blowup---array/elt-10.dat
blowup/_concurrent/blowup---array/elt-12.dat
blowup/_concurrent/blowup---array/h8
blowup/_concurrent/blowup---array/h8/elt-108.dat
blowup/_concurrent/blowup---array/elt-24.dat
blowup/_concurrent/blowup---array/elt-11.dat
blowup/_concurrent/blowup---array/h1
blowup/_concurrent/blowup---array/h1/elt-101.dat
blowup/_concurrent/blowup---array/h7
blowup/_concurrent/blowup---array/h7/elt-157.dat
blowup/_concurrent/blowup---array/h7/elt-107.dat
blowup/_concurrent/blowup---array/elt-14.dat
blowup/_concurrent/blowup---array/elt-15.dat
blowup/_concurrent/blowup---array/h14
blowup/_concurrent/blowup---array/h14/elt-39.dat
blowup/_concurrent/blowup---array/h14/elt-164.dat
blowup/_concurrent/blowup---array/elt-7.dat
blowup/_concurrent/blowup---array/h18
blowup/_concurrent/blowup---array/h18/elt-93.dat
blowup/_concurrent/blowup---array/h18/elt-168.dat


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From mparisien at uchicago.edu  Thu Nov 11 09:16:48 2010
From: mparisien at uchicago.edu (Marc Parisien)
Date: Thu, 11 Nov 2010 09:16:48 -0600
Subject: [Swift-user] runnin' on IBI
Message-ID: <9B56F526-7E0A-43FC-BF3C-A5D0B0B6F920@uchicago.edu>

Hi,

  has anyone got to run stuff on the IBI cluster?

  I'ved tried twice now, and both attempts eventually failed.
  The error goes along the lines of:


Caused by: Exception caught while reading exit code
Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exception caught while reading exit code
Caused by: java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Unknown Source)


  which would be strange for a program to terminate without an exit code...


  I attached the last 300 lines of the log file, if ever it helps!


  Many Thanks,
    Marc.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: swift.log
Type: application/octet-stream
Size: 27895 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20101111/ff0c3ef9/attachment.obj>

From wilde at mcs.anl.gov  Thu Nov 11 13:56:03 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 11 Nov 2010 13:56:03 -0600 (CST)
Subject: [Swift-user] concurrent_mapper filenames
In-Reply-To: <AANLkTimts7GRKGVgeJ4ENmmn24TJbKX290=knyxj=3uz@mail.gmail.com>
Message-ID: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>

Were you expecting something like:
  blowup/blowup-array/elt-10.dat
?

We could try to adjust the output behavior of this mapper.

I'm more interested, though, in revamping out mapper set, and rolling out a new set while leaving the old ones in place (deprecated) for a while.

- Mike

----- Original Message -----
> Blowup fblow[] <concurrent_mapper; location="blowup",
> prefix="blowup-", suffix=".dat">;
> 
> Is this the expected generated sequence?
> 
> blowup/
> blowup/_concurrent
> blowup/_concurrent/blowup---array
> blowup/_concurrent/blowup---array/elt-10.dat
> blowup/_concurrent/blowup---array/elt-12.dat
> blowup/_concurrent/blowup---array/h8
> blowup/_concurrent/blowup---array/h8/elt-108.dat
> blowup/_concurrent/blowup---array/elt-24.dat
> blowup/_concurrent/blowup---array/elt-11.dat
> blowup/_concurrent/blowup---array/h1
> blowup/_concurrent/blowup---array/h1/elt-101.dat
> blowup/_concurrent/blowup---array/h7
> blowup/_concurrent/blowup---array/h7/elt-157.dat
> blowup/_concurrent/blowup---array/h7/elt-107.dat
> blowup/_concurrent/blowup---array/elt-14.dat
> blowup/_concurrent/blowup---array/elt-15.dat
> blowup/_concurrent/blowup---array/h14
> blowup/_concurrent/blowup---array/h14/elt-39.dat
> blowup/_concurrent/blowup---array/h14/elt-164.dat
> blowup/_concurrent/blowup---array/elt-7.dat
> blowup/_concurrent/blowup---array/h18
> blowup/_concurrent/blowup---array/h18/elt-93.dat
> blowup/_concurrent/blowup---array/h18/elt-168.dat
> 
> 
> --
> Allan M. Espinosa <http://amespinosa.wordpress.com>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From aespinosa at cs.uchicago.edu  Thu Nov 11 14:02:11 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 11 Nov 2010 14:02:11 -0600
Subject: [Swift-user] concurrent_mapper filenames
In-Reply-To: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
References: <AANLkTimts7GRKGVgeJ4ENmmn24TJbKX290=knyxj=3uz@mail.gmail.com>
	<1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
Message-ID: <AANLkTim4d95aS+oe51wda4-ExU=_mq3PWy86Xn_PLnnj@mail.gmail.com>

I was expecting something like blowup/blowup-XXXX.dat .  But it
doesn't matter since i'm only after the output.

2010/11/11 Michael Wilde <wilde at mcs.anl.gov>:
> Were you expecting something like:
> ?blowup/blowup-array/elt-10.dat
> ?
>
> We could try to adjust the output behavior of this mapper.
>
> I'm more interested, though, in revamping out mapper set, and rolling out a new set while leaving the old ones in place (deprecated) for a while.
>
> - Mike
>
> ----- Original Message -----
>> Blowup fblow[] <concurrent_mapper; location="blowup",
>> prefix="blowup-", suffix=".dat">;
>>
>> Is this the expected generated sequence?
>>
>> blowup/
>> blowup/_concurrent
>> blowup/_concurrent/blowup---array
>> blowup/_concurrent/blowup---array/elt-10.dat
>> blowup/_concurrent/blowup---array/elt-12.dat
>> blowup/_concurrent/blowup---array/h8
>> blowup/_concurrent/blowup---array/h8/elt-108.dat
>> blowup/_concurrent/blowup---array/elt-24.dat
>> blowup/_concurrent/blowup---array/elt-11.dat
>> blowup/_concurrent/blowup---array/h1
>> blowup/_concurrent/blowup---array/h1/elt-101.dat
>> blowup/_concurrent/blowup---array/h7
>> blowup/_concurrent/blowup---array/h7/elt-157.dat
>> blowup/_concurrent/blowup---array/h7/elt-107.dat
>> blowup/_concurrent/blowup---array/elt-14.dat
>> blowup/_concurrent/blowup---array/elt-15.dat
>> blowup/_concurrent/blowup---array/h14
>> blowup/_concurrent/blowup---array/h14/elt-39.dat
>> blowup/_concurrent/blowup---array/h14/elt-164.dat
>> blowup/_concurrent/blowup---array/elt-7.dat
>> blowup/_concurrent/blowup---array/h18
>> blowup/_concurrent/blowup---array/h18/elt-93.dat
>> blowup/_concurrent/blowup---array/h18/elt-168.dat
>>
>>


-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From benc at hawaga.org.uk  Thu Nov 11 14:05:32 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 11 Nov 2010 20:05:32 +0000 (GMT)
Subject: [Swift-user] concurrent_mapper filenames
In-Reply-To: <AANLkTim4d95aS+oe51wda4-ExU=_mq3PWy86Xn_PLnnj@mail.gmail.com>
References: <AANLkTimts7GRKGVgeJ4ENmmn24TJbKX290=knyxj=3uz@mail.gmail.com>
	<1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
	<AANLkTim4d95aS+oe51wda4-ExU=_mq3PWy86Xn_PLnnj@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.1011112003470.3934@dildano.hawaga.org.uk>

> I was expecting something like blowup/blowup-XXXX.dat .  But it
> doesn't matter since i'm only after the output.

Roughly, the XXXXX number gets covnerted into a digit sequence in some 
base, and then each digit is used to make a new directory level with the 
last digit being the actual file name.

The aim is to reduce the number of GPFS nodes accessing the same 
directory, which is/was a fairly serious scalability problem.

-- 


From benc at hawaga.org.uk  Thu Nov 11 14:01:51 2010
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 11 Nov 2010 20:01:51 +0000 (GMT)
Subject: [Swift-user] concurrent_mapper filenames
In-Reply-To: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
References: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
Message-ID: <Pine.LNX.4.64.1011111959420.3934@dildano.hawaga.org.uk>

> Were you expecting something like:
>   blowup/blowup-array/elt-10.dat
> ?
> 
> We could try to adjust the output behavior of this mapper.

The concurrent mappper was not intended for file names that you would 
expect to use outside of swift -there is already a mapper for numbering 
files based on eg array indices. Its going to generate "weird looking" 
filenames based on tuning I did to make it behave well on GPFS, rather 
than on filenames that you should expect to be able to predict.

-- 


From aespinosa at cs.uchicago.edu  Thu Nov 11 14:17:37 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 11 Nov 2010 14:17:37 -0600
Subject: [Swift-user] concurrent_mapper filenames
In-Reply-To: <Pine.LNX.4.64.1011112003470.3934@dildano.hawaga.org.uk>
References: <AANLkTimts7GRKGVgeJ4ENmmn24TJbKX290=knyxj=3uz@mail.gmail.com>
	<1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov>
	<AANLkTim4d95aS+oe51wda4-ExU=_mq3PWy86Xn_PLnnj@mail.gmail.com>
	<Pine.LNX.4.64.1011112003470.3934@dildano.hawaga.org.uk>
Message-ID: <AANLkTik2Oqu=Hs-28YwcVgjfYe6XAH-UMPAnzgv5Nfxu@mail.gmail.com>

Ah.  that makes sense.

Thanks Ben!

-Allan

2010/11/11 Ben Clifford <benc at hawaga.org.uk>:
>> I was expecting something like blowup/blowup-XXXX.dat . ?But it
>> doesn't matter since i'm only after the output.
>
> Roughly, the XXXXX number gets covnerted into a digit sequence in some
> base, and then each digit is used to make a new directory level with the
> last digit being the actual file name.
>
> The aim is to reduce the number of GPFS nodes accessing the same
> directory, which is/was a fairly serious scalability problem.
>


From wilde at mcs.anl.gov  Thu Nov 11 18:11:19 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 11 Nov 2010 18:11:19 -0600 (CST)
Subject: [Swift-user] runnin' on IBI
In-Reply-To: <9B56F526-7E0A-43FC-BF3C-A5D0B0B6F920@uchicago.edu>
Message-ID: <190998605.50353.1289520679078.JavaMail.root@zimbra.anl.gov>

Marc, can you point me to the directory in which you performed this run, andmake sure that I can access it?

I am wondering if you changed swift.properties, either by specifying a -config file on the swift command line or by editng swift.properties in your swift build or $HOME/.swift directory?

Specifically Im wondering what the property "status.mode" is set to?

If its set to:
  status.mode=provider
then can you try again with it set to:
  status.mode=file
and vice verse?

It looks to me as if its set (defaulting) to "file", but I cant tell without looking deeper into the swift code.

As I recall you are using the SGE provider in non-coaster mode, right?

If you have not changed anything in swift.properties, but you are specifying -sites.file and -tc.file on the command line, then you should create a file (call it "cf") with these lines, in the directory in which you run the swift command:

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=0
lazy.errors=false
status.mode=provider
use.provider.staging=false
provider.staging.pin.swiftfiles=false

and then run the swift command with the additional arg (at the front):

  swift -config cf -sites.file sites.xml etc etc

This may not address the problem bit will help us diagnose it a bit further.

- Mike


----- Original Message -----
> Hi,
> 
> has anyone got to run stuff on the IBI cluster?
> 
> I'ved tried twice now, and both attempts eventually failed.
> The error goes along the lines of:
> 
> 
> Caused by: Exception caught while reading exit code
> Caused by:
> org.globus.cog.abstraction.impl.scheduler.common.ProcessException:
> Exception caught while reading exit code
> Caused by: java.lang.NumberFormatException: null
> at java.lang.Integer.parseInt(Unknown Source)
> 
> 
> which would be strange for a program to terminate without an exit
> code...
> 
> 
> I attached the last 300 lines of the log file, if ever it helps!
> 
> 
> Many Thanks,
> Marc.
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From wilde at mcs.anl.gov  Thu Nov 18 08:31:17 2010
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 18 Nov 2010 08:31:17 -0600 (CST)
Subject: [Swift-user] runnin' on IBI
In-Reply-To: <7B605614-990E-403A-B01B-248C913640F0@uchicago.edu>
Message-ID: <438910860.76468.1290090677740.JavaMail.root@zimbra.anl.gov>

Hi Marc,

Sorry for the delayed response.

I was suspicious that perhaps the (fairly) new SGE provider was not correctly handling error return codes. But I tested it (on ibicluster) with both non-zeroreturn codes and with apps that fail and raise a signal (I tried a divide by by 0 fault).

Everything I tested worked - I was unable to cause the error you received.

Then I tried to run your modftdock swift script - that worked as well (see below). This is in ~wilde/marc if you want to examine how I ran it (see run.sh)

Then I tested with the default Java, which caused first.swift to fail. TO my surprise, that worked as well.

So, the next things to do here are:
- you may want to copy my ~marc directory and see if it works for you
- I would like to get the full logs (and ideally the work directory) from the run(s) that failed for you
- If you could, please try to reproduce the error again and point me to a directory with all your files and and all the files that swift produced.
- also send me your $PATH value so I know what Java you used. (We should log this in the Swift log if its not there already).

Thanks,

Mike

My output was:

[wilde at ibicluster ~]$ pwd
/cchome/wilde
[wilde at ibicluster ~]$ cd marc
[wilde at ibicluster marc]$ ./run.sh
Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally)

RunID: 20101118-0819-01ot1210
Progress:
SwiftScript trace: 1a25
SwiftScript trace: 1a2z
Progress:  Stage in:8
Progress:  Submitting:6  Submitted:2
Progress:  Submitting:3  Submitted:5
Progress:  Submitted:8
Progress:  Submitted:7  Active:1
Progress:  Active:8
Progress:  Active:8
Progress:  Active:8
Progress:  Active:8
Progress:  Active:8
Progress:  Active:7  Checking status:1
Progress:  Active:4  Stage out:1  Finished successfully:3
Progress:  Submitted:1  Active:4  Finished successfully:4
Progress:  Active:5  Finished successfully:4
Progress:  Active:4  Finished successfully:5
Progress:  Active:4  Finished successfully:5
Progress:  Active:3  Checking status:1  Finished successfully:5
Progress:  Active:3  Finished successfully:6
Progress:  Active:2  Checking status:1  Finished successfully:6
Progress:  Active:2  Finished successfully:7
Progress:  Active:1  Checking status:1  Finished successfully:7
Progress:  Submitted:1  Finished successfully:9
Progress:  Submitted:1  Finished successfully:9
Progress:  Active:1  Finished successfully:9
Final status:  Finished successfully:10
[wilde at ibicluster marc]$ 


----- Original Message -----
> Hi Mike,
> 
> 
> > Marc, can you point me to the directory in which you performed this
> > run, andmake sure that I can access it?
> 
> there's awful lots of file here... (see below)
> 
> 
> > I am wondering if you changed swift.properties, either by specifying
> > a -config file on the swift command line or by editng
> > swift.properties in your swift build or $HOME/.swift directory?
> 
> I use a "cf" file, and in it I have:
> 
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=0
> lazy.errors=false
> status.mode=provider
> use.provider.staging=false
> provider.staging.pin.swiftfiles=false
> 
> -> I will change the status.mode for "file" tomorrow, and I will let
> you know if it works! If not, then I'll open up my folders and let you
> in :-D
> 
> 
> 
> > As I recall you are using the SGE provider in non-coaster mode,
> > right?
> 
> That's it:
> 
> <config>
> <pool handle="sge">
> <execution provider="sge" url="none"/>
> <filesystem provider="local" url="none"/>
> ...
> 
> 
> PS On Swift's website, perhaps you could put the parameter sets that
> "work" or that are "specific" for each site/cluster...? I hope I'm the
> first one using swift on IBI ;-)
> 
> 
> As for my code in SVN, I have no problem with that, but my main
> program has to be compiled 32 bits (I compiled it on godzilla). I
> tried compiling it on a 64 bits machine but the program crashes when
> executing (most likely because of the FFT lib it uses). I would prefer
> transferring the exec instead of an in-place compilation.
> 
> 
> I'll keep you informed,
> Marc.

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory


From aespinosa at cs.uchicago.edu  Thu Nov 18 18:54:29 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 18:54:29 -0600
Subject: [Swift-user] throttle transfers and vdl:stagein graphs
In-Reply-To: <1289277290.18134.12.camel@blabla2.none>
References: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>
	<1289277290.18134.12.camel@blabla2.none>
Message-ID: <AANLkTi=VyOn4sShh4=mPCgM1Uqhfg87RrYXEmmU3cTO4@mail.gmail.com>

Ah.  I see that the the log entries that is nearest the actual
task:transfer() task is in vdl:dostageinfile (no parallelFor loop
here).  But i still see the transfers being more than the throttle in
swift.properties.

One of the classes invoked by task:transfer() is
/org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ?

I'll try adding this to my log4j.properties and see what will happen.

-Allan


2010/11/8 Mihael Hategan <hategan at mcs.anl.gov>:
> On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote:
>> Hi,
>>
>> In my workflow, I use the default throttle.transfers=4 . ?But my
>> dostagein-total plot indicates that there are 72 stagein events going
>> on for around 90 seconds. ?shouldn't there be a linear ramp up or a
>> saw-tooth pattern at the plateau because of having throttled
>> transfers?
>
> Lies. And statistics.
>
> The plot indicates that a number of instances of a certain portion of
> vdl-int is executing.
>
> If you look at that portion of vdl-int (i.e. between setprogress("Stage
> in") and setprogress("Submitting")) there are a few things happening,
> including directory creation.
>
> Essentially you are dealing with the following pattern:
>
> parallelFor(...
> ?a()
> ?throttle(4, b())
> ?c()
> )
>
> The graph would show something like the parallelism in the invocation of
> the body of parallelFor. And it is quite possible that all a()
> invocations start well before any of the b() invocations start. The only
> accurate way to see the effect of the throttle is to trace the b()
> invocations, which you can probably do by looking at the status of file
> transfer tasks (by enabling the relevant logging stuff).
>
> Mihael


From aespinosa at cs.uchicago.edu  Thu Nov 18 18:54:29 2010
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Thu, 18 Nov 2010 18:54:29 -0600
Subject: [Swift-user] throttle transfers and vdl:stagein graphs
In-Reply-To: <1289277290.18134.12.camel@blabla2.none>
References: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>
	<1289277290.18134.12.camel@blabla2.none>
Message-ID: <AANLkTi=VyOn4sShh4=mPCgM1Uqhfg87RrYXEmmU3cTO4@mail.gmail.com>

Ah.  I see that the the log entries that is nearest the actual
task:transfer() task is in vdl:dostageinfile (no parallelFor loop
here).  But i still see the transfers being more than the throttle in
swift.properties.

One of the classes invoked by task:transfer() is
/org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ?

I'll try adding this to my log4j.properties and see what will happen.

-Allan


2010/11/8 Mihael Hategan <hategan at mcs.anl.gov>:
> On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote:
>> Hi,
>>
>> In my workflow, I use the default throttle.transfers=4 . ?But my
>> dostagein-total plot indicates that there are 72 stagein events going
>> on for around 90 seconds. ?shouldn't there be a linear ramp up or a
>> saw-tooth pattern at the plateau because of having throttled
>> transfers?
>
> Lies. And statistics.
>
> The plot indicates that a number of instances of a certain portion of
> vdl-int is executing.
>
> If you look at that portion of vdl-int (i.e. between setprogress("Stage
> in") and setprogress("Submitting")) there are a few things happening,
> including directory creation.
>
> Essentially you are dealing with the following pattern:
>
> parallelFor(...
> ?a()
> ?throttle(4, b())
> ?c()
> )
>
> The graph would show something like the parallelism in the invocation of
> the body of parallelFor. And it is quite possible that all a()
> invocations start well before any of the b() invocations start. The only
> accurate way to see the effect of the throttle is to trace the b()
> invocations, which you can probably do by looking at the status of file
> transfer tasks (by enabling the relevant logging stuff).
>
> Mihael


From hategan at mcs.anl.gov  Thu Nov 18 22:59:51 2010
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 18 Nov 2010 20:59:51 -0800
Subject: [Swift-user] throttle transfers and vdl:stagein graphs
In-Reply-To: <AANLkTi=VyOn4sShh4=mPCgM1Uqhfg87RrYXEmmU3cTO4@mail.gmail.com>
References: <AANLkTimBU_gFKCa4mOae5Y-FuGbJqZB=0EKwx+dayHbb@mail.gmail.com>
	<1289277290.18134.12.camel@blabla2.none>
	<AANLkTi=VyOn4sShh4=mPCgM1Uqhfg87RrYXEmmU3cTO4@mail.gmail.com>
Message-ID: <1290142791.32540.0.camel@blabla2.none>

On Thu, 2010-11-18 at 18:54 -0600, Allan Espinosa wrote:
> Ah.  I see that the the log entries that is nearest the actual
> task:transfer() task is in vdl:dostageinfile (no parallelFor loop
> here).  But i still see the transfers being more than the throttle in
> swift.properties.
> 
> One of the classes invoked by task:transfer() is
> /org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ?
> 
> I'll try adding this to my log4j.properties and see what will happen.

The transfer task status is the best choice. That's in the abstractions
module.

Mihael