From hategan at mcs.anl.gov  Wed Apr  1 10:22:55 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 01 Apr 2009 10:22:55 -0500
Subject: [Swift-user] possible coasters problem
In-Reply-To: <49D29BF2.6060101@uchicago.edu>
References: <49D29BF2.6060101@uchicago.edu>
Message-ID: <1238599375.9751.0.camel@localhost>

On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote:
> Hi Guys,
> Do you think this is a problem with coasters or just the way i'm using it...

It's a problem with coasters.

What version of cog/swift is this?

> 
> Thanks,
> Glen
> > Exception in runoops:
> > Arguments: [input/fasta/T1ubq.fasta, 
> > teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, 
> > input/native/T1ubq.pdb, 
> > teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, 
> > teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, 
> > 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, 
> > MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55]
> > Host: teraport
> > Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j
> > stderr.txt:
> >
> > stdout.txt:
> >
> > ----
> >
> > Caused by:
> >
> > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
> > java.lang.IllegalArgumentException: No worker with id=1956306968
> > at 
> > org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85)
> > at 
> > org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71)
> > Caused by: java.lang.IllegalArgumentException: No worker with 
> > id=1956306968
> > at 
> > org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483)
> > at 
> > org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78)
> > ... 1 more
> >
> > Cleaning up...
> > Shutting down service at https://128.135.125.118:55513
> > Got channel MetaChannel: 22129174 -> GSSSChannel-null(1)
> > - Done
> > Command exited with non-zero status 2
> > real 1628.27
> > user 169.87
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Wed Apr  1 10:28:48 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 01 Apr 2009 10:28:48 -0500
Subject: [Swift-user] possible coasters problem
In-Reply-To: <1238599375.9751.0.camel@localhost>
References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost>
Message-ID: <49D38830.8070903@mcs.anl.gov>

I think it was run on cog 2349, swift 2787.

com$ svn info cog
Path: cog
URL: https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog
Repository Root: https://cogkit.svn.sourceforge.net/svnroot/cogkit
Repository UUID: 5b74d2a0-fa0e-0410-85ed-ffba77ec0bde
Revision: 2349
Node Kind: directory
Schedule: normal
Last Changed Author: hategan
Last Changed Rev: 2349
Last Changed Date: 2009-03-29 14:58:40 -0500 (Sun, 29 Mar 2009)

com$ svn info cog/modules/swift
Path: cog/modules/swift
URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk
Repository Root: https://svn.ci.uchicago.edu/svn/vdl2
Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8
Revision: 2788
Node Kind: directory
Schedule: normal
Last Changed Author: hategan
Last Changed Rev: 2787
Last Changed Date: 2009-03-30 19:31:33 -0500 (Mon, 30 Mar 2009)

com$


On 4/1/09 10:22 AM, Mihael Hategan wrote:
> On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote:
>> Hi Guys,
>> Do you think this is a problem with coasters or just the way i'm using it...
> 
> It's a problem with coasters.
> 
> What version of cog/swift is this?
> 
>> Thanks,
>> Glen
>>> Exception in runoops:
>>> Arguments: [input/fasta/T1ubq.fasta, 
>>> teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, 
>>> input/native/T1ubq.pdb, 
>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, 
>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, 
>>> 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, 
>>> MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55]
>>> Host: teraport
>>> Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j
>>> stderr.txt:
>>>
>>> stdout.txt:
>>>
>>> ----
>>>
>>> Caused by:
>>>
>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
>>> java.lang.IllegalArgumentException: No worker with id=1956306968
>>> at 
>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85)
>>> at 
>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71)
>>> Caused by: java.lang.IllegalArgumentException: No worker with 
>>> id=1956306968
>>> at 
>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483)
>>> at 
>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78)
>>> ... 1 more
>>>
>>> Cleaning up...
>>> Shutting down service at https://128.135.125.118:55513
>>> Got channel MetaChannel: 22129174 -> GSSSChannel-null(1)
>>> - Done
>>> Command exited with non-zero status 2
>>> real 1628.27
>>> user 169.87
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Wed Apr  1 10:36:06 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 1 Apr 2009 15:36:06 +0000 (GMT)
Subject: [Swift-user] possible coasters problem
In-Reply-To: <49D38830.8070903@mcs.anl.gov>
References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost>
	<49D38830.8070903@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0904011534210.29026@dildano.hawaga.org.uk>


On Wed, 1 Apr 2009, Michael Wilde wrote:

> I think it was run on cog 2349, swift 2787.

You can tell more accurately from the run log with a command like this 
(that is, it will give what the executing Swift thought it was, rather 
than what was in your repo).

grep swift-r pc3-20090331-1506-bk02i344.log

-- 


From wilde at mcs.anl.gov  Wed Apr  1 10:39:10 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 01 Apr 2009 10:39:10 -0500
Subject: [Swift-user] possible coasters problem
In-Reply-To: <Pine.LNX.4.64.0904011534210.29026@dildano.hawaga.org.uk>
References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost>
	<49D38830.8070903@mcs.anl.gov>
	<Pine.LNX.4.64.0904011534210.29026@dildano.hawaga.org.uk>
Message-ID: <49D38A9E.3030006@mcs.anl.gov>

certainly.  Glen, can you provide that? I didnt have the time/info to 
hunt that down, sorry.

On 4/1/09 10:36 AM, Ben Clifford wrote:
> On Wed, 1 Apr 2009, Michael Wilde wrote:
> 
>> I think it was run on cog 2349, swift 2787.
> 
> You can tell more accurately from the run log with a command like this 
> (that is, it will give what the executing Swift thought it was, rather 
> than what was in your repo).
> 
> grep swift-r pc3-20090331-1506-bk02i344.log
> 


From hockyg at uchicago.edu  Wed Apr  1 13:56:55 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Wed, 01 Apr 2009 13:56:55 -0500
Subject: [Swift-user] possible coasters problem
In-Reply-To: <49D38830.8070903@mcs.anl.gov>
References: <49D29BF2.6060101@uchicago.edu> <1238599375.9751.0.camel@localhost>
	<49D38830.8070903@mcs.anl.gov>
Message-ID: <49D3B8F7.2060008@uchicago.edu>

it's
"Swift svn swift-r2788 cog-r2349"


Michael Wilde wrote:
> I think it was run on cog 2349, swift 2787.
>
> com$ svn info cog
> Path: cog
> URL: 
> https://cogkit.svn.sourceforge.net/svnroot/cogkit/trunk/current/src/cog
> Repository Root: https://cogkit.svn.sourceforge.net/svnroot/cogkit
> Repository UUID: 5b74d2a0-fa0e-0410-85ed-ffba77ec0bde
> Revision: 2349
> Node Kind: directory
> Schedule: normal
> Last Changed Author: hategan
> Last Changed Rev: 2349
> Last Changed Date: 2009-03-29 14:58:40 -0500 (Sun, 29 Mar 2009)
>
> com$ svn info cog/modules/swift
> Path: cog/modules/swift
> URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk
> Repository Root: https://svn.ci.uchicago.edu/svn/vdl2
> Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8
> Revision: 2788
> Node Kind: directory
> Schedule: normal
> Last Changed Author: hategan
> Last Changed Rev: 2787
> Last Changed Date: 2009-03-30 19:31:33 -0500 (Mon, 30 Mar 2009)
>
> com$
>
>
> On 4/1/09 10:22 AM, Mihael Hategan wrote:
>> On Tue, 2009-03-31 at 17:40 -0500, Glen Hocky wrote:
>>> Hi Guys,
>>> Do you think this is a problem with coasters or just the way i'm 
>>> using it...
>>
>> It's a problem with coasters.
>>
>> What version of cog/swift is this?
>>
>>> Thanks,
>>> Glen
>>>> Exception in runoops:
>>>> Arguments: [input/fasta/T1ubq.fasta, 
>>>> teraportoutdir.100/T1ubq/T1ubq.ST50.TU200.0000.secseq, 
>>>> input/native/T1ubq.pdb, 
>>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.pdt, 
>>>> teraportoutdir.100/T1ubq//ST50.TU200/0000/00/06/T1ubq.ST50.TU200.0000.0006.rmsd, 
>>>> 6, DEFAULT_INIT_TEMP_=_50, TEMP_UPDATE_INTERVAL_=_200, 
>>>> MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_55]
>>>> Host: teraport
>>>> Directory: oops-20090331-1701-fpuie7be/jobs/d/runoops-dsmccq8j
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>>
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: 
>>>> java.lang.IllegalArgumentException: No worker with id=1956306968
>>>> at 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:85) 
>>>>
>>>> at 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterQueueProcessor.run(CoasterQueueProcessor.java:71) 
>>>>
>>>> Caused by: java.lang.IllegalArgumentException: No worker with 
>>>> id=1956306968
>>>> at 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.WorkerManager.getChannelContext(WorkerManager.java:483) 
>>>>
>>>> at 
>>>> org.globus.cog.abstraction.coaster.service.job.manager.CoasterTaskHandler.submit(CoasterTaskHandler.java:78) 
>>>>
>>>> ... 1 more
>>>>
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:55513
>>>> Got channel MetaChannel: 22129174 -> GSSSChannel-null(1)
>>>> - Done
>>>> Command exited with non-zero status 2
>>>> real 1628.27
>>>> user 169.87
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From hockyg at uchicago.edu  Thu Apr  2 15:06:19 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 02 Apr 2009 15:06:19 -0500
Subject: [Swift-user] coasters problem on teraport
Message-ID: <49D51ABB.80807@uchicago.edu>

I get the following error trying to run on teraport w/ coasters
> Progress:  Submitting:4 Failed:1 Finished successfully:10
> Execution failed:
>         Exception in runoops:
> Arguments: [input/fasta/T1dcj.fasta, 
> home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj/T1dcj.ST10.TU10.0000.secseq, 
> input/native/T1dcj.pdb, 
> home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.pdt, 
> home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.rmsd, 
> 0, DEFAULT_INIT_TEMP_=_10, TEMP_UPDATE_INTERVAL_=_10, 
> MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_30]
> Host: teraport
> Directory: oops-20090402-1307-6ud4sy60/jobs/9/runoops-9h3jft8j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
>         Could not submit job
> Caused by:
>         Could not start coaster service
> Caused by:
>         Cannot submit job
> Caused by:
>         The job manager failed to open stderr
> Cleaning up...
>  Done
> Command exited with non-zero status 2
> real 189.93
> user 13.20
> sys 1.94
With sites files containing:
> <pool handle="teraport" >
>   <profile namespace="globus" key="queue">fast</profile>
>   <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">01:00:00</profile>
>   <filesystem provider="coaster" 
> url="coaster-gt2://tp-grid1.ci.uchicago.edu" />
>   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
> jobmanager="gt2:gt2:pbs" />
>   <workdirectory>/home/hockyg/swiftwork</workdirectory>
> </pool>
> <pool handle="teraport" >
>   <profile namespace="globus" key="queue">fast</profile>
>   <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">01:00:00</profile>
>   <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
>   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
> jobmanager="gt2:gt2:pbs" />
>   <workdirectory>/home/hockyg/swiftwork</workdirectory>
> </pool>


From hockyg at uchicago.edu  Thu Apr  2 15:08:54 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 02 Apr 2009 15:08:54 -0500
Subject: [Swift-user] re: coasters problem on terapo
Message-ID: <49D51B56.7050500@uchicago.edu>

More details, sorry:

Swift svn swift-r2809 cog-r2350


From hategan at mcs.anl.gov  Thu Apr  2 15:15:10 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 02 Apr 2009 15:15:10 -0500
Subject: [Swift-user] coasters problem on teraport
In-Reply-To: <49D51ABB.80807@uchicago.edu>
References: <49D51ABB.80807@uchicago.edu>
Message-ID: <1238703310.13579.1.camel@localhost>

Swift cannot properly guess your client machine's address.

Do export GLOBUS_HOSTNAME=your.address.or.ip before invoking swift.

On Thu, 2009-04-02 at 15:06 -0500, Glen Hocky wrote:
> I get the following error trying to run on teraport w/ coasters
> > Progress:  Submitting:4 Failed:1 Finished successfully:10
> > Execution failed:
> >         Exception in runoops:
> > Arguments: [input/fasta/T1dcj.fasta, 
> > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj/T1dcj.ST10.TU10.0000.secseq, 
> > input/native/T1dcj.pdb, 
> > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.pdt, 
> > home/ghocks/oops/swift/output/teraportoutdir.1/T1dcj//ST10.TU10/0000/00/00/T1dcj.ST10.TU10.0000.0000.rmsd, 
> > 0, DEFAULT_INIT_TEMP_=_10, TEMP_UPDATE_INTERVAL_=_10, 
> > MAX_NUMBER_OF_ANNEALING_STEPS_=_0, KILL_TIME_=_30]
> > Host: teraport
> > Directory: oops-20090402-1307-6ud4sy60/jobs/9/runoops-9h3jft8j
> > stderr.txt:
> >
> > stdout.txt:
> >
> > ----
> >
> > Caused by:
> >         Could not submit job
> > Caused by:
> >         Could not start coaster service
> > Caused by:
> >         Cannot submit job
> > Caused by:
> >         The job manager failed to open stderr
> > Cleaning up...
> >  Done
> > Command exited with non-zero status 2
> > real 189.93
> > user 13.20
> > sys 1.94
> With sites files containing:
> > <pool handle="teraport" >
> >   <profile namespace="globus" key="queue">fast</profile>
> >   <profile namespace="globus" 
> > key="coasterWorkerMaxwalltime">01:00:00</profile>
> >   <filesystem provider="coaster" 
> > url="coaster-gt2://tp-grid1.ci.uchicago.edu" />
> >   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
> > jobmanager="gt2:gt2:pbs" />
> >   <workdirectory>/home/hockyg/swiftwork</workdirectory>
> > </pool>
> > <pool handle="teraport" >
> >   <profile namespace="globus" key="queue">fast</profile>
> >   <profile namespace="globus" 
> > key="coasterWorkerMaxwalltime">01:00:00</profile>
> >   <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
> >   <execution provider="coaster" url="tp-grid1.ci.uchicago.edu" 
> > jobmanager="gt2:gt2:pbs" />
> >   <workdirectory>/home/hockyg/swiftwork</workdirectory>
> > </pool>
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Thu Apr  2 17:19:28 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 02 Apr 2009 17:19:28 -0500
Subject: [Swift-user] Re: installing swift
In-Reply-To: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
References: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
Message-ID: <49D539F0.4090206@mcs.anl.gov>

Hi Marco,

I am not sure about cog-setup, but I would skip that and proceed to the 
examples, after putting bin/ in your path and making sure that you have 
CA certificates so you can make a proxy (I assume you get all this from 
your OSG environment).

(I assume that was removed on purpose and the quickstart was not updated 
to reflect it. We need to fix that).

Note that Swift's bin/ provides some tools that are also in OSG bin dirs.

And you should direct all your questions to this list, swift-user, 
because that's where the developers listen for users who need help.

To test if its working, do the first example in the Swift tutorial:

http://www.ci.uchicago.edu/swift/guides/tutorial.php

Then move on to the Swift tutorial for the OSG Grid School, I think, 
would be the best approach.

- Mike

On 4/2/09 5:04 PM, Marco Mambelli wrote:
> Hi Mike,
> I'm setting up another host that could be used for the workshop (in case 
> the VM has problem, to have a VDT supported platform).
> 
> I need some help installing swift.
> I tried to follow the instructions in
> http://www.ci.uchicago.edu/swift/guides/quickstartguide.php
> 
> The files cog-setup and example.swift are not in the tarfiles (version 
> 8) that I downloaded.
> 
> I don't know exactly how to configure it and/or test if it is working.
> 
> Thank you,
> Marco


From benc at hawaga.org.uk  Thu Apr  2 17:28:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 2 Apr 2009 22:28:13 +0000 (GMT)
Subject: [Swift-user] Re: installing swift
In-Reply-To: <49D539F0.4090206@mcs.anl.gov>
References: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
	<49D539F0.4090206@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0904022222530.29026@dildano.hawaga.org.uk>


Yeah, that's quite out of date, it seems.

If you're installing on a machine with an OSG stack, get the version 
without extra stuff (swift-0.8-stripped.tar.gz) from the download page. 
Untar it, and put its bin/ directory on your system path.

The stripped version does not contain commands like grid-proxy-init, to 
avoid conflict with real versions deployed elsewhere (i.e. in an OSG 
install).

To test, go into the examples/swift/ directory, type:

swift first.swift

and check that a file hello.txt appears.

-- 


From wilde at mcs.anl.gov  Thu Apr  2 21:01:22 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 02 Apr 2009 21:01:22 -0500
Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs by
	CPU
Message-ID: <49D56DF2.7060005@mcs.anl.gov>

Some sites, like TeraPort, (I think) place independent jobs on all CPUS.

When using coasters, is it true that the user should not specify 
coastersPerNode? Or at least not set it to > 1?

  We should clarify this in the users guide.


From wilde at mcs.anl.gov  Thu Apr  2 21:20:05 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 02 Apr 2009 21:20:05 -0500
Subject: [Swift-user] Please clarify throttle parameters for coasters and
	GT2 issues
Message-ID: <49D57255.9090605@mcs.anl.gov>

I (and colleagues Im working with) have a few related questions:

At some point guidelines were posted regarding "safe" throttle values 
for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify 
if that number is still the best practice, and how to the 4-5 throttle 
parameters to conform?

Then, do those same values apply to coasters?

Finally, with the recent successes in high-volume coaster runs on Ranger 
- 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how 
were those achieved without tripping into the GRAM overhead limits, 
given that Ranger as far as I know has only GT2 GRAM and even submitting 
locallly to SGE must go through GRAM since we have no direct SGE 
provider?  Are the "safe" limits for Ranger simply higer,or is there 
something else involved that makes this practical?  In other words, 
please share and post how to get lots of jobs through Ranger.


From hategan at mcs.anl.gov  Thu Apr  2 22:52:20 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 02 Apr 2009 22:52:20 -0500
Subject: [Swift-user] Please clarify throttle parameters for coasters
	and GT2 issues
In-Reply-To: <49D57255.9090605@mcs.anl.gov>
References: <49D57255.9090605@mcs.anl.gov>
Message-ID: <1238730740.21734.2.camel@localhost>

On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote:
> I (and colleagues Im working with) have a few related questions:
> 
> At some point guidelines were posted regarding "safe" throttle values 
> for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify 
> if that number is still the best practice,

Yes. GT2 hasn't changed much since.

>  and how to the 4-5 throttle 
> parameters to conform?

The defaults are pretty much geared towards the gt2/gridftp combo.

> 
> Then, do those same values apply to coasters?

No. Throttles in the range of 32-256 (or maybe even more) are not
unreasonable with coasters.

> 
> Finally, with the recent successes in high-volume coaster runs on Ranger 
> - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how 
> were those achieved without tripping into the GRAM overhead limits, 
> given that Ranger as far as I know has only GT2 GRAM and even submitting 
> locallly to SGE must go through GRAM since we have no direct SGE 
> provider?  Are the "safe" limits for Ranger simply higer,or is there 
> something else involved that makes this practical?  In other words, 
> please share and post how to get lots of jobs through Ranger.

coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram
jobs.


From hategan at mcs.anl.gov  Thu Apr  2 23:05:13 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 02 Apr 2009 23:05:13 -0500
Subject: [Swift-user] Specifying coastersPerNode on sites that place
	jobs by CPU
In-Reply-To: <49D56DF2.7060005@mcs.anl.gov>
References: <49D56DF2.7060005@mcs.anl.gov>
Message-ID: <1238731513.22128.0.camel@localhost>

On Thu, 2009-04-02 at 21:01 -0500, Michael Wilde wrote:
> Some sites, like TeraPort, (I think) place independent jobs on all CPUS.
> 
> When using coasters, is it true that the user should not specify 
> coastersPerNode? Or at least not set it to > 1?

Yes. I believe "coastersPerNode" is misleading. It should probably be
"coastersPerWorkerJob", but that may sound cryptic.

> 
>   We should clarify this in the users guide.
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From hockyg at uchicago.edu  Fri Apr  3 01:05:16 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 03 Apr 2009 01:05:16 -0500
Subject: [Swift-user] what does this kind of error mean?
Message-ID: <49D5A71C.8030302@uchicago.edu>

log attached

Thanks,
Glen

p.s. can i get it to run more than 37-38 jobs concurrently on one site?
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: teraport.out.5000
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090403/644b3521/attachment.ksh>

From benc at hawaga.org.uk  Fri Apr  3 02:36:22 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Apr 2009 07:36:22 +0000 (GMT)
Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs
	by CPU
In-Reply-To: <49D56DF2.7060005@mcs.anl.gov>
References: <49D56DF2.7060005@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0904030735580.29026@dildano.hawaga.org.uk>


On Thu, 2 Apr 2009, Michael Wilde wrote:

> Some sites, like TeraPort, (I think) place independent jobs on all CPUS.
> 
> When using coasters, is it true that the user should not specify
> coastersPerNode? Or at least not set it to > 1?

Pretty much, yes.

-- 


From benc at hawaga.org.uk  Fri Apr  3 02:41:44 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Apr 2009 07:41:44 +0000 (GMT)
Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs
	by CPU
In-Reply-To: <Pine.LNX.4.64.0904030735580.29026@dildano.hawaga.org.uk>
References: <49D56DF2.7060005@mcs.anl.gov>
	<Pine.LNX.4.64.0904030735580.29026@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0904030738020.29026@dildano.hawaga.org.uk>


On Fri, 3 Apr 2009, Ben Clifford wrote:

> > Some sites, like TeraPort, (I think) place independent jobs on all CPUS.
> > 
> > When using coasters, is it true that the user should not specify
> > coastersPerNode? Or at least not set it to > 1?
> 
> Pretty much, yes.

* Although it provides way to get node overcomitting, which I think in 
some applications is good (i.e. we have two cores, try to run 4 jobs on 
them).

* If you're running on a node that allocates jobs per CPU by default, its 
probably going to present less load to the worker submission system (eg 
GRAM2 or PBS) if you can make it submit one per physical machine and have 
coastersPerNode allocate the CPUs instead of PBS. This is what you'd do in 
qsub with the ppn option. Off the top of my head, I don't know how (or if 
its possible at all) with coasters+gram2+pbs.

-- 


From wilde at mcs.anl.gov  Fri Apr  3 07:55:16 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 03 Apr 2009 07:55:16 -0500
Subject: [Swift-user] Specifying coastersPerNode on sites that place jobs
	by CPU
In-Reply-To: <Pine.LNX.4.64.0904030738020.29026@dildano.hawaga.org.uk>
References: <49D56DF2.7060005@mcs.anl.gov>
	<Pine.LNX.4.64.0904030735580.29026@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0904030738020.29026@dildano.hawaga.org.uk>
Message-ID: <49D60734.6070102@mcs.anl.gov>


On 4/3/09 2:41 AM, Ben Clifford wrote:
> On Fri, 3 Apr 2009, Ben Clifford wrote:
> 
>>> Some sites, like TeraPort, (I think) place independent jobs on all CPUS.
>>>
>>> When using coasters, is it true that the user should not specify
>>> coastersPerNode? Or at least not set it to > 1?
>> Pretty much, yes.
> 
> * Although it provides way to get node overcomitting, which I think in 
> some applications is good (i.e. we have two cores, try to run 4 jobs on 
> them).

Sounds reasonable; would be good to try for IO-bound jobs.

> * If you're running on a node that allocates jobs per CPU by default, its 
> probably going to present less load to the worker submission system (eg 
> GRAM2 or PBS) if you can make it submit one per physical machine and have 
> coastersPerNode allocate the CPUs instead of PBS. This is what you'd do in 
> qsub with the ppn option. Off the top of my head, I don't know how (or if 
> its possible at all) with coasters+gram2+pbs.

I see the possibility (this could make the "allocate all the cores I 
want in one job" feature work for us today with no code change).

But I dont understand the mechanism, even w/o GRAM. Youre implying this 
would work with coasters today, in provider=local:pbs mode, right?

But how would you specify it? Lets say you want 32 cores, on teraport. 
if you say coastersPerNode=32, you would get 32 coasters per core 
(overcommiting, as before).

Did you mean "If you're running on a node that allocates jobs per HOST 
by default"? Eg, systems like Abe and Ranger, the systems with 
substantial cores per host (8,16)?  So say we run on Abe, which I think 
has PBS, and say coastersPerNode=32, you think we would get 4 hosts 
running 32 coasters, 8 per host, 1 per core?  That would be cool to try, 
and then to try over GRAM.  But this direction depends somewhat on how 
Mihael will specify and design the coaster provisioning feature.


From hategan at mcs.anl.gov  Fri Apr  3 10:16:09 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 03 Apr 2009 10:16:09 -0500
Subject: [Swift-user] what does this kind of error mean?
In-Reply-To: <49D5A71C.8030302@uchicago.edu>
References: <49D5A71C.8030302@uchicago.edu>
Message-ID: <1238771769.23515.2.camel@localhost>

The GridFTP error, I don't know. What are your throttling parameters in
swift.properties?


From hockyg at uchicago.edu  Fri Apr  3 10:17:27 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Fri, 03 Apr 2009 10:17:27 -0500
Subject: [Swift-user] what does this kind of error mean?
In-Reply-To: <1238771769.23515.2.camel@localhost>
References: <49D5A71C.8030302@uchicago.edu>
	<1238771769.23515.2.camel@localhost>
Message-ID: <49D62887.9060807@uchicago.edu>

throttle.submit=10
throttle.host.submit=10
throttle.score.job.factor=100.0
throttle.transfers=10
throttle.file.operations=10


Mihael Hategan wrote:
> The GridFTP error, I don't know. What are your throttling parameters in
> swift.properties?
>
>
>   


From hategan at mcs.anl.gov  Fri Apr  3 10:21:29 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 03 Apr 2009 10:21:29 -0500
Subject: [Swift-user] what does this kind of error mean?
In-Reply-To: <49D62887.9060807@uchicago.edu>
References: <49D5A71C.8030302@uchicago.edu>
	<1238771769.23515.2.camel@localhost>  <49D62887.9060807@uchicago.edu>
Message-ID: <1238772089.23804.0.camel@localhost>

Those are a bit too high. Do things work with the defaults?

On Fri, 2009-04-03 at 10:17 -0500, Glen Hocky wrote:
> throttle.submit=10
> throttle.host.submit=10
> throttle.score.job.factor=100.0
> throttle.transfers=10
> throttle.file.operations=10
> 
> 
> Mihael Hategan wrote:
> > The GridFTP error, I don't know. What are your throttling parameters in
> > swift.properties?
> >
> >
> >   
> 


From aespinosa at cs.uchicago.edu  Fri Apr  3 11:58:16 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Fri, 3 Apr 2009 09:58:16 -0700
Subject: [Swift-user] Please clarify throttle parameters for coasters and 
	GT2 issues
In-Reply-To: <1238730740.21734.2.camel@localhost>
References: <49D57255.9090605@mcs.anl.gov> <1238730740.21734.2.camel@localhost>
Message-ID: <50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com>

In pushing Ranger's current scheduling policies of 50 SGE jobs, we can
push a max of 800 cpus.  I have tried this before using the gt2
interface.

-Allan

On Thu, Apr 2, 2009 at 8:52 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote:
>> I (and colleagues Im working with) have a few related questions:
>>
>> At some point guidelines were posted regarding "safe" throttle values
>> for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify
>> if that number is still the best practice,
>
> Yes. GT2 hasn't changed much since.
>
>> ?and how to the 4-5 throttle
>> parameters to conform?
>
> The defaults are pretty much geared towards the gt2/gridftp combo.
>
>>
>> Then, do those same values apply to coasters?
>
> No. Throttles in the range of 32-256 (or maybe even more) are not
> unreasonable with coasters.
>
>>
>> Finally, with the recent successes in high-volume coaster runs on Ranger
>> - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how
>> were those achieved without tripping into the GRAM overhead limits,
>> given that Ranger as far as I know has only GT2 GRAM and even submitting
>> locallly to SGE must go through GRAM since we have no direct SGE
>> provider? ?Are the "safe" limits for Ranger simply higer,or is there
>> something else involved that makes this practical? ?In other words,
>> please share and post how to get lots of jobs through Ranger.
>
> coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram
> jobs.
>


-- 
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>


From benc at hawaga.org.uk  Fri Apr  3 13:14:29 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 3 Apr 2009 18:14:29 +0000 (GMT)
Subject: [Swift-user] Re: installing swift
In-Reply-To: <Pine.LNX.4.64.0904031301500.7189@hep.uchicago.edu>
References: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
	<49D539F0.4090206@mcs.anl.gov>
	<Pine.LNX.4.64.0904022222530.29026@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0904031301500.7189@hep.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904031813560.14816@dildano.hawaga.org.uk>


Every user can share one swift installation, but your pwd needs to be 
writable by the user - so run in ~ rather than in the Swift installation 
directory.

On Fri, 3 Apr 2009, Marco Mambelli wrote:

> Hi Ben,
> the user owning the installation ran succesfully:
> Swift 0.8 (stripped) swift-r2448 cog-r2261
> 
> RunID: 20090403-1305-k22hrieg
> Progress:
> Final status:  Finished successfully:1
> 
> Another user had permission problems:
> swift.log exists, it was created by the first user but it is writable also by
> this one (I tryed to append something to it).
> Which permission is looking for? does it need to be the owner? Group write is
> not sufficient?
> Should every user have its own swift installation?
> 
> Below the java trace.
> Thanks,
> Marco
> 
> 
> [train02 at grid07 swift]$ swift first.swift
> log4j:ERROR setFile(null,true) call failed.
> java.io.FileNotFoundException: swift.log (Permission denied)
>         at java.io.FileOutputStream.openAppend(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:177)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
>         at org.apache.log4j.FileAppender.setFile(FileAppender.java:272)
>         at
> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151)
>         at
> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247)
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123)
>         at
> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87)
>         at
> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645)
>         at
> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603)
>         at
> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406)
>         at
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432)
>         at
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460)
>         at org.apache.log4j.LogManager.<clinit>(LogManager.java:113)
>         at org.apache.log4j.Logger.getLogger(Logger.java:94)
>         at org.globus.cog.karajan.Loader.<clinit>(Loader.java:43)
> Could not start execution.
>         first.xml (Permission denied)
> 
> 
> On Thu, 2 Apr 2009, Ben Clifford wrote:
> 
> > 
> > Yeah, that's quite out of date, it seems.
> > 
> > If you're installing on a machine with an OSG stack, get the version
> > without extra stuff (swift-0.8-stripped.tar.gz) from the download page.
> > Untar it, and put its bin/ directory on your system path.
> > 
> > The stripped version does not contain commands like grid-proxy-init, to
> > avoid conflict with real versions deployed elsewhere (i.e. in an OSG
> > install).
> > 
> > To test, go into the examples/swift/ directory, type:
> > 
> > swift first.swift
> > 
> > and check that a file hello.txt appears.
> > 
> > 
> 
> 


From marco at hep.uchicago.edu  Fri Apr  3 13:07:49 2009
From: marco at hep.uchicago.edu (Marco Mambelli)
Date: Fri, 3 Apr 2009 13:07:49 -0500 (CDT)
Subject: [Swift-user] Re: installing swift
In-Reply-To: <Pine.LNX.4.64.0904022222530.29026@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
	<49D539F0.4090206@mcs.anl.gov>
	<Pine.LNX.4.64.0904022222530.29026@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0904031301500.7189@hep.uchicago.edu>

Hi Ben,
the user owning the installation ran succesfully:
Swift 0.8 (stripped) swift-r2448 cog-r2261

RunID: 20090403-1305-k22hrieg
Progress:
Final status:  Finished successfully:1

Another user had permission problems:
swift.log exists, it was created by the first user but it is writable 
also by this one (I tryed to append something to it).
Which permission is looking for? does it need to be the owner? Group write 
is not sufficient?
Should every user have its own swift installation?

Below the java trace.
Thanks,
Marco


[train02 at grid07 swift]$ swift first.swift
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: swift.log (Permission denied)
         at java.io.FileOutputStream.openAppend(Native Method)
         at java.io.FileOutputStream.<init>(FileOutputStream.java:177)
         at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
         at org.apache.log4j.FileAppender.setFile(FileAppender.java:272)
         at 
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151)
         at 
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247)
         at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123)
         at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87)
         at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645)
         at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603)
         at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500)
         at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406)
         at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432)
         at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460)
         at org.apache.log4j.LogManager.<clinit>(LogManager.java:113)
         at org.apache.log4j.Logger.getLogger(Logger.java:94)
         at org.globus.cog.karajan.Loader.<clinit>(Loader.java:43)
Could not start execution.
         first.xml (Permission denied)


On Thu, 2 Apr 2009, Ben Clifford wrote:

>
> Yeah, that's quite out of date, it seems.
>
> If you're installing on a machine with an OSG stack, get the version
> without extra stuff (swift-0.8-stripped.tar.gz) from the download page.
> Untar it, and put its bin/ directory on your system path.
>
> The stripped version does not contain commands like grid-proxy-init, to
> avoid conflict with real versions deployed elsewhere (i.e. in an OSG
> install).
>
> To test, go into the examples/swift/ directory, type:
>
> swift first.swift
>
> and check that a file hello.txt appears.
>
>


From marco at hep.uchicago.edu  Fri Apr  3 13:27:44 2009
From: marco at hep.uchicago.edu (Marco Mambelli)
Date: Fri, 3 Apr 2009 13:27:44 -0500 (CDT)
Subject: [Swift-user] Re: installing swift
In-Reply-To: <Pine.LNX.4.64.0904031813560.14816@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0904021701160.22851@hep.uchicago.edu>
	<49D539F0.4090206@mcs.anl.gov>
	<Pine.LNX.4.64.0904022222530.29026@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0904031301500.7189@hep.uchicago.edu>
	<Pine.LNX.4.64.0904031813560.14816@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0904031324130.7189@hep.uchicago.edu>

My bad, I changed permission to the tree (files and subdir) but not to the 
directory itself and I was confused by a swift.log somewhere else in the 
path.

swift is installed and works fine

Thanks,
Marco


On Fri, 3 Apr 2009, Ben Clifford wrote:

>
> Every user can share one swift installation, but your pwd needs to be
> writable by the user - so run in ~ rather than in the Swift installation
> directory.
>
> On Fri, 3 Apr 2009, Marco Mambelli wrote:
>
>> Hi Ben,
>> the user owning the installation ran succesfully:
>> Swift 0.8 (stripped) swift-r2448 cog-r2261
>>
>> RunID: 20090403-1305-k22hrieg
>> Progress:
>> Final status:  Finished successfully:1
>>
>> Another user had permission problems:
>> swift.log exists, it was created by the first user but it is writable also by
>> this one (I tryed to append something to it).
>> Which permission is looking for? does it need to be the owner? Group write is
>> not sufficient?
>> Should every user have its own swift installation?
>>
>> Below the java trace.
>> Thanks,
>> Marco
>>
>>
>> [train02 at grid07 swift]$ swift first.swift
>> log4j:ERROR setFile(null,true) call failed.
>> java.io.FileNotFoundException: swift.log (Permission denied)
>>         at java.io.FileOutputStream.openAppend(Native Method)
>>         at java.io.FileOutputStream.<init>(FileOutputStream.java:177)
>>         at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
>>         at org.apache.log4j.FileAppender.setFile(FileAppender.java:272)
>>         at
>> org.apache.log4j.FileAppender.activateOptions(FileAppender.java:151)
>>         at
>> org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:247)
>>         at
>> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:123)
>>         at
>> org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:87)
>>         at
>> org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:645)
>>         at
>> org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:603)
>>         at
>> org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:500)
>>         at
>> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:406)
>>         at
>> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:432)
>>         at
>> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:460)
>>         at org.apache.log4j.LogManager.<clinit>(LogManager.java:113)
>>         at org.apache.log4j.Logger.getLogger(Logger.java:94)
>>         at org.globus.cog.karajan.Loader.<clinit>(Loader.java:43)
>> Could not start execution.
>>         first.xml (Permission denied)
>>
>>
>> On Thu, 2 Apr 2009, Ben Clifford wrote:
>>
>>>
>>> Yeah, that's quite out of date, it seems.
>>>
>>> If you're installing on a machine with an OSG stack, get the version
>>> without extra stuff (swift-0.8-stripped.tar.gz) from the download page.
>>> Untar it, and put its bin/ directory on your system path.
>>>
>>> The stripped version does not contain commands like grid-proxy-init, to
>>> avoid conflict with real versions deployed elsewhere (i.e. in an OSG
>>> install).
>>>
>>> To test, go into the examples/swift/ directory, type:
>>>
>>> swift first.swift
>>>
>>> and check that a file hello.txt appears.
>>>
>>>
>>
>>
>


From wilde at mcs.anl.gov  Fri Apr  3 15:57:22 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 03 Apr 2009 15:57:22 -0500
Subject: [Swift-user] Please clarify throttle parameters for coasters
	and 	GT2 issues
In-Reply-To: <50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com>
References: <49D57255.9090605@mcs.anl.gov> <1238730740.21734.2.camel@localhost>
	<50b07b4b0904030958o36051630i5b330f7068fc102d@mail.gmail.com>
Message-ID: <49D67832.1070202@mcs.anl.gov>

Allan,

50 SGE jobs is Ranger's queue limit, but its above the 40-job "safe" 
limit suggested by the Swift developers. Above 40 we risk causing sever 
overhead on their gatekeeper, which I think doubles as one of the login 
hosts.

So I would urge you that when using GT2 to submit (with and without 
coasters) that you stay under 40, until coasters can create more workers 
with less jobs.

- Mike


On 4/3/09 11:58 AM, Allan Espinosa wrote:
> In pushing Ranger's current scheduling policies of 50 SGE jobs, we can
> push a max of 800 cpus.  I have tried this before using the gt2
> interface.
> 
> -Allan
> 
> On Thu, Apr 2, 2009 at 8:52 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> On Thu, 2009-04-02 at 21:20 -0500, Michael Wilde wrote:
>>> I (and colleagues Im working with) have a few related questions:
>>>
>>> At some point guidelines were posted regarding "safe" throttle values
>>> for sending to GT2 GRAM sites. I recall "max 40 jobs". Can you clarify
>>> if that number is still the best practice,
>> Yes. GT2 hasn't changed much since.
>>
>>>  and how to the 4-5 throttle
>>> parameters to conform?
>> The defaults are pretty much geared towards the gt2/gridftp combo.
>>
>>> Then, do those same values apply to coasters?
>> No. Throttles in the range of 32-256 (or maybe even more) are not
>> unreasonable with coasters.
>>
>>> Finally, with the recent successes in high-volume coaster runs on Ranger
>>> - 190K jobs in ~ 5 hours, and even the earlier runs of 65K jobs - how
>>> were those achieved without tripping into the GRAM overhead limits,
>>> given that Ranger as far as I know has only GT2 GRAM and even submitting
>>> locallly to SGE must go through GRAM since we have no direct SGE
>>> provider?  Are the "safe" limits for Ranger simply higer,or is there
>>> something else involved that makes this practical?  In other words,
>>> please share and post how to get lots of jobs through Ranger.
>> coasterWorkersPerNode=16. That gives you 640 cpus with exactly 40 gram
>> jobs.
>>
> 
> 
> 
> 


From hockyg at uchicago.edu  Mon Apr  6 21:42:08 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Mon, 06 Apr 2009 21:42:08 -0500
Subject: [Swift-user] consultation about error messages, coaster usage
Message-ID: <49DABD80.8010508@uchicago.edu>

Hi Guys,
I just ran (and killed) too big runs w/ swift, one on ranger, one on 
abe. I stopped them because in each case there were many "Failed but can 
retry" jobs, several "Failed to transfer wrapper log" errors and at the 
point where i stopped them, many more cpu's allocated than "Active" 
jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an 
hour left (so 224 cpus) but only 76 "Active" jobs.

Could someone take a look at the logs and tell me if things are working 
properly? It's a little hard to tell from a user end...
On a ci home machine,
All run related files for abe are in
> /home/hockyg/oops/swift/output/abeoutdir.5/
and for ranger

> /home/hockyg/oops/swift/output/rangeroutdir.5/
In those directories, there will be a file $site.out.5 which has the stdout
and xout.XXXXX which has a log of all the commands run including the 
swift invocation
the tc.data file used is $site.data and the sites.xml file is $site.xml

Thanks,
Glen


From hategan at mcs.anl.gov  Mon Apr  6 22:08:41 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 22:08:41 -0500
Subject: [Swift-user] consultation about error messages, coaster usage
In-Reply-To: <49DABD80.8010508@uchicago.edu>
References: <49DABD80.8010508@uchicago.edu>
Message-ID: <1239073721.14311.2.camel@localhost>

You seem to be using a particularly bad version of swift. I suggest
trying the latest version.

Mihael

On Mon, 2009-04-06 at 21:42 -0500, Glen Hocky wrote:
> Hi Guys,
> I just ran (and killed) too big runs w/ swift, one on ranger, one on 
> abe. I stopped them because in each case there were many "Failed but can 
> retry" jobs, several "Failed to transfer wrapper log" errors and at the 
> point where i stopped them, many more cpu's allocated than "Active" 
> jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an 
> hour left (so 224 cpus) but only 76 "Active" jobs.
> 
> Could someone take a look at the logs and tell me if things are working 
> properly? It's a little hard to tell from a user end...
> On a ci home machine,
> All run related files for abe are in
> > /home/hockyg/oops/swift/output/abeoutdir.5/
> and for ranger
> 
> > /home/hockyg/oops/swift/output/rangeroutdir.5/
> In those directories, there will be a file $site.out.5 which has the stdout
> and xout.XXXXX which has a log of all the commands run including the 
> swift invocation
> the tc.data file used is $site.data and the sites.xml file is $site.xml
> 
> Thanks,
> Glen
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Mon Apr  6 23:15:36 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 06 Apr 2009 23:15:36 -0500
Subject: [Swift-user] consultation about error messages, coaster usage
In-Reply-To: <1239073721.14311.2.camel@localhost>
References: <49DABD80.8010508@uchicago.edu>
	<1239073721.14311.2.camel@localhost>
Message-ID: <49DAD368.5000006@mcs.anl.gov>

OK, will do. I think the fix you applied at 5PM enables us to go back to 
the latest rev. This morning we updated, then reverted back to Tuesday 3/31.

On 4/6/09 10:08 PM, Mihael Hategan wrote:
> You seem to be using a particularly bad version of swift. I suggest
> trying the latest version.
> 
> Mihael
> 
> On Mon, 2009-04-06 at 21:42 -0500, Glen Hocky wrote:
>> Hi Guys,
>> I just ran (and killed) too big runs w/ swift, one on ranger, one on 
>> abe. I stopped them because in each case there were many "Failed but can 
>> retry" jobs, several "Failed to transfer wrapper log" errors and at the 
>> point where i stopped them, many more cpu's allocated than "Active" 
>> jobs. E.g. on ranger there were 14 running jobs in the queue w/ over an 
>> hour left (so 224 cpus) but only 76 "Active" jobs.
>>
>> Could someone take a look at the logs and tell me if things are working 
>> properly? It's a little hard to tell from a user end...
>> On a ci home machine,
>> All run related files for abe are in
>>> /home/hockyg/oops/swift/output/abeoutdir.5/
>> and for ranger
>>
>>> /home/hockyg/oops/swift/output/rangeroutdir.5/
>> In those directories, there will be a file $site.out.5 which has the stdout
>> and xout.XXXXX which has a log of all the commands run including the 
>> swift invocation
>> the tc.data file used is $site.data and the sites.xml file is $site.xml
>>
>> Thanks,
>> Glen
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From hategan at mcs.anl.gov  Mon Apr  6 23:40:42 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 06 Apr 2009 23:40:42 -0500
Subject: [Swift-user] consultation about error messages, coaster usage
In-Reply-To: <49DAD368.5000006@mcs.anl.gov>
References: <49DABD80.8010508@uchicago.edu>
	<1239073721.14311.2.camel@localhost>  <49DAD368.5000006@mcs.anl.gov>
Message-ID: <1239079242.15719.0.camel@localhost>

On Mon, 2009-04-06 at 23:15 -0500, Michael Wilde wrote:
> OK, will do. I think the fix you applied at 5PM enables us to go back to 
> the latest rev. This morning we updated, then reverted back to Tuesday 3/31.

Yes. Sorry about that one. It happens though that Tuesday 3/31 was also
pretty unstable.


From hockyg at uchicago.edu  Thu Apr  9 13:40:34 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 09 Apr 2009 13:40:34 -0500
Subject: [Swift-user] Is there a way to have an optional command line
 argument to a swift script?
Message-ID: <49DE4122.8020409@uchicago.edu>

I have a new command line argument for my script and I want to check if 
it's there or not.
doing @arg("foo") just gives
    Missing command line argument: foo


Thanks,
Glen


From benc at hawaga.org.uk  Thu Apr  9 13:45:18 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 9 Apr 2009 18:45:18 +0000 (GMT)
Subject: [Swift-user] Is there a way to have an optional command line
	argument to a swift script?
In-Reply-To: <49DE4122.8020409@uchicago.edu>
References: <49DE4122.8020409@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904091843590.29026@dildano.hawaga.org.uk>


On Thu, 9 Apr 2009, Glen Hocky wrote:

> I have a new command line argument for my script and I want to check if it's
> there or not.

You can have a default value for an argument. The user guide describes 
how.

If you choose a sufficiently obscure default value, you can pretty much 
detect if you got the default value or not and change behaviour based on 
it. Or if you only want to know if its there or not in order to assign a 
default value you get that automatically.

-- 


From hategan at mcs.anl.gov  Thu Apr  9 13:49:22 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 09 Apr 2009 13:49:22 -0500
Subject: [Swift-user] Is there a way to have an optional command line
	argument to a swift script?
In-Reply-To: <49DE4122.8020409@uchicago.edu>
References: <49DE4122.8020409@uchicago.edu>
Message-ID: <1239302962.5659.0.camel@localhost>

On Thu, 2009-04-09 at 13:40 -0500, Glen Hocky wrote:
> I have a new command line argument for my script and I want to check if 
> it's there or not.
> doing @arg("foo") just gives
>     Missing command line argument: foo

What's the command line you use to start the script?

> 
> 
> Thanks,
> Glen
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Wed Apr 15 12:41:30 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 15 Apr 2009 12:41:30 -0500
Subject: [Swift-user] Swift restarts with iterate?
Message-ID: <49E61C4A.90902@mcs.anl.gov>

Is the restart feature designed to correctly handle restarts of scripts 
with active, possibly nested, iterate statements?

The use case of interest here is to run a single copy of swift 
continuously, or for extended periods, doing a task graph of work, 
sleeping, and repeating indefinitely.

The thought was that you could restart swift indefinitely after any 
failures, or periodically if for the time being it cant run indefinitely 
due to memory or other resource consumption issues.

The use case involves applying it to process log data on a continuing basis.

I suspect the pattern may also be useful in digesting, eg, news data.

Comments or advice on feasibility would be useful before experimenting.


From benc at hawaga.org.uk  Wed Apr 15 14:02:15 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 15 Apr 2009 19:02:15 +0000 (GMT)
Subject: [Swift-user] Swift restarts with iterate?
In-Reply-To: <49E61C4A.90902@mcs.anl.gov>
References: <49E61C4A.90902@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0904151852590.29026@dildano.hawaga.org.uk>


On Wed, 15 Apr 2009, Michael Wilde wrote:

> Is the restart feature designed to correctly handle restarts of scripts with
> active, possibly nested, iterate statements?

There was no intention that such would not work.

> The use case of interest here is to run a single copy of swift continuously,
> or for extended periods, doing a task graph of work, sleeping, and repeating
> indefinitely.

I've considered that as a way of handling something like streaming 
datasets.

Doing that should work in as much as it should accomodate new data 
appearing.

However I'm unsure of the memory usage scalability compared to a run where 
you had all the data in place at the start of a single run - Swift will 
still make karajan threads to attempt (and then optimise away) already 
done executions, and will still have an in-memory representation of each 
data object already processed.

>From a SwiftScript language perspective, the above fits in just fine, I 
think.

>From a practical perspective as it is now, you will need something that 
depends on the array being closed and fails (for example, call /bin/false 
with the array a an input).

-- 


From hategan at mcs.anl.gov  Wed Apr 15 14:12:58 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 15 Apr 2009 14:12:58 -0500
Subject: [Swift-user] Swift restarts with iterate?
In-Reply-To: <Pine.LNX.4.64.0904151852590.29026@dildano.hawaga.org.uk>
References: <49E61C4A.90902@mcs.anl.gov>
	<Pine.LNX.4.64.0904151852590.29026@dildano.hawaga.org.uk>
Message-ID: <1239822778.23411.26.camel@localhost>

On Wed, 2009-04-15 at 19:02 +0000, Ben Clifford wrote:
> On Wed, 15 Apr 2009, Michael Wilde wrote:
> 
> > Is the restart feature designed to correctly handle restarts of scripts with
> > active, possibly nested, iterate statements?
> 
> There was no intention that such would not work.
> 
> > The use case of interest here is to run a single copy of swift continuously,
> > or for extended periods, doing a task graph of work, sleeping, and repeating
> > indefinitely.
> 
> I've considered that as a way of handling something like streaming 
> datasets.
> 
> Doing that should work in as much as it should accomodate new data 
> appearing.
> 
> However I'm unsure of the memory usage scalability compared to a run where 
> you had all the data in place at the start of a single run - Swift will 
> still make karajan threads to attempt (and then optimise away) already 
> done executions, and will still have an in-memory representation of each 
> data object already processed.

While thinking of the scalability issues a while ago when I did the
foreach limiting, I concluded that a solution to that may exist.

Currently we use a certain scheme to detect when a piece of data stops
being written to, such that it can be considered "closed".

Similarly, it may be possible to determine when a piece of data will not
be referenced any more, and consequently remove in-memory references to
it and its associated data structures.


From benc at hawaga.org.uk  Wed Apr 15 14:34:43 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 15 Apr 2009 19:34:43 +0000 (GMT)
Subject: [Swift-user] Swift restarts with iterate?
In-Reply-To: <49E61C4A.90902@mcs.anl.gov>
References: <49E61C4A.90902@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0904151933180.29026@dildano.hawaga.org.uk>


On Wed, 15 Apr 2009, Michael Wilde wrote:

> Is the restart feature designed to correctly handle restarts of scripts with
> active, possibly nested, iterate statements?

I think, though, that foreach is more the construct for the use case of 
iterating over a growing collection of files.

Do a foreach over a mapepd collection of files. Run swift again with a 
bigger mapped collection of files, and if you are using the restart stuff 
described earlier in this thread, it will run only on the new entries.

-- 


From yuechen at bsd.uchicago.edu  Sat Apr 18 14:03:24 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Sat, 18 Apr 2009 14:03:24 -0500
Subject: [Swift-user] job waiting
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi,
 
I'm using Swift and PTMap to analyze E coli genome data. Right now, I'm only mapped on SDSC DTF and NCSA mercury, so I'm trying to use only these two computer clusters. Total number of jobs should be around 4127. After it started, the application runs normally. However, after 3769 jobs returned successfully, it could not receive any more data and the system kept waiting. On these computers, if I use qstat, I cannot find any active job. In my email, I received 45 emails like the following:
 
/////
PBS Job Id: 1932326.tg-master.ncsa.teragrid.org
Job Name:   null
job deleted
Job deleted at request of root at tg-master.ncsa.teragrid.org
MOAB_INFO:  job exceeded wallclock limit
//////
 
I'm wondering if I did something wrong and how I can avoid this situation. The log of the search should be /home/yuechen/PTMap2/PTMap2-unmod-20090418-1254-d8loarc1.log.

/************* The swift script I used is /home/yuechen/PTMap2/PTMap2-unmod.swift

/************* tc.data is /home/yuechen/PTMap2/tc.data

/************* sites.xml is /home/yuechen/PTMap2/sites.xml and the following are the two sites I used.

 <pool handle="SDSC_dtf_prews_pbs">
   <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
   <jobmanager universe="vanilla" url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
   <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
   <profile namespace="globus" key="queue">fast</profile>
 </pool>
<pool handle="NCSAMERCURY">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
    <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" jobManager="gt2:PBS"/>
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
 </pool>


Thank you very much!

Best regards,

Chen, Yue


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090418/5539d765/attachment.html>

From benc at hawaga.org.uk  Sat Apr 18 17:01:16 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 18 Apr 2009 22:01:16 +0000 (GMT)
Subject: [Swift-user] job waiting
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904182140340.29026@dildano.hawaga.org.uk>


On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:

> MOAB_INFO:  job exceeded wallclock limit

This message means that some of the josb that you tried to run took longer 
than is allowed by default.

I plotted your logs using swift-plot-log, and from the graph 'karajan 
active JOB_SUBMISSION cumulative duration' in the karajan tab 
(http://www.ci.uchicago.edu/~benc/tmp/report-PTMap2-unmod-20090418-1254-d8loarc1/karajan.html) 
it looks like while most of your jobs take somewhere between seconds to a 
few minutes, a number of your jobs take longer (up to 3000 seconds in that 
graph)

Check that: i) an hour is a sane time for your programs to be taking, and 
ii) that the queues that you are submitting to (the default on ncsa, for 
example) allow this length of time.

-- 


From benc at hawaga.org.uk  Sat Apr 18 17:05:55 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 18 Apr 2009 22:05:55 +0000 (GMT)
Subject: [Swift-user] job waiting
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>


actually, I see you're using coasters on NCSA, so the actual numbers for 
walltimes being submitted into NCSA's queueing system will be a little 
strange. But my first question, that some jobs taking around an hour still 
stands.

Also I notice a large number of jobs being submitted at the start of your 
run - have you adjusted the default throttles on your swift installation 
to some larger value?

-- 


From yuechen at bsd.uchicago.edu  Sat Apr 18 18:20:43 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Sat, 18 Apr 2009 18:20:43 -0500
Subject: [Swift-user] job waiting
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Ben,
 
Thanks for answering my question. This phenomena occur after half an hour of execution. If all the jobs finish execution at original speed, it would probably take not more than 40 min. How the system figure out that some jobs will take more than 1 hour? Should I request more time when I execute "grid-proxy-init"?
 
I did not change the default throttles. How much is more appropriate? The total number of jobs in my application typically run between 4000 and 30000 and typically each job can be finished within a couple of minutes.
 
Thanks!
 
Chen, Yue
 

________________________________

From: Ben Clifford [mailto:benc at hawaga.org.uk]
Sent: Sat 4/18/2009 5:05 PM
To: Yue, Chen - BMD
Cc: swift user
Subject: Re: [Swift-user] job waiting


actually, I see you're using coasters on NCSA, so the actual numbers for
walltimes being submitted into NCSA's queueing system will be a little
strange. But my first question, that some jobs taking around an hour still
stands.

Also I notice a large number of jobs being submitted at the start of your
run - have you adjusted the default throttles on your swift installation
to some larger value?

--


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090418/bdcccda7/attachment.html>

From benc at hawaga.org.uk  Sun Apr 19 02:07:13 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 19 Apr 2009 07:07:13 +0000 (GMT)
Subject: [Swift-user] job waiting
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>


On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:

> Thanks for answering my question. This phenomena occur after half an 
> hour of execution. If all the jobs finish execution at original speed, 
> it would probably take not more than 40 min. How the system figure out 
> that some jobs will take more than 1 hour? Should I request more time 
> when I execute "grid-proxy-init"?

Not with grid-proxy-init. You can specify a parameter called maxwalltime 
in your sites file or your tc.data file that will tell Swift an upper 
bound on how long your job will run. In Swift 0.8, coasters assume 
something like 10 minutes if you do not specify a walltime, so you will 
run into trouble.

For example, change the null at the end of your tc.data lines to 
globus::maxwalltime=50  to mean 50 minutes maxwalltime.

There has been work done on coasters since Swift 0.8, and so Mihael may 
have some other recommendations.

> I did not change the default throttles. How much is more appropriate? 
> The total number of jobs in my application typically run between 4000 
> and 30000 and typically each job can be finished within a couple of 
> minutes.

Where is your Swift installation? I would liek to look at it.

-- 


From hategan at mcs.anl.gov  Sun Apr 19 11:20:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 19 Apr 2009 11:20:14 -0500
Subject: [Swift-user] job waiting
In-Reply-To: <Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
Message-ID: <1240158014.25901.1.camel@localhost>

On Sun, 2009-04-19 at 07:07 +0000, Ben Clifford wrote:
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
> 
> > Thanks for answering my question. This phenomena occur after half an 
> > hour of execution. If all the jobs finish execution at original speed, 
> > it would probably take not more than 40 min. How the system figure out 
> > that some jobs will take more than 1 hour? Should I request more time 
> > when I execute "grid-proxy-init"?
> 
> Not with grid-proxy-init. You can specify a parameter called maxwalltime 
> in your sites file or your tc.data file that will tell Swift an upper 
> bound on how long your job will run. In Swift 0.8, coasters assume 
> something like 10 minutes if you do not specify a walltime, so you will 
> run into trouble.
> 
> For example, change the null at the end of your tc.data lines to 
> globus::maxwalltime=50  to mean 50 minutes maxwalltime.
> 
> There has been work done on coasters since Swift 0.8, and so Mihael may 
> have some other recommendations.

Yes. Coasters are experimental. As such, there are problems. However,
you may get better results with the current development version.

> 
> > I did not change the default throttles. How much is more appropriate? 
> > The total number of jobs in my application typically run between 4000 
> > and 30000 and typically each job can be finished within a couple of 
> > minutes.
> 
> Where is your Swift installation? I would liek to look at it.
> 


From yuechen at bsd.uchicago.edu  Tue Apr 21 10:53:02 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Tue, 21 Apr 2009 10:53:02 -0500
Subject: [Swift-user] job waiting
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu><Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu><Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
	<1240158014.25901.1.camel@localhost>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C88@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Ben and Mihael,
 
Thanks for answering my questions. I will try to set the maxwalltime in tc.data in my application and let you know how it works.
 
My Swift installation is at /home/yuechen/swift-0.8 on Communicado. Please let me know if you see any problem in my setup.
 
Thank you very much!
 
Chen, Yue
 
 
________________________________

From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Sun 4/19/2009 11:20 AM
To: Ben Clifford
Cc: Yue, Chen - BMD; swift user
Subject: RE: [Swift-user] job waiting


On Sun, 2009-04-19 at 07:07 +0000, Ben Clifford wrote:
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
>
> > Thanks for answering my question. This phenomena occur after half an
> > hour of execution. If all the jobs finish execution at original speed,
> > it would probably take not more than 40 min. How the system figure out
> > that some jobs will take more than 1 hour? Should I request more time
> > when I execute "grid-proxy-init"?
>
> Not with grid-proxy-init. You can specify a parameter called maxwalltime
> in your sites file or your tc.data file that will tell Swift an upper
> bound on how long your job will run. In Swift 0.8, coasters assume
> something like 10 minutes if you do not specify a walltime, so you will
> run into trouble.
>
> For example, change the null at the end of your tc.data lines to
> globus::maxwalltime=50  to mean 50 minutes maxwalltime.
>
> There has been work done on coasters since Swift 0.8, and so Mihael may
> have some other recommendations.

Yes. Coasters are experimental. As such, there are problems. However,
you may get better results with the current development version.

>
> > I did not change the default throttles. How much is more appropriate?
> > The total number of jobs in my application typically run between 4000
> > and 30000 and typically each job can be finished within a couple of
> > minutes.
>
> Where is your Swift installation? I would liek to look at it.
>


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090421/11b936a8/attachment.html>

From benc at hawaga.org.uk  Wed Apr 22 11:30:17 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 22 Apr 2009 16:30:17 +0000 (GMT)
Subject: [Swift-user] job waiting
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904221628020.14392@dildano.hawaga.org.uk>


You might find it useful to try out coasters from swift 0.9rc2, which is a 
much more recent testing version of Swift compared to 0.8 (which I think 
you are using)

You can get that here:

www.ci.uchicago.edu/~benc/swift-0.9rc2.tar.gz

Your existing SwiftScript and site files should work with that.

-- 


From yuechen at bsd.uchicago.edu  Wed Apr 22 11:25:42 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 22 Apr 2009 11:25:42 -0500
Subject: [Swift-user] job waiting
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Ben,
 
Yesterday, I tested my application a few times on NCSA mercury only with coaster and with the specification of globus::maxwalltime=50 in tc.data. Similar to previous try, in several runs, the application keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns respectively. Does this relate to my setting? The log for the last run is at: 
 
/home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log

I started to receive email with the following content after about 10 min of execution,  
 
/////////
PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
Job Name:   null
job deleted
Job deleted at request of root at tg-master.ncsa.teragrid.org
MOAB_INFO:  job exceeded wallclock limit
/////////
 
However, Swift did not indicate any job failure, so should I worry about the success of those jobs? 
 
I also tried NCSA mercury only without coaster, but the submitted jobs do not seem to return successfully. I notice that if I use coaster, typicaly max number jobs I have on NCSA is about 130, but if I do not use coaster, I can have more than 300 jobs queued on NCSA computer. Is this related with the throttle setting?
 
I also tried SDSC dtf server without coaster, but the jobs submitted do not get started on SDSC dtf server. Instead, I got many error messages like the following. Should I contact teragrid for these errors?
 
Progress:  Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished successfully:230 Failed but can retry:45
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
Failed to transfer wrapper log from PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
 
The following is my sites.xml content for NCSA mercury with and without coaster and SDSC DTF:
 
 <pool handle="NCSAMERCURY">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
    <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" jobManager="gt2:PBS"/>
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
 </pool>
 <pool handle="NCSAMERCURY_nocoaster">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
   <jobmanager universe="vanilla" url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
 </pool>
 <pool handle="SDSC_dtf_prews_pbs">
   <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
   <jobmanager universe="vanilla" url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
   <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
   <profile namespace="globus" key="queue">fast</profile>
 </pool>
 
The swift script I used is at:
 
/home/yuechen/PTMap2/PTMap2-unmod.swift
 
The tc.data I used is:
 
/home/yuechen/PTMap2/tc.data
 
I will start to try other servers to see if I can run all jobs successfully.
 
Thank you very much for help!
 
Chen, Yue
 

________________________________

From: Ben Clifford [mailto:benc at hawaga.org.uk]
Sent: Sun 4/19/2009 2:07 AM
To: Yue, Chen - BMD
Cc: swift user
Subject: RE: [Swift-user] job waiting


On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:

> Thanks for answering my question. This phenomena occur after half an
> hour of execution. If all the jobs finish execution at original speed,
> it would probably take not more than 40 min. How the system figure out
> that some jobs will take more than 1 hour? Should I request more time
> when I execute "grid-proxy-init"?

Not with grid-proxy-init. You can specify a parameter called maxwalltime
in your sites file or your tc.data file that will tell Swift an upper
bound on how long your job will run. In Swift 0.8, coasters assume
something like 10 minutes if you do not specify a walltime, so you will
run into trouble.

For example, change the null at the end of your tc.data lines to
globus::maxwalltime=50  to mean 50 minutes maxwalltime.

There has been work done on coasters since Swift 0.8, and so Mihael may
have some other recommendations.

> I did not change the default throttles. How much is more appropriate?
> The total number of jobs in my application typically run between 4000
> and 30000 and typically each job can be finished within a couple of
> minutes.

Where is your Swift installation? I would liek to look at it.

--


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090422/df4ad558/attachment.html>

From hategan at mcs.anl.gov  Wed Apr 22 11:44:14 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 22 Apr 2009 11:44:14 -0500
Subject: [Swift-user] job waiting
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu>
	<Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <1240418654.7409.0.camel@localhost>

This behavior was observed previously with the version you have. I
strongly recommend upgrading to the version Ben mentions.

On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote:
> Hi Ben,
>  
> Yesterday, I tested my application a few times on NCSA mercury only
> with coaster and with the specification of globus::maxwalltime=50 in
> tc.data. Similar to previous try, in several runs, the application
> keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns
> respectively. Does this relate to my setting? The log for the last run
> is at: 
>  
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log
> 
> I started to receive email with the following content after about 10
> min of execution,  
>  
> /////////
> PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
> Job Name:   null
> job deleted
> Job deleted at request of root at tg-master.ncsa.teragrid.org
> MOAB_INFO:  job exceeded wallclock limit
> /////////
>  
> However, Swift did not indicate any job failure, so should I worry
> about the success of those jobs? 
>  
> I also tried NCSA mercury only without coaster, but the submitted jobs
> do not seem to return successfully. I notice that if I use coaster,
> typicaly max number jobs I have on NCSA is about 130, but if I do not
> use coaster, I can have more than 300 jobs queued on NCSA computer. Is
> this related with the throttle setting?
>  
> I also tried SDSC dtf server without coaster, but the jobs submitted
> do not get started on SDSC dtf server. Instead, I got many error
> messages like the following. Should I contact teragrid for these
> errors?
>  
> Progress:  Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished
> successfully:230 Failed but can retry:45
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
>  
> The following is my sites.xml content for NCSA mercury with and
> without coaster and SDSC DTF:
>  
>  <pool handle="NCSAMERCURY">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="NCSAMERCURY_nocoaster">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>    <jobmanager universe="vanilla"
> url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="SDSC_dtf_prews_pbs">
>    <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
>    <jobmanager universe="vanilla"
> url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
>    <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
>    <profile namespace="globus" key="queue">fast</profile>
>  </pool>
>  
> The swift script I used is at:
>  
> /home/yuechen/PTMap2/PTMap2-unmod.swift
>  
> The tc.data I used is:
>  
> /home/yuechen/PTMap2/tc.data
>  
> I will start to try other servers to see if I can run all jobs
> successfully.
>  
> Thank you very much for help!
>  
> Chen, Yue
>  
>  
> 
>  
> 
> 
> 
>  
>  
>  
> 
> 
> ______________________________________________________________________
> From: Ben Clifford [mailto:benc at hawaga.org.uk]
> Sent: Sun 4/19/2009 2:07 AM
> To: Yue, Chen - BMD
> Cc: swift user
> Subject: RE: [Swift-user] job waiting
> 
> 
> 
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
> 
> > Thanks for answering my question. This phenomena occur after half an
> > hour of execution. If all the jobs finish execution at original
> speed,
> > it would probably take not more than 40 min. How the system figure
> out
> > that some jobs will take more than 1 hour? Should I request more
> time
> > when I execute "grid-proxy-init"?
> 
> Not with grid-proxy-init. You can specify a parameter called
> maxwalltime
> in your sites file or your tc.data file that will tell Swift an upper
> bound on how long your job will run. In Swift 0.8, coasters assume
> something like 10 minutes if you do not specify a walltime, so you
> will
> run into trouble.
> 
> For example, change the null at the end of your tc.data lines to
> globus::maxwalltime=50  to mean 50 minutes maxwalltime.
> 
> There has been work done on coasters since Swift 0.8, and so Mihael
> may
> have some other recommendations.
> 
> > I did not change the default throttles. How much is more
> appropriate?
> > The total number of jobs in my application typically run between
> 4000
> > and 30000 and typically each job can be finished within a couple of
> > minutes.
> 
> Where is your Swift installation? I would liek to look at it.
> 
> --
> 
> 
> 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From yuechen at bsd.uchicago.edu  Wed Apr 22 21:29:51 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 22 Apr 2009 21:29:51 -0500
Subject: [Swift-user] job waiting
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C83@ADM-EXCHVS04.bsdad.uchicago.edu><Pine.LNX.4.64.0904182203480.29026@dildano.hawaga.org.uk><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C85@ADM-EXCHVS04.bsdad.uchicago.edu><Pine.LNX.4.64.0904190702260.15342@dildano.hawaga.org.uk><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C92@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1240418654.7409.0.camel@localhost>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158C97@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Mihael and Ben,
 
Thanks for your information. The new version of coasters works very well on NCSA mercury and I don't receive those email any more. But I run into some problem with SDSC server. I will send separate email tomorrow after I get response from SDSC people.
 
Best,
Chen, Yue
 

________________________________

From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Wed 4/22/2009 11:44 AM
To: Yue, Chen - BMD
Cc: Ben Clifford; swift user
Subject: RE: [Swift-user] job waiting


This behavior was observed previously with the version you have. I
strongly recommend upgrading to the version Ben mentions.

On Wed, 2009-04-22 at 11:25 -0500, Yue, Chen - BMD wrote:
> Hi Ben,
> 
> Yesterday, I tested my application a few times on NCSA mercury only
> with coaster and with the specification of globus::maxwalltime=50 in
> tc.data. Similar to previous try, in several runs, the application
> keeps waiting after 4076, 4052, 4099, 4048, 4051 successful returns
> respectively. Does this relate to my setting? The log for the last run
> is at:
> 
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1036-07c88p47.log
>
> I started to receive email with the following content after about 10
> min of execution, 
> 
> /////////
> PBS Job Id: 1947957.tg-master.ncsa.teragrid.org
> Job Name:   null
> job deleted
> Job deleted at request of root at tg-master.ncsa.teragrid.org
> MOAB_INFO:  job exceeded wallclock limit
> /////////
> 
> However, Swift did not indicate any job failure, so should I worry
> about the success of those jobs?
> 
> I also tried NCSA mercury only without coaster, but the submitted jobs
> do not seem to return successfully. I notice that if I use coaster,
> typicaly max number jobs I have on NCSA is about 130, but if I do not
> use coaster, I can have more than 300 jobs queued on NCSA computer. Is
> this related with the throttle setting?
> 
> I also tried SDSC dtf server without coaster, but the jobs submitted
> do not get started on SDSC dtf server. Instead, I got many error
> messages like the following. Should I contact teragrid for these
> errors?
> 
> Progress:  Stage in:93 Submitted:3710 Active:45 Stage out:4 Finished
> successfully:230 Failed but can retry:45
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/f on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/o on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> Failed to transfer wrapper log from
> PTMap2-unmod-20090421-2214-e6ssbye5/info/t on SDSC_dtf_prews_pbs
> 
> The following is my sites.xml content for NCSA mercury with and
> without coaster and SDSC DTF:
> 
>  <pool handle="NCSAMERCURY">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="NCSAMERCURY_nocoaster">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>    <jobmanager universe="vanilla"
> url="grid-hg.ncsa.teragrid.org:2119/jobmanager-pbs" major="2" />
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>  </pool>
>  <pool handle="SDSC_dtf_prews_pbs">
>    <gridftp  url="gsiftp://tg-gridftp.sdsc.teragrid.org:2811/" />
>    <jobmanager universe="vanilla"
> url="tg-login1.sdsc.teragrid.org:2119/jobmanager-pbs" major="2" />
>    <workdirectory >/gpfs-wan/scratch/yuechen</workdirectory>
>    <profile namespace="globus" key="queue">fast</profile>
>  </pool>
> 
> The swift script I used is at:
> 
> /home/yuechen/PTMap2/PTMap2-unmod.swift
> 
> The tc.data I used is:
> 
> /home/yuechen/PTMap2/tc.data
> 
> I will start to try other servers to see if I can run all jobs
> successfully.
> 
> Thank you very much for help!
> 
> Chen, Yue
> 
> 
>
> 
>
>
>
> 
> 
> 
>
>
> ______________________________________________________________________
> From: Ben Clifford [mailto:benc at hawaga.org.uk]
> Sent: Sun 4/19/2009 2:07 AM
> To: Yue, Chen - BMD
> Cc: swift user
> Subject: RE: [Swift-user] job waiting
>
>
>
> On Sat, 18 Apr 2009, Yue, Chen - BMD wrote:
>
> > Thanks for answering my question. This phenomena occur after half an
> > hour of execution. If all the jobs finish execution at original
> speed,
> > it would probably take not more than 40 min. How the system figure
> out
> > that some jobs will take more than 1 hour? Should I request more
> time
> > when I execute "grid-proxy-init"?
>
> Not with grid-proxy-init. You can specify a parameter called
> maxwalltime
> in your sites file or your tc.data file that will tell Swift an upper
> bound on how long your job will run. In Swift 0.8, coasters assume
> something like 10 minutes if you do not specify a walltime, so you
> will
> run into trouble.
>
> For example, change the null at the end of your tc.data lines to
> globus::maxwalltime=50  to mean 50 minutes maxwalltime.
>
> There has been work done on coasters since Swift 0.8, and so Mihael
> may
> have some other recommendations.
>
> > I did not change the default throttles. How much is more
> appropriate?
> > The total number of jobs in my application typically run between
> 4000
> > and 30000 and typically each job can be finished within a couple of
> > minutes.
>
> Where is your Swift installation? I would liek to look at it.
>
> --
>
>
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090422/36890032/attachment.html>

From hockyg at uchicago.edu  Thu Apr 23 14:21:36 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 23 Apr 2009 14:21:36 -0500
Subject: [Swift-user] max number of jobs?
Message-ID: <49F0BFC0.1040504@uchicago.edu>

Hi everyone.
I was wondering if there is a cap on number of coasters or jobs in the 
queue on some machines. I've had a lot of success running on Ranger but 
I've never had more than 256 active jobs (i.e. 16x16) even with very 
high initial score and throttle settings.

Glen


From hockyg at uchicago.edu  Thu Apr 23 14:43:00 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 23 Apr 2009 14:43:00 -0500
Subject: [Swift-user] max number of jobs?
In-Reply-To: <49F0C3FD.7040907@cs.uchicago.edu>
References: <49F0BFC0.1040504@uchicago.edu> <49F0C3FD.7040907@cs.uchicago.edu>
Message-ID: <49F0C4C4.6050403@uchicago.edu>

Ah, I do intend to try running under ranger but the one main reason I 
haven't is I'm trying to run from a single location (a ci machine) 
because it's easier to keep managed that way.

I'm running in the normal queue, but all of my jobs are 16 node, the 
reason for that I think is that there is no way to get larger block 
allocations as was being discussed a few weeks ago. I would be better 
off w/ larger because I'm sure the wait time is the same for 16 or 32 or 
64...

Ioan Raicu wrote:
> I can't help with the Swift or coaster settings, but don't forget that 
> Falkon is also installed on Ranger, and you can use it the same way 
> that you use it on intrepid. I have yet to do extremely large runs on 
> ranger to see how well things scale, but you might want to give Falkon 
> a try as well.
>
> Also, I recall something about the development queue being limited to 
> 16 or 32 nodes. The normal queue, which allows larger allocations, 
> usually also has higher wait times. Coaster might be configured to use 
> the faster development queue, which has a limited number of nodes you 
> can use. You might want to look into changing the queue Swift/Coaster 
> submits the jobs to. Perhaps Mihael or others can offer details on how 
> to change the queue Swift will submit to.
>
> Ioan
>
> Glen Hocky wrote:
>> Hi everyone.
>> I was wondering if there is a cap on number of coasters or jobs in 
>> the queue on some machines. I've had a lot of success running on 
>> Ranger but I've never had more than 256 active jobs (i.e. 16x16) even 
>> with very high initial score and throttle settings.
>>
>> Glen
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>


From iraicu at cs.uchicago.edu  Thu Apr 23 14:39:41 2009
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 23 Apr 2009 14:39:41 -0500
Subject: [Swift-user] max number of jobs?
In-Reply-To: <49F0BFC0.1040504@uchicago.edu>
References: <49F0BFC0.1040504@uchicago.edu>
Message-ID: <49F0C3FD.7040907@cs.uchicago.edu>

I can't help with the Swift or coaster settings, but don't forget that 
Falkon is also installed on Ranger, and you can use it the same way that 
you use it on intrepid. I have yet to do extremely large runs on ranger 
to see how well things scale, but you might want to give Falkon a try as 
well.

Also, I recall something about the development queue being limited to 16 
or 32 nodes. The normal queue, which allows larger allocations, usually 
also has higher wait times. Coaster might be configured to use the 
faster development queue, which has a limited number of nodes you can 
use. You might want to look into changing the queue Swift/Coaster 
submits the jobs to. Perhaps Mihael or others can offer details on how 
to change the queue Swift will submit to.

Ioan

Glen Hocky wrote:
> Hi everyone.
> I was wondering if there is a cap on number of coasters or jobs in the 
> queue on some machines. I've had a lot of success running on Ranger 
> but I've never had more than 256 active jobs (i.e. 16x16) even with 
> very high initial score and throttle settings.
>
> Glen
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>

-- 
===================================================
Ioan Raicu, Ph.D.
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From hockyg at uchicago.edu  Thu Apr 23 14:45:55 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 23 Apr 2009 14:45:55 -0500
Subject: [Swift-user] max number of jobs?
In-Reply-To: <49F0C415.7040809@mcs.anl.gov>
References: <49F0BFC0.1040504@uchicago.edu> <49F0C415.7040809@mcs.anl.gov>
Message-ID: <49F0C573.6040002@uchicago.edu>

ranger.data
> #sitename       transformation  path    INSTALLED       
> platform        profiles
> localhost       echo            /bin/echo       INSTALLED       
> INTEL32::LINUX  null
> localhost       cat             /bin/cat        INSTALLED       
> INTEL32::LINUX  null
> localhost       ls              /bin/ls         INSTALLED       
> INTEL32::LINUX  null
> localhost       grep            /bin/grep       INSTALLED       
> INTEL32::LINUX  null
> localhost       sort            /bin/sort       INSTALLED       
> INTEL32::LINUX  null
> localhost       paste           /bin/paste      INSTALLED       
> INTEL32::LINUX  null
> localhost       sed             /bin/sed        INSTALLED       
> INTEL32::LINUX  null
> localhost       cp              /bin/cp         INSTALLED       
> INTEL32::LINUX  null
> localhost       sumarizeStudy   
> /home/hockyg/oops/swift/genPlotFiles.py INSTALLED       INTEL32::LINUX 
> null
>
>
> ranger    runoops    
> /share/home/01021/hockyg/oops/trunk/bin/runoops.sh      
> INSTALLED       INTEL32::LINUX  null
> ranger    runrama    
> /share/home/01021/hockyg/oops/trunk/bin/runrama.sh      
> INSTALLED       INTEL32::LINUX  null
> ranger    runramaSpeed            
> /share/home/01021/hockyg/oops/trunk/bin/runramaSpeed.sh   
> INSTALLED       INTEL32::LINUX  null
> ranger    analyze_round_dir    
> /share/home/01021/hockyg/oops/trunk/bin/analyze_round_dir.sh      
> INSTALLED       INTEL32::LINUX  null


ranger.xml

> <config>
>  <pool handle="localhost" sysinfo="INTEL32::LINUX">
>     <gridftp url="local://localhost" />
>     <execution provider="local" url="none" />
>     <workdirectory>/home/hockyg/swiftwork</workdirectory>
>   </pool>
>
>   <pool handle="ranger">
>     <execution provider="coaster" 
> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>     <!--<filesystem provider="coaster" 
> url="gt2://gatekeeper.ranger.tacc.teragrid.org"/>-->
>     <gridftp url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" />
>     <profile namespace="env" 
> key="SWIFT_JOBDIR_PATH">/tmp/hockyg/jobdir</profile>
>     <profile namespace="globus" key="project">TG-MCB080099N</profile>
>     <profile namespace="globus" key="coastersPerNode">16</profile>
>     <profile namespace="globus" 
> key="coasterWorkerMaxwalltime">05:00:00</profile>
>     <profile namespace="globus" key="maxwalltime">60</profile>
>     <profile namespace="karajan" key="initialScore">50</profile>
>     <profile namespace="karajan" key="jobThrottle">10</profile>
>     <workdirectory>/share/home/01021/hockyg/swiftwork</workdirectory>
>   </pool>
> </config>


> [hockyg at communicado rangeroutdir.1002]$ less 
> /home/hockyg/.swift/swift.properties
> sitedir.keep=true
> lazy.errors=false
> #execution.retries=0
> status.mode=provider

Michael Wilde wrote:
> Glen, can you post your sites.xml, tc.data and swift.properties files?
>
> On 4/23/09 2:21 PM, Glen Hocky wrote:
>> Hi everyone.
>> I was wondering if there is a cap on number of coasters or jobs in 
>> the queue on some machines. I've had a lot of success running on 
>> Ranger but I've never had more than 256 active jobs (i.e. 16x16) even 
>> with very high initial score and throttle settings.
>>
>> Glen
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Thu Apr 23 16:53:24 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 23 Apr 2009 21:53:24 +0000 (GMT)
Subject: [Swift-user] max number of jobs?
In-Reply-To: <49F0BFC0.1040504@uchicago.edu>
References: <49F0BFC0.1040504@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0904232152210.29026@dildano.hawaga.org.uk>


On Thu, 23 Apr 2009, Glen Hocky wrote:

> I was wondering if there is a cap on number of coasters or jobs in the queue
> on some machines. I've had a lot of success running on Ranger but I've never
> had more than 256 active jobs (i.e. 16x16) even with very high initial score
> and throttle settings.

Do you see other jobs from Swift sitting in the queue on Ranger in 
queued/waiting state (rather than running)? Or do you only see exactly 256 
jobs in the queue? (this being from wahtever Ranger's equivalent of qstat 
is)

-- 


From hockyg at uchicago.edu  Thu Apr 23 16:55:48 2009
From: hockyg at uchicago.edu (Glen Hocky)
Date: Thu, 23 Apr 2009 16:55:48 -0500
Subject: [Swift-user] max number of jobs?
In-Reply-To: <Pine.LNX.4.64.0904232152210.29026@dildano.hawaga.org.uk>
References: <49F0BFC0.1040504@uchicago.edu>
	<Pine.LNX.4.64.0904232152210.29026@dildano.hawaga.org.uk>
Message-ID: <49F0E3E4.6010909@uchicago.edu>

With these settings, exactly 16 jobs immediately go into the queue 
(which on ranger goes to 256 coasters) and that number never changes

Ben Clifford wrote:
> On Thu, 23 Apr 2009, Glen Hocky wrote:
>
>   
>> I was wondering if there is a cap on number of coasters or jobs in the queue
>> on some machines. I've had a lot of success running on Ranger but I've never
>> had more than 256 active jobs (i.e. 16x16) even with very high initial score
>> and throttle settings.
>>     
>
> Do you see other jobs from Swift sitting in the queue on Ranger in 
> queued/waiting state (rather than running)? Or do you only see exactly 256 
> jobs in the queue? (this being from wahtever Ranger's equivalent of qstat 
> is)
>
>   


From aespinosa at cs.uchicago.edu  Fri Apr 24 13:05:44 2009
From: aespinosa at cs.uchicago.edu (Allan Espinosa)
Date: Fri, 24 Apr 2009 13:05:44 -0500
Subject: [Swift-user] swift-plot-log with svg graphics
Message-ID: <20090424180543.GA4121@origin>

I wanted to zoom into the plots as much as I want so i changed the png term configured in gnuplot invocations to svg. Much prettier than png plots in my opinion :)

My patch for swift-plot-log is in http://www.ci.uchicago.edu/~aespinosa/swiftplot_svg-r2874.patch

Sample plots:  http://www.ci.uchicago.edu/~aespinosa/swift/report-blast-20090410-2357-j4nnkrg1/


Known issues:  firefox does not properly render svg graphics produced by gnuplot4.0patch0 (like the one installed in communicado).  Although this is fixed in gnuplot4.0patch2.  below's a small sed script to fix that:

sed -i 's/xmlns/xmlns="http:\/\/www.w3.org\/2000\/svg" xmlns/g' *.svg

-Allan


From hategan at mcs.anl.gov  Fri Apr 24 13:15:07 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 24 Apr 2009 13:15:07 -0500
Subject: [Swift-user] swift-plot-log with svg graphics
In-Reply-To: <20090424180543.GA4121@origin>
References: <20090424180543.GA4121@origin>
Message-ID: <1240596907.9287.8.camel@localhost>

Prettier indeed. The rendering on my browser is slow though. So this
should probably be an option. Or maybe the pages should show a png and
have a link to the high-resolution svg.

On Fri, 2009-04-24 at 13:05 -0500, Allan Espinosa wrote:
> I wanted to zoom into the plots as much as I want so i changed the png term configured in gnuplot invocations to svg. Much prettier than png plots in my opinion :)
> 
> My patch for swift-plot-log is in http://www.ci.uchicago.edu/~aespinosa/swiftplot_svg-r2874.patch
> 
> Sample plots:  http://www.ci.uchicago.edu/~aespinosa/swift/report-blast-20090410-2357-j4nnkrg1/
> 
> 
> Known issues:  firefox does not properly render svg graphics produced by gnuplot4.0patch0 (like the one installed in communicado).  Although this is fixed in gnuplot4.0patch2.  below's a small sed script to fix that:
> 
> sed -i 's/xmlns/xmlns="http:\/\/www.w3.org\/2000\/svg" xmlns/g' *.svg
> 
> -Allan
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From benc at hawaga.org.uk  Mon Apr 27 07:41:39 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 27 Apr 2009 12:41:39 +0000 (GMT)
Subject: [Swift-user] swift-plot-log with svg graphics
In-Reply-To: <20090424180543.GA4121@origin>
References: <20090424180543.GA4121@origin>
Message-ID: <Pine.LNX.4.64.0904271240210.29026@dildano.hawaga.org.uk>

I think general SVG support isn't good enough for SVG to be the main 
default format. But if its making better images, it would be good to 
incorporate it somehow, like mihael suggests.

At the very least, make a bugzilla enhancement request; even better, write 
the code and contribute it...

-- 


From benc at hawaga.org.uk  Mon Apr 27 08:24:05 2009
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 27 Apr 2009 13:24:05 +0000 (GMT)
Subject: [Swift-user] Swift 0.9 released.
Message-ID: <Pine.LNX.4.64.0904271319360.29026@dildano.hawaga.org.uk>


Swift 0.9 is released. Download it at                                           
http://www.ci.uchicago.edu/swift/downloads/    

The release notes, with more information on new features and bugfixes, are 
available at:  
http://www.ci.uchicago.edu/swift/downloads/release-notes-0.9.txt

-- 


From yuechen at bsd.uchicago.edu  Wed Apr 29 12:38:09 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 29 Apr 2009 12:38:09 -0500
Subject: [Swift-user] errors in file transfer
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi,
 
I was trying to test PTMap application using NCSA Abe. However, I got many error messages like the following and no process started on the Abe.
 
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe

The log of the search is on communicado:
 
/home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log
 
In the sites.xml, the entry for NCSA Abe is :
 
 <pool handle="NCSA_Abe">
    <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
    <execution provider="coaster" url="grid-abe.ncsa.teragrid.org" jobManager="gt2:PBS"/>
    <workdirectory>/cfs/scratch/users/yuechen/swiftwork</workdirectory>
    <profile namespace="globus" key="queue">fast</profile>
 </pool>

In the tc.data, the entry is:
 
NCSA_Abe        PTMap2          /u/ac/yuechen/PTMap2/PTMap2     INSTALLED       INTEL32::LINUX  globus::maxwalltime=50
 
I'm wondering if I have any setup problem or I should contact NCSA administrator.
 
Thank you very much!
 
Regards,
 
Chen, Yue
 
 
This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090429/b0dd2898/attachment.html>

From hategan at mcs.anl.gov  Wed Apr 29 12:51:17 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 29 Apr 2009 12:51:17 -0500
Subject: [Swift-user] errors in file transfer
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <1241027477.14561.0.camel@localhost>

This is the error:
 Cannot submit job: Could not submit job (qsub reported an exit code o
f 170). no error output

Try jobmanager="gt2:gt2:PBS" instead of "gt2:PBS".

Mihael

On Wed, 2009-04-29 at 12:38 -0500, Yue, Chen - BMD wrote:
> Hi,
>  
> I was trying to test PTMap application using NCSA Abe. However, I
> got many error messages like the following and no process started on
> the Abe.
>  
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe
> 
> The log of the search is on communicado:
>  
> /home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log
>  
> In the sites.xml, the entry for NCSA Abe is :
>  
>  <pool handle="NCSA_Abe">
>     <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
> 
> <workdirectory>/cfs/scratch/users/yuechen/swiftwork</workdirectory>
>     <profile namespace="globus" key="queue">fast</profile>
>  </pool>
> 
> In the tc.data, the entry is:
>  
> NCSA_Abe        PTMap2          /u/ac/yuechen/PTMap2/PTMap2
> INSTALLED       INTEL32::LINUX  globus::maxwalltime=50
>  
> I'm wondering if I have any setup problem or I should contact NCSA
> administrator.
>  
> Thank you very much!
>  
> Regards,
>  
> Chen, Yue
>  
>  
>  
>  
>  
>  
>  
>  
> 
> 
> This email is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged
> and confidential. If the reader of this email message is not the
> intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication is prohibited. If you
> have received this email in error, please notify the sender and
> destroy/delete all copies of the transmittal. Thank you.
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From yuechen at bsd.uchicago.edu  Wed Apr 29 14:04:11 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 29 Apr 2009 14:04:11 -0500
Subject: [Swift-user] errors in file transfer
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241027477.14561.0.camel@localhost>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Mihael,
 
I tried the setting but it still gives me the following error:
 
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe
Failed to transfer wrapper log from PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe

The log for the search is at:
 
/home/yuechen/PTMap2/PTMap2-unmod-20090429-1402-b5t8cqua.log
 
Thanks!
 
Chen, Yue
 
 
________________________________

From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Wed 4/29/2009 12:51 PM
To: Yue, Chen - BMD
Cc: swift user
Subject: Re: [Swift-user] errors in file transfer


This is the error:
 Cannot submit job: Could not submit job (qsub reported an exit code o
f 170). no error output

Try jobmanager="gt2:gt2:PBS" instead of "gt2:PBS".

Mihael

On Wed, 2009-04-29 at 12:38 -0500, Yue, Chen - BMD wrote:
> Hi,
> 
> I was trying to test PTMap application using NCSA Abe. However, I
> got many error messages like the following and no process started on
> the Abe.
> 
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/b on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/h on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/0 on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/i on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/u on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/n on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1222-926eesff/info/z on NCSA_Abe
>
> The log of the search is on communicado:
> 
> /home/yuechen/PTMap2/PTMap2-unmod-20090429-1222-926eesff.log
> 
> In the sites.xml, the entry for NCSA Abe is :
> 
>  <pool handle="NCSA_Abe">
>     <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-abe.ncsa.teragrid.org"
> jobManager="gt2:PBS"/>
>
> <workdirectory>/cfs/scratch/users/yuechen/swiftwork</workdirectory>
>     <profile namespace="globus" key="queue">fast</profile>
>  </pool>
>
> In the tc.data, the entry is:
> 
> NCSA_Abe        PTMap2          /u/ac/yuechen/PTMap2/PTMap2
> INSTALLED       INTEL32::LINUX  globus::maxwalltime=50
> 
> I'm wondering if I have any setup problem or I should contact NCSA
> administrator.
> 
> Thank you very much!
> 
> Regards,
> 
> Chen, Yue
> 
> 
> 
> 
> 
> 
> 
> 
>
>
> This email is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged
> and confidential. If the reader of this email message is not the
> intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication is prohibited. If you
> have received this email in error, please notify the sender and
> destroy/delete all copies of the transmittal. Thank you.
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090429/210985d0/attachment.html>

From hategan at mcs.anl.gov  Wed Apr 29 14:19:26 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 29 Apr 2009 14:19:26 -0500
Subject: [Swift-user] errors in file transfer
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241027477.14561.0.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <1241032766.16150.2.camel@localhost>

On Wed, 2009-04-29 at 14:04 -0500, Yue, Chen - BMD wrote:
> Hi Mihael,
>  
> I tried the setting but it still gives me the following error:
>  
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe

Those are more like warnings, not errors.

The real error should be displayed towards the end of the run. Anyway,
it says:
"org.globus.gram.GramException: The provided RSL 'queue' parameter is
invalid"


From yuechen at bsd.uchicago.edu  Wed Apr 29 16:06:28 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 29 Apr 2009 16:06:28 -0500
Subject: [Swift-user] errors in file transfer
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu><1241027477.14561.0.camel@localhost><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241032766.16150.2.camel@localhost>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAA@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Mihael,
 
I deleted the following line in my sites.xml file for NCSA_Abe and the wrapper transfer warnings are gone. 
 
<profile namespace="globus" key="queue">fast</profile>
 
I can also find jobs queuing on Abe. However, after quite a while, no job returned. I guess it is because I didn't set a priority and all the jobs are waiting. Is there other way to set priority? I will try again later.
 
I then tested the IU BigRed with my application. Swift showed me the following error and I don't know if this is because of my setting:
 
Progress:  Selecting site:1019  Initializing site shared directory:4
Execution failed:
        Could not initialize shared directory on IU_BigRed
Caused by:
        org.globus.cog.abstraction.impl.file.FileResourceException: Error communicating with the GridFTP server
Caused by:
        Server refused performing the request. Custom message: Server refused GSSAPI authentication. (error code 1) [Nested exception message:  Custom message: Unexpected reply: 530-globus_xio: Server side credential failure
530-globus_gsi_gssapi: Error with GSI credential
530-globus_gsi_gssapi: Error with gss credential handle
530-globus_credential: Error with credential: The host credential: /etc/grid-security/hostcert.pem
530-     with subject: /C=US/O=National Center for Supercomputing Applications/CN=gridftp4.bigred.teragrid.iu.edu
530-     has expired 4459 minutes ago.
530-
530 End.]

The log for this search is at :
 
/home/yuechen/PTMap2/PTMap2-unmod-20090429-1553-vz669563.log
 
In the sites.xml, the entry for the BigRed is :
 
 <pool handle="IU_BigRed">
    <gridftp url="gsiftp://gridftp.bigred.iu.teragrid.org"/>
    <execution provider="coaster" url="gatekeeper.bigred.iu.teragrid.org" jobManager="gt2:fork"/>
    <workdirectory>/N/u/tg-yuechen/BigRed/swiftwork</workdirectory>
    <profile namespace="globus" key="queue">fast</profile>
 </pool>

Thank you for help!
 
Chen, Yue
 
 
________________________________

From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Wed 4/29/2009 2:19 PM
To: Yue, Chen - BMD
Cc: swift user
Subject: RE: [Swift-user] errors in file transfer


On Wed, 2009-04-29 at 14:04 -0500, Yue, Chen - BMD wrote:
> Hi Mihael,
> 
> I tried the setting but it still gives me the following error:
> 
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/q on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/m on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/l on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/9 on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/f on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/k on NCSA_Abe
> Failed to transfer wrapper log from
> PTMap2-unmod-20090429-1402-b5t8cqua/info/7 on NCSA_Abe

Those are more like warnings, not errors.

The real error should be displayed towards the end of the run. Anyway,
it says:
"org.globus.gram.GramException: The provided RSL 'queue' parameter is
invalid"


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090429/f9b3e88e/attachment.html>

From hategan at mcs.anl.gov  Wed Apr 29 16:23:42 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 29 Apr 2009 16:23:42 -0500
Subject: [Swift-user] errors in file transfer
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAA@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241027477.14561.0.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241032766.16150.2.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAA@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <1241040222.18377.7.camel@localhost>

On Wed, 2009-04-29 at 16:06 -0500, Yue, Chen - BMD wrote:
> Hi Mihael,
>  
> I deleted the following line in my sites.xml file for NCSA_Abe and the
> wrapper transfer warnings are gone. 
>  
> <profile namespace="globus" key="queue">fast</profile>
>  
> I can also find jobs queuing on Abe. However, after quite a while, no
> job returned. I guess it is because I didn't set a priority and all
> the jobs are waiting.

When you do qstat, are your jobs in a queued state?

>  Is there other way to set priority?

You should be able to specify the queue. The only problem is that you
are specifying a queue that doesn't exist on Abe.

This is what I've found online:
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Jobs.html#Queues

You can also log in, and do a qstat -q, which will show the following:
[hategan at honest2 ~]$ qstat -q

server: abem5.ncsa.uiuc.edu

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
normal             --      --    48:00:00   600  82 928 --   E R
iacat2             --      --    241:00:0   --    0  20 --   E R
indprio            --      --    48:00:00   600   0   0 --   E R
long               --      --    168:00:0   600  13  15 --   E R
iacat              --      --    241:00:0   --    0   0 --   E R
industrial         --      --    336:00:0   600  14  32 --   E R
lincoln            --      --    241:00:0   --    2   0 --   E R
wide               --      --    48:00:00  1196   6 344 --   E R
mlinglin           --      --    168:00:0   256   2   0 --   E R
debug              --      --    00:30:00    16   0   4 --   E R
fernsler           --      --    168:00:0    32   0   0 --   E R
specreq            --      --    241:00:0   600   2   0 --   E R
                                               ----- -----
                                                 121  1343


>  I will try again later.
>  
> I then tested the IU BigRed with my application. Swift showed me the
> following error and I don't know if this is because of my setting:
>  
> Progress:  Selecting site:1019  Initializing site shared directory:4
> Execution failed:
>         Could not initialize shared directory on IU_BigRed
> Caused by:
>         org.globus.cog.abstraction.impl.file.FileResourceException:
> Error communicating with the GridFTP server
> Caused by:
>         Server refused performing the request. Custom message: Server
> refused GSSAPI authentication. (error code 1) [Nested exception
> message:  Custom message: Unexpected reply: 530-globus_xio: Server
> side credential failure
> 530-globus_gsi_gssapi: Error with GSI credential
> 530-globus_gsi_gssapi: Error with gss credential handle
> 530-globus_credential: Error with credential: The host
> credential: /etc/grid-security/hostcert.pem
> 530-     with subject: /C=US/O=National Center for Supercomputing
> Applications/CN=gridftp4.bigred.teragrid.iu.edu
> 530-     has expired 4459 minutes ago.
> 530-
> 530 End.]

Bigred, it would seem, has an expired host certificate. This is a
problem with the site. I would suggest seding an email to
help at teragrid.org with the above message (from "Server refused
performing the request" to "530 End.]").


From yuechen at bsd.uchicago.edu  Wed Apr 29 17:30:07 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Wed, 29 Apr 2009 17:30:07 -0500
Subject: [Swift-user] errors in file transfer
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu><1241027477.14561.0.camel@localhost><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu><1241032766.16150.2.camel@localhost><AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAA@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241040222.18377.7.camel@localhost>
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAC@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi Mihael,
 
When I do qstat, it shows the following line for all my jobs in the queue:
 
937872.abem5              null             yuechen                0 Q(null) normal
 
It looks like no job is running. 
 
I did the qstat -q. Should I use the following line instead in sites.xml for shorter Walltime?
 
<profile namespace="globus" key="queue">debug</profile>

I will send email to help at teragrid.org about the Bigred certificate problem.
 
Thanks!
 
Chen, Yue
 
 
________________________________

From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Wed 4/29/2009 4:23 PM
To: Yue, Chen - BMD
Cc: swift user
Subject: RE: [Swift-user] errors in file transfer


On Wed, 2009-04-29 at 16:06 -0500, Yue, Chen - BMD wrote:
> Hi Mihael,
> 
> I deleted the following line in my sites.xml file for NCSA_Abe and the
> wrapper transfer warnings are gone.
> 
> <profile namespace="globus" key="queue">fast</profile>
> 
> I can also find jobs queuing on Abe. However, after quite a while, no
> job returned. I guess it is because I didn't set a priority and all
> the jobs are waiting.

When you do qstat, are your jobs in a queued state?

>  Is there other way to set priority?

You should be able to specify the queue. The only problem is that you
are specifying a queue that doesn't exist on Abe.

This is what I've found online:
http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/Doc/Jobs.html#Queues

You can also log in, and do a qstat -q, which will show the following:
[hategan at honest2 ~]$ qstat -q

server: abem5.ncsa.uiuc.edu

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
normal             --      --    48:00:00   600  82 928 --   E R
iacat2             --      --    241:00:0   --    0  20 --   E R
indprio            --      --    48:00:00   600   0   0 --   E R
long               --      --    168:00:0   600  13  15 --   E R
iacat              --      --    241:00:0   --    0   0 --   E R
industrial         --      --    336:00:0   600  14  32 --   E R
lincoln            --      --    241:00:0   --    2   0 --   E R
wide               --      --    48:00:00  1196   6 344 --   E R
mlinglin           --      --    168:00:0   256   2   0 --   E R
debug              --      --    00:30:00    16   0   4 --   E R
fernsler           --      --    168:00:0    32   0   0 --   E R
specreq            --      --    241:00:0   600   2   0 --   E R
                                               ----- -----
                                                 121  1343


>  I will try again later.
> 
> I then tested the IU BigRed with my application. Swift showed me the
> following error and I don't know if this is because of my setting:
> 
> Progress:  Selecting site:1019  Initializing site shared directory:4
> Execution failed:
>         Could not initialize shared directory on IU_BigRed
> Caused by:
>         org.globus.cog.abstraction.impl.file.FileResourceException:
> Error communicating with the GridFTP server
> Caused by:
>         Server refused performing the request. Custom message: Server
> refused GSSAPI authentication. (error code 1) [Nested exception
> message:  Custom message: Unexpected reply: 530-globus_xio: Server
> side credential failure
> 530-globus_gsi_gssapi: Error with GSI credential
> 530-globus_gsi_gssapi: Error with gss credential handle
> 530-globus_credential: Error with credential: The host
> credential: /etc/grid-security/hostcert.pem
> 530-     with subject: /C=US/O=National Center for Supercomputing
> Applications/CN=gridftp4.bigred.teragrid.iu.edu
> 530-     has expired 4459 minutes ago.
> 530-
> 530 End.]

Bigred, it would seem, has an expired host certificate. This is a
problem with the site. I would suggest seding an email to
help at teragrid.org with the above message (from "Server refused
performing the request" to "530 End.]").


This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090429/bf1fac8a/attachment.html>

From hategan at mcs.anl.gov  Wed Apr 29 17:58:12 2009
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 29 Apr 2009 17:58:12 -0500
Subject: [Swift-user] errors in file transfer
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAC@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA5@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241027477.14561.0.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CA8@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241032766.16150.2.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAA@ADM-EXCHVS04.bsdad.uchicago.edu>
	<1241040222.18377.7.camel@localhost>
	<AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CAC@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <1241045892.21707.1.camel@localhost>

On Wed, 2009-04-29 at 17:30 -0500, Yue, Chen - BMD wrote:
> Hi Mihael,
>  
> When I do qstat, it shows the following line for all my jobs in the
> queue:
>  
> 937872.abem5              null             yuechen                0
> Q(null) normal
>  
> It looks like no job is running. 

Yep. That's what it looks like.

>  
> I did the qstat -q. Should I use the following line instead in
> sites.xml for shorter Walltime?
>  
> <profile namespace="globus" key="queue">debug</profile>

I think so. Though make sure to set coasterWorkerMaxwalltime to 30
minutes if you do.

> 
> I will send email to help at teragrid.org about the Bigred certificate
> problem.
>  
> Thanks!

You're welcome.

> 


From yuechen at bsd.uchicago.edu  Thu Apr 30 12:08:57 2009
From: yuechen at bsd.uchicago.edu (Yue, Chen - BMD)
Date: Thu, 30 Apr 2009 12:08:57 -0500
Subject: [Swift-user] Execution error
Message-ID: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>

Hi,
 
I came back to re-run my application on NCSA Mercury which was tested successfully last week after I just set up coasters with swift 0.9, but I got many messages like the following:
 
Progress:  Stage in:219  Submitting:803  Submitted:1
Progress:  Stage in:129  Submitting:703  Submitted:190 Failed but can retry:1
Progress:  Stage in:38  Submitting:425  Submitted:556 Failed but can retry:4
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
Failed to transfer wrapper log from PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
Progress:  Stage in:1  Submitted:1013  Active:1 Failed but can retry:8
Progress:  Submitted:1011  Active:1 Failed but can retry:11

The log file for the successful run last week is ;
/home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
 
The log file for the failed run is :
/home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
 
I don't think I did anything different, so I don't know why this time they failed. The sites.xml for Mercury is:
 
 <pool handle="NCSA_MERCURY">
    <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
    <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" jobManager="gt2:PBS"/>
    <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
    <profile namespace="globus" key="queue">debug</profile>
 </pool>
 
Thank you for help!
 
Chen, Yue
 
 
This email is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this email message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is prohibited. If you have received this email in error, please notify the sender and destroy/delete all copies of the transmittal. Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20090430/9749b660/attachment.html>

From wilde at mcs.anl.gov  Thu Apr 30 12:20:03 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 30 Apr 2009 12:20:03 -0500
Subject: [Swift-user] Execution error
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <49F9DDC3.5090907@mcs.anl.gov>

Yue, I'm looking at your logs.

I see that swift is encountering qsub errors (code "-68").

Did you either get emails from qsub on Mercury, or have logs in your 
home dir there or your .globus dir? (I cant access your home directory).

Can you look for log files in both $HOME and below .globus (maybe 
deeper) for that time period, and put them somewhere (like back on CI in 
a tarball) where we can access them?

- Mike


On 4/30/09 12:08 PM, Yue, Chen - BMD wrote:
> Hi,
>  
> I came back to re-run my application on NCSA Mercury which was tested 
> successfully last week after I just set up coasters with swift 0.9, but 
> I got many messages like the following:
>  
> Progress:  Stage in:219  Submitting:803  Submitted:1
> Progress:  Stage in:129  Submitting:703  Submitted:190 Failed but can 
> retry:1
> Progress:  Stage in:38  Submitting:425  Submitted:556 Failed but can retry:4
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
> Progress:  Stage in:1  Submitted:1013  Active:1 Failed but can retry:8
> Progress:  Submitted:1011  Active:1 Failed but can retry:11
> The log file for the successful run last week is ;
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
>  
> The log file for the failed run is :
> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
>  
> I don't think I did anything different, so I don't know why this time 
> they failed. The sites.xml for Mercury is:
>  
>  <pool handle="NCSA_MERCURY">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" 
> jobManager="gt2:PBS"/>
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>     <profile namespace="globus" key="queue">debug</profile>
>  </pool>
>  
> Thank you for help!
>  
> Chen, Yue
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
> 
> This email is intended only for the use of the individual or entity to 
> which it is addressed and may contain information that is privileged and 
> confidential. If the reader of this email message is not the intended 
> recipient, you are hereby notified that any dissemination, distribution, 
> or copying of this communication is prohibited. If you have received 
> this email in error, please notify the sender and destroy/delete all 
> copies of the transmittal. Thank you.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Thu Apr 30 12:23:27 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 30 Apr 2009 12:23:27 -0500
Subject: [Swift-user] Execution error
In-Reply-To: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>
Message-ID: <49F9DE8F.1070404@mcs.anl.gov>

Also, what account are you running under? We may need to change you to a 
new account - as the OSG Training account expires today.
If that happend at Noon, it *might* be the problem.

- Mike


On 4/30/09 12:08 PM, Yue, Chen - BMD wrote:
> Hi,
>  
> I came back to re-run my application on NCSA Mercury which was tested 
> successfully last week after I just set up coasters with swift 0.9, but 
> I got many messages like the following:
>  
> Progress:  Stage in:219  Submitting:803  Submitted:1
> Progress:  Stage in:129  Submitting:703  Submitted:190 Failed but can 
> retry:1
> Progress:  Stage in:38  Submitting:425  Submitted:556 Failed but can retry:4
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
> Failed to transfer wrapper log from 
> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
> Progress:  Stage in:1  Submitted:1013  Active:1 Failed but can retry:8
> Progress:  Submitted:1011  Active:1 Failed but can retry:11
> The log file for the successful run last week is ;
> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
>  
> The log file for the failed run is :
> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
>  
> I don't think I did anything different, so I don't know why this time 
> they failed. The sites.xml for Mercury is:
>  
>  <pool handle="NCSA_MERCURY">
>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" 
> jobManager="gt2:PBS"/>
>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>     <profile namespace="globus" key="queue">debug</profile>
>  </pool>
>  
> Thank you for help!
>  
> Chen, Yue
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
> 
> This email is intended only for the use of the individual or entity to 
> which it is addressed and may contain information that is privileged and 
> confidential. If the reader of this email message is not the intended 
> recipient, you are hereby notified that any dissemination, distribution, 
> or copying of this communication is prohibited. If you have received 
> this email in error, please notify the sender and destroy/delete all 
> copies of the transmittal. Thank you.
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Thu Apr 30 12:40:40 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 30 Apr 2009 12:40:40 -0500
Subject: [Swift-user] Execution error
In-Reply-To: <49F9DE8F.1070404@mcs.anl.gov>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>
	<49F9DE8F.1070404@mcs.anl.gov>
Message-ID: <49F9E298.8030801@mcs.anl.gov>

I just checked - TG-CDA070002T has indeed expired.

The best for now is to move to use (only) Ranger, under this account:
TG-CCR080022N

I will locate and send you a sites.xml entry in a moment.

You need to go to a web page to activate your Ranger login.

Best to contact me in IM and we can work this out.

- Mike


On 4/30/09 12:23 PM, Michael Wilde wrote:
> Also, what account are you running under? We may need to change you to a 
> new account - as the OSG Training account expires today.
> If that happend at Noon, it *might* be the problem.
> 
> - Mike
> 
> 
> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote:
>> Hi,
>>  
>> I came back to re-run my application on NCSA Mercury which was tested 
>> successfully last week after I just set up coasters with swift 0.9, 
>> but I got many messages like the following:
>>  
>> Progress:  Stage in:219  Submitting:803  Submitted:1
>> Progress:  Stage in:129  Submitting:703  Submitted:190 Failed but can 
>> retry:1
>> Progress:  Stage in:38  Submitting:425  Submitted:556 Failed but can 
>> retry:4
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
>> Failed to transfer wrapper log from 
>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
>> Progress:  Stage in:1  Submitted:1013  Active:1 Failed but can retry:8
>> Progress:  Submitted:1011  Active:1 Failed but can retry:11
>> The log file for the successful run last week is ;
>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
>>  
>> The log file for the failed run is :
>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
>>  
>> I don't think I did anything different, so I don't know why this time 
>> they failed. The sites.xml for Mercury is:
>>  
>>  <pool handle="NCSA_MERCURY">
>>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" 
>> jobManager="gt2:PBS"/>
>>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>>     <profile namespace="globus" key="queue">debug</profile>
>>  </pool>
>>  
>> Thank you for help!
>>  
>> Chen, Yue
>>  
>>  
>>
>>  
>>
>>  
>>
>>
>>  
>>
>>
>> This email is intended only for the use of the individual or entity to 
>> which it is addressed and may contain information that is privileged 
>> and confidential. If the reader of this email message is not the 
>> intended recipient, you are hereby notified that any dissemination, 
>> distribution, or copying of this communication is prohibited. If you 
>> have received this email in error, please notify the sender and 
>> destroy/delete all copies of the transmittal. Thank you.
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user


From wilde at mcs.anl.gov  Thu Apr 30 13:07:55 2009
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Thu, 30 Apr 2009 13:07:55 -0500
Subject: [Swift-user] Execution error
In-Reply-To: <49F9E298.8030801@mcs.anl.gov>
References: <AD1FA15416EEBC49A0FE4F8B0C8AD7C5158CB1@ADM-EXCHVS04.bsdad.uchicago.edu>	<49F9DE8F.1070404@mcs.anl.gov>
	<49F9E298.8030801@mcs.anl.gov>
Message-ID: <49F9E8FB.9020500@mcs.anl.gov>

Yue, use this XML pool element to access ranger:

  <pool handle="ranger">
     <execution provider="coaster"
                url="gatekeeper.ranger.tacc.teragrid.org"
                jobManager="gt2:gt2:SGE"/>
     <gridftp url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" />
     <profile namespace="env"
              key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir</profile>
     <profile namespace="globus" key="project">TG-CCR080022N</profile>
     <profile namespace="globus" key="coastersPerNode">16</profile>
     <profile namespace="globus" key="queue">development</profile>
     <profile namespace="globus"
              key="coasterWorkerMaxwalltime">00:40:00</profile>
     <profile namespace="globus" key="maxwalltime">31</profile>
     <profile namespace="karajan" key="initialScore">50</profile>
     <profile namespace="karajan" key="jobThrottle">10</profile>
     <workdirectory>/work/00306/tg455797/swiftwork</workdirectory>
   </pool>


You will need to also do these steps:

Go to this web page to enable your Ranger account:

https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx

Then login to Ranger via the TeraGrid portal and put your ssh keys in 
place (assuming you use ssh keys, which you should)

While on Ranger, do this:

echo $WORK
mkdir $work/swiftwork

and put the full path of your $WORK/swiftwork directory in the 
<workdirectory> element above. (My login is tg455etc, yours is yuechen)

Then scp your code to Ranger and compile it.

Then create a tc.data entry for your ptmap app

Next, set your time values in the sites.xml entry above to suitable 
values for Ranger. You'll need to measure times, but I think you will 
find Ranger about twice as fast as Mercury for CPU-bound jobs.

The values above were set for one app job per coaster. I think you can 
probably do more.

If you estimate a run time of 5 minutes, use:

     <profile namespace="globus"
              key="coasterWorkerMaxwalltime">00:30:00</profile>
     <profile namespace="globus" key="maxwalltime">5</profile>

Other people on the list - please sanity check what I suggest here.

- Mike


On 4/30/09 12:40 PM, Michael Wilde wrote:
> I just checked - TG-CDA070002T has indeed expired.
> 
> The best for now is to move to use (only) Ranger, under this account:
> TG-CCR080022N
> 
> I will locate and send you a sites.xml entry in a moment.
> 
> You need to go to a web page to activate your Ranger login.
> 
> Best to contact me in IM and we can work this out.
> 
> - Mike
> 
> 
> 
> On 4/30/09 12:23 PM, Michael Wilde wrote:
>> Also, what account are you running under? We may need to change you to 
>> a new account - as the OSG Training account expires today.
>> If that happend at Noon, it *might* be the problem.
>>
>> - Mike
>>
>>
>> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote:
>>> Hi,
>>>  
>>> I came back to re-run my application on NCSA Mercury which was tested 
>>> successfully last week after I just set up coasters with swift 0.9, 
>>> but I got many messages like the following:
>>>  
>>> Progress:  Stage in:219  Submitting:803  Submitted:1
>>> Progress:  Stage in:129  Submitting:703  Submitted:190 Failed but can 
>>> retry:1
>>> Progress:  Stage in:38  Submitting:425  Submitted:556 Failed but can 
>>> retry:4
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
>>> Failed to transfer wrapper log from 
>>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
>>> Progress:  Stage in:1  Submitted:1013  Active:1 Failed but can retry:8
>>> Progress:  Submitted:1011  Active:1 Failed but can retry:11
>>> The log file for the successful run last week is ;
>>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
>>>  
>>> The log file for the failed run is :
>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
>>>  
>>> I don't think I did anything different, so I don't know why this time 
>>> they failed. The sites.xml for Mercury is:
>>>  
>>>  <pool handle="NCSA_MERCURY">
>>>     <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>>>     <execution provider="coaster" url="grid-hg.ncsa.teragrid.org" 
>>> jobManager="gt2:PBS"/>
>>>     <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>>>     <profile namespace="globus" key="queue">debug</profile>
>>>  </pool>
>>>  
>>> Thank you for help!
>>>  
>>> Chen, Yue
>>>  
>>>  
>>>
>>>  
>>>
>>>  
>>>
>>>
>>>  
>>>
>>>
>>> This email is intended only for the use of the individual or entity 
>>> to which it is addressed and may contain information that is 
>>> privileged and confidential. If the reader of this email message is 
>>> not the intended recipient, you are hereby notified that any 
>>> dissemination, distribution, or copying of this communication is 
>>> prohibited. If you have received this email in error, please notify 
>>> the sender and destroy/delete all copies of the transmittal. Thank you.
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user