[Swift-user] Problems getting started with coasters

Tue Aug 25 11:54:37 CDT 2009

On 8/25/09 10:44 AM, Andriy Fedorov wrote:
> Michael,
> 
> Thanks for the reply.
> 
> So my understanding is, I should check out the trunk version and
> compile (yes, I've done this before), and try the real application
> with GT2+coasters.

Yes, thats a good step to re-master, in preparation for Mihael checking 
in Coaster fixes. He made significant enhancements to Coasters in the 
past 2 months, but has been working ona different project lately and 
thus these are not yet sufficiently tested. If you're willing to help in 
the testing that would be great.

If not, I think the next best approach to try is this:

- We have a small experimental mod that enables Swift GRAM2 jobs to use 
all cores of multi-core hosts (such as the 8-core hosts on Abe and 
QueenBee). Basically it uses the Swift clustering facility but runs jobs 
in parallel instead of serially.

It works well if your jobs have a very uniform runtime. If they dont, 
then it wastes CPU.  But its a good interim solution for many apps until 
coasters is more stable.

This is described at:
http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering

This info is very preliminary and not end-user ready. Tibi Stef-Praun, 
on this list has tried it. Please start a new thread here if you want to 
  discuss it or report experiences or problems with it.

- On QueenBee or other GRAM5-enabled systems (not many test as its in 
test mode) you can use the GRAM2 provider if submitting remotely.
On Abe and any other GRAM2 systems you should run this with the Condor-G 
provider if submitting remotely.

The rule of thumb here for submitting jobs to a site from Swift running 
remotely on a submit host is:

   -- up to 20 jobs in parallel you can use plain GRAM2
   -- above 20 jobs, use Condor-G or, where available, GRAM2

- On Abe, QueenBee, and other PBS systems with login hosts, you can run 
Swift locally on the login host, and use the PBS provider with the 
parallel clustering approach.

We have a few other solutions that I will save till we explore these two 
solutions.

To prepare for this, try running your app on Abe using the PBS provider, 
with just 1 or 2 jobs, then try the parallel clustering tip above.

> I do have an account on Queen Bee. You say, it has GT GRAM5, but I
> thought you also said I should target using GT2. What is GRAM5?

GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, 
so you just set Swift sites.xml exactly as for GRAM2. The only thing 
that changes is that you use a different URL for the GRAM gatekeeper 
contact string (ie different host and/or port, thats all).

I'll need to get you the contact string for GRAM5 on QueenBee if/when we 
both agree the time is right to try it.

> At
> this point, my preference is the system with lowest load and confirmed
> functional coaster provider, to save time debugging and getting up to
> speed. Should I use Abe or Queen Bee?

Thats hard to answer, as the loads fluctuate.  You can examine the 
TeraPort system load monitor in the TG portal, which gives some rough 
estimates of load and queue time.  Then queue the jobs and wait. Best to 
run Swift under screen, so you can easily wait for and monitor your 
script executions from anywhere, and not be interrupted if long delays 
are encountered.

- Mike

> 
> As soon as I compile the current swift trunk and try GT2+coaster @Abe
> for my application, I will report to the list my experience.
> 
> --
> Andriy Fedorov, Ph.D.
> 
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu
> 
> 
> 
> On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
>> Andrey,
>>
>> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>> Hi,
>>>
>>> I have a processing step that takes somewhere ~2-5 min. It takes on
>>> input two ~5Mb files, and produces a small text file, which I need to
>>> store. I need to compute large number of such jobs, using different
>>> parameters. It seems to me "coaster" is the best execution provider
>>> for my application.
>>>
>>> Trying to start simple, I am running first.swift (echo) example that
>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>>
>>> Here's my sites.xml:
>>>
>>> <pool handle="Abe-GT4">
>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>  <execution provider="gt4" jobmanager="PBS"
>>>
>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>> </pool>
>>>
>>> <pool handle="Abe-GT4-coasters">
>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>  <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>>
>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>> </pool>
>>>
>>> <pool handle="Abe-GT2">
>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>  <execution provider="gt2" jobmanager="PBS"
>>>  url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>> </pool>
>>>
>>> <pool handle="Abe-GT2-coasters">
>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>>  url="grid-abe.ncsa.teragrid.org"/>
>>>  <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>> </pool>
>>>
>>> And tc.data is simply
>>>
>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>>
>>> and I change the site to test different providers.
>>>
>>> Now, results:
>>>
>>> 1) both GT2 and GT4 providers work fine, script completes
>>>
>>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>>> (requested time is 01:41, I guess this comes with the default coaster
>>> parameters, that I didn't change). The job appears to finish
>>> successfully, but then I get this error:
>>>
>>> Final status:  Finished successfully:1
>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>>> Submitted task Task(type=JOB_SUBMISSION,
>>> identity=urn:0-1-1251210343871). Job id:
>>> urn:1251210343871-1251210376098-1251210376099
>>> Unregistering Command(21, SUBMITJOB)
>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>> Cleaning up...
>>> Shutting down service at https://141.142.68.180:45552
>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>>> Command(22, SHUTDOWNSERVICE): failed too many times
>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>        at
>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>>        at
>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>>        at java.util.TimerThread.mainLoop(Timer.java:512)
>>>        at java.util.TimerThread.run(Timer.java:462)
>>> - Done
>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets see
>> how coasters works for you on Abe using your real app and a larger number of
>> jobs, and come back to this shutdown problem if it proves to be a blocker to
>> getting work done.
>>
>> Coasters has a few other current issues - mainly not throttling work
>> efficiently - that we have a fix for, and need to apply and test that one
>> first.
>>
>> We've also been experimenting with a non-coaster way to use all 8 cores of
>> machines like Abe, but lets try the coaster route first, of thats OK with
>> you, and lets focus on GT2/Coasters, as that will be more common.
>>
>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
>> sister-system at LSU, which we can try, assuming your TG project lets you
>> run there.
>>
>> So please try to run the app, and we will try to get the latest coaster
>> fixes committed. (I assume you are comfortable extracting Swift from svn and
>> building it; if you have not done this before, can you try it, Andrey?)
>>
>> Regards,
>>
>> Mike
>>
>>
>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>>> Possibly I am not setting up properly the site entry. I was not able
>>> to find any examples in the manual how to set coasters with GT4 (can
>>> anyone provide an example?). Here's the error:
>>>
>>> Failed to transfer wrapper log from
>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>>> END_FAILURE thread=0 tr=echo
>>> Progress:  Failed:1
>>> Execution failed:
>>>        Exception in echo:
>>> Arguments: [Hello, world!]
>>> Host: Abe-GT4-coasters
>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>>> stderr.txt:
>>>
>>> stdout.txt:
>>>
>>> ----
>>>
>>> Caused by:
>>>        Cannot submit job: Limited proxy is not accepted
>>>
>>>
>>> Can anybody help figuring this out?
>>>
>>> Thanks
>>> --
>>> Andriy Fedorov, Ph.D.
>>>
>>> Research Fellow
>>> Brigham and Women's Hospital
>>> Harvard Medical School
>>> 75 Francis Street
>>> Boston, MA 02115 USA
>>> fedorov at bwh.harvard.edu
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user