[Swift-user] Fwd: Problems getting started with coasters

Andriy Fedorov fedorov at bwh.harvard.edu
Tue Aug 25 09:58:04 CDT 2009


Hi,

I have a processing step that takes somewhere ~2-5 min. It takes on
input two ~5Mb files, and produces a small text file, which I need to
store. I need to compute large number of such jobs, using different
parameters. It seems to me "coaster" is the best execution provider
for my application.

Trying to start simple, I am running first.swift (echo) example that
comes with Swift using different providers: GT2, GT4, GT2/coaster, and
GT4/coaster. All of this is done on Abe NCSA cluster.

Here's my sites.xml:

<pool handle="Abe-GT4">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt4" jobmanager="PBS"
 url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT4-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt4:gt4:pbs"
 url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="gt2" jobmanager="PBS"
 url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

<pool handle="Abe-GT2-coasters">
 <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
 <execution provider="coaster" jobmanager="gt2:gt2:pbs"
 url="grid-abe.ncsa.teragrid.org"/>
 <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
 <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
</pool>

And tc.data is simply

Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null

and I change the site to test different providers.

Now, results:

1) both GT2 and GT4 providers work fine, script completes

2) with GT2+coaster provider, I can see the job in the PBS queue
(requested time is 01:41, I guess this comes with the default coaster
parameters, that I didn't change). The job appears to finish
successfully, and it seems like the output file is fetched back, but
then I get this error:

Final status:  Finished successfully:1
START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
Submitted task Task(type=JOB_SUBMISSION,
identity=urn:0-1-1251210343871). Job id:
urn:1251210343871-1251210376098-1251210376099
Unregistering Command(21, SUBMITJOB)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
Cleaning up...
Shutting down service at https://141.142.68.180:45552
Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
Command(22, SHUTDOWNSERVICE): handling reply timeout
Command(22, SHUTDOWNSERVICE): failed too many times
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
       at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
       at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
       at java.util.TimerThread.mainLoop(Timer.java:512)
       at java.util.TimerThread.run(Timer.java:462)
- Done

3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
Possibly I am not setting up properly the site entry. I was not able
to find any examples in the manual how to set coasters with GT4 (can
anyone provide an example?). Here's the error:

Failed to transfer wrapper log from
first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
END_FAILURE thread=0 tr=echo
Progress:  Failed:1
Execution failed:
       Exception in echo:
Arguments: [Hello, world!]
Host: Abe-GT4-coasters
Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
stderr.txt:

stdout.txt:

----

Caused by:
       Cannot submit job: Limited proxy is not accepted


Can anybody help figuring this out?

Thanks
--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu



More information about the Swift-user mailing list