[Swift-user] Problems getting started with coasters
Michael Wilde
wilde at mcs.anl.gov
Tue Aug 25 10:31:26 CDT 2009
Andrey,
On 8/25/09 9:49 AM, Andrey Fedorov wrote:
> Hi,
>
> I have a processing step that takes somewhere ~2-5 min. It takes on
> input two ~5Mb files, and produces a small text file, which I need to
> store. I need to compute large number of such jobs, using different
> parameters. It seems to me "coaster" is the best execution provider
> for my application.
>
> Trying to start simple, I am running first.swift (echo) example that
> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
> GT4/coaster. All of this is done on Abe NCSA cluster.
>
> Here's my sites.xml:
>
> <pool handle="Abe-GT4">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="gt4" jobmanager="PBS"
> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT4-coasters">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="coaster" jobmanager="gt4:gt4:pbs"
> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT2">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="gt2" jobmanager="PBS"
> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT2-coasters">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
> url="grid-abe.ncsa.teragrid.org"/>
> <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> And tc.data is simply
>
> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>
> and I change the site to test different providers.
>
> Now, results:
>
> 1) both GT2 and GT4 providers work fine, script completes
>
> 2) with GT2+coaster provider, I can see the job in the PBS queue
> (requested time is 01:41, I guess this comes with the default coaster
> parameters, that I didn't change). The job appears to finish
> successfully, but then I get this error:
>
> Final status: Finished successfully:1
> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
> Submitted task Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1251210343871). Job id:
> urn:1251210343871-1251210376098-1251210376099
> Unregistering Command(21, SUBMITJOB)
> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
> Cleaning up...
> Shutting down service at https://141.142.68.180:45552
> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
> Command(22, SHUTDOWNSERVICE): handling reply timeout
> Command(22, SHUTDOWNSERVICE): failed too many times
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> - Done
This seems like a low-prio error. I'll file it in bugzilla for now. Lets
see how coasters works for you on Abe using your real app and a larger
number of jobs, and come back to this shutdown problem if it proves to
be a blocker to getting work done.
Coasters has a few other current issues - mainly not throttling work
efficiently - that we have a fix for, and need to apply and test that
one first.
We've also been experimenting with a non-coaster way to use all 8 cores
of machines like Abe, but lets try the coaster route first, of thats OK
with you, and lets focus on GT2/Coasters, as that will be more common.
In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
sister-system at LSU, which we can try, assuming your TG project lets
you run there.
So please try to run the app, and we will try to get the latest coaster
fixes committed. (I assume you are comfortable extracting Swift from svn
and building it; if you have not done this before, can you try it, Andrey?)
Regards,
Mike
> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
> Possibly I am not setting up properly the site entry. I was not able
> to find any examples in the manual how to set coasters with GT4 (can
> anyone provide an example?). Here's the error:
>
> Failed to transfer wrapper log from
> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
> END_FAILURE thread=0 tr=echo
> Progress: Failed:1
> Execution failed:
> Exception in echo:
> Arguments: [Hello, world!]
> Host: Abe-GT4-coasters
> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Limited proxy is not accepted
>
>
> Can anybody help figuring this out?
>
> Thanks
> --
> Andriy Fedorov, Ph.D.
>
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list