[Swift-user] Problems getting started with coasters
Mihael Hategan
hategan at mcs.anl.gov
Tue Aug 25 10:49:49 CDT 2009
(2)
This isn't strictly a bug.
When shutting down a coaster service the client sends a shutdown command
to the service, which it hopes will be acknowledged. The service
acknowledges it and then terminates. However, there is no guarantee now
that the termination will happen after the acknowledgement message is
sent (which is something that could be corrected I guess)
However, the client only tries to shut down the service. It is not an
error condition if it doesn't succeed, but a diagnostic message gets
printed.
(3)
I know what's happening. That is a bug. When using gt4:gt4:xxx,
delegation needs to be enabled on the first step. Delegation is disabled
(as much as possible) by default in all the providers. There should be a
fix in SVN this week.
Mihael
On Tue, 2009-08-25 at 10:49 -0400, Andrey Fedorov wrote:
> Hi,
>
> I have a processing step that takes somewhere ~2-5 min. It takes on
> input two ~5Mb files, and produces a small text file, which I need to
> store. I need to compute large number of such jobs, using different
> parameters. It seems to me "coaster" is the best execution provider
> for my application.
>
> Trying to start simple, I am running first.swift (echo) example that
> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
> GT4/coaster. All of this is done on Abe NCSA cluster.
>
> Here's my sites.xml:
>
> <pool handle="Abe-GT4">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="gt4" jobmanager="PBS"
> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT4-coasters">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="coaster" jobmanager="gt4:gt4:pbs"
> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT2">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="gt2" jobmanager="PBS"
> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> <pool handle="Abe-GT2-coasters">
> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
> url="grid-abe.ncsa.teragrid.org"/>
> <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
> </pool>
>
> And tc.data is simply
>
> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>
> and I change the site to test different providers.
>
> Now, results:
>
> 1) both GT2 and GT4 providers work fine, script completes
>
> 2) with GT2+coaster provider, I can see the job in the PBS queue
> (requested time is 01:41, I guess this comes with the default coaster
> parameters, that I didn't change). The job appears to finish
> successfully, but then I get this error:
>
> Final status: Finished successfully:1
> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
> Submitted task Task(type=JOB_SUBMISSION,
> identity=urn:0-1-1251210343871). Job id:
> urn:1251210343871-1251210376098-1251210376099
> Unregistering Command(21, SUBMITJOB)
> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
> Cleaning up...
> Shutting down service at https://141.142.68.180:45552
> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
> Command(22, SHUTDOWNSERVICE): handling reply timeout
> Command(22, SHUTDOWNSERVICE): failed too many times
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
> at java.util.TimerThread.mainLoop(Timer.java:512)
> at java.util.TimerThread.run(Timer.java:462)
> - Done
>
> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
> Possibly I am not setting up properly the site entry. I was not able
> to find any examples in the manual how to set coasters with GT4 (can
> anyone provide an example?). Here's the error:
>
> Failed to transfer wrapper log from
> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
> END_FAILURE thread=0 tr=echo
> Progress: Failed:1
> Execution failed:
> Exception in echo:
> Arguments: [Hello, world!]
> Host: Abe-GT4-coasters
> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Cannot submit job: Limited proxy is not accepted
>
>
> Can anybody help figuring this out?
>
> Thanks
> --
> Andriy Fedorov, Ph.D.
>
> Research Fellow
> Brigham and Women's Hospital
> Harvard Medical School
> 75 Francis Street
> Boston, MA 02115 USA
> fedorov at bwh.harvard.edu
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list