[Swift-user] Problems getting started with coasters

Andriy Fedorov fedorov at bwh.harvard.edu
Tue Aug 25 10:44:38 CDT 2009


Michael,

Thanks for the reply.

So my understanding is, I should check out the trunk version and
compile (yes, I've done this before), and try the real application
with GT2+coasters.

I do have an account on Queen Bee. You say, it has GT GRAM5, but I
thought you also said I should target using GT2. What is GRAM5? At
this point, my preference is the system with lowest load and confirmed
functional coaster provider, to save time debugging and getting up to
speed. Should I use Abe or Queen Bee?

As soon as I compile the current swift trunk and try GT2+coaster @Abe
for my application, I will report to the list my experience.

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu



On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
> Andrey,
>
> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>
>> Hi,
>>
>> I have a processing step that takes somewhere ~2-5 min. It takes on
>> input two ~5Mb files, and produces a small text file, which I need to
>> store. I need to compute large number of such jobs, using different
>> parameters. It seems to me "coaster" is the best execution provider
>> for my application.
>>
>> Trying to start simple, I am running first.swift (echo) example that
>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>
>> Here's my sites.xml:
>>
>> <pool handle="Abe-GT4">
>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>  <execution provider="gt4" jobmanager="PBS"
>>
>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> </pool>
>>
>> <pool handle="Abe-GT4-coasters">
>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>  <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>
>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> </pool>
>>
>> <pool handle="Abe-GT2">
>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>  <execution provider="gt2" jobmanager="PBS"
>>  url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> </pool>
>>
>> <pool handle="Abe-GT2-coasters">
>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>  url="grid-abe.ncsa.teragrid.org"/>
>>  <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>> </pool>
>>
>> And tc.data is simply
>>
>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>
>> and I change the site to test different providers.
>>
>> Now, results:
>>
>> 1) both GT2 and GT4 providers work fine, script completes
>>
>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>> (requested time is 01:41, I guess this comes with the default coaster
>> parameters, that I didn't change). The job appears to finish
>> successfully, but then I get this error:
>>
>> Final status:  Finished successfully:1
>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>> Submitted task Task(type=JOB_SUBMISSION,
>> identity=urn:0-1-1251210343871). Job id:
>> urn:1251210343871-1251210376098-1251210376099
>> Unregistering Command(21, SUBMITJOB)
>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>> Cleaning up...
>> Shutting down service at https://141.142.68.180:45552
>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>> Command(22, SHUTDOWNSERVICE): failed too many times
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>        at
>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>        at
>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>        at java.util.TimerThread.mainLoop(Timer.java:512)
>>        at java.util.TimerThread.run(Timer.java:462)
>> - Done
>
> This seems like a low-prio error. I'll file it in bugzilla for now. Lets see
> how coasters works for you on Abe using your real app and a larger number of
> jobs, and come back to this shutdown problem if it proves to be a blocker to
> getting work done.
>
> Coasters has a few other current issues - mainly not throttling work
> efficiently - that we have a fix for, and need to apply and test that one
> first.
>
> We've also been experimenting with a non-coaster way to use all 8 cores of
> machines like Abe, but lets try the coaster route first, of thats OK with
> you, and lets focus on GT2/Coasters, as that will be more common.
>
> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
> sister-system at LSU, which we can try, assuming your TG project lets you
> run there.
>
> So please try to run the app, and we will try to get the latest coaster
> fixes committed. (I assume you are comfortable extracting Swift from svn and
> building it; if you have not done this before, can you try it, Andrey?)
>
> Regards,
>
> Mike
>
>
>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>> Possibly I am not setting up properly the site entry. I was not able
>> to find any examples in the manual how to set coasters with GT4 (can
>> anyone provide an example?). Here's the error:
>>
>> Failed to transfer wrapper log from
>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>> END_FAILURE thread=0 tr=echo
>> Progress:  Failed:1
>> Execution failed:
>>        Exception in echo:
>> Arguments: [Hello, world!]
>> Host: Abe-GT4-coasters
>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>>        Cannot submit job: Limited proxy is not accepted
>>
>>
>> Can anybody help figuring this out?
>>
>> Thanks
>> --
>> Andriy Fedorov, Ph.D.
>>
>> Research Fellow
>> Brigham and Women's Hospital
>> Harvard Medical School
>> 75 Francis Street
>> Boston, MA 02115 USA
>> fedorov at bwh.harvard.edu
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>



More information about the Swift-user mailing list