[Swift-user] Problems getting started with coasters

Andriy Fedorov fedorov at bwh.harvard.edu
Tue Aug 25 12:11:21 CDT 2009


Michael --

Sounds like a plan, thanks :)

Let me digest this, and give it a try. I should get back to you and
the list with the report on my experience later this week (or earlier,
if I come across a stopper...)

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu



On Tue, Aug 25, 2009 at 13:04, Michael Wilde<wilde at mcs.anl.gov> wrote:
> Andrey, good news: GRAM5 is now available on Abe as well. Info and contact
> URLs, as well as some Swift usage experience reports, are at:
>
> http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
>
> So with this in mind, a good approach is:
>
> - sanity test your app using the PBS provider on Abe, with swift on the
> login host, just 1 or 2 jobs
>
> - sanity test 16 to 64 or so jobs, adding parallel clustering to the above
>
> - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but
> using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
> (still submitting from the Abe login host to Abe. You can keep the local
> data provider for this case)
>
> - Add Queenbee GRAM5 as a second site, using the gridftp data provider.
>
> Mike
>
>
> On 8/25/09 11:54 AM, Michael Wilde wrote:
>>
>> On 8/25/09 10:44 AM, Andriy Fedorov wrote:
>>>
>>> Michael,
>>>
>>> Thanks for the reply.
>>>
>>> So my understanding is, I should check out the trunk version and
>>> compile (yes, I've done this before), and try the real application
>>> with GT2+coasters.
>>
>> Yes, thats a good step to re-master, in preparation for Mihael checking in
>> Coaster fixes. He made significant enhancements to Coasters in the past 2
>> months, but has been working ona different project lately and thus these are
>> not yet sufficiently tested. If you're willing to help in the testing that
>> would be great.
>>
>> If not, I think the next best approach to try is this:
>>
>> - We have a small experimental mod that enables Swift GRAM2 jobs to use
>> all cores of multi-core hosts (such as the 8-core hosts on Abe and
>> QueenBee). Basically it uses the Swift clustering facility but runs jobs in
>> parallel instead of serially.
>>
>> It works well if your jobs have a very uniform runtime. If they dont, then
>> it wastes CPU.  But its a good interim solution for many apps until coasters
>> is more stable.
>>
>> This is described at:
>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering
>>
>> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on
>> this list has tried it. Please start a new thread here if you want to
>>  discuss it or report experiences or problems with it.
>>
>> - On QueenBee or other GRAM5-enabled systems (not many test as its in test
>> mode) you can use the GRAM2 provider if submitting remotely.
>> On Abe and any other GRAM2 systems you should run this with the Condor-G
>> provider if submitting remotely.
>>
>> The rule of thumb here for submitting jobs to a site from Swift running
>> remotely on a submit host is:
>>
>>   -- up to 20 jobs in parallel you can use plain GRAM2
>>   -- above 20 jobs, use Condor-G or, where available, GRAM2
>>
>> - On Abe, QueenBee, and other PBS systems with login hosts, you can run
>> Swift locally on the login host, and use the PBS provider with the parallel
>> clustering approach.
>>
>> We have a few other solutions that I will save till we explore these two
>> solutions.
>>
>> To prepare for this, try running your app on Abe using the PBS provider,
>> with just 1 or 2 jobs, then try the parallel clustering tip above.
>>
>>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I
>>> thought you also said I should target using GT2. What is GRAM5?
>>
>> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so
>> you just set Swift sites.xml exactly as for GRAM2. The only thing that
>> changes is that you use a different URL for the GRAM gatekeeper contact
>> string (ie different host and/or port, thats all).
>>
>> I'll need to get you the contact string for GRAM5 on QueenBee if/when we
>> both agree the time is right to try it.
>>
>>> At
>>> this point, my preference is the system with lowest load and confirmed
>>> functional coaster provider, to save time debugging and getting up to
>>> speed. Should I use Abe or Queen Bee?
>>
>> Thats hard to answer, as the loads fluctuate.  You can examine the
>> TeraPort system load monitor in the TG portal, which gives some rough
>> estimates of load and queue time.  Then queue the jobs and wait. Best to run
>> Swift under screen, so you can easily wait for and monitor your script
>> executions from anywhere, and not be interrupted if long delays are
>> encountered.
>>
>> - Mike
>>
>>> As soon as I compile the current swift trunk and try GT2+coaster @Abe
>>> for my application, I will report to the list my experience.
>>>
>>> --
>>> Andriy Fedorov, Ph.D.
>>>
>>> Research Fellow
>>> Brigham and Women's Hospital
>>> Harvard Medical School
>>> 75 Francis Street
>>> Boston, MA 02115 USA
>>> fedorov at bwh.harvard.edu
>>>
>>>
>>>
>>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
>>>>
>>>> Andrey,
>>>>
>>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a processing step that takes somewhere ~2-5 min. It takes on
>>>>> input two ~5Mb files, and produces a small text file, which I need to
>>>>> store. I need to compute large number of such jobs, using different
>>>>> parameters. It seems to me "coaster" is the best execution provider
>>>>> for my application.
>>>>>
>>>>> Trying to start simple, I am running first.swift (echo) example that
>>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>>>>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>>>>
>>>>> Here's my sites.xml:
>>>>>
>>>>> <pool handle="Abe-GT4">
>>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>  <execution provider="gt4" jobmanager="PBS"
>>>>>
>>>>>
>>>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT4-coasters">
>>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>  <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>>>>
>>>>>
>>>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT2">
>>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>  <execution provider="gt2" jobmanager="PBS"
>>>>>  url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT2-coasters">
>>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>>>>  url="grid-abe.ncsa.teragrid.org"/>
>>>>>  <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org"
>>>>> />
>>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> And tc.data is simply
>>>>>
>>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>>>>
>>>>> and I change the site to test different providers.
>>>>>
>>>>> Now, results:
>>>>>
>>>>> 1) both GT2 and GT4 providers work fine, script completes
>>>>>
>>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>>>>> (requested time is 01:41, I guess this comes with the default coaster
>>>>> parameters, that I didn't change). The job appears to finish
>>>>> successfully, but then I get this error:
>>>>>
>>>>> Final status:  Finished successfully:1
>>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>>>>> Submitted task Task(type=JOB_SUBMISSION,
>>>>> identity=urn:0-1-1251210343871). Job id:
>>>>> urn:1251210343871-1251210376098-1251210376099
>>>>> Unregistering Command(21, SUBMITJOB)
>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>> Cleaning up...
>>>>> Shutting down service at https://141.142.68.180:45552
>>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>>>>> Command(22, SHUTDOWNSERVICE): failed too many times
>>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>>       at
>>>>>
>>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>>>>       at
>>>>>
>>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>>>>       at java.util.TimerThread.mainLoop(Timer.java:512)
>>>>>       at java.util.TimerThread.run(Timer.java:462)
>>>>> - Done
>>>>
>>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets
>>>> see
>>>> how coasters works for you on Abe using your real app and a larger
>>>> number of
>>>> jobs, and come back to this shutdown problem if it proves to be a
>>>> blocker to
>>>> getting work done.
>>>>
>>>> Coasters has a few other current issues - mainly not throttling work
>>>> efficiently - that we have a fix for, and need to apply and test that
>>>> one
>>>> first.
>>>>
>>>> We've also been experimenting with a non-coaster way to use all 8 cores
>>>> of
>>>> machines like Abe, but lets try the coaster route first, of thats OK
>>>> with
>>>> you, and lets focus on GT2/Coasters, as that will be more common.
>>>>
>>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
>>>> sister-system at LSU, which we can try, assuming your TG project lets
>>>> you
>>>> run there.
>>>>
>>>> So please try to run the app, and we will try to get the latest coaster
>>>> fixes committed. (I assume you are comfortable extracting Swift from svn
>>>> and
>>>> building it; if you have not done this before, can you try it, Andrey?)
>>>>
>>>> Regards,
>>>>
>>>> Mike
>>>>
>>>>
>>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>>>>> Possibly I am not setting up properly the site entry. I was not able
>>>>> to find any examples in the manual how to set coasters with GT4 (can
>>>>> anyone provide an example?). Here's the error:
>>>>>
>>>>> Failed to transfer wrapper log from
>>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>>>>> END_FAILURE thread=0 tr=echo
>>>>> Progress:  Failed:1
>>>>> Execution failed:
>>>>>       Exception in echo:
>>>>> Arguments: [Hello, world!]
>>>>> Host: Abe-GT4-coasters
>>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>>>>> stderr.txt:
>>>>>
>>>>> stdout.txt:
>>>>>
>>>>> ----
>>>>>
>>>>> Caused by:
>>>>>       Cannot submit job: Limited proxy is not accepted
>>>>>
>>>>>
>>>>> Can anybody help figuring this out?
>>>>>
>>>>> Thanks
>>>>> --
>>>>> Andriy Fedorov, Ph.D.
>>>>>
>>>>> Research Fellow
>>>>> Brigham and Women's Hospital
>>>>> Harvard Medical School
>>>>> 75 Francis Street
>>>>> Boston, MA 02115 USA
>>>>> fedorov at bwh.harvard.edu
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>



More information about the Swift-user mailing list