[Swift-user] Problems getting started with coasters

Michael Wilde wilde at mcs.anl.gov
Tue Aug 25 12:04:12 CDT 2009


Andrey, good news: GRAM5 is now available on Abe as well. Info and 
contact URLs, as well as some Swift usage experience reports, are at:

http://dev.globus.org/wiki/GRAM/GRAM5#Deployments

So with this in mind, a good approach is:

- sanity test your app using the PBS provider on Abe, with swift on the 
login host, just 1 or 2 jobs

- sanity test 16 to 64 or so jobs, adding parallel clustering to the above

- change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but 
using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
(still submitting from the Abe login host to Abe. You can keep the local 
data provider for this case)

- Add Queenbee GRAM5 as a second site, using the gridftp data provider.

Mike


On 8/25/09 11:54 AM, Michael Wilde wrote:
> On 8/25/09 10:44 AM, Andriy Fedorov wrote:
>> Michael,
>>
>> Thanks for the reply.
>>
>> So my understanding is, I should check out the trunk version and
>> compile (yes, I've done this before), and try the real application
>> with GT2+coasters.
> 
> Yes, thats a good step to re-master, in preparation for Mihael checking 
> in Coaster fixes. He made significant enhancements to Coasters in the 
> past 2 months, but has been working ona different project lately and 
> thus these are not yet sufficiently tested. If you're willing to help in 
> the testing that would be great.
> 
> If not, I think the next best approach to try is this:
> 
> - We have a small experimental mod that enables Swift GRAM2 jobs to use 
> all cores of multi-core hosts (such as the 8-core hosts on Abe and 
> QueenBee). Basically it uses the Swift clustering facility but runs jobs 
> in parallel instead of serially.
> 
> It works well if your jobs have a very uniform runtime. If they dont, 
> then it wastes CPU.  But its a good interim solution for many apps until 
> coasters is more stable.
> 
> This is described at:
> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering
> 
> This info is very preliminary and not end-user ready. Tibi Stef-Praun, 
> on this list has tried it. Please start a new thread here if you want to 
>   discuss it or report experiences or problems with it.
> 
> - On QueenBee or other GRAM5-enabled systems (not many test as its in 
> test mode) you can use the GRAM2 provider if submitting remotely.
> On Abe and any other GRAM2 systems you should run this with the Condor-G 
> provider if submitting remotely.
> 
> The rule of thumb here for submitting jobs to a site from Swift running 
> remotely on a submit host is:
> 
>    -- up to 20 jobs in parallel you can use plain GRAM2
>    -- above 20 jobs, use Condor-G or, where available, GRAM2
> 
> - On Abe, QueenBee, and other PBS systems with login hosts, you can run 
> Swift locally on the login host, and use the PBS provider with the 
> parallel clustering approach.
> 
> We have a few other solutions that I will save till we explore these two 
> solutions.
> 
> To prepare for this, try running your app on Abe using the PBS provider, 
> with just 1 or 2 jobs, then try the parallel clustering tip above.
> 
>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I
>> thought you also said I should target using GT2. What is GRAM5?
> 
> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, 
> so you just set Swift sites.xml exactly as for GRAM2. The only thing 
> that changes is that you use a different URL for the GRAM gatekeeper 
> contact string (ie different host and/or port, thats all).
> 
> I'll need to get you the contact string for GRAM5 on QueenBee if/when we 
> both agree the time is right to try it.
> 
>> At
>> this point, my preference is the system with lowest load and confirmed
>> functional coaster provider, to save time debugging and getting up to
>> speed. Should I use Abe or Queen Bee?
> 
> Thats hard to answer, as the loads fluctuate.  You can examine the 
> TeraPort system load monitor in the TG portal, which gives some rough 
> estimates of load and queue time.  Then queue the jobs and wait. Best to 
> run Swift under screen, so you can easily wait for and monitor your 
> script executions from anywhere, and not be interrupted if long delays 
> are encountered.
> 
> - Mike
> 
>> As soon as I compile the current swift trunk and try GT2+coaster @Abe
>> for my application, I will report to the list my experience.
>>
>> --
>> Andriy Fedorov, Ph.D.
>>
>> Research Fellow
>> Brigham and Women's Hospital
>> Harvard Medical School
>> 75 Francis Street
>> Boston, MA 02115 USA
>> fedorov at bwh.harvard.edu
>>
>>
>>
>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
>>> Andrey,
>>>
>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>>> Hi,
>>>>
>>>> I have a processing step that takes somewhere ~2-5 min. It takes on
>>>> input two ~5Mb files, and produces a small text file, which I need to
>>>> store. I need to compute large number of such jobs, using different
>>>> parameters. It seems to me "coaster" is the best execution provider
>>>> for my application.
>>>>
>>>> Trying to start simple, I am running first.swift (echo) example that
>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>>>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>>>
>>>> Here's my sites.xml:
>>>>
>>>> <pool handle="Abe-GT4">
>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>  <execution provider="gt4" jobmanager="PBS"
>>>>
>>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>> </pool>
>>>>
>>>> <pool handle="Abe-GT4-coasters">
>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>  <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>>>
>>>>  url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>> </pool>
>>>>
>>>> <pool handle="Abe-GT2">
>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>  <execution provider="gt2" jobmanager="PBS"
>>>>  url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>> </pool>
>>>>
>>>> <pool handle="Abe-GT2-coasters">
>>>>  <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>  <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>>>  url="grid-abe.ncsa.teragrid.org"/>
>>>>  <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org" />
>>>>  <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>> </pool>
>>>>
>>>> And tc.data is simply
>>>>
>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>>>
>>>> and I change the site to test different providers.
>>>>
>>>> Now, results:
>>>>
>>>> 1) both GT2 and GT4 providers work fine, script completes
>>>>
>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>>>> (requested time is 01:41, I guess this comes with the default coaster
>>>> parameters, that I didn't change). The job appears to finish
>>>> successfully, but then I get this error:
>>>>
>>>> Final status:  Finished successfully:1
>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>>>> Submitted task Task(type=JOB_SUBMISSION,
>>>> identity=urn:0-1-1251210343871). Job id:
>>>> urn:1251210343871-1251210376098-1251210376099
>>>> Unregistering Command(21, SUBMITJOB)
>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>> Cleaning up...
>>>> Shutting down service at https://141.142.68.180:45552
>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>>>> Command(22, SHUTDOWNSERVICE): failed too many times
>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>>>        at
>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>>>        at java.util.TimerThread.mainLoop(Timer.java:512)
>>>>        at java.util.TimerThread.run(Timer.java:462)
>>>> - Done
>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets see
>>> how coasters works for you on Abe using your real app and a larger number of
>>> jobs, and come back to this shutdown problem if it proves to be a blocker to
>>> getting work done.
>>>
>>> Coasters has a few other current issues - mainly not throttling work
>>> efficiently - that we have a fix for, and need to apply and test that one
>>> first.
>>>
>>> We've also been experimenting with a non-coaster way to use all 8 cores of
>>> machines like Abe, but lets try the coaster route first, of thats OK with
>>> you, and lets focus on GT2/Coasters, as that will be more common.
>>>
>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
>>> sister-system at LSU, which we can try, assuming your TG project lets you
>>> run there.
>>>
>>> So please try to run the app, and we will try to get the latest coaster
>>> fixes committed. (I assume you are comfortable extracting Swift from svn and
>>> building it; if you have not done this before, can you try it, Andrey?)
>>>
>>> Regards,
>>>
>>> Mike
>>>
>>>
>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>>>> Possibly I am not setting up properly the site entry. I was not able
>>>> to find any examples in the manual how to set coasters with GT4 (can
>>>> anyone provide an example?). Here's the error:
>>>>
>>>> Failed to transfer wrapper log from
>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>>>> END_FAILURE thread=0 tr=echo
>>>> Progress:  Failed:1
>>>> Execution failed:
>>>>        Exception in echo:
>>>> Arguments: [Hello, world!]
>>>> Host: Abe-GT4-coasters
>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>>        Cannot submit job: Limited proxy is not accepted
>>>>
>>>>
>>>> Can anybody help figuring this out?
>>>>
>>>> Thanks
>>>> --
>>>> Andriy Fedorov, Ph.D.
>>>>
>>>> Research Fellow
>>>> Brigham and Women's Hospital
>>>> Harvard Medical School
>>>> 75 Francis Street
>>>> Boston, MA 02115 USA
>>>> fedorov at bwh.harvard.edu
>>>> _______________________________________________
>>>> Swift-user mailing list
>>>> Swift-user at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 



More information about the Swift-user mailing list