[Swift-user] Problems getting started with coasters
Michael Wilde
wilde at mcs.anl.gov
Thu Aug 27 15:42:33 CDT 2009
Andriy, can you post your sites.xml file?
I *suspect* that you may (inadvertently) be using the Coaster data
provider, using an XML tag like this in the <pool> element for the local
site:
<filesystem provider="coaster" url="gt2://grid.myhost.org" />
If you are, remove that (for now). There is suspected problems with
coaster data transfer for large(er) files. Its worked well for large
sets of very small ones. (We need to get such alerts posted somewhere
clearly, sorry).
Use (only) this tag for the data provider for the local-PBS "sanity" test:
<gridftp url="local://localhost" />
If you do *not* have a coaster filesystem tag in your <pool> element,
then I need to dig deeper, and may need some logs from you and/or access
to your directories on Abe.
Also note:
o you can and should use the gridftp/local tag above even when using
coasters as your execution provider, when you are running on a site that
has access to your local directories (eg when the worker nodes of your
target site can directly access the file names that your Swift script is
mapping).
o Mihael posted the promised fixes to Coasters last night, and once we
get past this sanity test, you should try those. You may be trying these
ahead of me, so my apologies if you find some problems for us first.
- Mike
On 8/27/09 2:37 PM, Andriy Fedorov wrote:
> On Tue, Aug 25, 2009 at 13:04, Michael Wilde<wilde at mcs.anl.gov> wrote:
>> Andrey, good news: GRAM5 is now available on Abe as well. Info and contact
>> URLs, as well as some Swift usage experience reports, are at:
>>
>> http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
>>
>> So with this in mind, a good approach is:
>>
>> - sanity test your app using the PBS provider on Abe, with swift on the
>> login host, just 1 or 2 jobs
>>
>
> Michael,
>
> I am actually having troubles with this sanity test.
>
> I need to submit a file (about 5M) as an input to my application. What
> seems to be happening is that the file gets corrupted in transmission!
>
> I debugged this, and this appears to be the reason for my application to fail.
>
> The same application/swift script runs fine when I use plain gt2,
> without coasters.
>
> What I did to debug, I echo the directory, where my applications is
> started by Swift, so I get exact location of the file:
>
> [fedorov at TG/Abe:honest4 SlicerReg] cat fileInfo.txt
> lrwxrwxrwx 1 fedorov dkk 109 Aug 27 14:15 Data/MRMeningioma0.nrrd ->
> /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
>
> The file has the same size, but the content is not identical! Here's
> basically the story:
>
> [fedorov at TG/Abe:honest4 SlicerReg] ls -la Data/
> total 10960
> drwxr-x--- 2 fedorov dkk 4096 Aug 27 14:22 .
> drwxr-x--- 25 fedorov dkk 12288 Aug 27 14:27 ..
> -rw-r----- 1 fedorov dkk 5069225 Aug 25 15:49 MRMeningioma0.nrrd
> -rw-r----- 1 fedorov dkk 6132840 Aug 25 15:49 MRMeningioma1.nrrd
> [fedorov at TG/Abe:honest4 SlicerReg] ls -la
> /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data
> total 10952
> drwxr-xr-x 2 fedorov dkk 4096 Aug 27 14:22 .
> drwxr-xr-x 3 fedorov dkk 4096 Aug 27 14:10 ..
> -rw-r--r-- 1 fedorov dkk 5069225 Aug 27 14:11 MRMeningioma0.nrrd
> -rw-r--r-- 1 fedorov dkk 6132840 Aug 27 14:16 MRMeningioma1.nrrd
> [fedorov at TG/Abe:honest4 SlicerReg] diff Data/MRMeningioma0.nrrd
> /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
> Binary files Data/MRMeningioma0.nrrd and
> /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
> differ
>
> I can read my original file, but not the copied one:
>
> [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu
> minmax /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
> unu minmax: trouble with
> "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd":
> [unu minmax] unu minmax: trouble loading
> "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd"
> [unu minmax] [nrrd] nrrdLoad: trouble reading
> "/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd"
> [unu minmax] [nrrd] nrrdRead: trouble
> [unu minmax] [nrrd] _nrrdRead: trouble reading NRRD file
> [unu minmax] [nrrd] _nrrdFormatNRRD_read:
> [unu minmax] [nrrd] _nrrdEncodingGzip_read: error reading from gzFile
> [unu minmax] [nrrd] _nrrdGzRead: data read error
> [fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu
> minmax Data/MRMeningioma0.nrrd
> min: 0
> max: 695
>
>
>
> Have you guys run any applications with non-trivial input file size,
> and verified that file integritiy is preserved?
>
>
>
>> - sanity test 16 to 64 or so jobs, adding parallel clustering to the above
>>
>> - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but
>> using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
>> (still submitting from the Abe login host to Abe. You can keep the local
>> data provider for this case)
>>
>> - Add Queenbee GRAM5 as a second site, using the gridftp data provider.
>>
>> Mike
>>
>>
>> On 8/25/09 11:54 AM, Michael Wilde wrote:
>>> On 8/25/09 10:44 AM, Andriy Fedorov wrote:
>>>> Michael,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> So my understanding is, I should check out the trunk version and
>>>> compile (yes, I've done this before), and try the real application
>>>> with GT2+coasters.
>>> Yes, thats a good step to re-master, in preparation for Mihael checking in
>>> Coaster fixes. He made significant enhancements to Coasters in the past 2
>>> months, but has been working ona different project lately and thus these are
>>> not yet sufficiently tested. If you're willing to help in the testing that
>>> would be great.
>>>
>>> If not, I think the next best approach to try is this:
>>>
>>> - We have a small experimental mod that enables Swift GRAM2 jobs to use
>>> all cores of multi-core hosts (such as the 8-core hosts on Abe and
>>> QueenBee). Basically it uses the Swift clustering facility but runs jobs in
>>> parallel instead of serially.
>>>
>>> It works well if your jobs have a very uniform runtime. If they dont, then
>>> it wastes CPU. But its a good interim solution for many apps until coasters
>>> is more stable.
>>>
>>> This is described at:
>>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering
>>>
>>> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on
>>> this list has tried it. Please start a new thread here if you want to
>>> discuss it or report experiences or problems with it.
>>>
>>> - On QueenBee or other GRAM5-enabled systems (not many test as its in test
>>> mode) you can use the GRAM2 provider if submitting remotely.
>>> On Abe and any other GRAM2 systems you should run this with the Condor-G
>>> provider if submitting remotely.
>>>
>>> The rule of thumb here for submitting jobs to a site from Swift running
>>> remotely on a submit host is:
>>>
>>> -- up to 20 jobs in parallel you can use plain GRAM2
>>> -- above 20 jobs, use Condor-G or, where available, GRAM2
>>>
>>> - On Abe, QueenBee, and other PBS systems with login hosts, you can run
>>> Swift locally on the login host, and use the PBS provider with the parallel
>>> clustering approach.
>>>
>>> We have a few other solutions that I will save till we explore these two
>>> solutions.
>>>
>>> To prepare for this, try running your app on Abe using the PBS provider,
>>> with just 1 or 2 jobs, then try the parallel clustering tip above.
>>>
>>>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I
>>>> thought you also said I should target using GT2. What is GRAM5?
>>> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so
>>> you just set Swift sites.xml exactly as for GRAM2. The only thing that
>>> changes is that you use a different URL for the GRAM gatekeeper contact
>>> string (ie different host and/or port, thats all).
>>>
>>> I'll need to get you the contact string for GRAM5 on QueenBee if/when we
>>> both agree the time is right to try it.
>>>
>>>> At
>>>> this point, my preference is the system with lowest load and confirmed
>>>> functional coaster provider, to save time debugging and getting up to
>>>> speed. Should I use Abe or Queen Bee?
>>> Thats hard to answer, as the loads fluctuate. You can examine the
>>> TeraPort system load monitor in the TG portal, which gives some rough
>>> estimates of load and queue time. Then queue the jobs and wait. Best to run
>>> Swift under screen, so you can easily wait for and monitor your script
>>> executions from anywhere, and not be interrupted if long delays are
>>> encountered.
>>>
>>> - Mike
>>>
>>>> As soon as I compile the current swift trunk and try GT2+coaster @Abe
>>>> for my application, I will report to the list my experience.
>>>>
>>>> --
>>>> Andriy Fedorov, Ph.D.
>>>>
>>>> Research Fellow
>>>> Brigham and Women's Hospital
>>>> Harvard Medical School
>>>> 75 Francis Street
>>>> Boston, MA 02115 USA
>>>> fedorov at bwh.harvard.edu
>>>>
>>>>
>>>>
>>>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
>>>>> Andrey,
>>>>>
>>>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have a processing step that takes somewhere ~2-5 min. It takes on
>>>>>> input two ~5Mb files, and produces a small text file, which I need to
>>>>>> store. I need to compute large number of such jobs, using different
>>>>>> parameters. It seems to me "coaster" is the best execution provider
>>>>>> for my application.
>>>>>>
>>>>>> Trying to start simple, I am running first.swift (echo) example that
>>>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>>>>>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>>>>>
>>>>>> Here's my sites.xml:
>>>>>>
>>>>>> <pool handle="Abe-GT4">
>>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>> <execution provider="gt4" jobmanager="PBS"
>>>>>>
>>>>>>
>>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>>> </pool>
>>>>>>
>>>>>> <pool handle="Abe-GT4-coasters">
>>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>> <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>>>>>
>>>>>>
>>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>>> </pool>
>>>>>>
>>>>>> <pool handle="Abe-GT2">
>>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>> <execution provider="gt2" jobmanager="PBS"
>>>>>> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>>> </pool>
>>>>>>
>>>>>> <pool handle="Abe-GT2-coasters">
>>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>>> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>>>>> url="grid-abe.ncsa.teragrid.org"/>
>>>>>> <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org"
>>>>>> />
>>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>>> </pool>
>>>>>>
>>>>>> And tc.data is simply
>>>>>>
>>>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>>>>>
>>>>>> and I change the site to test different providers.
>>>>>>
>>>>>> Now, results:
>>>>>>
>>>>>> 1) both GT2 and GT4 providers work fine, script completes
>>>>>>
>>>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>>>>>> (requested time is 01:41, I guess this comes with the default coaster
>>>>>> parameters, that I didn't change). The job appears to finish
>>>>>> successfully, but then I get this error:
>>>>>>
>>>>>> Final status: Finished successfully:1
>>>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>>>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>>>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>>>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>>>>>> Submitted task Task(type=JOB_SUBMISSION,
>>>>>> identity=urn:0-1-1251210343871). Job id:
>>>>>> urn:1251210343871-1251210376098-1251210376099
>>>>>> Unregistering Command(21, SUBMITJOB)
>>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>>>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>>>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>>> Cleaning up...
>>>>>> Shutting down service at https://141.142.68.180:45552
>>>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>>>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>>>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>>>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>>>>>> Command(22, SHUTDOWNSERVICE): failed too many times
>>>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>>> at
>>>>>>
>>>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>>>>> at
>>>>>>
>>>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>>>>> at java.util.TimerThread.mainLoop(Timer.java:512)
>>>>>> at java.util.TimerThread.run(Timer.java:462)
>>>>>> - Done
>>>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets
>>>>> see
>>>>> how coasters works for you on Abe using your real app and a larger
>>>>> number of
>>>>> jobs, and come back to this shutdown problem if it proves to be a
>>>>> blocker to
>>>>> getting work done.
>>>>>
>>>>> Coasters has a few other current issues - mainly not throttling work
>>>>> efficiently - that we have a fix for, and need to apply and test that
>>>>> one
>>>>> first.
>>>>>
>>>>> We've also been experimenting with a non-coaster way to use all 8 cores
>>>>> of
>>>>> machines like Abe, but lets try the coaster route first, of thats OK
>>>>> with
>>>>> you, and lets focus on GT2/Coasters, as that will be more common.
>>>>>
>>>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
>>>>> sister-system at LSU, which we can try, assuming your TG project lets
>>>>> you
>>>>> run there.
>>>>>
>>>>> So please try to run the app, and we will try to get the latest coaster
>>>>> fixes committed. (I assume you are comfortable extracting Swift from svn
>>>>> and
>>>>> building it; if you have not done this before, can you try it, Andrey?)
>>>>>
>>>>> Regards,
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>>>>>> Possibly I am not setting up properly the site entry. I was not able
>>>>>> to find any examples in the manual how to set coasters with GT4 (can
>>>>>> anyone provide an example?). Here's the error:
>>>>>>
>>>>>> Failed to transfer wrapper log from
>>>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>>>>>> END_FAILURE thread=0 tr=echo
>>>>>> Progress: Failed:1
>>>>>> Execution failed:
>>>>>> Exception in echo:
>>>>>> Arguments: [Hello, world!]
>>>>>> Host: Abe-GT4-coasters
>>>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>>>>>> stderr.txt:
>>>>>>
>>>>>> stdout.txt:
>>>>>>
>>>>>> ----
>>>>>>
>>>>>> Caused by:
>>>>>> Cannot submit job: Limited proxy is not accepted
>>>>>>
>>>>>>
>>>>>> Can anybody help figuring this out?
>>>>>>
>>>>>> Thanks
>>>>>> --
>>>>>> Andriy Fedorov, Ph.D.
>>>>>>
>>>>>> Research Fellow
>>>>>> Brigham and Women's Hospital
>>>>>> Harvard Medical School
>>>>>> 75 Francis Street
>>>>>> Boston, MA 02115 USA
>>>>>> fedorov at bwh.harvard.edu
>>>>>> _______________________________________________
>>>>>> Swift-user mailing list
>>>>>> Swift-user at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list