[Swift-user] Problems getting started with coasters
Andriy Fedorov
fedorov at bwh.harvard.edu
Thu Aug 27 14:37:51 CDT 2009
On Tue, Aug 25, 2009 at 13:04, Michael Wilde<wilde at mcs.anl.gov> wrote:
> Andrey, good news: GRAM5 is now available on Abe as well. Info and contact
> URLs, as well as some Swift usage experience reports, are at:
>
> http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
>
> So with this in mind, a good approach is:
>
> - sanity test your app using the PBS provider on Abe, with swift on the
> login host, just 1 or 2 jobs
>
Michael,
I am actually having troubles with this sanity test.
I need to submit a file (about 5M) as an input to my application. What
seems to be happening is that the file gets corrupted in transmission!
I debugged this, and this appears to be the reason for my application to fail.
The same application/swift script runs fine when I use plain gt2,
without coasters.
What I did to debug, I echo the directory, where my applications is
started by Swift, so I get exact location of the file:
[fedorov at TG/Abe:honest4 SlicerReg] cat fileInfo.txt
lrwxrwxrwx 1 fedorov dkk 109 Aug 27 14:15 Data/MRMeningioma0.nrrd ->
/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
The file has the same size, but the content is not identical! Here's
basically the story:
[fedorov at TG/Abe:honest4 SlicerReg] ls -la Data/
total 10960
drwxr-x--- 2 fedorov dkk 4096 Aug 27 14:22 .
drwxr-x--- 25 fedorov dkk 12288 Aug 27 14:27 ..
-rw-r----- 1 fedorov dkk 5069225 Aug 25 15:49 MRMeningioma0.nrrd
-rw-r----- 1 fedorov dkk 6132840 Aug 25 15:49 MRMeningioma1.nrrd
[fedorov at TG/Abe:honest4 SlicerReg] ls -la
/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data
total 10952
drwxr-xr-x 2 fedorov dkk 4096 Aug 27 14:22 .
drwxr-xr-x 3 fedorov dkk 4096 Aug 27 14:10 ..
-rw-r--r-- 1 fedorov dkk 5069225 Aug 27 14:11 MRMeningioma0.nrrd
-rw-r--r-- 1 fedorov dkk 6132840 Aug 27 14:16 MRMeningioma1.nrrd
[fedorov at TG/Abe:honest4 SlicerReg] diff Data/MRMeningioma0.nrrd
/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
Binary files Data/MRMeningioma0.nrrd and
/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
differ
I can read my original file, but not the copied one:
[fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu
minmax /u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd
unu minmax: trouble with
"/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd":
[unu minmax] unu minmax: trouble loading
"/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd"
[unu minmax] [nrrd] nrrdLoad: trouble reading
"/u/ac/fedorov/scratch-global/scratch/RigidRegistration1-20090827-1411-ttnpb5d3/shared/Data/MRMeningioma0.nrrd"
[unu minmax] [nrrd] nrrdRead: trouble
[unu minmax] [nrrd] _nrrdRead: trouble reading NRRD file
[unu minmax] [nrrd] _nrrdFormatNRRD_read:
[unu minmax] [nrrd] _nrrdEncodingGzip_read: error reading from gzFile
[unu minmax] [nrrd] _nrrdGzRead: data read error
[fedorov at TG/Abe:honest4 SlicerReg] ~/Slicer3-lib/teem-build/bin/unu
minmax Data/MRMeningioma0.nrrd
min: 0
max: 695
Have you guys run any applications with non-trivial input file size,
and verified that file integritiy is preserved?
> - sanity test 16 to 64 or so jobs, adding parallel clustering to the above
>
> - change from the PBS provider to the GRAM2 (pre-WS-GRAM) provider, but
> using the GRAM URLs at http://dev.globus.org/wiki/GRAM/GRAM5#Deployments
> (still submitting from the Abe login host to Abe. You can keep the local
> data provider for this case)
>
> - Add Queenbee GRAM5 as a second site, using the gridftp data provider.
>
> Mike
>
>
> On 8/25/09 11:54 AM, Michael Wilde wrote:
>>
>> On 8/25/09 10:44 AM, Andriy Fedorov wrote:
>>>
>>> Michael,
>>>
>>> Thanks for the reply.
>>>
>>> So my understanding is, I should check out the trunk version and
>>> compile (yes, I've done this before), and try the real application
>>> with GT2+coasters.
>>
>> Yes, thats a good step to re-master, in preparation for Mihael checking in
>> Coaster fixes. He made significant enhancements to Coasters in the past 2
>> months, but has been working ona different project lately and thus these are
>> not yet sufficiently tested. If you're willing to help in the testing that
>> would be great.
>>
>> If not, I think the next best approach to try is this:
>>
>> - We have a small experimental mod that enables Swift GRAM2 jobs to use
>> all cores of multi-core hosts (such as the 8-core hosts on Abe and
>> QueenBee). Basically it uses the Swift clustering facility but runs jobs in
>> parallel instead of serially.
>>
>> It works well if your jobs have a very uniform runtime. If they dont, then
>> it wastes CPU. But its a good interim solution for many apps until coasters
>> is more stable.
>>
>> This is described at:
>> http://www.ci.uchicago.edu/wiki/bin/view/SWFT/SwiftParallelClustering
>>
>> This info is very preliminary and not end-user ready. Tibi Stef-Praun, on
>> this list has tried it. Please start a new thread here if you want to
>> discuss it or report experiences or problems with it.
>>
>> - On QueenBee or other GRAM5-enabled systems (not many test as its in test
>> mode) you can use the GRAM2 provider if submitting remotely.
>> On Abe and any other GRAM2 systems you should run this with the Condor-G
>> provider if submitting remotely.
>>
>> The rule of thumb here for submitting jobs to a site from Swift running
>> remotely on a submit host is:
>>
>> -- up to 20 jobs in parallel you can use plain GRAM2
>> -- above 20 jobs, use Condor-G or, where available, GRAM2
>>
>> - On Abe, QueenBee, and other PBS systems with login hosts, you can run
>> Swift locally on the login host, and use the PBS provider with the parallel
>> clustering approach.
>>
>> We have a few other solutions that I will save till we explore these two
>> solutions.
>>
>> To prepare for this, try running your app on Abe using the PBS provider,
>> with just 1 or 2 jobs, then try the parallel clustering tip above.
>>
>>> I do have an account on Queen Bee. You say, it has GT GRAM5, but I
>>> thought you also said I should target using GT2. What is GRAM5?
>>
>> GRAM5 is a new, more efficient version of GRAM2. Its fully compatible, so
>> you just set Swift sites.xml exactly as for GRAM2. The only thing that
>> changes is that you use a different URL for the GRAM gatekeeper contact
>> string (ie different host and/or port, thats all).
>>
>> I'll need to get you the contact string for GRAM5 on QueenBee if/when we
>> both agree the time is right to try it.
>>
>>> At
>>> this point, my preference is the system with lowest load and confirmed
>>> functional coaster provider, to save time debugging and getting up to
>>> speed. Should I use Abe or Queen Bee?
>>
>> Thats hard to answer, as the loads fluctuate. You can examine the
>> TeraPort system load monitor in the TG portal, which gives some rough
>> estimates of load and queue time. Then queue the jobs and wait. Best to run
>> Swift under screen, so you can easily wait for and monitor your script
>> executions from anywhere, and not be interrupted if long delays are
>> encountered.
>>
>> - Mike
>>
>>> As soon as I compile the current swift trunk and try GT2+coaster @Abe
>>> for my application, I will report to the list my experience.
>>>
>>> --
>>> Andriy Fedorov, Ph.D.
>>>
>>> Research Fellow
>>> Brigham and Women's Hospital
>>> Harvard Medical School
>>> 75 Francis Street
>>> Boston, MA 02115 USA
>>> fedorov at bwh.harvard.edu
>>>
>>>
>>>
>>> On Tue, Aug 25, 2009 at 11:31, Michael Wilde<wilde at mcs.anl.gov> wrote:
>>>>
>>>> Andrey,
>>>>
>>>> On 8/25/09 9:49 AM, Andrey Fedorov wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a processing step that takes somewhere ~2-5 min. It takes on
>>>>> input two ~5Mb files, and produces a small text file, which I need to
>>>>> store. I need to compute large number of such jobs, using different
>>>>> parameters. It seems to me "coaster" is the best execution provider
>>>>> for my application.
>>>>>
>>>>> Trying to start simple, I am running first.swift (echo) example that
>>>>> comes with Swift using different providers: GT2, GT4, GT2/coaster, and
>>>>> GT4/coaster. All of this is done on Abe NCSA cluster.
>>>>>
>>>>> Here's my sites.xml:
>>>>>
>>>>> <pool handle="Abe-GT4">
>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>> <execution provider="gt4" jobmanager="PBS"
>>>>>
>>>>>
>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT4-coasters">
>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>> <execution provider="coaster" jobmanager="gt4:gt4:pbs"
>>>>>
>>>>>
>>>>> url="https://grid-abe.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService"/>
>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT2">
>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>> <execution provider="gt2" jobmanager="PBS"
>>>>> url="grid-abe.ncsa.teragrid.org:2119/jobmanager-pbs"/>
>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> <pool handle="Abe-GT2-coasters">
>>>>> <gridftp url="gsiftp://gridftp-abe.ncsa.teragrid.org:2811/" />
>>>>> <execution provider="coaster" jobmanager="gt2:gt2:pbs"
>>>>> url="grid-abe.ncsa.teragrid.org"/>
>>>>> <filesystem provider="coaster" url="gt2://grid-abe.ncsa.teragrid.org"
>>>>> />
>>>>> <workdirectory>/u/ac/fedorov/scratch-global/scratch</workdirectory>
>>>>> </pool>
>>>>>
>>>>> And tc.data is simply
>>>>>
>>>>> Abe-GT4-coasters echo /bin/echo INSTALLED INTEL32::LINUX null
>>>>>
>>>>> and I change the site to test different providers.
>>>>>
>>>>> Now, results:
>>>>>
>>>>> 1) both GT2 and GT4 providers work fine, script completes
>>>>>
>>>>> 2) with GT2+coaster provider, I can see the job in the PBS queue
>>>>> (requested time is 01:41, I guess this comes with the default coaster
>>>>> parameters, that I didn't change). The job appears to finish
>>>>> successfully, but then I get this error:
>>>>>
>>>>> Final status: Finished successfully:1
>>>>> START cleanups=[[first-20090825-0925-emkt2qt0, Abe-GT2-coasters]]
>>>>> START dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>> Sending Command(21, SUBMITJOB) on GSSSChannel-null(1)
>>>>> Command(21, SUBMITJOB) CMD: Command(21, SUBMITJOB)
>>>>> GSSSChannel-null(1) REPL: Command(21, SUBMITJOB)
>>>>> Submitted task Task(type=JOB_SUBMISSION,
>>>>> identity=urn:0-1-1251210343871). Job id:
>>>>> urn:1251210343871-1251210376098-1251210376099
>>>>> Unregistering Command(21, SUBMITJOB)
>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>> GSSSChannel-null(1) REQ: Handler(JOBSTATUS)
>>>>> Task(type=JOB_SUBMISSION, identity=urn:0-1-1251210343871) Completed.
>>>>> Waiting: 0, Running: 0. Heap size: 65M, Heap free: 42M, Max heap: 227M
>>>>> END dir=first-20090825-0925-emkt2qt0 host=Abe-GT2-coasters
>>>>> Cleaning up...
>>>>> Shutting down service at https://141.142.68.180:45552
>>>>> Got channel MetaChannel: 500265006 -> GSSSChannel-null(1)
>>>>> Sending Command(22, SHUTDOWNSERVICE) on GSSSChannel-null(1)
>>>>> Command(22, SHUTDOWNSERVICE) CMD: Command(22, SHUTDOWNSERVICE)
>>>>> Command(22, SHUTDOWNSERVICE): handling reply timeout
>>>>> Command(22, SHUTDOWNSERVICE): failed too many times
>>>>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>>>>> at
>>>>>
>>>>> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:241)
>>>>> at
>>>>>
>>>>> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:246)
>>>>> at java.util.TimerThread.mainLoop(Timer.java:512)
>>>>> at java.util.TimerThread.run(Timer.java:462)
>>>>> - Done
>>>>
>>>> This seems like a low-prio error. I'll file it in bugzilla for now. Lets
>>>> see
>>>> how coasters works for you on Abe using your real app and a larger
>>>> number of
>>>> jobs, and come back to this shutdown problem if it proves to be a
>>>> blocker to
>>>> getting work done.
>>>>
>>>> Coasters has a few other current issues - mainly not throttling work
>>>> efficiently - that we have a fix for, and need to apply and test that
>>>> one
>>>> first.
>>>>
>>>> We've also been experimenting with a non-coaster way to use all 8 cores
>>>> of
>>>> machines like Abe, but lets try the coaster route first, of thats OK
>>>> with
>>>> you, and lets focus on GT2/Coasters, as that will be more common.
>>>>
>>>> In addition, there is a test version of GT GRAM5 on QueenBee, Abe's
>>>> sister-system at LSU, which we can try, assuming your TG project lets
>>>> you
>>>> run there.
>>>>
>>>> So please try to run the app, and we will try to get the latest coaster
>>>> fixes committed. (I assume you are comfortable extracting Swift from svn
>>>> and
>>>> building it; if you have not done this before, can you try it, Andrey?)
>>>>
>>>> Regards,
>>>>
>>>> Mike
>>>>
>>>>
>>>>> 3) with GT4-coaster provider, I don't get as far as with GT2-coaster.
>>>>> Possibly I am not setting up properly the site entry. I was not able
>>>>> to find any examples in the manual how to set coasters with GT4 (can
>>>>> anyone provide an example?). Here's the error:
>>>>>
>>>>> Failed to transfer wrapper log from
>>>>> first-20090825-0929-39x94x09/info/t on Abe-GT4-coasters
>>>>> END_FAILURE thread=0 tr=echo
>>>>> Progress: Failed:1
>>>>> Execution failed:
>>>>> Exception in echo:
>>>>> Arguments: [Hello, world!]
>>>>> Host: Abe-GT4-coasters
>>>>> Directory: first-20090825-0929-39x94x09/jobs/t/echo-t5oymmfj
>>>>> stderr.txt:
>>>>>
>>>>> stdout.txt:
>>>>>
>>>>> ----
>>>>>
>>>>> Caused by:
>>>>> Cannot submit job: Limited proxy is not accepted
>>>>>
>>>>>
>>>>> Can anybody help figuring this out?
>>>>>
>>>>> Thanks
>>>>> --
>>>>> Andriy Fedorov, Ph.D.
>>>>>
>>>>> Research Fellow
>>>>> Brigham and Women's Hospital
>>>>> Harvard Medical School
>>>>> 75 Francis Street
>>>>> Boston, MA 02115 USA
>>>>> fedorov at bwh.harvard.edu
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>
More information about the Swift-user
mailing list