[Swift-devel] Problems with coaster data provider
wilde at mcs.anl.gov
wilde at mcs.anl.gov
Tue Jul 20 10:20:29 CDT 2010
I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error:
"org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below).
Has anyone else tried the coaster data provider?
My sites file has the single pool:
<pool handle="crush">
<execution provider="coaster" url="crush.mcs.anl.gov" jobmanager="ssh:local"/>
<profile namespace="globus" key="workersPerNode">8</profile>
<profile namespace="globus" key="maxTime">3500</profile>
<profile namespace="globus" key="slots">1</profile>
<profile namespace="globus" key="nodeGranularity">1</profile>
<profile namespace="globus" key="maxNodes">1</profile>
<profile key="jobThrottle" namespace="karajan">.07</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<filesystem provider="coaster" url="ssh://crush.mcs.anl.gov" />
<workdirectory>/home/wilde/swiftwork/crush</workdirectory>
</pool>
Is that the correct url= value?
I set these properties:
wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
status.mode=provider
The run command, svn version, and full error text on stdout/err is:
vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1
Swift svn swift-r3449 cog-r2816
RunID: 20100720-1006-z1vio8i1
Progress:
Progress: Failed:1
Execution failed:
Could not initialize shared directory on crush
Caused by:
org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH
# THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH
# NOTE THAT THIS SCRIPT MODIFIES $IFS
INFOSECTION() {
...full text of _swiftwrap shows up here, in upper case...
# ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION
EXIT 0
# LOCAL VARIABLES:
# MODE: SH
# SH-BASIC-OFFSET: 8
# END:
Cleaning up...
Shutting down service at https://140.221.8.62:59300
Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}]
+ Done
vanquish$
The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed.
vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1
/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1:
total 2
drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/
/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared:
total 1
-rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap
vanquish$
----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:
> Most of the problems that were obvious with coaster file staging
> should
> be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
> with 2-8 workers/node (such that "concurrent" workers are tested) and
> it
> consistently seemed fine.
>
> I also quickly made a fake provider and I am getting a rate of about
> 100
> j/s. So that seems not to infirm my previous suspicion.
>
> On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> > Here's my view on these:
> >
> > > 2. test/fix coaster file staging
> >
> > This would be useful for both real apps and (I think) for CDM
> testing. I would do this first.
> >
> > I would then add:
> >
> > 5. Adjustments needed, if any, on multicore handling in PBS and SGE
> provider.
> >
> > 6. Adjustments and fixes for reliability and logging, if needed, in
> Condor-G provider.
> >
> > I expect that 5 & 6 would be small tasks, and they are not yet
> clearly defined. I think that other people could do them.
> >
> > Maybe add:
> >
> > 7. -tui fixes. Seems not to be working so well on recent tests;
> several of the screens, including the source-code view, seem not to be
> working.
> >
> > Then:
> >
> > > 1. make swift core faster
> >
> > I would do this second; I think you said you need about 7-10 days to
> try things and see what can be done, maybe more after that if the
> exploration suggests things that will take much (re)coding?
> >
> > > 3. standalone coaster service
> >
> > The current manual coasters is proving useful.
> > > 4. swift shell
> >
> > Lets defer (4) for now; if we can instead run swift repeatedly and
> either have the coaster worker pool re-connect quickly to each new
> swift, or quickly start new pools within the same cluster job(s), that
> would suffice for now.
> >
> > Justin, do you want to weigh in on these?
> >
> > Thanks,
> >
> > Mike
> >
> >
> > > The idea is that some recent changes may have shifted the
> existing
> > > priorities. So think of this from the perspective of
> > > user/application/publication goals rather than what you think
> would
> > > be
> > > "nice to have".
> > >
> > > Mihael
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list