[Swift-devel] Problems with coaster data provider

wilde at mcs.anl.gov wilde at mcs.anl.gov
Tue Jul 20 10:20:29 CDT 2010


I tried the coaster data provider from MCS host vanquish to crush (2 of the compute servers) via ssh:local and get the error:

"org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH" (Full error text below).

Has anyone else tried the coaster data provider?

My sites file has the single pool:

  <pool handle="crush">
    <execution provider="coaster" url="crush.mcs.anl.gov" jobmanager="ssh:local"/>
    <profile namespace="globus" key="workersPerNode">8</profile>
    <profile namespace="globus" key="maxTime">3500</profile>
    <profile namespace="globus" key="slots">1</profile>
    <profile namespace="globus" key="nodeGranularity">1</profile>
    <profile namespace="globus" key="maxNodes">1</profile>

    <profile key="jobThrottle" namespace="karajan">.07</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="coaster" url="ssh://crush.mcs.anl.gov" />
    <workdirectory>/home/wilde/swiftwork/crush</workdirectory>
  </pool>

Is that the correct url= value?

I set these properties:

wrapperlog.always.transfer=false
sitedir.keep=true
execution.retries=0
status.mode=provider

The run command, svn version, and full error text on stdout/err is:

vanquish$ swift -tc.file tc -sites.file crushds.xml -config cf catsn.swift -n=1
Swift svn swift-r3449 cog-r2816

RunID: 20100720-1006-z1vio8i1
Progress:
Progress:  Failed:1
Execution failed:
	Could not initialize shared directory on crush
Caused by:
	org.globus.cog.abstraction.impl.file.FileResourceException: org.globus.cog.karajan.workflow.service.ProtocolException: Unknown command: #!/BIN/BASH
# THIS SCRIPT MUST BE INVOKED INSIDE OF BASH, NOT PLAIN SH
# NOTE THAT THIS SCRIPT MODIFIES $IFS

INFOSECTION() {

...full text of _swiftwrap shows up here, in upper case...

# ENSURE WE EXIT WITH A 0 AFTER A SUCCESSFUL EXECUTION
EXIT 0

# LOCAL VARIABLES: 
# MODE: SH
# SH-BASIC-OFFSET: 8
# END:

Cleaning up...
Shutting down service at https://140.221.8.62:59300
Got channel MetaChannel: 2039421489[205498061: {}] -> GSSSChannel-0494700354(1)[205498061: {}]
+ Done
vanquish$

The _swiftwrap file was created in the workdirectory shared/ subdir but has length zero. So presumably it was about to be transferred but the transfer failed.

vanquish$ ls -lR /home/wilde/swiftwork/crush/*8i1
/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1:
total 2
drwxr-sr-x 2 wilde mcsz 3 Jul 20 10:06 shared/

/home/wilde/swiftwork/crush/catsn-20100720-1006-z1vio8i1/shared:
total 1
-rw-r--r-- 1 wilde mcsz 0 Jul 20 10:06 _swiftwrap
vanquish$ 

----- "Mihael Hategan" <hategan at mcs.anl.gov> wrote:

> Most of the problems that were obvious with coaster file staging
> should
> be fixed now. I ran a few tests for 1024 cat jobs on TP with ssh:pbs
> with 2-8 workers/node (such that "concurrent" workers are tested) and
> it
> consistently seemed fine.
> 
> I also quickly made a fake provider and I am getting a rate of about
> 100
> j/s. So that seems not to infirm my previous suspicion.
> 
> On Mon, 2010-07-12 at 11:52 -0500, Michael Wilde wrote:
> > Here's my view on these:
> > 
> > > 2. test/fix coaster file staging
> > 
> > This would be useful for both real apps and (I think) for CDM
> testing. I would do this first.
> > 
> > I would then add:
> > 
> > 5. Adjustments needed, if any, on multicore handling in PBS and SGE
> provider.
> > 
> > 6. Adjustments and fixes for reliability and logging, if needed, in
> Condor-G provider.
> > 
> > I expect that 5 & 6 would be small tasks, and they are not yet
> clearly defined. I think that other people could do them.
> > 
> > Maybe add:
> > 
> > 7. -tui fixes. Seems not to be working so well on recent tests;
> several of the screens, including the source-code view, seem not to be
> working.
> > 
> > Then:
> > 
> > > 1. make swift core faster
> > 
> > I would do this second; I think you said you need about 7-10 days to
> try things and see what can be done, maybe more after that if the
> exploration suggests things that will take much (re)coding?
> > 
> > > 3. standalone coaster service
> > 
> > The current manual coasters is proving useful. 
> > > 4. swift shell
> > 
> > Lets defer (4) for now; if we can instead run swift repeatedly and
> either have the coaster worker pool re-connect quickly to each new
> swift, or quickly start new pools within the same cluster job(s), that
> would suffice for now.
> > 
> > Justin, do you want to weigh in on these?
> > 
> > Thanks,
> > 
> > Mike
> > 
> > 
> > > The idea is that some recent changes may have shifted the
> existing
> > > priorities. So think of this from the perspective of
> > > user/application/publication goals rather than what you think
> would
> > > be
> > > "nice to have".
> > > 
> > > Mihael
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list