[Swift-devel] Re: Manual start script for persistent coasters on Cobalt and other schedulers

Michael Wilde wilde at mcs.anl.gov
Wed Jan 12 08:54:48 CST 2011


David, lets do a skype call in a few hours to discuss.

I *think* this command should "just work" to a large extent if you make sure that the helper script is accessible and the "R"-specific stuff is commented out.

I last tested it on SGE but it has worked on PADS/PBS.

- Mike


----- Original Message -----
> Mike,
> 
> I will give it a try. Would the configuration for this be similar to
> the persistent passive coaster configuration used on the MCS machines?
> 
> For example:
> <execution provider="coaster-persistent" url="churn.mcs.anl.gov"
> jobmanager="local:local"/>
> <profile namespace="globus" key="workerManager">passive</profile>
> 
> With each of the 4 worker nodes having it's own entry? Do you happen
> to know the names of the workers for Gadzooks?
> 
> Thanks,
> David
> 
> On Tue, Jan 11, 2011 at 7:08 PM, Michael Wilde <wilde at mcs.anl.gov>
> wrote:
> > was: Re: [Swift-devel] Re:
> >  [alcf-support #60887] Can Cobalt command-line bug on Eureka be
> >  fixed?
> >
> > David, the evolving Swift R package has a start-swift command in
> > this directory:
> >
> >  https://svn.ci.uchicago.edu/svn/vdl2/SwiftApps/SwiftR/Swift/exec
> >
> > which has the logic needed to start a manual persistent passive
> > coaster pool on both clusters and workstations.
> >
> > You'll need to pick up the files that start-swift sources from that
> > same directory, and remove the final stage of the script where it
> > actually launches Swift (that part is just for the Swift R service).
> >
> > You'll want to keep the part where it launches the Swift script
> > "passivate.swift" to force the persistent service into passive mode.
> >
> > I think that with some cleanup and much testing, this script could
> > be adapted to launch all means of manual coaster configurations.
> >
> > Justin has expressed the view that perhaps this whole process can
> > not be scripted cleanly, and that we instead should provide tools
> > for the user to do this manually.
> >
> > I would like to try, though, to see if this script can be made clean
> > and reliable, and then we could place it in Swift and factor it out
> > of SwiftR.
> >
> > I'm willing to help you get this set up and tested.
> >
> > - Mike
> >
> > ----- Original Message -----
> >> One workaround we can try here, which may be more valuable than a
> >> temp
> >> fix, would be to make a more user-ready script to launch manual
> >> coasters (persistent/passive) on any cluster.
> >>
> >> We have several such scripts floating around; probably Sheri could
> >> use
> >> one if it were only slightly polished.
> >>
> >> That would be a good project for you, David.
> >>
> >> Such a script would be useful on any cluster, and would need only
> >> slight flexibility to specify the batch jobs for various PBS, SGE,
> >> Cobalt, and Slurm systems.
> >>
> >> It has all the drawbacks of manual coasters (which some folks like)
> >> and is a usage mode we want to support.
> >>
> >> Justin, you noted yesterday that its hard to make such a script
> >> general. Maybe if we split the script into 2 variants (one for
> >> clusters, and one for sets of workstations) that would ake the
> >> resultant scripts more maintainable and testable?
> >>
> >> - Mike
> >>
> >>
> >> ----- Original Message -----
> >> > Thanks, Rich and Andrew, for the very fast responses.
> >> >
> >> > We'll try the work-around, then.
> >> >
> >> > Regards,
> >> >
> >> > - Mike
> >> >
> >> >
> >> > ----- Original Message -----
> >> > > Michael,
> >> > >
> >> > > Unfortunately a fix for this will, at this point in time, take
> >> > > a
> >> > > minimum
> >> > > of four weeks to deploy to a production resource like Eureka,
> >> > > due
> >> > > to
> >> > > our
> >> > > testing, upgrade and maintenance procedures.
> >> > >
> >> > > As a workaround for this on Eureka, since every job effectively
> >> > > runs
> >> > > in
> >> > > script mode, you should be able to set environment variables
> >> > > within
> >> > > the
> >> > > script that you submit to Cobalt.
> >> > >
> >> > > We apologize for the inconvenience. Let us know if you have any
> >> > > other
> >> > > questions.
> >> > >
> >> > > --
> >> > > Paul Rich
> >> > > ALCF Operations -- AIG
> >> > > richp at alcf.anl.gov
> >> > >
> >> > >
> >> > > On 1/11/11 4:48 PM, Michael Wilde wrote:
> >> > > > User info for wilde at mcs.anl.gov
> >> > > > =================================
> >> > > > Username: wilde
> >> > > > Full Name: Michael Wilde
> >> > > > Projects:
> >> > > > HTCScienceApps,JGI-Pilot,MTCScienceApps,OOPS,PTMAP,pilot-wilde
> >> > > >              ('*' denotes INCITE projects)
> >> > > > =================================
> >> > > >
> >> > > >
> >> > > > Hi ALCF Team,
> >> > > >
> >> > > > The following known issue in Cobalt is currently preventing
> >> > > > us
> >> > > > from
> >> > > > running Swift on Eureka:
> >> > > >
> >> > > >    http://trac.mcs.anl.gov/projects/cobalt/ticket/462
> >> > > >
> >> > > > With some additional development effort we can work around
> >> > > > this,
> >> > > > but
> >> > > > it would be much cleaner and better if this were fixed in
> >> > > > Cobalt,
> >> > > > instead, as suggested in ticket 462 above.
> >> > > >
> >> > > > Is there any chance that can be done in the next few days?
> >> > > > If not, please let me know, and we will implement the
> >> > > > work-around
> >> > > > instead.
> >> > > >
> >> > > > This is holding up work on the DOE ParVis project (Rob Jacob,
> >> > > > PI)
> >> > > > and we've had to move some work we want to run on Eureka to
> >> > > > other
> >> > > > platforms in the meantime.
> >> > > >
> >> > > > Thanks very much,
> >> > > >
> >> > > > Mike
> >> > > >
> >> > > > 462 is:
> >> > > >
> >> > > > Ticket #462 (new defect)
> >> > > > Opened 7 months ago
> >> > > > Cobalt on clusters ignores job script arguments
> >> > > >
> >> > > > Reported by: acherry
> >> > > > Priority: major
> >> > > > Component: clients
> >> > > >
> >> > > > Description
> >> > > >
> >> > > > It appears that cobalt-launcher.py does not support running a
> >> > > > job
> >> > > > script or executable with command arguments, even though qsub
> >> > > > will
> >> > > > accept the arguments, and the man page and help for qsub
> >> > > > indicates
> >> > > > that arguments are accepted.
> >> > > >
> >> > > > I'm filing this as a bug rather than a feature request, since
> >> > > > the
> >> > > > behavior isn't consistent with the documentation. But I'd
> >> > > > rather
> >> > > > the
> >> > > > fix for this to be adding support for args, rather than
> >> > > > changing
> >> > > > the
> >> > > > docs to say they aren't accepted. :-)
> >> > > >
> >> > > >
> >> >
> >> > --
> >> > Michael Wilde
> >> > Computation Institute, University of Chicago
> >> > Mathematics and Computer Science Division
> >> > Argonne National Laboratory
> >> >
> >> > _______________________________________________
> >> > Swift-devel mailing list
> >> > Swift-devel at ci.uchicago.edu
> >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> >> --
> >> Michael Wilde
> >> Computation Institute, University of Chicago
> >> Mathematics and Computer Science Division
> >> Argonne National Laboratory
> >>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list