[Swift-devel] auto-coaster bootstrap for stampede cluster

Ketan Maheshwari ketancmaheshwari at gmail.com
Sun Apr 28 00:04:10 CDT 2013


I tested this successfully from communicado into Stampede.

One tricky issue I fell into was that ibrun on Stampede checks for keys in
~/.ssh originally generated by stampede at the time of first login. I
replaced them with my own keypair that I use on other machines.

This was causing the jobs to subtly fail without any explicit error message
on stderr of Swift nor on the gram log.

The issue was resolved after digging into Stampede manual and confirming
with a similar buried post on xsede forum.

Thanks,
Ketan


On Thu, Apr 25, 2013 at 9:25 AM, Michael Wilde <wilde at mcs.anl.gov> wrote:

> David, this sounds great - nice work.
>
> Can you test with multiple, mixed sites and provider and gridftp
> staging? Try e.g. Stampede+trestles(+midway+beagle+kraken)
>
> Also gt2:slurm:slurm might work well.
>
> Please add this all to the site guide (ideally with a diagram).
>
> Mihael, how hard would it be to make ssh-cl:slurm:slurm to work? I.e.
> start the coaster service ond the remote site as a slurm job instaed
> of on the login host, which is the objective of this configuration.
>
> Very cool.
>
> - Mike
>
> On 4/24/13, David Kelly <davidk at ci.uchicago.edu> wrote:
> > Ketan,
> >
> >
> > I have gram working to Stampede now. Given the restrictions about running
> > swift on the head nodes, I think this is the way to go. I'll add this
> info
> > to the site guide, but for now here is a quick overview of what's needed.
> >
> >
> > Get a proxy: myproxy-logon -l username -s myproxy.teragrid.org
> >
> >
> > Make sure you have GLOBUS_HOSTNAME and GLOBUS_TCP_PORT_RANGE defined
> > correctly.
> >
> >
> > Use something like this for your sites .xml (with work directory,
> project,
> > and throttle adjusted as needed)
> > ---
> >
> >
> > <config>
> > <pool handle="stampede">
> > <execution provider="coaster" jobmanager="gt2:gt2:slurm"
> > url="login5.stampede.tacc.utexas.edu:2119/jobmanager-slurm"/>
> > <filesystem provider="gsiftp"
> > url="gsiftp://gridftp.stampede.tacc.utexas.edu:2811"/>
> > <profile namespace="globus" key="jobsPerNode">16</profile>
> > <profile namespace="globus" key="ppn">16</profile>
> > <profile namespace="globus" key="maxTime">3600</profile>
> > <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> > <profile namespace="globus" key="lowOverallocation">100</profile>
> > <profile namespace="globus" key="highOverallocation">100</profile>
> > <profile namespace="globus" key="queue">normal</profile>
> > <profile namespace="globus" key="nodeGranularity">1</profile>
> > <profile namespace="globus" key="maxNodes">1</profile>
> > <profile namespace="globus" key="project">TG-EAR130015</profile>
> > <profile namespace="karajan" key="jobThrottle">.3199</profile>
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <workdirectory>/scratch/01503/davidkel</workdirectory>
> > </pool>
> > </config>
> > ---
> >
> >
> > You'll also need the latest version of Swift from SVN. Swift was setting
> > some invalid gram RSL attributes that were causing jobs to fail. I added
> a
> > check to verify only valid attributes get set now. I've tested this with
> a
> > simple swift script that calls /bin/hostname and it ran across multiple
> > Stampede nodes. I haven't tested it with any larger applications yet -
> > please let me know if you run into any problems with it.
> >
> >
> > Thanks,
> > David
> > ----- Original Message -----
> >
> >
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Wednesday, April 17, 2013 3:51:31 PM
> > Subject: [Swift-devel] auto-coaster bootstrap for stampede cluster
> >
> >
> > I'm moving this topic to swift-devel, so others, in particular Mihael,
> can
> > weigh in.
> >
> > - Mike
> >
> > ----- Forwarded Message -----
> > From: "David Kelly" <davidk at ci.uchicago.edu>
> > To: "Ketan Maheshwari" <ketan at mcs.anl.gov>
> > Cc: "Wilde" <wilde at mcs.anl.gov>
> > Sent: Wednesday, April 17, 2013 3:45:30 PM
> > Subject: Fwd: auto-coaster bootstrap for stamped
> >
> > Hey Ketan,
> >
> > Mike mentioned that you were interested in running remotely to Stampede
> via
> > ssh-cl. Normally we could use ssh-cl like any other site, but the
> problem we
> > ran into here is that we can't run Swift on the stampede head node. We
> need
> > to ssh-cl AND also start swift on a remote worker node, which is a setup
> > that hasn't been tested very much.
> >
> > I believe you've used start-coaster-service before when we were running
> on
> > ec2. You can this configuration for Stampede too. Modify
> > coaster-service.conf to set WORKER_NODE=slurm,
> > WORKER_RELAY_HOST=stampede.tacc.utexas.edu, and it will generate a slurm
> > script, scp it to stampede, and remotely start swift on a worker node.
> I'll
> > see if I can find an example config file for this.
> >
> > With automatic coaters it's a bit more complicated and completely
> untested
> > as far as I know.
> >
> > You may be able to use gram2. This worked on Ranger, but haven't tried
> yet
> > on Stampede.
> > Mike mentioned in the email below you may be able to change the ssh-cl
> > provider to add some kind of prefix command (srun).
> > Maybe you can modify your PATH so the 'ssh' command is actually a wrapper
> > you created and does something sneaky.
> > You may also be able to add a prefix command to
> > cog/modules/provider-coaster/resources/bootstrap.sh.
> >
> > Hopefully this can help you get started - let me know if any of this
> works
> > for you, curious to see how we can get it working well.
> >
> > David
> >
> > ----- Forwarded Message -----
> >
> >
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "David Kelly" <davidk at ci.uchicago.edu>
> > Sent: Tuesday, April 16, 2013 10:59:22 AM
> > Subject: auto-coaster bootstrap for stamped
> >
> >
> > was: Re: Another item for the to-do list
> >
> > David, thanks for the details.
> >
> > Im wondering, for systems like stampede, could automatic coasters work
> to it
> > (eg from swift.rcc) by adding a sinteractive or srun command into the
> middle
> > of the ssh command generated by the ssh-cl parameter?
> >
> > ie instead of doing ssh -sshargsgere auto-boostrap-coaster-stuff-here.sh
> > do: ssh -sshargsgere srun auto-boostrap-coaster-stuff-here.sh
> >
> > ?
> >
> >> This is the only mode that I've been able to test on Stampede so far.
> >> I will experiment more the others when Stampede is back up.
> >
> > Others meaning GRAM? Perhaps using myproxy-logon? That *should* work out
> of
> > the box but we've not tested GRAM in ages so it probably doesnt.
> >
> > Lets keep this lower on the prio list. I just want to be sure we have a
> > ticket for this. Please create one if not - thanks.
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> >
> >
> >
>
> --
> Sent from my mobile device
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>



-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20130428/d21690c7/attachment.html>


More information about the Swift-devel mailing list