[Swift-devel] ssh:pbs to beagle

Michael Wilde wilde at mcs.anl.gov
Tue May 17 09:42:09 CDT 2011



----- Original Message -----
> With lots of help from Mike, yesterday, we successfully submitted
> swift
> jobs from Bridled machine to Beagle via ssh:pbs.
> 
> Following are the notes from the exercise:
> 
> 1. Used, automatic coasters, with security, since there is no way to
> specify -nosec. This implies:
> a. make sure proxy is valid on both ends (bridled and beagle),
> using grid-proxy-init
> b. make sure ca certs are present on both ends,
> X509_CERT_DIR=/home/ketan/TRUSTEDCA, X509_CADIR=/home/ketan/TRUSTEDCA

We could specify -nosec if we do a version of this with the external coaster-servce process (at the cost of a bt more complexity, but we're trying to wrap that nicely in reliable scripts).

This suggests, though, that many of the options provided by coaster-service should also be made available when the coaster service is run inside Swift (-passive, -nosec, and the options to write at least the worker connection port# to a file for scripting).  The latter may still take some synchronization effort within a wrapper script that manually starts the workers.

- Mike


> 2. For ssh authentication, make sure the auth.defaults is in place
> with
> proper authentication info and permissions:
> a. ~/.ssh/auth.defaults looks like the following for a key-based
> access:
> 
> bridled.ci.uchicago.edu.type=key
> bridled.ci.uchicago.edu.username=uname
> bridled.ci.uchicago.edu.key=/path/to/your/private_key
> bridled.ci.uchicago.edu.passphrase=yourpassphrase
> 
> 
> login1.beagle.ci.uchicago.edu.type=key
> login1.beagle.ci.uchicago.edu.username=uname
> login1.beagle.ci.uchicago.edu.key=/path/to/your/private_key
> login1.beagle.ci.uchicago.edu.passphrase=yourpassphrase
> 
> b. Make sure you have 600 perm on this auth.defaults file.
> 
> 3. Java: We found the following exception was occuring because of IBM
> java on Beagle:
> 
> Could not start connection handler
> java.io.EOFException
> 
> We installed locally the Sun java and the above exception was gone.

This makes me wonder if we should require Sun Java on Beagle (and make a module for it).

It also suggests that our test suite be used to "certify" Swift on various Java implementations: at least we should advertise which Java we do release testing on and show users how to certify the release themselves on other Javas.

> 
> 4. Owing to the fact that beagle login nodes cannot write on /home
> filesystem, we encountered error 524 from worker.pl being unable to
> write workdirs/jobdirs to a previously set /home as workdir location.
> Make sure your workdir is set to /lustre/beagle/your/preferred/path.
> Alternatively, setting it to PADS /gpfs is also ok since worker nodes
> can write their. Beagle admins do not encourage this though.

Mihael or Justin: I was surprised to see that coaster provider staging used the <workdirectory> tag to determine the jobdir on the compute node, on Beagle where /tmp is not writeable.  I always thought that it would honor the <scratch> tag to let the user specify the provider staging jobdir.  But this seems not to be the case.  Can you clarify how the jobdir is determined in the provider staging case and also when the scratch tag is used and not?

- Mike

> 
> To wrap up, following are the relavant files;
> sites.xml:
> 
> <config>
> <pool handle="ssh-pbs">
> <execution provider="coaster" url="login1.beagle.ci.uchicago.edu"
> jobmanager="ssh:pbs"/>
> <profile namespace="globus" key="project">CI-CCR000013</profile>
> 
> <profile namespace="globus" key="ppn">24</profile>
> 
> <profile namespace="globus"
> key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>
> 
> <profile namespace="globus" key="jobsPerNode">24</profile>
> <profile namespace="globus" key="maxTime">1000</profile>
> <profile namespace="globus" key="slots">1</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">1</profile>
> 
> <profile namespace="karajan" key="jobThrottle">.63</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> 
> <workdirectory>/lustre/beagle/ketan/swift.workdir</workdirectory>
> </pool>
> </config>
> ===========
> 
> tc:
> 
> ssh-pbs cat /bin/cat null null null
> ===========
> 
> cf: (note, provider staging is enabled, required)
> 
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=10
> provenance.log=true
> ===========
> 
> swift commandline:
> 
> swift -config cf -tc.file tc -sites.file beagle-coaster.xml
> catsn.swift -n=1
> ===========
> 
> 
> Regards,
> Ketan
> 
> 
> On 4/28/11 2:17 PM, Mihael Hategan wrote:
> > What does your sites file look like?
> >
> > On Thu, 2011-04-28 at 13:36 -0500, Ketan Maheshwari wrote:
> >> Ok, I got past CredentialException with grid-proxy-init, now I am
> >> facing this (note: I have turned on provider staging) :
> >>
> >> ========
> >> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >> -sites.file beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >> modified locally)
> >>
> >> RunID: 20110428-1332-llaa031f
> >> Progress:
> >> Could not start connection handler
> >> java.io.EOFException
> >> 	at
> >> 	org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
> >> 	at
> >> 	org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
> >> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
> >> 	at java.lang.Thread.run(Thread.java:662)
> >> Progress: Submitted:1
> >> Could not start connection handler
> >> java.io.EOFException
> >> 	at
> >> 	org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
> >> 	at
> >> 	org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
> >> 	at
> >> 	org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
> >> 	at
> >> 	org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
> >> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
> >> 	at java.lang.Thread.run(Thread.java:662)
> >> Progress: Submitted:1
> >> Exception in cat:
> >> Arguments: [data.txt]
> >> Host: beagle-remote-pbs-coasters-ssh
> >> Directory: catsn-20110428-1332-llaa031f/jobs/b/cat-bxal1d9kTODO:
> >> outs
> >> ----
> >>
> >> Caused by: Could not submit job
> >> Caused by:
> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >> Could not submit job
> >> Caused by:
> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >> Could not start coaster service
> >> Caused by:
> >> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >> Task ended before registration was received.
> >> STDOUT:
> >> STDERR:
> >> Caused by:
> >> org.globus.cog.abstraction.impl.common.execution.JobException: Job
> >> failed with an exit code of 1
> >> Final status: Failed:1
> >> The following errors have occurred:
> >> 1. Job failed with an exit code of 1
> >>
> >> ========
> >>
> >>
> >>  From bridled to communicado, I see the following error:
> >>
> >> **************
> >> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >> -sites.file coaster-local-ssh-communicado.xml catsn.swift -n=1
> >> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >> modified locally)
> >>
> >> RunID: 20110428-1335-k685b2ye
> >> Progress:
> >> Progress: Submitted:1
> >> Progress: Active:1
> >> Exception in cat:
> >> Arguments: [data.txt]
> >> Host: communicado-ssh
> >> Directory: catsn-20110428-1335-k685b2ye/jobs/c/cat-coip1d9kTODO:
> >> outs
> >> ----
> >>
> >> Caused by: Job failed with an exit code of 524
> >> Caused by:
> >> org.globus.cog.abstraction.impl.common.execution.JobException: Job
> >> failed with an exit code of 524
> >> Final status: Failed:1
> >> The following errors have occurred:
> >> 1. Job failed with an exit code of 524
> >>
> >> ************
> >>
> >>
> >> --
> >> Ketan
> >>
> >>
> >>
> >>
> >> On Apr 28, 2011, at 1:03 PM, Michael Wilde wrote:
> >>
> >>> For now - create a proxy using grid-proxy-init on the swift
> >>> execution machine.
> >>> I think there is an option to set "no security" for this config
> >>> but I cant recall where that is specified. Maybe swift.properties,
> >>> I cant recall.
> >>>
> >>> - Mike
> >>>
> >>> ----- Original Message -----
> >>>> Hi,
> >>>>
> >>>> It looks better now. However, I am getting the following:
> >>>>
> >>>> =====
> >>>>
> >>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >>>> -sites.file
> >>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >>>> modified
> >>>> locally)
> >>>>
> >>>> RunID: 20110428-1251-oi9theh8
> >>>> Progress:
> >>>> Progress: Stage in:1
> >>>> Could not submit job
> >>>> Caused by:
> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>> Could not submit job
> >>>> Caused by:
> >>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>> Could not start coaster service
> >>>> Caused by:
> >>>> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
> >>>> org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file
> >>>> (/tmp/x509up_u2006) not found.
> >>>> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5]
> >>>> Proxy
> >>>> file (/tmp/x509up_u2006) not found.
> >>>> Failed to transfer wrapper log from
> >>>> catsn-20110428-1251-oi9theh8/info/e on
> >>>> beagle-remote-pbs-coasters-ssh
> >>>>
> >>>> =====
> >>>>
> >>>> How do I specify "-nosec" on automatic coasters?
> >>>>
> >>>> Ketan
> >>>>
> >>>> On Apr 28, 2011, at 12:00 PM, Michael Wilde wrote:
> >>>>
> >>>>> OK. Was there a cookbook on the ssh settings? Did you set up a
> >>>>> $HOME/.ssh/auth.defaults per the user guide?
> >>>>>
> >>>>> Here is an auth.defaults example. Im not sure its 100% correct,
> >>>>> but
> >>>>> could serve as a base for you:
> >>>>>
> >>>>> xlogin1.pads.ci.uchicago.edu.type=password
> >>>>> xlogin1.pads.ci.uchicago.edu.username=wilde
> >>>>>
> >>>>> login.pads.ci.uchicago.edu.type=key
> >>>>> login.pads.ci.uchicago.edu.username=wilde
> >>>>> login.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
> >>>>> login.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
> >>>>> SURE
> >>>>> mode=600!!!
> >>>>>
> >>>>> login1.pads.ci.uchicago.edu.type=key
> >>>>> login1.pads.ci.uchicago.edu.username=wilde
> >>>>> login1.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
> >>>>> login1.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
> >>>>> SURE mode=600!!!
> >>>>>
> >>>>> login.mcs.anl.gov.type=key
> >>>>> login.mcs.anl.gov.username=wilde
> >>>>> login.mcs.anl.gov.key=/home/wilde/.ssh/swift_rsa
> >>>>> login.mcs.anl.gov.passphrase=yourpassphrasehere # MAKE SURE
> >>>>> mode=600!!!
> >>>>>
> >>>>> - Mike
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> It does look like an ssh problem. I am getting the same stderr
> >>>>>> and
> >>>>>> log
> >>>>>> messages on trying to communicate from Bridled to Communicado.
> >>>>>>
> >>>>>> Ketan
> >>>>>>
> >>>>>> On Apr 28, 2011, at 11:19 AM, Michael Wilde wrote:
> >>>>>>
> >>>>>>> Have you already run a simple hellow-world swift test from
> >>>>>>> communicado to bridled to make sure you have ssh configured
> >>>>>>> correctly? I would do that first.
> >>>>>>>
> >>>>>>> Im not sure if an ssh problem explains what you show below, or
> >>>>>>> not.
> >>>>>>>
> >>>>>>> - Mike
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> Thanks, I made the change. However, now, I am getting the
> >>>>>>>> following
> >>>>>>>> on
> >>>>>>>> my stderr
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ===========
> >>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >>>>>>>> -sites.file
> >>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >>>>>>>> modified
> >>>>>>>> locally)
> >>>>>>>>
> >>>>>>>> RunID: 20110428-1022-n9s0k0e0
> >>>>>>>> Progress:
> >>>>>>>> [ketan]
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> [ketan] Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>> ========
> >>>>>>>>
> >>>>>>>> And from the log it seems some network transmission has
> >>>>>>>> failed:
> >>>>>>>>
> >>>>>>>> 2011-04-28 10:22:45,261-0500 INFO TransportProtocolCommon
> >>>>>>>> Sending
> >>>>>>>> SSH_MSG_SERVICE_REQUEST
> >>>>>>>> 2011-04-28 10:22:45,264-0500 INFO TransportProtocolCommon
> >>>>>>>> Received
> >>>>>>>> SSH_MSG_SERVICE_ACCEPT
> >>>>>>>> 2011-04-28 10:24:27,626-0500 INFO TransportProtocolCommon The
> >>>>>>>> Transport Protocol thread failed
> >>>>>>>> java.io.IOException: The socket is EOF
> >>>>>>>> at
> >>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183)
> >>>>>>>> at
> >>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226)
> >>>>>>>> at
> >>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440)
> >>>>>>>> at
> >>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034)
> >>>>>>>> at
> >>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393)
> >>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Any clues?
> >>>>>>>> Ketan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Apr 28, 2011, at 10:20 AM, Michael Wilde wrote:
> >>>>>>>>
> >>>>>>>>> The pool name in your sites file is
> >>>>>>>>> pads-remote-pbs-coasters-ssh
> >>>>>>>>> but
> >>>>>>>>> you used pbs in your tc.data.
> >>>>>>>>>
> >>>>>>>>> - Mike
> >>>>>>>>>
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> Some context:
> >>>>>>>>>> I am trying to submit a big run on Beagle using swift +
> >>>>>>>>>> coasters.
> >>>>>>>>>> However, a previous run is already underway on beagle. So,
> >>>>>>>>>> there
> >>>>>>>>>> are
> >>>>>>>>>> two difficulties running a new run from its login node:
> >>>>>>>>>>
> >>>>>>>>>> 1. Running another swift from the same jvm will result in
> >>>>>>>>>> chaos
> >>>>>>>>>> on
> >>>>>>>>>> the
> >>>>>>>>>> logs (As far as I know, please correct me if this is not
> >>>>>>>>>> the
> >>>>>>>>>> case
> >>>>>>>>>> anymore)
> >>>>>>>>>>
> >>>>>>>>>> 2. Login node is already under load because of my running
> >>>>>>>>>> previous
> >>>>>>>>>> big
> >>>>>>>>>> run.
> >>>>>>>>>>
> >>>>>>>>>> /context
> >>>>>>>>>>
> >>>>>>>>>> So, I am now trying to submit this big run from a remote
> >>>>>>>>>> host
> >>>>>>>>>> (bridled). I know this has been done on PADS using ssh:pbs,
> >>>>>>>>>> provider
> >>>>>>>>>> coaster. I tried the similar approach on a trial swift
> >>>>>>>>>> script
> >>>>>>>>>> but
> >>>>>>>>>> getting error.
> >>>>>>>>>>
> >>>>>>>>>> Following is the error message:
> >>>>>>>>>>
> >>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >>>>>>>>>> -sites.file
> >>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088
> >>>>>>>>>> (cog
> >>>>>>>>>> modified
> >>>>>>>>>> locally)
> >>>>>>>>>>
> >>>>>>>>>> RunID: 20110428-1002-c8rvqhe6
> >>>>>>>>>> Progress:
> >>>>>>>>>> The application "cat" is not available in your tc.data
> >>>>>>>>>> catalog
> >>>>>>>>>> Caused by:
> >>>>>>>>>> org.globus.cog.karajan.scheduler.NoSuchResourceException
> >>>>>>>>>> Final status: Failed:1
> >>>>>>>>>> The following errors have occurred:
> >>>>>>>>>> 1. The application "cat" is not available in your tc.data
> >>>>>>>>>> catalog
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Attached are my .swift, sites.xml and tc.data files.
> >>>>>>>>>>
> >>>>>>>>>> Could someone indicate if what I am doing is doable and if
> >>>>>>>>>> so
> >>>>>>>>>> how
> >>>>>>>>>> can
> >>>>>>>>>> I correctly configure my sites and tc setup.
> >>>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>> Ketan
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Swift-devel mailing list
> >>>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>> --
> >>>>>>>>> Michael Wilde
> >>>>>>>>> Computation Institute, University of Chicago
> >>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>> Argonne National Laboratory
> >>>>>>>>>
> >>>>>>> --
> >>>>>>> Michael Wilde
> >>>>>>> Computation Institute, University of Chicago
> >>>>>>> Mathematics and Computer Science Division
> >>>>>>> Argonne National Laboratory
> >>>>>>>
> >>>>> --
> >>>>> Michael Wilde
> >>>>> Computation Institute, University of Chicago
> >>>>> Mathematics and Computer Science Division
> >>>>> Argonne National Laboratory
> >>>>>
> >>> --
> >>> Michael Wilde
> >>> Computation Institute, University of Chicago
> >>> Mathematics and Computer Science Division
> >>> Argonne National Laboratory
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list