[Swift-devel] ssh:pbs to beagle

Mihael Hategan hategan at mcs.anl.gov
Thu Apr 28 14:34:13 CDT 2011


You have a bunch of uknown CA errors in there.

You should have the CA public key for your proxy in
~/.globus/certificates (on both client and server machines).

Mihael

On Thu, 2011-04-28 at 14:29 -0500, Ketan Maheshwari wrote:
> They are here : /home/ketan/.globus/coasters
> 
> 
> On Apr 28, 2011, at 2:26 PM, Mihael Hategan wrote:
> 
> > That EOFException doesn't make much sense.
> > 
> > On beagle you should have something called coaster.log in
> > ~/.globus/coasters.
> > 
> > Can post a link to that?
> > 
> > Mihael
> > 
> > On Thu, 2011-04-28 at 14:21 -0500, Ketan Maheshwari wrote:
> >> On Apr 28, 2011, at 2:17 PM, Mihael Hategan wrote:
> >> 
> >>> What does your sites file look like?
> >> 
> >> ** For beagle **
> >> 
> >> <config>
> >>    <!--<pool handle="pbs">-->
> >>  <pool handle="beagle-remote-pbs-coasters-ssh">
> >>    <execution provider="coaster" url="login1.beagle.ci.uchicago.edu" jobmanager="ssh:pbs"/>
> >>    <profile namespace="globus" key="project">CI-CCR000013</profile>
> >> 
> >>    <profile namespace="globus" key="ppn">24:cray:pack</profile>
> >> 
> >>    <profile namespace="globus" key="workersPerNode">24</profile>
> >>    <profile namespace="globus" key="maxTime">1000</profile>
> >>    <profile namespace="globus" key="slots">1</profile>
> >>    <profile namespace="globus" key="nodeGranularity">1</profile>
> >>    <profile namespace="globus" key="maxNodes">1</profile>
> >> 
> >>    <profile namespace="karajan" key="jobThrottle">.63</profile>
> >>    <profile namespace="karajan" key="initialScore">10000</profile>
> >> 
> >>    <filesystem provider="ssh" url="login1.beagle.ci.uchicago.edu" />
> >>    <workdirectory>$HOME/swift.workdir</workdirectory>
> >>  </pool>
> >> </config>
> >> 
> >> 
> >> 
> >> ** for communicado **
> >> 
> >> <config>
> >>    <!--<pool handle="pbs">-->
> >>  <pool handle="communicado-ssh">
> >>    <execution provider="coaster" url="communicado.ci.uchicago.edu" jobmanager="ssh:ssh"/>
> >> 
> >>    <profile namespace="karajan" key="jobThrottle">.63</profile>
> >>    <profile namespace="karajan" key="initialScore">10000</profile>
> >> 
> >>    <filesystem provider="ssh" url="communicado.ci.uchicago.edu" />
> >>    <workdirectory>$HOME/swift.workdir</workdirectory>
> >>  </pool>
> >> </config>
> >> 
> >> 
> >> 
> >>> 
> >>> On Thu, 2011-04-28 at 13:36 -0500, Ketan Maheshwari wrote:
> >>>> Ok, I got past CredentialException with grid-proxy-init, now I am facing this (note: I have turned on provider staging)  :
> >>>> 
> >>>> ========
> >>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified locally)
> >>>> 
> >>>> RunID: 20110428-1332-llaa031f
> >>>> Progress:
> >>>> Could not start connection handler
> >>>> java.io.EOFException
> >>>> 	at org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
> >>>> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
> >>>> 	at org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
> >>>> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
> >>>> 	at org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
> >>>> 	at org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
> >>>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
> >>>> 	at java.lang.Thread.run(Thread.java:662)
> >>>> Progress:  Submitted:1
> >>>> Could not start connection handler
> >>>> java.io.EOFException
> >>>> 	at org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
> >>>> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
> >>>> 	at org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
> >>>> 	at org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
> >>>> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
> >>>> 	at org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
> >>>> 	at org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
> >>>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
> >>>> 	at java.lang.Thread.run(Thread.java:662)
> >>>> Progress:  Submitted:1
> >>>> Exception in cat:
> >>>> Arguments: [data.txt]
> >>>> Host: beagle-remote-pbs-coasters-ssh
> >>>> Directory: catsn-20110428-1332-llaa031f/jobs/b/cat-bxal1d9kTODO: outs
> >>>> ----
> >>>> 
> >>>> Caused by: Could not submit job
> >>>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
> >>>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service
> >>>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received. 
> >>>> STDOUT: 
> >>>> STDERR: 
> >>>> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 1
> >>>> Final status:  Failed:1
> >>>> The following errors have occurred:
> >>>> 1. Job failed with an exit code of 1
> >>>> 
> >>>> ========
> >>>> 
> >>>> 
> >>>> From bridled to communicado, I see the following error:
> >>>> 
> >>>> **************
> >>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file coaster-local-ssh-communicado.xml catsn.swift -n=1
> >>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified locally)
> >>>> 
> >>>> RunID: 20110428-1335-k685b2ye
> >>>> Progress:
> >>>> Progress:  Submitted:1
> >>>> Progress:  Active:1
> >>>> Exception in cat:
> >>>> Arguments: [data.txt]
> >>>> Host: communicado-ssh
> >>>> Directory: catsn-20110428-1335-k685b2ye/jobs/c/cat-coip1d9kTODO: outs
> >>>> ----
> >>>> 
> >>>> Caused by: Job failed with an exit code of 524
> >>>> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 524
> >>>> Final status:  Failed:1
> >>>> The following errors have occurred:
> >>>> 1. Job failed with an exit code of 524
> >>>> 
> >>>> ************
> >>>> 
> >>>> 
> >>>> --
> >>>> Ketan
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> On Apr 28, 2011, at 1:03 PM, Michael Wilde wrote:
> >>>> 
> >>>>> For now - create a proxy using grid-proxy-init on the swift execution machine.
> >>>>> I think there is an option to set "no security" for this config but I cant recall where that is specified.  Maybe swift.properties, I cant recall.
> >>>>> 
> >>>>> - Mike
> >>>>> 
> >>>>> ----- Original Message -----
> >>>>>> Hi,
> >>>>>> 
> >>>>>> It looks better now. However, I am getting the following:
> >>>>>> 
> >>>>>> =====
> >>>>>> 
> >>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file
> >>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified
> >>>>>> locally)
> >>>>>> 
> >>>>>> RunID: 20110428-1251-oi9theh8
> >>>>>> Progress:
> >>>>>> Progress: Stage in:1
> >>>>>> Could not submit job
> >>>>>> Caused by:
> >>>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>>>> Could not submit job
> >>>>>> Caused by:
> >>>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> >>>>>> Could not start coaster service
> >>>>>> Caused by:
> >>>>>> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
> >>>>>> org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file
> >>>>>> (/tmp/x509up_u2006) not found.
> >>>>>> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
> >>>>>> file (/tmp/x509up_u2006) not found.
> >>>>>> Failed to transfer wrapper log from
> >>>>>> catsn-20110428-1251-oi9theh8/info/e on beagle-remote-pbs-coasters-ssh
> >>>>>> 
> >>>>>> =====
> >>>>>> 
> >>>>>> How do I specify "-nosec" on automatic coasters?
> >>>>>> 
> >>>>>> Ketan
> >>>>>> 
> >>>>>> On Apr 28, 2011, at 12:00 PM, Michael Wilde wrote:
> >>>>>> 
> >>>>>>> OK. Was there a cookbook on the ssh settings? Did you set up a
> >>>>>>> $HOME/.ssh/auth.defaults per the user guide?
> >>>>>>> 
> >>>>>>> Here is an auth.defaults example. Im not sure its 100% correct, but
> >>>>>>> could serve as a base for you:
> >>>>>>> 
> >>>>>>> xlogin1.pads.ci.uchicago.edu.type=password
> >>>>>>> xlogin1.pads.ci.uchicago.edu.username=wilde
> >>>>>>> 
> >>>>>>> login.pads.ci.uchicago.edu.type=key
> >>>>>>> login.pads.ci.uchicago.edu.username=wilde
> >>>>>>> login.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
> >>>>>>> login.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE SURE
> >>>>>>> mode=600!!!
> >>>>>>> 
> >>>>>>> login1.pads.ci.uchicago.edu.type=key
> >>>>>>> login1.pads.ci.uchicago.edu.username=wilde
> >>>>>>> login1.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
> >>>>>>> login1.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
> >>>>>>> SURE mode=600!!!
> >>>>>>> 
> >>>>>>> login.mcs.anl.gov.type=key
> >>>>>>> login.mcs.anl.gov.username=wilde
> >>>>>>> login.mcs.anl.gov.key=/home/wilde/.ssh/swift_rsa
> >>>>>>> login.mcs.anl.gov.passphrase=yourpassphrasehere # MAKE SURE
> >>>>>>> mode=600!!!
> >>>>>>> 
> >>>>>>> - Mike
> >>>>>>> 
> >>>>>>> ----- Original Message -----
> >>>>>>>> It does look like an ssh problem. I am getting the same stderr and
> >>>>>>>> log
> >>>>>>>> messages on trying to communicate from Bridled to Communicado.
> >>>>>>>> 
> >>>>>>>> Ketan
> >>>>>>>> 
> >>>>>>>> On Apr 28, 2011, at 11:19 AM, Michael Wilde wrote:
> >>>>>>>> 
> >>>>>>>>> Have you already run a simple hellow-world swift test from
> >>>>>>>>> communicado to bridled to make sure you have ssh configured
> >>>>>>>>> correctly? I would do that first.
> >>>>>>>>> 
> >>>>>>>>> Im not sure if an ssh problem explains what you show below, or
> >>>>>>>>> not.
> >>>>>>>>> 
> >>>>>>>>> - Mike
> >>>>>>>>> 
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> Thanks, I made the change. However, now, I am getting the
> >>>>>>>>>> following
> >>>>>>>>>> on
> >>>>>>>>>> my stderr
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> ===========
> >>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >>>>>>>>>> -sites.file
> >>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >>>>>>>>>> modified
> >>>>>>>>>> locally)
> >>>>>>>>>> 
> >>>>>>>>>> RunID: 20110428-1022-n9s0k0e0
> >>>>>>>>>> Progress:
> >>>>>>>>>> [ketan]
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> [ketan] Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> Progress: Initializing site shared directory:1
> >>>>>>>>>> ========
> >>>>>>>>>> 
> >>>>>>>>>> And from the log it seems some network transmission has failed:
> >>>>>>>>>> 
> >>>>>>>>>> 2011-04-28 10:22:45,261-0500 INFO TransportProtocolCommon Sending
> >>>>>>>>>> SSH_MSG_SERVICE_REQUEST
> >>>>>>>>>> 2011-04-28 10:22:45,264-0500 INFO TransportProtocolCommon
> >>>>>>>>>> Received
> >>>>>>>>>> SSH_MSG_SERVICE_ACCEPT
> >>>>>>>>>> 2011-04-28 10:24:27,626-0500 INFO TransportProtocolCommon The
> >>>>>>>>>> Transport Protocol thread failed
> >>>>>>>>>> java.io.IOException: The socket is EOF
> >>>>>>>>>> at
> >>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183)
> >>>>>>>>>> at
> >>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226)
> >>>>>>>>>> at
> >>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440)
> >>>>>>>>>> at
> >>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034)
> >>>>>>>>>> at
> >>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393)
> >>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> Any clues?
> >>>>>>>>>> Ketan
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> On Apr 28, 2011, at 10:20 AM, Michael Wilde wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> The pool name in your sites file is pads-remote-pbs-coasters-ssh
> >>>>>>>>>>> but
> >>>>>>>>>>> you used pbs in your tc.data.
> >>>>>>>>>>> 
> >>>>>>>>>>> - Mike
> >>>>>>>>>>> 
> >>>>>>>>>>> ----- Original Message -----
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Some context:
> >>>>>>>>>>>> I am trying to submit a big run on Beagle using swift +
> >>>>>>>>>>>> coasters.
> >>>>>>>>>>>> However, a previous run is already underway on beagle. So,
> >>>>>>>>>>>> there
> >>>>>>>>>>>> are
> >>>>>>>>>>>> two difficulties running a new run from its login node:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 1. Running another swift from the same jvm will result in chaos
> >>>>>>>>>>>> on
> >>>>>>>>>>>> the
> >>>>>>>>>>>> logs (As far as I know, please correct me if this is not the
> >>>>>>>>>>>> case
> >>>>>>>>>>>> anymore)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 2. Login node is already under load because of my running
> >>>>>>>>>>>> previous
> >>>>>>>>>>>> big
> >>>>>>>>>>>> run.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> /context
> >>>>>>>>>>>> 
> >>>>>>>>>>>> So, I am now trying to submit this big run from a remote host
> >>>>>>>>>>>> (bridled). I know this has been done on PADS using ssh:pbs,
> >>>>>>>>>>>> provider
> >>>>>>>>>>>> coaster. I tried the similar approach on a trial swift script
> >>>>>>>>>>>> but
> >>>>>>>>>>>> getting error.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Following is the error message:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
> >>>>>>>>>>>> -sites.file
> >>>>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
> >>>>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
> >>>>>>>>>>>> modified
> >>>>>>>>>>>> locally)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> RunID: 20110428-1002-c8rvqhe6
> >>>>>>>>>>>> Progress:
> >>>>>>>>>>>> The application "cat" is not available in your tc.data catalog
> >>>>>>>>>>>> Caused by:
> >>>>>>>>>>>> org.globus.cog.karajan.scheduler.NoSuchResourceException
> >>>>>>>>>>>> Final status: Failed:1
> >>>>>>>>>>>> The following errors have occurred:
> >>>>>>>>>>>> 1. The application "cat" is not available in your tc.data
> >>>>>>>>>>>> catalog
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Attached are my .swift, sites.xml and tc.data files.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Could someone indicate if what I am doing is doable and if so
> >>>>>>>>>>>> how
> >>>>>>>>>>>> can
> >>>>>>>>>>>> I correctly configure my sites and tc setup.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Thanks.
> >>>>>>>>>>>> Ketan
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Swift-devel mailing list
> >>>>>>>>>>>> Swift-devel at ci.uchicago.edu
> >>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>>>>>>>>> 
> >>>>>>>>>>> --
> >>>>>>>>>>> Michael Wilde
> >>>>>>>>>>> Computation Institute, University of Chicago
> >>>>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>>>> Argonne National Laboratory
> >>>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> --
> >>>>>>>>> Michael Wilde
> >>>>>>>>> Computation Institute, University of Chicago
> >>>>>>>>> Mathematics and Computer Science Division
> >>>>>>>>> Argonne National Laboratory
> >>>>>>>>> 
> >>>>>>> 
> >>>>>>> --
> >>>>>>> Michael Wilde
> >>>>>>> Computation Institute, University of Chicago
> >>>>>>> Mathematics and Computer Science Division
> >>>>>>> Argonne National Laboratory
> >>>>>>> 
> >>>>> 
> >>>>> -- 
> >>>>> Michael Wilde
> >>>>> Computation Institute, University of Chicago
> >>>>> Mathematics and Computer Science Division
> >>>>> Argonne National Laboratory
> >>>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> Swift-devel mailing list
> >>>> Swift-devel at ci.uchicago.edu
> >>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>> 
> >>> 
> >> 
> > 
> > 
> 





More information about the Swift-devel mailing list