[Swift-devel] ssh:pbs to beagle

Ketan Maheshwari ketancmaheshwari at gmail.com
Thu Apr 28 15:09:46 CDT 2011


On Apr 28, 2011, at 2:32 PM, Michael Wilde wrote:

> What is your communicado pool trying to test?
> 
> If thats to run eg bridled to communicado, I think jobmanager should be jobmanager="ssh:local" ???

I am on bridled and want to run coaster service on bridled (so local) and workers on communicado (ssh). that is why I have jobmanager=local:ssh

> 
> - Mike
> 
> ----- Original Message -----
>> On Apr 28, 2011, at 2:17 PM, Mihael Hategan wrote:
>> 
>>> What does your sites file look like?
>> 
>> ** For beagle **
>> 
>> <config>
>> <!--<pool handle="pbs">-->
>> <pool handle="beagle-remote-pbs-coasters-ssh">
>> <execution provider="coaster" url="login1.beagle.ci.uchicago.edu"
>> jobmanager="ssh:pbs"/>
>> <profile namespace="globus" key="project">CI-CCR000013</profile>
>> 
>> <profile namespace="globus" key="ppn">24:cray:pack</profile>
>> 
>> <profile namespace="globus" key="workersPerNode">24</profile>
>> <profile namespace="globus" key="maxTime">1000</profile>
>> <profile namespace="globus" key="slots">1</profile>
>> <profile namespace="globus" key="nodeGranularity">1</profile>
>> <profile namespace="globus" key="maxNodes">1</profile>
>> 
>> <profile namespace="karajan" key="jobThrottle">.63</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> 
>> <filesystem provider="ssh" url="login1.beagle.ci.uchicago.edu" />
>> <workdirectory>$HOME/swift.workdir</workdirectory>
>> </pool>
>> </config>
>> 
>> 
>> 
>> ** for communicado **
>> 
>> <config>
>> <!--<pool handle="pbs">-->
>> <pool handle="communicado-ssh">
>> <execution provider="coaster" url="communicado.ci.uchicago.edu"
>> jobmanager="ssh:ssh"/>
>> 
>> <profile namespace="karajan" key="jobThrottle">.63</profile>
>> <profile namespace="karajan" key="initialScore">10000</profile>
>> 
>> <filesystem provider="ssh" url="communicado.ci.uchicago.edu" />
>> <workdirectory>$HOME/swift.workdir</workdirectory>
>> </pool>
>> </config>
>> 
>> 
>> 
>>> 
>>> On Thu, 2011-04-28 at 13:36 -0500, Ketan Maheshwari wrote:
>>>> Ok, I got past CredentialException with grid-proxy-init, now I am
>>>> facing this (note: I have turned on provider staging) :
>>>> 
>>>> ========
>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>> -sites.file beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>> modified locally)
>>>> 
>>>> RunID: 20110428-1332-llaa031f
>>>> Progress:
>>>> Could not start connection handler
>>>> java.io.EOFException
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
>>>> 	at
>>>> 	org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
>>>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>> Progress: Submitted:1
>>>> Could not start connection handler
>>>> java.io.EOFException
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
>>>> 	at
>>>> 	org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
>>>> 	at
>>>> 	org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
>>>> 	at
>>>> 	org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
>>>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
>>>> 	at java.lang.Thread.run(Thread.java:662)
>>>> Progress: Submitted:1
>>>> Exception in cat:
>>>> Arguments: [data.txt]
>>>> Host: beagle-remote-pbs-coasters-ssh
>>>> Directory: catsn-20110428-1332-llaa031f/jobs/b/cat-bxal1d9kTODO:
>>>> outs
>>>> ----
>>>> 
>>>> Caused by: Could not submit job
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Could not submit job
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Could not start coaster service
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Task ended before registration was received.
>>>> STDOUT:
>>>> STDERR:
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.execution.JobException: Job
>>>> failed with an exit code of 1
>>>> Final status: Failed:1
>>>> The following errors have occurred:
>>>> 1. Job failed with an exit code of 1
>>>> 
>>>> ========
>>>> 
>>>> 
>>>> From bridled to communicado, I see the following error:
>>>> 
>>>> **************
>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>> -sites.file coaster-local-ssh-communicado.xml catsn.swift -n=1
>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>> modified locally)
>>>> 
>>>> RunID: 20110428-1335-k685b2ye
>>>> Progress:
>>>> Progress: Submitted:1
>>>> Progress: Active:1
>>>> Exception in cat:
>>>> Arguments: [data.txt]
>>>> Host: communicado-ssh
>>>> Directory: catsn-20110428-1335-k685b2ye/jobs/c/cat-coip1d9kTODO:
>>>> outs
>>>> ----
>>>> 
>>>> Caused by: Job failed with an exit code of 524
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.execution.JobException: Job
>>>> failed with an exit code of 524
>>>> Final status: Failed:1
>>>> The following errors have occurred:
>>>> 1. Job failed with an exit code of 524
>>>> 
>>>> ************
>>>> 
>>>> 
>>>> --
>>>> Ketan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Apr 28, 2011, at 1:03 PM, Michael Wilde wrote:
>>>> 
>>>>> For now - create a proxy using grid-proxy-init on the swift
>>>>> execution machine.
>>>>> I think there is an option to set "no security" for this config
>>>>> but I cant recall where that is specified. Maybe swift.properties,
>>>>> I cant recall.
>>>>> 
>>>>> - Mike
>>>>> 
>>>>> ----- Original Message -----
>>>>>> Hi,
>>>>>> 
>>>>>> It looks better now. However, I am getting the following:
>>>>>> 
>>>>>> =====
>>>>>> 
>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>>>> -sites.file
>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>>>> modified
>>>>>> locally)
>>>>>> 
>>>>>> RunID: 20110428-1251-oi9theh8
>>>>>> Progress:
>>>>>> Progress: Stage in:1
>>>>>> Could not submit job
>>>>>> Caused by:
>>>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>>>> Could not submit job
>>>>>> Caused by:
>>>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>>>> Could not start coaster service
>>>>>> Caused by:
>>>>>> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
>>>>>> org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file
>>>>>> (/tmp/x509up_u2006) not found.
>>>>>> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5]
>>>>>> Proxy
>>>>>> file (/tmp/x509up_u2006) not found.
>>>>>> Failed to transfer wrapper log from
>>>>>> catsn-20110428-1251-oi9theh8/info/e on
>>>>>> beagle-remote-pbs-coasters-ssh
>>>>>> 
>>>>>> =====
>>>>>> 
>>>>>> How do I specify "-nosec" on automatic coasters?
>>>>>> 
>>>>>> Ketan
>>>>>> 
>>>>>> On Apr 28, 2011, at 12:00 PM, Michael Wilde wrote:
>>>>>> 
>>>>>>> OK. Was there a cookbook on the ssh settings? Did you set up a
>>>>>>> $HOME/.ssh/auth.defaults per the user guide?
>>>>>>> 
>>>>>>> Here is an auth.defaults example. Im not sure its 100% correct,
>>>>>>> but
>>>>>>> could serve as a base for you:
>>>>>>> 
>>>>>>> xlogin1.pads.ci.uchicago.edu.type=password
>>>>>>> xlogin1.pads.ci.uchicago.edu.username=wilde
>>>>>>> 
>>>>>>> login.pads.ci.uchicago.edu.type=key
>>>>>>> login.pads.ci.uchicago.edu.username=wilde
>>>>>>> login.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
>>>>>>> login.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
>>>>>>> SURE
>>>>>>> mode=600!!!
>>>>>>> 
>>>>>>> login1.pads.ci.uchicago.edu.type=key
>>>>>>> login1.pads.ci.uchicago.edu.username=wilde
>>>>>>> login1.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
>>>>>>> login1.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
>>>>>>> SURE mode=600!!!
>>>>>>> 
>>>>>>> login.mcs.anl.gov.type=key
>>>>>>> login.mcs.anl.gov.username=wilde
>>>>>>> login.mcs.anl.gov.key=/home/wilde/.ssh/swift_rsa
>>>>>>> login.mcs.anl.gov.passphrase=yourpassphrasehere # MAKE SURE
>>>>>>> mode=600!!!
>>>>>>> 
>>>>>>> - Mike
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>>> It does look like an ssh problem. I am getting the same stderr
>>>>>>>> and
>>>>>>>> log
>>>>>>>> messages on trying to communicate from Bridled to Communicado.
>>>>>>>> 
>>>>>>>> Ketan
>>>>>>>> 
>>>>>>>> On Apr 28, 2011, at 11:19 AM, Michael Wilde wrote:
>>>>>>>> 
>>>>>>>>> Have you already run a simple hellow-world swift test from
>>>>>>>>> communicado to bridled to make sure you have ssh configured
>>>>>>>>> correctly? I would do that first.
>>>>>>>>> 
>>>>>>>>> Im not sure if an ssh problem explains what you show below, or
>>>>>>>>> not.
>>>>>>>>> 
>>>>>>>>> - Mike
>>>>>>>>> 
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> Thanks, I made the change. However, now, I am getting the
>>>>>>>>>> following
>>>>>>>>>> on
>>>>>>>>>> my stderr
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ===========
>>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>>>>>>>> -sites.file
>>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>>>>>>>> modified
>>>>>>>>>> locally)
>>>>>>>>>> 
>>>>>>>>>> RunID: 20110428-1022-n9s0k0e0
>>>>>>>>>> Progress:
>>>>>>>>>> [ketan]
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> [ketan] Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>>>> ========
>>>>>>>>>> 
>>>>>>>>>> And from the log it seems some network transmission has
>>>>>>>>>> failed:
>>>>>>>>>> 
>>>>>>>>>> 2011-04-28 10:22:45,261-0500 INFO TransportProtocolCommon
>>>>>>>>>> Sending
>>>>>>>>>> SSH_MSG_SERVICE_REQUEST
>>>>>>>>>> 2011-04-28 10:22:45,264-0500 INFO TransportProtocolCommon
>>>>>>>>>> Received
>>>>>>>>>> SSH_MSG_SERVICE_ACCEPT
>>>>>>>>>> 2011-04-28 10:24:27,626-0500 INFO TransportProtocolCommon The
>>>>>>>>>> Transport Protocol thread failed
>>>>>>>>>> java.io.IOException: The socket is EOF
>>>>>>>>>> at
>>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183)
>>>>>>>>>> at
>>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226)
>>>>>>>>>> at
>>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440)
>>>>>>>>>> at
>>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034)
>>>>>>>>>> at
>>>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393)
>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Any clues?
>>>>>>>>>> Ketan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Apr 28, 2011, at 10:20 AM, Michael Wilde wrote:
>>>>>>>>>> 
>>>>>>>>>>> The pool name in your sites file is
>>>>>>>>>>> pads-remote-pbs-coasters-ssh
>>>>>>>>>>> but
>>>>>>>>>>> you used pbs in your tc.data.
>>>>>>>>>>> 
>>>>>>>>>>> - Mike
>>>>>>>>>>> 
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> Some context:
>>>>>>>>>>>> I am trying to submit a big run on Beagle using swift +
>>>>>>>>>>>> coasters.
>>>>>>>>>>>> However, a previous run is already underway on beagle. So,
>>>>>>>>>>>> there
>>>>>>>>>>>> are
>>>>>>>>>>>> two difficulties running a new run from its login node:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Running another swift from the same jvm will result in
>>>>>>>>>>>> chaos
>>>>>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>> logs (As far as I know, please correct me if this is not
>>>>>>>>>>>> the
>>>>>>>>>>>> case
>>>>>>>>>>>> anymore)
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Login node is already under load because of my running
>>>>>>>>>>>> previous
>>>>>>>>>>>> big
>>>>>>>>>>>> run.
>>>>>>>>>>>> 
>>>>>>>>>>>> /context
>>>>>>>>>>>> 
>>>>>>>>>>>> So, I am now trying to submit this big run from a remote
>>>>>>>>>>>> host
>>>>>>>>>>>> (bridled). I know this has been done on PADS using ssh:pbs,
>>>>>>>>>>>> provider
>>>>>>>>>>>> coaster. I tried the similar approach on a trial swift
>>>>>>>>>>>> script
>>>>>>>>>>>> but
>>>>>>>>>>>> getting error.
>>>>>>>>>>>> 
>>>>>>>>>>>> Following is the error message:
>>>>>>>>>>>> 
>>>>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>>>>>>>>>> -sites.file
>>>>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088
>>>>>>>>>>>> (cog
>>>>>>>>>>>> modified
>>>>>>>>>>>> locally)
>>>>>>>>>>>> 
>>>>>>>>>>>> RunID: 20110428-1002-c8rvqhe6
>>>>>>>>>>>> Progress:
>>>>>>>>>>>> The application "cat" is not available in your tc.data
>>>>>>>>>>>> catalog
>>>>>>>>>>>> Caused by:
>>>>>>>>>>>> org.globus.cog.karajan.scheduler.NoSuchResourceException
>>>>>>>>>>>> Final status: Failed:1
>>>>>>>>>>>> The following errors have occurred:
>>>>>>>>>>>> 1. The application "cat" is not available in your tc.data
>>>>>>>>>>>> catalog
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Attached are my .swift, sites.xml and tc.data files.
>>>>>>>>>>>> 
>>>>>>>>>>>> Could someone indicate if what I am doing is doable and if
>>>>>>>>>>>> so
>>>>>>>>>>>> how
>>>>>>>>>>>> can
>>>>>>>>>>>> I correctly configure my sites and tc setup.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Ketan
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Michael Wilde
>>>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>>>> Argonne National Laboratory
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Michael Wilde
>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>> Argonne National Laboratory
>>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Michael Wilde
>>>>>>> Computation Institute, University of Chicago
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>> 
>>>>> 
>>>>> --
>>>>> Michael Wilde
>>>>> Computation Institute, University of Chicago
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>> 
>>> 
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 




More information about the Swift-devel mailing list