[Swift-devel] ssh:pbs to beagle

ketan ketancmaheshwari at gmail.com
Tue May 17 09:32:26 CDT 2011


With lots of help  from Mike, yesterday, we successfully submitted swift 
jobs from Bridled machine to Beagle via ssh:pbs.

Following are the notes from the exercise:

1. Used, automatic coasters, with security, since there is no way to 
specify -nosec. This implies:
     a. make sure proxy is valid on both ends (bridled and beagle), 
using grid-proxy-init
     b. make sure ca certs are present on both ends, 
X509_CERT_DIR=/home/ketan/TRUSTEDCA, X509_CADIR=/home/ketan/TRUSTEDCA

2. For ssh authentication, make sure the auth.defaults is in place with 
proper authentication info and permissions:
     a. ~/.ssh/auth.defaults looks like the following for a key-based 
access:

         bridled.ci.uchicago.edu.type=key
         bridled.ci.uchicago.edu.username=uname
         bridled.ci.uchicago.edu.key=/path/to/your/private_key
         bridled.ci.uchicago.edu.passphrase=yourpassphrase


         login1.beagle.ci.uchicago.edu.type=key
         login1.beagle.ci.uchicago.edu.username=uname
         login1.beagle.ci.uchicago.edu.key=/path/to/your/private_key
         login1.beagle.ci.uchicago.edu.passphrase=yourpassphrase

  b. Make sure you have 600 perm on this auth.defaults file.

3. Java: We found the following exception was occuring because of IBM 
java on Beagle:

Could not start connection handler
java.io.EOFException

We installed locally the Sun java and the above exception was gone.

4. Owing to the fact that beagle login nodes cannot write on /home 
filesystem, we encountered error 524 from worker.pl being unable to 
write workdirs/jobdirs to a previously set /home as workdir location. 
Make sure your workdir is set to /lustre/beagle/your/preferred/path. 
Alternatively, setting it to PADS /gpfs is also ok since worker nodes 
can write their. Beagle admins do not encourage this though.

To wrap up, following are the relavant files;
sites.xml:

<config>
<pool handle="ssh-pbs">
<execution provider="coaster" url="login1.beagle.ci.uchicago.edu" 
jobmanager="ssh:pbs"/>
<profile namespace="globus" key="project">CI-CCR000013</profile>

<profile namespace="globus" key="ppn">24</profile>

<profile namespace="globus" 
key="providerAttributes">pbs.aprun;pbs.mpp;depth=24</profile>

<profile namespace="globus" key="jobsPerNode">24</profile>
<profile namespace="globus" key="maxTime">1000</profile>
<profile namespace="globus" key="slots">1</profile>
<profile namespace="globus" key="nodeGranularity">1</profile>
<profile namespace="globus" key="maxNodes">1</profile>

<profile namespace="karajan" key="jobThrottle">.63</profile>
<profile namespace="karajan" key="initialScore">10000</profile>

<workdirectory>/lustre/beagle/ketan/swift.workdir</workdirectory>
</pool>
</config>
===========

tc:

ssh-pbs cat /bin/cat null null null
===========

cf: (note, provider staging is enabled, required)

wrapperlog.always.transfer=true
sitedir.keep=true
execution.retries=1
lazy.errors=true
status.mode=provider
use.provider.staging=true
provider.staging.pin.swiftfiles=false
foreach.max.threads=10
provenance.log=true
===========

swift commandline:

swift -config cf -tc.file tc -sites.file beagle-coaster.xml catsn.swift -n=1
===========


Regards,
Ketan


On 4/28/11 2:17 PM, Mihael Hategan wrote:
> What does your sites file look like?
>
> On Thu, 2011-04-28 at 13:36 -0500, Ketan Maheshwari wrote:
>> Ok, I got past CredentialException with grid-proxy-init, now I am facing this (note: I have turned on provider staging)  :
>>
>> ========
>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified locally)
>>
>> RunID: 20110428-1332-llaa031f
>> Progress:
>> Could not start connection handler
>> java.io.EOFException
>> 	at org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
>> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
>> 	at org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
>> 	at org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
>> 	at org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
>> 	at org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
>> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
>> 	at org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
>> 	at org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
>> 	at java.lang.Thread.run(Thread.java:662)
>> Progress:  Submitted:1
>> Could not start connection handler
>> java.io.EOFException
>> 	at org.globus.gsi.gssapi.net.impl.GSIGssInputStream.readHandshakeToken(GSIGssInputStream.java:61)
>> 	at org.globus.gsi.gssapi.net.impl.GSIGssSocket.readToken(GSIGssSocket.java:65)
>> 	at org.globus.gsi.gssapi.net.GssSocket.authenticateServer(GssSocket.java:127)
>> 	at org.globus.gsi.gssapi.net.GssSocket.startHandshake(GssSocket.java:147)
>> 	at org.globus.gsi.gssapi.net.GssSocket.getInputStream(GssSocket.java:177)
>> 	at org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.setSocket(AbstractTCPChannel.java:30)
>> 	at org.globus.cog.karajan.workflow.service.channels.GSSChannel.<init>(GSSChannel.java:47)
>> 	at org.globus.cog.karajan.workflow.service.ConnectionHandler.<init>(ConnectionHandler.java:41)
>> 	at org.globus.cog.abstraction.coaster.service.local.LocalService.handleConnection(LocalService.java:63)
>> 	at org.globus.net.BaseServer.run(BaseServer.java:247)
>> 	at java.lang.Thread.run(Thread.java:662)
>> Progress:  Submitted:1
>> Exception in cat:
>> Arguments: [data.txt]
>> Host: beagle-remote-pbs-coasters-ssh
>> Directory: catsn-20110428-1332-llaa031f/jobs/b/cat-bxal1d9kTODO: outs
>> ----
>>
>> Caused by: Could not submit job
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not submit job
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Could not start coaster service
>> Caused by: org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Task ended before registration was received.
>> STDOUT:
>> STDERR:
>> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 1
>> Final status:  Failed:1
>> The following errors have occurred:
>> 1. Job failed with an exit code of 1
>>
>> ========
>>
>>
>>  From bridled to communicado, I see the following error:
>>
>> **************
>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file coaster-local-ssh-communicado.xml catsn.swift -n=1
>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified locally)
>>
>> RunID: 20110428-1335-k685b2ye
>> Progress:
>> Progress:  Submitted:1
>> Progress:  Active:1
>> Exception in cat:
>> Arguments: [data.txt]
>> Host: communicado-ssh
>> Directory: catsn-20110428-1335-k685b2ye/jobs/c/cat-coip1d9kTODO: outs
>> ----
>>
>> Caused by: Job failed with an exit code of 524
>> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 524
>> Final status:  Failed:1
>> The following errors have occurred:
>> 1. Job failed with an exit code of 524
>>
>> ************
>>
>>
>> --
>> Ketan
>>
>>
>>
>>
>> On Apr 28, 2011, at 1:03 PM, Michael Wilde wrote:
>>
>>> For now - create a proxy using grid-proxy-init on the swift execution machine.
>>> I think there is an option to set "no security" for this config but I cant recall where that is specified.  Maybe swift.properties, I cant recall.
>>>
>>> - Mike
>>>
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>> It looks better now. However, I am getting the following:
>>>>
>>>> =====
>>>>
>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc -sites.file
>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog modified
>>>> locally)
>>>>
>>>> RunID: 20110428-1251-oi9theh8
>>>> Progress:
>>>> Progress: Stage in:1
>>>> Could not submit job
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Could not submit job
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
>>>> Could not start coaster service
>>>> Caused by:
>>>> org.globus.cog.abstraction.impl.common.task.InvalidSecurityContextException:
>>>> org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy file
>>>> (/tmp/x509up_u2006) not found.
>>>> Caused by: org.globus.gsi.GlobusCredentialException: [JGLOBUS-5] Proxy
>>>> file (/tmp/x509up_u2006) not found.
>>>> Failed to transfer wrapper log from
>>>> catsn-20110428-1251-oi9theh8/info/e on beagle-remote-pbs-coasters-ssh
>>>>
>>>> =====
>>>>
>>>> How do I specify "-nosec" on automatic coasters?
>>>>
>>>> Ketan
>>>>
>>>> On Apr 28, 2011, at 12:00 PM, Michael Wilde wrote:
>>>>
>>>>> OK. Was there a cookbook on the ssh settings? Did you set up a
>>>>> $HOME/.ssh/auth.defaults per the user guide?
>>>>>
>>>>> Here is an auth.defaults example. Im not sure its 100% correct, but
>>>>> could serve as a base for you:
>>>>>
>>>>> xlogin1.pads.ci.uchicago.edu.type=password
>>>>> xlogin1.pads.ci.uchicago.edu.username=wilde
>>>>>
>>>>> login.pads.ci.uchicago.edu.type=key
>>>>> login.pads.ci.uchicago.edu.username=wilde
>>>>> login.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
>>>>> login.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE SURE
>>>>> mode=600!!!
>>>>>
>>>>> login1.pads.ci.uchicago.edu.type=key
>>>>> login1.pads.ci.uchicago.edu.username=wilde
>>>>> login1.pads.ci.uchicago.edu.key=/home/wilde/.ssh/swift_rsa
>>>>> login1.pads.ci.uchicago.edu.passphrase=yourpassphrasehere # MAKE
>>>>> SURE mode=600!!!
>>>>>
>>>>> login.mcs.anl.gov.type=key
>>>>> login.mcs.anl.gov.username=wilde
>>>>> login.mcs.anl.gov.key=/home/wilde/.ssh/swift_rsa
>>>>> login.mcs.anl.gov.passphrase=yourpassphrasehere # MAKE SURE
>>>>> mode=600!!!
>>>>>
>>>>> - Mike
>>>>>
>>>>> ----- Original Message -----
>>>>>> It does look like an ssh problem. I am getting the same stderr and
>>>>>> log
>>>>>> messages on trying to communicate from Bridled to Communicado.
>>>>>>
>>>>>> Ketan
>>>>>>
>>>>>> On Apr 28, 2011, at 11:19 AM, Michael Wilde wrote:
>>>>>>
>>>>>>> Have you already run a simple hellow-world swift test from
>>>>>>> communicado to bridled to make sure you have ssh configured
>>>>>>> correctly? I would do that first.
>>>>>>>
>>>>>>> Im not sure if an ssh problem explains what you show below, or
>>>>>>> not.
>>>>>>>
>>>>>>> - Mike
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> Thanks, I made the change. However, now, I am getting the
>>>>>>>> following
>>>>>>>> on
>>>>>>>> my stderr
>>>>>>>>
>>>>>>>>
>>>>>>>> ===========
>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>>>>>> -sites.file
>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>>>>>> modified
>>>>>>>> locally)
>>>>>>>>
>>>>>>>> RunID: 20110428-1022-n9s0k0e0
>>>>>>>> Progress:
>>>>>>>> [ketan]
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> [ketan] Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> Progress: Initializing site shared directory:1
>>>>>>>> ========
>>>>>>>>
>>>>>>>> And from the log it seems some network transmission has failed:
>>>>>>>>
>>>>>>>> 2011-04-28 10:22:45,261-0500 INFO TransportProtocolCommon Sending
>>>>>>>> SSH_MSG_SERVICE_REQUEST
>>>>>>>> 2011-04-28 10:22:45,264-0500 INFO TransportProtocolCommon
>>>>>>>> Received
>>>>>>>> SSH_MSG_SERVICE_ACCEPT
>>>>>>>> 2011-04-28 10:24:27,626-0500 INFO TransportProtocolCommon The
>>>>>>>> Transport Protocol thread failed
>>>>>>>> java.io.IOException: The socket is EOF
>>>>>>>> at
>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readBufferedData(TransportProtocolInputStream.java:183)
>>>>>>>> at
>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolInputStream.readMessage(TransportProtocolInputStream.java:226)
>>>>>>>> at
>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.processMessages(TransportProtocolCommon.java:1440)
>>>>>>>> at
>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.startBinaryPacketProtocol(TransportProtocolCommon.java:1034)
>>>>>>>> at
>>>>>>>> com.sshtools.j2ssh.transport.TransportProtocolCommon.run(TransportProtocolCommon.java:393)
>>>>>>>> at java.lang.Thread.run(Thread.java:662)
>>>>>>>>
>>>>>>>>
>>>>>>>> Any clues?
>>>>>>>> Ketan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 28, 2011, at 10:20 AM, Michael Wilde wrote:
>>>>>>>>
>>>>>>>>> The pool name in your sites file is pads-remote-pbs-coasters-ssh
>>>>>>>>> but
>>>>>>>>> you used pbs in your tc.data.
>>>>>>>>>
>>>>>>>>> - Mike
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Some context:
>>>>>>>>>> I am trying to submit a big run on Beagle using swift +
>>>>>>>>>> coasters.
>>>>>>>>>> However, a previous run is already underway on beagle. So,
>>>>>>>>>> there
>>>>>>>>>> are
>>>>>>>>>> two difficulties running a new run from its login node:
>>>>>>>>>>
>>>>>>>>>> 1. Running another swift from the same jvm will result in chaos
>>>>>>>>>> on
>>>>>>>>>> the
>>>>>>>>>> logs (As far as I know, please correct me if this is not the
>>>>>>>>>> case
>>>>>>>>>> anymore)
>>>>>>>>>>
>>>>>>>>>> 2. Login node is already under load because of my running
>>>>>>>>>> previous
>>>>>>>>>> big
>>>>>>>>>> run.
>>>>>>>>>>
>>>>>>>>>> /context
>>>>>>>>>>
>>>>>>>>>> So, I am now trying to submit this big run from a remote host
>>>>>>>>>> (bridled). I know this has been done on PADS using ssh:pbs,
>>>>>>>>>> provider
>>>>>>>>>> coaster. I tried the similar approach on a trial swift script
>>>>>>>>>> but
>>>>>>>>>> getting error.
>>>>>>>>>>
>>>>>>>>>> Following is the error message:
>>>>>>>>>>
>>>>>>>>>> [ketan at bridled catsn.works]$ swift -config cf -tc.file tc
>>>>>>>>>> -sites.file
>>>>>>>>>> beagle-coaster-ssh-pbs.xml catsn.swift -n=1
>>>>>>>>>> Swift svn swift-r4252 (swift modified locally) cog-r3088 (cog
>>>>>>>>>> modified
>>>>>>>>>> locally)
>>>>>>>>>>
>>>>>>>>>> RunID: 20110428-1002-c8rvqhe6
>>>>>>>>>> Progress:
>>>>>>>>>> The application "cat" is not available in your tc.data catalog
>>>>>>>>>> Caused by:
>>>>>>>>>> org.globus.cog.karajan.scheduler.NoSuchResourceException
>>>>>>>>>> Final status: Failed:1
>>>>>>>>>> The following errors have occurred:
>>>>>>>>>> 1. The application "cat" is not available in your tc.data
>>>>>>>>>> catalog
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Attached are my .swift, sites.xml and tc.data files.
>>>>>>>>>>
>>>>>>>>>> Could someone indicate if what I am doing is doable and if so
>>>>>>>>>> how
>>>>>>>>>> can
>>>>>>>>>> I correctly configure my sites and tc setup.
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Ketan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>> --
>>>>>>>>> Michael Wilde
>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>> Argonne National Laboratory
>>>>>>>>>
>>>>>>> --
>>>>>>> Michael Wilde
>>>>>>> Computation Institute, University of Chicago
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>>
>>>>> --
>>>>> Michael Wilde
>>>>> Computation Institute, University of Chicago
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>>
>>> -- 
>>> Michael Wilde
>>> Computation Institute, University of Chicago
>>> Mathematics and Computer Science Division
>>> Argonne National Laboratory
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>



More information about the Swift-devel mailing list