[Swift-devel] Re: Using coaster provider with jobmanager ssh:pbs
Arjun Comar
mandaya at rose-hulman.edu
Mon Jun 7 09:01:09 CDT 2010
Ok, so I'm still having the issue, meaning it wasn't just a screwy
connection. I peeked into the logs and the first thing that's popping out at
me are these lines:
2010-06-07 08:48:18,929-0500 INFO SshPrivateKeyFile Parsing private key
file
2010-06-07 08:48:18,935-0500 INFO SshPrivateKeyFile Private key is not in
the default format, attempting parse with other supported formats
2010-06-07 08:48:18,944-0500 INFO PublicKeyAuthenticationClient Generating
data to sign
2010-06-07 08:48:18,945-0500 INFO PublicKeyAuthenticationClient Preparing
public key authentication request
2010-06-07 08:48:19,006-0500 INFO TransportProtocolCommon Sending
SSH_MSG_USERAUTH_REQUEST
2010-06-07 08:48:19,051-0500 INFO TransportProtocolCommon Received
SSH_MSG_USERAUTH_SUCCESS
2010-06-07 08:48:19,051-0500 INFO ConnectionProtocol Registering connection
protocol messages
2010-06-07 08:48:19,052-0500 INFO Service ssh-connection has been requested
2010-06-07 08:48:19,052-0500 INFO Service Starting ssh-connection service
thread
2010-06-07 08:48:19,053-0500 INFO AuthenticationProtocolClient Requesting
authentication methods
2010-06-07 08:48:19,053-0500 INFO TransportProtocolCommon Sending
SSH_MSG_USERAUTH_REQUEST
2010-06-07 08:48:19,056-0500 INFO TransportProtocolCommon Received
SSH_MSG_UNIMPLEMENTED
And that's the end of the log file. To test things, I tried sticking the
wrong password into the auth.defaults file to see if it would give me the
same error, but it didn't. This is the same private/public key pair I've
been using to ssh in for an interactive shell so I'm pretty sure the key's
not at fault. But from what I can tell, it hits that last INFO message, and
then produces no further logs. At least, I can't find any more. No files are
being produce and stuck into the directory that's created for the run. And
no directory is created under the work directory.
Anyone have any thoughts? As far as I can tell, all logging stops as soon as
that " INFO TransportProtocolCommon Received SSH_MSG_UNIMPLEMENTED" line is
reached, and the progress indicator just loops printing "Progress:
Initializing site shared directory:1" repeatedly.
Arjun
On Mon, Jun 7, 2010 at 6:37 AM, Arjun Comar <mandaya at rose-hulman.edu> wrote:
> You're right, I'd thought I stuck the PATH info to bashrc but looks like I
> forgot to. I fixed it and reran, and now I've got a totally new problem,
> though I suspect my internet connection on this one. When I try and run the
> script this time, rather than crash, it just loops on "Initializing site
> shared directory" a la:
> [arjun at bridled ~]$ swift -sites.file .swift/sites-pads-pbs-coasters.xml
> -tc.file .swift/tc.data helloworld.swift
> Swift svn swift-r3258 cog-r2726
>
> RunID: 20100607-0624-5dz82mtc
> Progress:
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
> Progress: Initializing site shared directory:1
>
> ad nauseaum. I've had internet issues all night so I'm wondering if it's
> not a problem due to that, so I'll confirm once I come to Argonne in a
> couple hours. Haven't checked the logs yet, I'll do that at Argonne.
>
> Arjun
>
>
> On Mon, Jun 7, 2010 at 12:13 AM, wilde at mcs.anl.gov <wilde at mcs.anl.gov>wrote:
>
>> Arjun, looking briefly at your logs, it seems like the run you tried at
>> about 18:36 on Friday came close - it shows in your coasters.log file that
>> it failed because there was no valid proxy on login 1.
>>
>> After that, you reverted from using the more recent stable branch code
>> (from /home/wilde/swift/src/stable/.../dist/ back tp the old 0.9 release in
>> /common.
>>
>> Like I mentioned Friday the old 0.9 release does not have the latest ssh
>> provider code and thus doesnt recognize your auth.default parameters.
>>
>> So use my swift (or build your own from stable branch), make sure you have
>> a valid proxy on both sides, and try again. I suspect that will progress
>> further.
>>
>> You can see that after you reverted back to 0.9, Swift never again got as
>> far as starting coasters (from your ~/.globus/coasters/coasters.log file)
>> because the ssh likely failed (I suspect).
>>
>> - Mike
>>
>> From your .log files:
>>
>> login1$ fgrep .home $(ls -1t hello*.log | head -20)
>>
>> helloworld-20100606-2209-uuldx126.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100606-2207-n9aul0q5.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100606-2204-f2x1rm9f.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100606-1958-zf7ppjl6.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100604-2208-omool1yb.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100604-2206-17fmgozg.log: vds.home =
>> /software/common/swift-0.9-r1/bin/..
>> helloworld-20100604-1836-jp5jbuy5.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1835-83mngdfe.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1835-mvmb56f5.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1834-833fef14.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1833-7tgi5o87.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1832-gbenp2xa.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1831-044dbd38.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1830-ua5qxocg.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1827-b31yuh98.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1826-zxygui3c.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1824-iym4edt3.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> helloworld-20100604-1820-74936sp7.log: swift.home =
>> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
>> login1$
>>
>>
>>
>> ----- "Arjun Comar" <mandaya at rose-hulman.edu> wrote:
>>
>> > Alright, I've been playing with this for a few hours, but I can't
>> > manage to get any further. The sites.xml file isn't up to date, the
>> > one you want to see is sites-pads-pbs-coasters.xml. So I ran it a
>> > couple times, saving logs, etc. and noticed that in the
>> > .globus/coasters/coasters.log file, the jvm was being started with a
>> > -DGLOBUS_HOSTNAME=login.pads.ci.uchicago. So I tried setting
>> > GLOBUS_HOSTNAME to login1.pads.ci.uchicago. But even after that, the
>> > log file still showed the former. And the log shows an exception being
>> > thrown. So my hunch is to figure out how to force GLOBUS_HOSTNAME to
>> > get set. Anyone have any thoughts? Am I barking up the wrong tree?
>> >
>> > Arjun
>> >
>> >
>> > On Sat, Jun 5, 2010 at 9:53 AM, wilde at mcs.anl.gov < wilde at mcs.anl.gov
>> > > wrote:
>> >
>> >
>> > Looking at your latest logs, in particular coaster.log in your
>> > ~/.globus/coasters dir, Swift is still unable to create a secure
>> > connection using GSI: it thinks there is not a valid proxy in
>> > /tmp/x509/:
>> >
>> > Looking at your sites.xml files, this is because you are telling Swift
>> > to run at the hostname " login.ci.uchicago.edu " - a load balancing
>> > virtual DNS host rotors between login1 and login2
>> >
>> > I suspect that the coaster service tried to start on login2 while you
>> > made the proxy on login1, or something similar. Its a good exercise
>> > for you to examine all the logs involved to confirm or disprove this
>> > theory. Look at:
>> >
>> > - the detailed swift .log file
>> > - the $HOME/.globus/coasters/coasters.log file
>> > - the $HOME/.globus/scripts PBS submit file, stdout/err, and exitcode
>> > files
>> > - your proxy files in the local /tmp dirs of the machines that
>> > grid-proxy-init was run on
>> > - ifconfig (note that pads login hosts have multiple networks)
>> >
>> > ---
>> >
>> > login1.pads.ci.uchicago.edu
>> > login1$ ls -lt /tmp/x* | head
>> > -rw------- 1 arjun ci-users 2995 Jun 4 22:01 /tmp/x509up_u1857
>> > ---
>> >
>> > I dont have time at the moment to trace this all back for you, but I
>> > suggest two steps:
>> >
>> > 1) specify login1 everywhere you have "login" in sites.xml and
>> > auth.defaults
>> >
>> > 2) look at the logs in your ~/.globus/coasters and /scripts directory,
>> > perhaps moving the old logs out to a save/ directory each time (save
>> > them for debugging till you resolve this). You'll be able to tell from
>> > host names and IP addresses
>> >
>> > You may need to set GLOBUS_HOSTNAME, but I am not sure about that (see
>> > the users guide and swift-user and devel lists for more info on that,
>> > then ask on the list if still not clear).
>> >
>> > If the problem persists after you set everything to use the specific
>> > login host login1, then be sure to send the the exact error message
>> > your are getting and the locations of all the log files, as even
>> > though the top-level error seems the same to you, the logs may
>> > indicate that the underlying error changes as you correct various
>> > aspects of the configuration and security context.
>> >
>> > - Mike
>> >
>> >
>> >
>> > login1$ grep login.pads *.xml
>> > sites.xml: <filesystem url=" login.pads.ci.uchicago.edu "
>> > provider="ssh"/>
>> > sites.xml: <execution url=" login.pads.ci.uchicago.edu "
>> > provider="ssh"/>
>> > testsites.xml: <execution provider="coaster" url="
>> > login.pads.ci.uchicago.edu " jobmanager="ssh:pbs"/>
>> > testsites.xml: <filesystem provider="ssh" url="
>> > login.pads.ci.uchicago.edu "/>
>> >
>> >
>> >
>> >
>> >
>> >
>> > ----- "Arjun Comar" < mandaya at rose-hulman.edu > wrote:
>> >
>> > > Just realized I only sent this to Mike. I'm resending it to
>> > > swift-devel.
>> > >
>> > >
>> > > On Fri, Jun 4, 2010 at 10:11 PM, Arjun Comar <
>> > mandaya at rose-hulman.edu
>> > > > wrote:
>> > >
>> > >
>> > > Nope, no luck. Here's grid-proxy-info from both:
>> > >
>> > > pads:
>> > > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar
>> > > 693820/CN=53942264
>> > > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
>> > > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
>> > > type : RFC 3820 compliant impersonation proxy
>> > > strength : 512 bits
>> > > path : /tmp/x509up_u1857
>> > > timeleft : 11:52:08
>> > >
>> > > bridled:
>> > > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar
>> > > 693820/CN=1363223477
>> > > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
>> > > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
>> > > type : RFC 3820 compliant impersonation proxy
>> > > strength : 512 bits
>> > > path : /tmp/x509up_u1857
>> > > timeleft : 11:57:52
>> > >
>> > > Used the same passphrase to get both proxies,and set no options on
>> > > grid-proxy-init.
>> > >
>> > > Arjun
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Jun 4, 2010 at 9:00 PM, wilde at mcs.anl.gov <
>> > wilde at mcs.anl.gov
>> > > > wrote:
>> > >
>> > >
>> > > When you use this configuration for running jobs from a submit host
>> > to
>> > > a PBS cluster using ssh to launch the coaster service on the PBS
>> > login
>> > > host, you need to create a GSI proxy (using grid-proxy-init) on both
>> > > the client and on the remote login host, using the same certificate.
>> > >
>> > > <pool handle="coasterpads">
>> > > <execution provider="coaster" url=" login1.pads.ci.uchicago.edu "
>> > > jobmanager="ssh:pbs"/>
>> > > <profile namespace="globus" key="maxtime">3000</profile>
>> > > <profile namespace="globus" key="workersPerNode">8</profile>
>> > > <profile namespace="globus" key="slots">1</profile>
>> > > <profile namespace="globus" key="nodeGranularity">1</profile>
>> > > <profile namespace="globus" key="maxNodes">1</profile>
>> > > <profile namespace="globus" key="queue">fast</profile>
>> > > <profile namespace="karajan" key="jobThrottle">0.5</profile>
>> > > <profile namespace="karajan" key="initialScore">10000</profile>
>> > > <filesystem provider="ssh" url=" login1.pads.ci.uchicago.edu "/>
>> > > <workdirectory>/home/wilde/swift/lab</workdirectory>
>> > > </pool>
>> > >
>> > > Arjun, this is, I think, what was causing your workflow to fail.
>> > >
>> > > I thought, that in the past, we used to get at least a GSI (grid
>> > > security infrastructure) error in the detailed log file. But I don't
>> > > see that in this case.
>> > >
>> > > Let me know if creating proxies on both sides works for you. Be sure
>> > > to create it on the right PADS login host.
>> > >
>> > > David and Arjun, can you coordinate on integrating this use case
>> > into
>> > > the tutorial (and eventually the Users Guide)? I suggested we do a
>> > > series of "profiles" (with diagrams) to show the various ways of
>> > > running Swift locally and remotely, and provide accompanying site
>> > file
>> > > entries. Dennis, when you get started next week and try these cases,
>> > > we'll want to find a way to do automated tests for them.
>> > >
>> > > Thanks,
>> > >
>> > > Mike
>> > >
>> > > --
>> > >
>> > > Michael Wilde
>> > > Computation Institute, University of Chicago
>> > > Mathematics and Computer Science Division
>> > > Argonne National Laboratory
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Arjun Comar, Rose-Hulman '12
>> > >
>> > >
>> > >
>> > > --
>> > > Arjun Comar, Rose-Hulman '12
>> >
>> > --
>> >
>> >
>> >
>> > Michael Wilde
>> > Computation Institute, University of Chicago
>> > Mathematics and Computer Science Division
>> > Argonne National Laboratory
>> >
>> >
>> >
>> >
>> > --
>> > Arjun Comar, Rose-Hulman '12
>>
>> --
>> Michael Wilde
>> Computation Institute, University of Chicago
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>>
>>
>
>
> --
> Arjun Comar, Rose-Hulman '12
>
--
Arjun Comar, Rose-Hulman '12
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100607/c4a706ec/attachment.html>
More information about the Swift-devel
mailing list