Ok, so I'm still having the issue, meaning it wasn't just a screwy connection. I peeked into the logs and the first thing that's popping out at me are these lines:<br>2010-06-07 08:48:18,929-0500 INFO SshPrivateKeyFile Parsing private key file<br>
2010-06-07 08:48:18,935-0500 INFO SshPrivateKeyFile Private key is not in the default format, attempting parse with other supported formats<br>2010-06-07 08:48:18,944-0500 INFO PublicKeyAuthenticationClient Generating data to sign<br>
2010-06-07 08:48:18,945-0500 INFO PublicKeyAuthenticationClient Preparing public key authentication request<br>2010-06-07 08:48:19,006-0500 INFO TransportProtocolCommon Sending SSH_MSG_USERAUTH_REQUEST<br>2010-06-07 08:48:19,051-0500 INFO TransportProtocolCommon Received SSH_MSG_USERAUTH_SUCCESS<br>
2010-06-07 08:48:19,051-0500 INFO ConnectionProtocol Registering connection protocol messages<br>2010-06-07 08:48:19,052-0500 INFO Service ssh-connection has been requested<br>2010-06-07 08:48:19,052-0500 INFO Service Starting ssh-connection service thread<br>
2010-06-07 08:48:19,053-0500 INFO AuthenticationProtocolClient Requesting authentication methods<br>2010-06-07 08:48:19,053-0500 INFO TransportProtocolCommon Sending SSH_MSG_USERAUTH_REQUEST<br>2010-06-07 08:48:19,056-0500 INFO TransportProtocolCommon Received SSH_MSG_UNIMPLEMENTED<br>
<br>And that's the end of the log file. To test things, I tried sticking the wrong password into the auth.defaults file to see if it would give me the same error, but it didn't. This is the same private/public key pair I've been using to ssh in for an interactive shell so I'm pretty sure the key's not at fault. But from what I can tell, it hits that last INFO message, and then produces no further logs. At least, I can't find any more. No files are being produce and stuck into the directory that's created for the run. And no directory is created under the work directory. <br>
<br>Anyone have any thoughts? As far as I can tell, all logging stops as soon as that " INFO TransportProtocolCommon Received SSH_MSG_UNIMPLEMENTED" line is reached, and the progress indicator just loops printing "Progress: Initializing site shared directory:1" repeatedly.<br>
<br>Arjun<br><br><br><div class="gmail_quote">On Mon, Jun 7, 2010 at 6:37 AM, Arjun Comar <span dir="ltr"><<a href="mailto:mandaya@rose-hulman.edu">mandaya@rose-hulman.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
You're right, I'd thought I stuck the PATH info to bashrc but looks like I forgot to. I fixed it and reran, and now I've got a totally new problem, though I suspect my internet connection on this one. When I try and run the script this time, rather than crash, it just loops on "Initializing site shared directory" a la:<br>
[arjun@bridled ~]$ swift -sites.file .swift/sites-pads-pbs-coasters.xml -tc.file .swift/tc.data helloworld.swift <br>Swift svn swift-r3258 cog-r2726<br><br>RunID: 20100607-0624-5dz82mtc<br>Progress:<br>Progress: Initializing site shared directory:1<br>
Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>
Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>
Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br>Progress: Initializing site shared directory:1<br><br>ad nauseaum. I've had internet issues all night so I'm wondering if it's not a problem due to that, so I'll confirm once I come to Argonne in a couple hours. Haven't checked the logs yet, I'll do that at Argonne.<br>
<font color="#888888">
<br>Arjun</font><div><div></div><div class="h5"><br><br><div class="gmail_quote">On Mon, Jun 7, 2010 at 12:13 AM, <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a> <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Arjun, looking briefly at your logs, it seems like the run you tried at about 18:36 on Friday came close - it shows in your coasters.log file that it failed because there was no valid proxy on login 1.<br>
<br>
After that, you reverted from using the more recent stable branch code (from /home/wilde/swift/src/stable/.../dist/ back tp the old 0.9 release in /common.<br>
<br>
Like I mentioned Friday the old 0.9 release does not have the latest ssh provider code and thus doesnt recognize your auth.default parameters.<br>
<br>
So use my swift (or build your own from stable branch), make sure you have a valid proxy on both sides, and try again. I suspect that will progress further.<br>
<br>
You can see that after you reverted back to 0.9, Swift never again got as far as starting coasters (from your ~/.globus/coasters/coasters.log file) because the ssh likely failed (I suspect).<br>
<br>
- Mike<br>
<br>
>From your .log files:<br>
<br>
login1$ fgrep .home $(ls -1t hello*.log | head -20)<br>
<br>
helloworld-20100606-2209-uuldx126.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100606-2207-n9aul0q5.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100606-2204-f2x1rm9f.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100606-1958-zf7ppjl6.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100604-2208-omool1yb.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100604-2206-17fmgozg.log: vds.home = /software/common/swift-0.9-r1/bin/..<br>
helloworld-20100604-1836-jp5jbuy5.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1835-83mngdfe.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1835-mvmb56f5.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1834-833fef14.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1833-7tgi5o87.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1832-gbenp2xa.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1831-044dbd38.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1830-ua5qxocg.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1827-b31yuh98.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1826-zxygui3c.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1824-iym4edt3.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
helloworld-20100604-1820-74936sp7.log: swift.home = /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..<br>
login1$<br>
<div><div></div><div><br>
<br>
<br>
----- "Arjun Comar" <<a href="mailto:mandaya@rose-hulman.edu" target="_blank">mandaya@rose-hulman.edu</a>> wrote:<br>
<br>
> Alright, I've been playing with this for a few hours, but I can't<br>
> manage to get any further. The sites.xml file isn't up to date, the<br>
> one you want to see is sites-pads-pbs-coasters.xml. So I ran it a<br>
> couple times, saving logs, etc. and noticed that in the<br>
> .globus/coasters/coasters.log file, the jvm was being started with a<br>
> -DGLOBUS_HOSTNAME=login.pads.ci.uchicago. So I tried setting<br>
> GLOBUS_HOSTNAME to login1.pads.ci.uchicago. But even after that, the<br>
> log file still showed the former. And the log shows an exception being<br>
> thrown. So my hunch is to figure out how to force GLOBUS_HOSTNAME to<br>
> get set. Anyone have any thoughts? Am I barking up the wrong tree?<br>
><br>
> Arjun<br>
><br>
><br>
> On Sat, Jun 5, 2010 at 9:53 AM, <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a> < <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a><br>
> > wrote:<br>
><br>
><br>
> Looking at your latest logs, in particular coaster.log in your<br>
> ~/.globus/coasters dir, Swift is still unable to create a secure<br>
> connection using GSI: it thinks there is not a valid proxy in<br>
> /tmp/x509/:<br>
><br>
> Looking at your sites.xml files, this is because you are telling Swift<br>
> to run at the hostname " <a href="http://login.ci.uchicago.edu" target="_blank">login.ci.uchicago.edu</a> " - a load balancing<br>
> virtual DNS host rotors between login1 and login2<br>
><br>
> I suspect that the coaster service tried to start on login2 while you<br>
> made the proxy on login1, or something similar. Its a good exercise<br>
> for you to examine all the logs involved to confirm or disprove this<br>
> theory. Look at:<br>
><br>
> - the detailed swift .log file<br>
> - the $HOME/.globus/coasters/coasters.log file<br>
> - the $HOME/.globus/scripts PBS submit file, stdout/err, and exitcode<br>
> files<br>
> - your proxy files in the local /tmp dirs of the machines that<br>
> grid-proxy-init was run on<br>
> - ifconfig (note that pads login hosts have multiple networks)<br>
><br>
> ---<br>
><br>
> <a href="http://login1.pads.ci.uchicago.edu" target="_blank">login1.pads.ci.uchicago.edu</a><br>
> login1$ ls -lt /tmp/x* | head<br>
> -rw------- 1 arjun ci-users 2995 Jun 4 22:01 /tmp/x509up_u1857<br>
> ---<br>
><br>
> I dont have time at the moment to trace this all back for you, but I<br>
> suggest two steps:<br>
><br>
> 1) specify login1 everywhere you have "login" in sites.xml and<br>
> auth.defaults<br>
><br>
> 2) look at the logs in your ~/.globus/coasters and /scripts directory,<br>
> perhaps moving the old logs out to a save/ directory each time (save<br>
> them for debugging till you resolve this). You'll be able to tell from<br>
> host names and IP addresses<br>
><br>
> You may need to set GLOBUS_HOSTNAME, but I am not sure about that (see<br>
> the users guide and swift-user and devel lists for more info on that,<br>
> then ask on the list if still not clear).<br>
><br>
> If the problem persists after you set everything to use the specific<br>
> login host login1, then be sure to send the the exact error message<br>
> your are getting and the locations of all the log files, as even<br>
> though the top-level error seems the same to you, the logs may<br>
> indicate that the underlying error changes as you correct various<br>
> aspects of the configuration and security context.<br>
><br>
> - Mike<br>
><br>
><br>
><br>
> login1$ grep login.pads *.xml<br>
> sites.xml: <filesystem url=" <a href="http://login.pads.ci.uchicago.edu" target="_blank">login.pads.ci.uchicago.edu</a> "<br>
> provider="ssh"/><br>
> sites.xml: <execution url=" <a href="http://login.pads.ci.uchicago.edu" target="_blank">login.pads.ci.uchicago.edu</a> "<br>
> provider="ssh"/><br>
> testsites.xml: <execution provider="coaster" url="<br>
> <a href="http://login.pads.ci.uchicago.edu" target="_blank">login.pads.ci.uchicago.edu</a> " jobmanager="ssh:pbs"/><br>
> testsites.xml: <filesystem provider="ssh" url="<br>
> <a href="http://login.pads.ci.uchicago.edu" target="_blank">login.pads.ci.uchicago.edu</a> "/><br>
><br>
><br>
><br>
><br>
><br>
><br>
> ----- "Arjun Comar" < <a href="mailto:mandaya@rose-hulman.edu" target="_blank">mandaya@rose-hulman.edu</a> > wrote:<br>
><br>
> > Just realized I only sent this to Mike. I'm resending it to<br>
> > swift-devel.<br>
> ><br>
> ><br>
> > On Fri, Jun 4, 2010 at 10:11 PM, Arjun Comar <<br>
> <a href="mailto:mandaya@rose-hulman.edu" target="_blank">mandaya@rose-hulman.edu</a><br>
> > > wrote:<br>
> ><br>
> ><br>
> > Nope, no luck. Here's grid-proxy-info from both:<br>
> ><br>
> > pads:<br>
> > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar<br>
> > 693820/CN=53942264<br>
> > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820<br>
> > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820<br>
> > type : RFC 3820 compliant impersonation proxy<br>
> > strength : 512 bits<br>
> > path : /tmp/x509up_u1857<br>
> > timeleft : 11:52:08<br>
> ><br>
> > bridled:<br>
> > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar<br>
> > 693820/CN=1363223477<br>
> > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820<br>
> > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820<br>
> > type : RFC 3820 compliant impersonation proxy<br>
> > strength : 512 bits<br>
> > path : /tmp/x509up_u1857<br>
> > timeleft : 11:57:52<br>
> ><br>
> > Used the same passphrase to get both proxies,and set no options on<br>
> > grid-proxy-init.<br>
> ><br>
> > Arjun<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > On Fri, Jun 4, 2010 at 9:00 PM, <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a> <<br>
> <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a><br>
> > > wrote:<br>
> ><br>
> ><br>
> > When you use this configuration for running jobs from a submit host<br>
> to<br>
> > a PBS cluster using ssh to launch the coaster service on the PBS<br>
> login<br>
> > host, you need to create a GSI proxy (using grid-proxy-init) on both<br>
> > the client and on the remote login host, using the same certificate.<br>
> ><br>
> > <pool handle="coasterpads"><br>
> > <execution provider="coaster" url=" <a href="http://login1.pads.ci.uchicago.edu" target="_blank">login1.pads.ci.uchicago.edu</a> "<br>
> > jobmanager="ssh:pbs"/><br>
> > <profile namespace="globus" key="maxtime">3000</profile><br>
> > <profile namespace="globus" key="workersPerNode">8</profile><br>
> > <profile namespace="globus" key="slots">1</profile><br>
> > <profile namespace="globus" key="nodeGranularity">1</profile><br>
> > <profile namespace="globus" key="maxNodes">1</profile><br>
> > <profile namespace="globus" key="queue">fast</profile><br>
> > <profile namespace="karajan" key="jobThrottle">0.5</profile><br>
> > <profile namespace="karajan" key="initialScore">10000</profile><br>
> > <filesystem provider="ssh" url=" <a href="http://login1.pads.ci.uchicago.edu" target="_blank">login1.pads.ci.uchicago.edu</a> "/><br>
> > <workdirectory>/home/wilde/swift/lab</workdirectory><br>
> > </pool><br>
> ><br>
> > Arjun, this is, I think, what was causing your workflow to fail.<br>
> ><br>
> > I thought, that in the past, we used to get at least a GSI (grid<br>
> > security infrastructure) error in the detailed log file. But I don't<br>
> > see that in this case.<br>
> ><br>
> > Let me know if creating proxies on both sides works for you. Be sure<br>
> > to create it on the right PADS login host.<br>
> ><br>
> > David and Arjun, can you coordinate on integrating this use case<br>
> into<br>
> > the tutorial (and eventually the Users Guide)? I suggested we do a<br>
> > series of "profiles" (with diagrams) to show the various ways of<br>
> > running Swift locally and remotely, and provide accompanying site<br>
> file<br>
> > entries. Dennis, when you get started next week and try these cases,<br>
> > we'll want to find a way to do automated tests for them.<br>
> ><br>
> > Thanks,<br>
> ><br>
> > Mike<br>
> ><br>
> > --<br>
> ><br>
> > Michael Wilde<br>
> > Computation Institute, University of Chicago<br>
> > Mathematics and Computer Science Division<br>
> > Argonne National Laboratory<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Arjun Comar, Rose-Hulman '12<br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Arjun Comar, Rose-Hulman '12<br>
><br>
> --<br>
><br>
><br>
><br>
> Michael Wilde<br>
> Computation Institute, University of Chicago<br>
> Mathematics and Computer Science Division<br>
> Argonne National Laboratory<br>
><br>
><br>
><br>
><br>
> --<br>
> Arjun Comar, Rose-Hulman '12<br>
<br>
</div></div>--<br>
<div><div></div><div>Michael Wilde<br>
Computation Institute, University of Chicago<br>
Mathematics and Computer Science Division<br>
Argonne National Laboratory<br>
<br>
</div></div></blockquote></div><br><br clear="all"><br></div></div>-- <br><div><div></div><div class="h5">Arjun Comar, Rose-Hulman '12<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Arjun Comar, Rose-Hulman '12<br>