[Swift-devel] Re: Using coaster provider with jobmanager ssh:pbs

Arjun Comar mandaya at rose-hulman.edu
Mon Jun 7 06:37:01 CDT 2010


You're right, I'd thought I stuck the PATH info to bashrc but looks like I
forgot to. I fixed it and reran, and now I've got a totally new problem,
though I suspect my internet connection on this one. When I try and run the
script this time, rather than crash, it just loops on "Initializing site
shared directory" a la:
[arjun at bridled ~]$ swift -sites.file .swift/sites-pads-pbs-coasters.xml
-tc.file .swift/tc.data helloworld.swift
Swift svn swift-r3258 cog-r2726

RunID: 20100607-0624-5dz82mtc
Progress:
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1

ad nauseaum. I've had internet issues all night so I'm wondering if it's not
a problem due to that, so I'll confirm once I come to Argonne in a couple
hours. Haven't checked the logs yet, I'll do that at Argonne.

Arjun

On Mon, Jun 7, 2010 at 12:13 AM, wilde at mcs.anl.gov <wilde at mcs.anl.gov>wrote:

> Arjun, looking briefly at your logs, it seems like the run you tried at
> about 18:36 on Friday came close - it shows in your coasters.log file that
> it failed because there was no valid proxy on login 1.
>
> After that, you reverted from using the more recent stable branch code
> (from /home/wilde/swift/src/stable/.../dist/ back tp the old 0.9 release in
> /common.
>
> Like I mentioned Friday the old 0.9 release does not have the latest ssh
> provider code and thus doesnt recognize your auth.default parameters.
>
> So use my swift (or build your own from stable branch), make sure you have
> a valid proxy on both sides, and try again. I suspect that will progress
> further.
>
> You can see that after you reverted back to 0.9, Swift never again got as
> far as starting coasters (from your ~/.globus/coasters/coasters.log file)
> because the ssh likely failed (I suspect).
>
> - Mike
>
> From your .log files:
>
> login1$ fgrep .home $(ls -1t hello*.log | head -20)
>
> helloworld-20100606-2209-uuldx126.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100606-2207-n9aul0q5.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100606-2204-f2x1rm9f.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100606-1958-zf7ppjl6.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100604-2208-omool1yb.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100604-2206-17fmgozg.log:  vds.home =
> /software/common/swift-0.9-r1/bin/..
> helloworld-20100604-1836-jp5jbuy5.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1835-83mngdfe.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1835-mvmb56f5.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1834-833fef14.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1833-7tgi5o87.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1832-gbenp2xa.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1831-044dbd38.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1830-ua5qxocg.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1827-b31yuh98.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1826-zxygui3c.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1824-iym4edt3.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> helloworld-20100604-1820-74936sp7.log:  swift.home =
> /home/wilde/swift/src/stable/cog/modules/swift/dist/swift-svn/bin/..
> login1$
>
>
>
> ----- "Arjun Comar" <mandaya at rose-hulman.edu> wrote:
>
> > Alright, I've been playing with this for a few hours, but I can't
> > manage to get any further. The sites.xml file isn't up to date, the
> > one you want to see is sites-pads-pbs-coasters.xml. So I ran it a
> > couple times, saving logs, etc. and noticed that in the
> > .globus/coasters/coasters.log file, the jvm was being started with a
> > -DGLOBUS_HOSTNAME=login.pads.ci.uchicago. So I tried setting
> > GLOBUS_HOSTNAME to login1.pads.ci.uchicago. But even after that, the
> > log file still showed the former. And the log shows an exception being
> > thrown. So my hunch is to figure out how to force GLOBUS_HOSTNAME to
> > get set. Anyone have any thoughts? Am I barking up the wrong tree?
> >
> > Arjun
> >
> >
> > On Sat, Jun 5, 2010 at 9:53 AM, wilde at mcs.anl.gov < wilde at mcs.anl.gov
> > > wrote:
> >
> >
> > Looking at your latest logs, in particular coaster.log in your
> > ~/.globus/coasters dir, Swift is still unable to create a secure
> > connection using GSI: it thinks there is not a valid proxy in
> > /tmp/x509/:
> >
> > Looking at your sites.xml files, this is because you are telling Swift
> > to run at the hostname " login.ci.uchicago.edu " - a load balancing
> > virtual DNS host rotors between login1 and login2
> >
> > I suspect that the coaster service tried to start on login2 while you
> > made the proxy on login1, or something similar. Its a good exercise
> > for you to examine all the logs involved to confirm or disprove this
> > theory. Look at:
> >
> > - the detailed swift .log file
> > - the $HOME/.globus/coasters/coasters.log file
> > - the $HOME/.globus/scripts PBS submit file, stdout/err, and exitcode
> > files
> > - your proxy files in the local /tmp dirs of the machines that
> > grid-proxy-init was run on
> > - ifconfig (note that pads login hosts have multiple networks)
> >
> > ---
> >
> > login1.pads.ci.uchicago.edu
> > login1$ ls -lt /tmp/x* | head
> > -rw------- 1 arjun ci-users 2995 Jun 4 22:01 /tmp/x509up_u1857
> > ---
> >
> > I dont have time at the moment to trace this all back for you, but I
> > suggest two steps:
> >
> > 1) specify login1 everywhere you have "login" in sites.xml and
> > auth.defaults
> >
> > 2) look at the logs in your ~/.globus/coasters and /scripts directory,
> > perhaps moving the old logs out to a save/ directory each time (save
> > them for debugging till you resolve this). You'll be able to tell from
> > host names and IP addresses
> >
> > You may need to set GLOBUS_HOSTNAME, but I am not sure about that (see
> > the users guide and swift-user and devel lists for more info on that,
> > then ask on the list if still not clear).
> >
> > If the problem persists after you set everything to use the specific
> > login host login1, then be sure to send the the exact error message
> > your are getting and the locations of all the log files, as even
> > though the top-level error seems the same to you, the logs may
> > indicate that the underlying error changes as you correct various
> > aspects of the configuration and security context.
> >
> > - Mike
> >
> >
> >
> > login1$ grep login.pads *.xml
> > sites.xml: <filesystem url=" login.pads.ci.uchicago.edu "
> > provider="ssh"/>
> > sites.xml: <execution url=" login.pads.ci.uchicago.edu "
> > provider="ssh"/>
> > testsites.xml: <execution provider="coaster" url="
> > login.pads.ci.uchicago.edu " jobmanager="ssh:pbs"/>
> > testsites.xml: <filesystem provider="ssh" url="
> > login.pads.ci.uchicago.edu "/>
> >
> >
> >
> >
> >
> >
> > ----- "Arjun Comar" < mandaya at rose-hulman.edu > wrote:
> >
> > > Just realized I only sent this to Mike. I'm resending it to
> > > swift-devel.
> > >
> > >
> > > On Fri, Jun 4, 2010 at 10:11 PM, Arjun Comar <
> > mandaya at rose-hulman.edu
> > > > wrote:
> > >
> > >
> > > Nope, no luck. Here's grid-proxy-info from both:
> > >
> > > pads:
> > > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar
> > > 693820/CN=53942264
> > > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
> > > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
> > > type : RFC 3820 compliant impersonation proxy
> > > strength : 512 bits
> > > path : /tmp/x509up_u1857
> > > timeleft : 11:52:08
> > >
> > > bridled:
> > > subject : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar
> > > 693820/CN=1363223477
> > > issuer : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
> > > identity : /DC=org/DC=doegrids/OU=People/CN=Arjun Comar 693820
> > > type : RFC 3820 compliant impersonation proxy
> > > strength : 512 bits
> > > path : /tmp/x509up_u1857
> > > timeleft : 11:57:52
> > >
> > > Used the same passphrase to get both proxies,and set no options on
> > > grid-proxy-init.
> > >
> > > Arjun
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jun 4, 2010 at 9:00 PM, wilde at mcs.anl.gov <
> > wilde at mcs.anl.gov
> > > > wrote:
> > >
> > >
> > > When you use this configuration for running jobs from a submit host
> > to
> > > a PBS cluster using ssh to launch the coaster service on the PBS
> > login
> > > host, you need to create a GSI proxy (using grid-proxy-init) on both
> > > the client and on the remote login host, using the same certificate.
> > >
> > > <pool handle="coasterpads">
> > > <execution provider="coaster" url=" login1.pads.ci.uchicago.edu "
> > > jobmanager="ssh:pbs"/>
> > > <profile namespace="globus" key="maxtime">3000</profile>
> > > <profile namespace="globus" key="workersPerNode">8</profile>
> > > <profile namespace="globus" key="slots">1</profile>
> > > <profile namespace="globus" key="nodeGranularity">1</profile>
> > > <profile namespace="globus" key="maxNodes">1</profile>
> > > <profile namespace="globus" key="queue">fast</profile>
> > > <profile namespace="karajan" key="jobThrottle">0.5</profile>
> > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > <filesystem provider="ssh" url=" login1.pads.ci.uchicago.edu "/>
> > > <workdirectory>/home/wilde/swift/lab</workdirectory>
> > > </pool>
> > >
> > > Arjun, this is, I think, what was causing your workflow to fail.
> > >
> > > I thought, that in the past, we used to get at least a GSI (grid
> > > security infrastructure) error in the detailed log file. But I don't
> > > see that in this case.
> > >
> > > Let me know if creating proxies on both sides works for you. Be sure
> > > to create it on the right PADS login host.
> > >
> > > David and Arjun, can you coordinate on integrating this use case
> > into
> > > the tutorial (and eventually the Users Guide)? I suggested we do a
> > > series of "profiles" (with diagrams) to show the various ways of
> > > running Swift locally and remotely, and provide accompanying site
> > file
> > > entries. Dennis, when you get started next week and try these cases,
> > > we'll want to find a way to do automated tests for them.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > --
> > >
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > >
> > >
> > >
> > >
> > > --
> > > Arjun Comar, Rose-Hulman '12
> > >
> > >
> > >
> > > --
> > > Arjun Comar, Rose-Hulman '12
> >
> > --
> >
> >
> >
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> > --
> > Arjun Comar, Rose-Hulman '12
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>


-- 
Arjun Comar, Rose-Hulman '12
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100607/5d70c3c8/attachment.html>


More information about the Swift-devel mailing list