[Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication

wilde at mcs.anl.gov wilde at mcs.anl.gov
Fri Apr 30 08:57:07 CDT 2010


Yi,

I'll leave question 2 for Mihael.

For your first problem (using a link to qsub) the logs below suggest that your coaster workers did start (but you should verify this).

Look in your ~/.globus/scripts directory to see if an error was returned by qsub. (I suspect not, since this would likely have generated a message on stdout/err and in your swift run log.

Look in your workdirectory (from sites.xml) to see if the results of echo were generated.

Look in your coaster logs to see if there were any problems in launching coasters (bot from below it looks like not.

Try changing echo to sleep 12345 to see if the app is starting or not. (then use qstat to find the node, and ssh to the node to see if the "sleep" is running).

If it got that far, perhaps a configuration error is preventing the result from getting back successfully.

I can meet with you after 5PM if you get stuck in debugging.

- Mike

----- "Yi Zhu" <yizhu at cs.uchicago.edu> wrote:

> On 4/29/2010 6:18 PM, Mihael Hategan wrote:
> 
> On Thu, 2010-04-29 at 17:34 -0500, Yi Zhu wrote:
> 
> HI,
> 
>  I've tried it with "gt2:pbs", and got a "qsub not found" error, for
> further investigation, I  pulled the env used by globus,and  found
> that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think
> that's why cause  "qsub not found" problem.
> 
> Any suggested solution ? Two actually.
> 1. This is for the qsub problem: you can add the relevant environment
> variables (for Torque) in sites.xml. I've tried to add
> <profile namespace="env" key="PATH">/opt/torque-2.3.6/bin</profile>
> 
> to the sites.xml, but still get the same error;" qsub is not found".
> 
> make a link from /opt/torque-2.3.6/bin/qsub to /usr/bin seems works,
> but I get another error:
> 
> -bash-3.2$
> -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml
> first.swift
> Swift svn swift-r3262 cog-r2729 (cog modified locally)
> 
> RunID: 20100430-0105-nzzk6xxd
> Progress:
> Progress: Stage in:1
> Progress: Submitted:1
> Progress: Active:1
> Failed to transfer wrapper log from
> first-20100430-0105-nzzk6xxd/info/x on ec2
> Progress: Failed:1
> Execution failed:
> Exception in echo:
> Arguments: [Hello, world!]
> Host: ec2
> Directory: first-20100430-0105-nzzk6xxd/jobs/x/echo-xvom1arj
> stderr.txt:
> 
> stdout.txt:
> 
> ----
> 
> Caused by:
> No status file was found. Check the shared filesystem on ec2
> Cleaning up...
> Shutting down service at https://10.251.214.179:48615
> Got channel MetaChannel: 1317572826 -> GSSSChannel-11921994068(1)
> + Done
> 
> and the coaster-bootstrap log:
> 
> [torqueuser at ip-10-251-214-179 ~]$
> [torqueuser at ip-10-251-214-179 ~]$ cat
> coaster-bootstrap-11921994068.log
> using plain mode
> BS: http://tp-login2.ci.uchicago.edu:57278
> which: no gmd5sum in
> (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin)
> Expected checksum: 9017a89a3a700d9866592187fdb27b5b
> Computed checksum: 9017a89a3a700d9866592187fdb27b5b
> JAVA=/opt/vdt-1.10.1/jdk1.5/bin/java
> plain /opt/vdt-1.10.1/jdk1.5/bin/java
> -Djava=/opt/vdt-1.10.1/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE=
> -DX509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/31355.1272607512/x509_up
> -DX509_CERT_DIR=/etc/grid-security/certificates
> -DGLOBUS_HOSTNAME=ec2-204-236-204-71.compute-1.amazonaws.com -jar
> /tmp/bootstrap.t31454 http://tp-login2.ci.uchicago.edu:57278
> https://128.135.125.117:54201 11921994068
> Canceling job 28.ip-10-251-214-179.ec2.internal
> 
> EC: 0
> [torqueuser at ip-10-251-214-179 ~]$
> [torqueuser at ip-10-251-214-179 ~]$
> 
> 
> 
> 
> 2. This is for the DN issue with gt2:gt2:pbs: Edit /etc/hosts and make
> sure that the expected DN is the first entry for the internal IP
> passed
> to the coaster service. If the entry is not in there at all, add it.
> This is a way to impersonate a Globus service and possibly do a
> man-in-the-middle thing, but it may also work to fix the DN mismatch
> problem.
> 
> Mihael by modify the entry in /etc/hosts to the expect DN address, so
> solve the DNS mismatch problem, but still get an " No status file was
> found. Check the shared filesystem on ec2" error As same as the one
> mentioned above.
> 
> -Yi Zhu

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list