[Swift-devel] Coasters failing on Teraport - cant find Java?
Michael Wilde
wilde at mcs.anl.gov
Tue Jan 27 14:14:35 CST 2009
Further info: I dont see any .profile or shell .rc files in OSG, so Im
confused on how its environment is getting set up, unless softenv is
doing it all, and acting differently for a login shell and non-login shell.
It seems backwards to me that (as in previous email) the "-l" shell,
which *should* do full initialization, is getting a smaller environment
than the non- "-l" shell, which has tons of osg directories in its path,
and includes java.
Running a globus fork job without a shell shows the full OSG PATH is set
up (see printenv below). Probably, because there is no .profile or shell
.rc files, /bin/sh -l unsets the PATH that was set up by default.
Is globus doing some osg initialization when it launches jobs?
Can we have a per-site option to drop the "-l" when launching coasters?
Am I heading down the right path on this, or is the problem & solution
elsewhere?
- Mike
tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'ls -ld
$HOME/.*'drwxr-xr-x 4 osg osgvo 28672 Jan 27 13:50 /home/osgvo/osg/.
drwxr-xr-x 40 root root 4096 May 19 2008 /home/osgvo/osg/..
drwx------ 4 osg osgvo 4096 Jun 12 2008 /home/osgvo/osg/.globus
-rw------- 1 osg osgvo 245 Jun 22 2008 /home/osgvo/osg/.soft
-rw-r--r-- 1 osg osgvo 9044 Jan 27 11:04 /home/osgvo/osg/.soft.cache.csh
-rw-r--r-- 1 osg osgvo 9193 Jan 27 11:04 /home/osgvo/osg/.soft.cache.sh
drwx------ 2 osg osgvo 4096 Jun 22 2008 /home/osgvo/osg/.ssh
tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'cat $HOME/.soft'
#
# This is your SoftEnv configuration run control file.
#
# It is used to tell SoftEnv how to customize your environment by
# setting up variables such as PATH and MANPATH. To learn more
# about this file, do a "man softenv".
#
@default
tp$
globus-job-run tp-grid1.ci.uchicago.edu /usr/bin/printenv PATH
/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt
/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
tp$
On 1/27/09 1:41 PM, Michael Wilde wrote:
> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>
> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs
> mode.
>
> Im using 0.8rc1 and submitting from tp-login.
>
> I am running with a DOEgrids cert in the OSG VO.
>
> I *think* the issue is that when a gt2 jobs on this vo runs with a login
> shell, it doesnt get java in its path.
>
> When I run /bin/sh *without* the "-l" option, under globus, I do get a
> java in my path.
>
> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs
> coaster run on teraport, after you fixed the walltime issue?
>
> It seems to me that this is a rough edge with coaster startup. Recall
> that I had a similar problem running on abe last year: I had to edit out
> the "-l" and create a custom .profile to get coasters to work.
>
> It would be great if we can iron this out in 0.8 or soon after. I'm
> willing to do some testing and enlist help from Allan and Zhengxiong for
> wider testing.
>
> Do we need special site attributes for specific sites to override
> default behaviors when they dont work?
>
>
> My sites.xml is:
>
> <config>
> <pool handle="teraport" >
> <profile namespace="globus" key="queue">fast</profile>
> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
> <execution provider="coaster"
> url="tp-grid1.ci.uchicago.edu"
> jobmanager="gt2:gt2:pbs" />
> <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
> </pool>
> </config>
>
> I get this on stdout/err:
>
> ---------------------------------------------
> Swift 0.8rc1 swift-r2448 cog-r2261
>
> RunID: 20090127-1305-hcxdpor3
> Progress:
> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1
> Progress: Selecting site:2 Stage in:1 Submitting:1
> Progress: Selecting site:2 Submitting:1 Submitted:1
> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a
> on teraport
> Execution failed:
> Exception in runoops:
> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq,
> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1,
> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
> Host: teraport
> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Could not submit job
> Caused by:
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: which: no java in
> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>
> dirname: too few arguments
> Try `dirname --help' for more information.
> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No
> such file or directory
>
> STDERR: null
> Cleaning up...
> Done
>
> ------------------------------------
>
> Checking out the environment with this cert I see:
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
> /bin/sh: java: command not found
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
> java version "1.5.0_14"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java;
> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
> JAVA_HOME IS:
> PATH IS:
> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
>
> /usr/bin/which: no java in
> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>
> tp$
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo
> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>
> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
> JAVA_HOME IS:
> PATH IS:
> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/o
pt
>
> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java
> -version'java version "1.5.0_14"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>
>
> - Mike
>
>
>
>
>
> On 1/24/09 5:03 PM, Allan Espinosa wrote:
>> Hi,
>>
>> I am using swift0.8rc1. the same also happens to v0.7
>>
>> I tried submitting a job from communicado to tp-grid1 (teraport) using
>> coasters. The swift runtime does not give any error but it does not
>> finish as well. Looking through the files received by the teraport
>> head node, i observed that swift keeps submitting gram jobs. It looks
>> like that the submitted pbs scripts kept finishing / failing.
>>
>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
>>
>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
>> "http://128.135.125.118:50001" "1728236079"
>> #! /bin/sh
>> # PBS batch job script built by Globus job manager
>> #
>> #PBS -S /bin/sh
>> #PBS -m n
>> #PBS -q fast
>> #PBS -l walltime=101:00
>> #PBS -o /dev/null
>> #PBS -e /dev/null
>> #PBS -l nodes=1
>> HOME="/home/aespinosa";
>> export HOME;
>> OSG_DATA="/gpfs1/osg/data";
>> ...
>> ...
>> counter=0
>> exit_code=0
>> while test $counter -lt 1; do
>> /bin/touch
>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
>>
>>
>> read tmp_exit_code <
>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
>>
>> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
>> exit_code=$tmp_exit_code
>> fi
>> counter=`expr $counter + 1`
>> done
>>
>> exit $exit_code
>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
>> walltime requirement
>>
>>
>>
>> Below is my sites.xml:
>>
>> <config>
>>
>> <pool handle="Teraport" sysinfo="INTEL32::LINUX">
>> <profile namespace="globus" key="queue">fast</profile>
>> <profile namespace="globus" key="maxwalltime">00:10:00</profile>
>> <gridftp
>> url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
>> </gridftp>
>> <execution provider="coaster" url="tp-grid1.uchicago.edu"
>> jobmanager="gt2:gt2:pbs" />
>> <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
>> <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
>> </pool>
>>
>> </config>
>>
>> This does not happen if i use "local:pbs" as the jobmanager for the
>> coaster and was successful in running jobs
>> -Allan
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list