[Swift-devel] Coasters failing on Teraport - cant find Java?
Mihael Hategan
hategan at mcs.anl.gov
Tue Jan 27 21:33:31 CST 2009
Hmm. Looks like -l has the opposite effect of what I thought it should
do (end up with an environment equivalent to the one you get in when you
log in as an interactive session). Is it my misunderstanding or
something else?
On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote:
> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>
> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs
> mode.
>
> Im using 0.8rc1 and submitting from tp-login.
>
> I am running with a DOEgrids cert in the OSG VO.
>
> I *think* the issue is that when a gt2 jobs on this vo runs with a login
> shell, it doesnt get java in its path.
>
> When I run /bin/sh *without* the "-l" option, under globus, I do get a
> java in my path.
>
> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs
> coaster run on teraport, after you fixed the walltime issue?
>
> It seems to me that this is a rough edge with coaster startup. Recall
> that I had a similar problem running on abe last year: I had to edit out
> the "-l" and create a custom .profile to get coasters to work.
>
> It would be great if we can iron this out in 0.8 or soon after. I'm
> willing to do some testing and enlist help from Allan and Zhengxiong for
> wider testing.
>
> Do we need special site attributes for specific sites to override
> default behaviors when they dont work?
>
>
> My sites.xml is:
>
> <config>
> <pool handle="teraport" >
> <profile namespace="globus" key="queue">fast</profile>
> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
> <execution provider="coaster"
> url="tp-grid1.ci.uchicago.edu"
> jobmanager="gt2:gt2:pbs" />
> <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
> </pool>
> </config>
>
> I get this on stdout/err:
>
> ---------------------------------------------
> Swift 0.8rc1 swift-r2448 cog-r2261
>
> RunID: 20090127-1305-hcxdpor3
> Progress:
> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1
> Progress: Selecting site:2 Stage in:1 Submitting:1
> Progress: Selecting site:2 Submitting:1 Submitted:1
> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a
> on teraport
> Execution failed:
> Exception in runoops:
> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq,
> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1,
> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
> Host: teraport
> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
> stderr.txt:
>
> stdout.txt:
>
> ----
>
> Caused by:
> Could not submit job
> Caused by:
> Could not start coaster service
> Caused by:
> Task ended before registration was received.
> STDOUT: which: no java in
> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
> dirname: too few arguments
> Try `dirname --help' for more information.
> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No
> such file or directory
>
> STDERR: null
> Cleaning up...
> Done
>
> ------------------------------------
>
> Checking out the environment with this cert I see:
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
> /bin/sh: java: command not found
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
> java version "1.5.0_14"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java;
> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
> JAVA_HOME IS:
> PATH IS:
> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
> /usr/bin/which: no java in
> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
> tp$
>
>
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo
> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>
> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
> JAVA_HOME IS:
> PATH IS:
> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/o
pt
> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java
> -version'java version "1.5.0_14"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>
>
> - Mike
>
>
>
>
>
> On 1/24/09 5:03 PM, Allan Espinosa wrote:
> > Hi,
> >
> > I am using swift0.8rc1. the same also happens to v0.7
> >
> > I tried submitting a job from communicado to tp-grid1 (teraport) using
> > coasters. The swift runtime does not give any error but it does not
> > finish as well. Looking through the files received by the teraport
> > head node, i observed that swift keeps submitting gram jobs. It looks
> > like that the submitted pbs scripts kept finishing / failing.
> >
> > diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
> > see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
> >
> > /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
> > "http://128.135.125.118:50001" "1728236079"
> > #! /bin/sh
> > # PBS batch job script built by Globus job manager
> > #
> > #PBS -S /bin/sh
> > #PBS -m n
> > #PBS -q fast
> > #PBS -l walltime=101:00
> > #PBS -o /dev/null
> > #PBS -e /dev/null
> > #PBS -l nodes=1
> > HOME="/home/aespinosa";
> > export HOME;
> > OSG_DATA="/gpfs1/osg/data";
> > ...
> > ...
> > counter=0
> > exit_code=0
> > while test $counter -lt 1; do
> > /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
> >
> > read tmp_exit_code <
> > /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
> > if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
> > exit_code=$tmp_exit_code
> > fi
> > counter=`expr $counter + 1`
> > done
> >
> > exit $exit_code
> > qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
> > walltime requirement
> >
> >
> >
> > Below is my sites.xml:
> >
> > <config>
> >
> > <pool handle="Teraport" sysinfo="INTEL32::LINUX">
> > <profile namespace="globus" key="queue">fast</profile>
> > <profile namespace="globus" key="maxwalltime">00:10:00</profile>
> > <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
> > storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
> > </gridftp>
> > <execution provider="coaster" url="tp-grid1.uchicago.edu"
> > jobmanager="gt2:gt2:pbs" />
> > <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
> > <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
> > </pool>
> >
> > </config>
> >
> > This does not happen if i use "local:pbs" as the jobmanager for the
> > coaster and was successful in running jobs
> > -Allan
> >
> >
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list