[Swift-devel] Coasters failing on Teraport - cant find Java?

Michael Wilde wilde at mcs.anl.gov
Tue Jan 27 13:41:49 CST 2009


Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs

I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs 
mode.

Im using 0.8rc1 and submitting from tp-login.

I am running with a DOEgrids cert in the OSG VO.

I *think* the issue is that when a gt2 jobs on this vo runs with a login 
shell, it doesnt get java in its path.

When I run /bin/sh *without* the "-l" option, under globus, I do get a 
java in my path.

Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs 
coaster run on teraport, after you fixed the walltime issue?

It seems to me that this is a rough edge with coaster startup. Recall 
that I had a similar problem running on abe last year: I had to edit out 
the "-l" and create a custom .profile to get coasters to work.

It would be great if we can iron this out in 0.8 or soon after. I'm 
willing to do some testing and enlist help from Allan and Zhengxiong for 
wider testing.

Do we need special site attributes for specific sites to override 
default behaviors when they dont work?


My sites.xml is:

<config>
<pool handle="teraport" >
   <profile namespace="globus" key="queue">fast</profile>
   <profile namespace="globus" key="maxwalltime">00:05:00</profile>
   <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
   <execution provider="coaster"
      url="tp-grid1.ci.uchicago.edu"
      jobmanager="gt2:gt2:pbs" />
   <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
</pool>
</config>

I get this on stdout/err:

---------------------------------------------
Swift 0.8rc1 swift-r2448 cog-r2261

RunID: 20090127-1305-hcxdpor3
Progress:
Progress:  Selecting site:2 Stage in:1 Initializing site shared directory:1
Progress:  Selecting site:2 Stage in:1 Submitting:1
Progress:  Selecting site:2 Submitting:1 Submitted:1
Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a 
on teraport
Execution failed:
         Exception in runoops:
Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, 
[TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
Host: teraport
Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
stderr.txt:

stdout.txt:

----

Caused by:
         Could not submit job
Caused by:
         Could not start coaster service
Caused by:
         Task ended before registration was received.
STDOUT: which: no java in 
(/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
dirname: too few arguments
Try `dirname --help' for more information.
http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No 
such file or directory

STDERR: null
Cleaning up...
  Done

------------------------------------

Checking out the environment with this cert I see:

tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
/bin/sh: java: command not found


tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
java version "1.5.0_14"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)


tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; 
echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
JAVA_HOME IS:
PATH IS: 
/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
/usr/bin/which: no java in 
(/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
tp$


tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo 
JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'

/opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
JAVA_HOME IS:
PATH IS: 
/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt
/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java 
-version'java version "1.5.0_14"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)


- Mike





On 1/24/09 5:03 PM, Allan Espinosa wrote:
> Hi,
> 
> I am using swift0.8rc1.  the same also happens to v0.7
> 
> I tried submitting a job from communicado to tp-grid1 (teraport) using
> coasters.  The swift runtime does not give any error but it does not
> finish as well. Looking through the files received by the teraport
> head node, i observed that swift keeps submitting gram jobs.  It looks
> like that the submitted pbs scripts kept finishing / failing.
> 
> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
> 
> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
> "http://128.135.125.118:50001" "1728236079"
> #! /bin/sh
> # PBS batch job script built by Globus job manager
> #
> #PBS -S /bin/sh
> #PBS -m n
> #PBS -q fast
> #PBS -l walltime=101:00
> #PBS -o /dev/null
> #PBS -e /dev/null
> #PBS -l nodes=1
> HOME="/home/aespinosa";
> export HOME;
> OSG_DATA="/gpfs1/osg/data";
> ...
> ...
> counter=0
> exit_code=0
> while test $counter -lt 1; do
>     /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
> 
>     read tmp_exit_code <
> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
>     if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
>         exit_code=$tmp_exit_code
>     fi
>     counter=`expr $counter + 1`
> done
> 
> exit $exit_code
> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
> walltime requirement
> 
> 
> 
> Below is my sites.xml:
> 
> <config>
> 
>   <pool handle="Teraport" sysinfo="INTEL32::LINUX">
>     <profile namespace="globus" key="queue">fast</profile>
>     <profile namespace="globus" key="maxwalltime">00:10:00</profile>
>     <gridftp  url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
>     </gridftp>
>     <execution provider="coaster" url="tp-grid1.uchicago.edu"
> jobmanager="gt2:gt2:pbs" />
>     <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
>     <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
>   </pool>
> 
> </config>
> 
> This does not happen if i use "local:pbs" as the jobmanager for the
> coaster and was successful in running jobs
> -Allan
> 
> 



More information about the Swift-devel mailing list