[Swift-devel] Coasters failing on Teraport - cant find Java?

Michael Wilde wilde at mcs.anl.gov
Tue Jan 27 23:03:59 CST 2009


I dug a bit deeper. As far as I can tell, this is what's happening:

1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars 
and set the PATH to contain OSG stuff. So if you do a globus-job-run of 
/usr/bin/printenv (i.e. with no shell) you see all this, including java 
in the path (from an osg dir).

2) when you globus-job-run /bin/sh, all this stays around, but

3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which 
un-does the path and LD_LIBRARY_PATH, setting PATH to some default and 
LD_LIBRARY_PATH to null.  I *think* this is being done by softenv which 
runs from /etc/profile.d, called at the end of /etc/profile.

You can simulate this with:

globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source 
/etc/profile; which java"  (or try printenv instead of which java to see 
the details)

So bottom line: there's at least two cases where -l hurts, this one, and 
abe, where attempts to run login shells from globus are thwarted.

If the purpose of -l was just to get java in the path,, then
for OSG sites that behave like teraport, just omitting -l should work, 
because the OSG jobmanager modes put it in the path.

For sites like abe, bypassing -l, and forcing the user to put Java in 
the path with a .bashrc or equivalent, may work. (The hack I used on abe 
was to remove the -l arg, and insert this in bootstrap.sh:

+if [ -f ~/.myetcprofile ]; then
+  source ~/.myetcprofile
+else
+  source /etc/profile
+fi

One option is to accept a per-site option from sites.xml to bypass "-l" 
  on the startup shell, and insert the logic above for something like 
.coasterinit, sourcing that if the user provides it.

Another option is to put a +java line in the OSG .soft file on TeraPort.

Its possible this problem only eists on the few sites like teraport that 
run both OSG mods and softenv???

I think we need to test coasters broadly across OSG to be sure (Ben's IP 
problem is a case in point).  But a simple shell test across all the OSG 
VO sites could detect whether Java will be there or not, with and 
without -l.

- Mike


On 1/27/09 9:33 PM, Mihael Hategan wrote:
> Hmm. Looks like -l has the opposite effect of what I thought it should
> do (end up with an environment equivalent to the one you get in when you
> log in as an interactive session). Is it my misunderstanding or
> something else?
> 
> On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote:
>> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>>
>> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs 
>> mode.
>>
>> Im using 0.8rc1 and submitting from tp-login.
>>
>> I am running with a DOEgrids cert in the OSG VO.
>>
>> I *think* the issue is that when a gt2 jobs on this vo runs with a login 
>> shell, it doesnt get java in its path.
>>
>> When I run /bin/sh *without* the "-l" option, under globus, I do get a 
>> java in my path.
>>
>> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs 
>> coaster run on teraport, after you fixed the walltime issue?
>>
>> It seems to me that this is a rough edge with coaster startup. Recall 
>> that I had a similar problem running on abe last year: I had to edit out 
>> the "-l" and create a custom .profile to get coasters to work.
>>
>> It would be great if we can iron this out in 0.8 or soon after. I'm 
>> willing to do some testing and enlist help from Allan and Zhengxiong for 
>> wider testing.
>>
>> Do we need special site attributes for specific sites to override 
>> default behaviors when they dont work?
>>
>>
>> My sites.xml is:
>>
>> <config>
>> <pool handle="teraport" >
>>    <profile namespace="globus" key="queue">fast</profile>
>>    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>>    <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
>>    <execution provider="coaster"
>>       url="tp-grid1.ci.uchicago.edu"
>>       jobmanager="gt2:gt2:pbs" />
>>    <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
>> </pool>
>> </config>
>>
>> I get this on stdout/err:
>>
>> ---------------------------------------------
>> Swift 0.8rc1 swift-r2448 cog-r2261
>>
>> RunID: 20090127-1305-hcxdpor3
>> Progress:
>> Progress:  Selecting site:2 Stage in:1 Initializing site shared directory:1
>> Progress:  Selecting site:2 Stage in:1 Submitting:1
>> Progress:  Selecting site:2 Submitting:1 Submitted:1
>> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a 
>> on teraport
>> Execution failed:
>>          Exception in runoops:
>> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
>> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, 
>> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
>> Host: teraport
>> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
>> stderr.txt:
>>
>> stdout.txt:
>>
>> ----
>>
>> Caused by:
>>          Could not submit job
>> Caused by:
>>          Could not start coaster service
>> Caused by:
>>          Task ended before registration was received.
>> STDOUT: which: no java in 
>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>> dirname: too few arguments
>> Try `dirname --help' for more information.
>> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No 
>> such file or directory
>>
>> STDERR: null
>> Cleaning up...
>>   Done
>>
>> ------------------------------------
>>
>> Checking out the environment with this cert I see:
>>
>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
>> /bin/sh: java: command not found
>>
>>
>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
>> java version "1.5.0_14"
>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>
>>
>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; 
>> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>> JAVA_HOME IS:
>> PATH IS: 
>> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
>> /usr/bin/which: no java in 
>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>> tp$
>>
>>
>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo 
>> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>>
>> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
>> JAVA_HOME IS:
>> PATH IS: 
>> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/
o
>  pt
>> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java 
>> -version'java version "1.5.0_14"
>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>
>>
>> - Mike
>>
>>
>>
>>
>>
>> On 1/24/09 5:03 PM, Allan Espinosa wrote:
>>> Hi,
>>>
>>> I am using swift0.8rc1.  the same also happens to v0.7
>>>
>>> I tried submitting a job from communicado to tp-grid1 (teraport) using
>>> coasters.  The swift runtime does not give any error but it does not
>>> finish as well. Looking through the files received by the teraport
>>> head node, i observed that swift keeps submitting gram jobs.  It looks
>>> like that the submitted pbs scripts kept finishing / failing.
>>>
>>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
>>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
>>>
>>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
>>> "http://128.135.125.118:50001" "1728236079"
>>> #! /bin/sh
>>> # PBS batch job script built by Globus job manager
>>> #
>>> #PBS -S /bin/sh
>>> #PBS -m n
>>> #PBS -q fast
>>> #PBS -l walltime=101:00
>>> #PBS -o /dev/null
>>> #PBS -e /dev/null
>>> #PBS -l nodes=1
>>> HOME="/home/aespinosa";
>>> export HOME;
>>> OSG_DATA="/gpfs1/osg/data";
>>> ...
>>> ...
>>> counter=0
>>> exit_code=0
>>> while test $counter -lt 1; do
>>>     /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
>>>
>>>     read tmp_exit_code <
>>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
>>>     if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
>>>         exit_code=$tmp_exit_code
>>>     fi
>>>     counter=`expr $counter + 1`
>>> done
>>>
>>> exit $exit_code
>>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
>>> walltime requirement
>>>
>>>
>>>
>>> Below is my sites.xml:
>>>
>>> <config>
>>>
>>>   <pool handle="Teraport" sysinfo="INTEL32::LINUX">
>>>     <profile namespace="globus" key="queue">fast</profile>
>>>     <profile namespace="globus" key="maxwalltime">00:10:00</profile>
>>>     <gridftp  url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
>>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
>>>     </gridftp>
>>>     <execution provider="coaster" url="tp-grid1.uchicago.edu"
>>> jobmanager="gt2:gt2:pbs" />
>>>     <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
>>>     <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
>>>   </pool>
>>>
>>> </config>
>>>
>>> This does not happen if i use "local:pbs" as the jobmanager for the
>>> coaster and was successful in running jobs
>>> -Allan
>>>
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list