[Swift-devel] Coasters failing on Teraport - cant find Java?
Michael Wilde
wilde at mcs.anl.gov
Tue Feb 3 08:44:47 CST 2009
I didnt see this message till now. I'll compare this to the approach I
was testing (see previous message) and see what works where.
- Mike
On 1/30/09 12:42 PM, Mihael Hategan wrote:
> Cog r2267 contains a tentative fix for this. The bootstrap script is
> started without -l, and if java cannot be found, it attempts to get that
> information using bash -l.
>
> I haven't tested it.
>
> On Tue, 2009-01-27 at 23:03 -0600, Michael Wilde wrote:
>> I dug a bit deeper. As far as I can tell, this is what's happening:
>>
>> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars
>> and set the PATH to contain OSG stuff. So if you do a globus-job-run of
>> /usr/bin/printenv (i.e. with no shell) you see all this, including java
>> in the path (from an osg dir).
>>
>> 2) when you globus-job-run /bin/sh, all this stays around, but
>>
>> 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which
>> un-does the path and LD_LIBRARY_PATH, setting PATH to some default and
>> LD_LIBRARY_PATH to null. I *think* this is being done by softenv which
>> runs from /etc/profile.d, called at the end of /etc/profile.
>>
>> You can simulate this with:
>>
>> globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source
>> /etc/profile; which java" (or try printenv instead of which java to see
>> the details)
>>
>> So bottom line: there's at least two cases where -l hurts, this one, and
>> abe, where attempts to run login shells from globus are thwarted.
>>
>> If the purpose of -l was just to get java in the path,, then
>> for OSG sites that behave like teraport, just omitting -l should work,
>> because the OSG jobmanager modes put it in the path.
>>
>> For sites like abe, bypassing -l, and forcing the user to put Java in
>> the path with a .bashrc or equivalent, may work. (The hack I used on abe
>> was to remove the -l arg, and insert this in bootstrap.sh:
>>
>> +if [ -f ~/.myetcprofile ]; then
>> + source ~/.myetcprofile
>> +else
>> + source /etc/profile
>> +fi
>>
>> One option is to accept a per-site option from sites.xml to bypass "-l"
>> on the startup shell, and insert the logic above for something like
>> .coasterinit, sourcing that if the user provides it.
>>
>> Another option is to put a +java line in the OSG .soft file on TeraPort.
>>
>> Its possible this problem only eists on the few sites like teraport that
>> run both OSG mods and softenv???
>>
>> I think we need to test coasters broadly across OSG to be sure (Ben's IP
>> problem is a case in point). But a simple shell test across all the OSG
>> VO sites could detect whether Java will be there or not, with and
>> without -l.
>>
>> - Mike
>>
>>
>> On 1/27/09 9:33 PM, Mihael Hategan wrote:
>>> Hmm. Looks like -l has the opposite effect of what I thought it should
>>> do (end up with an environment equivalent to the one you get in when you
>>> log in as an interactive session). Is it my misunderstanding or
>>> something else?
>>>
>>> On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote:
>>>> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
>>>>
>>>> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs
>>>> mode.
>>>>
>>>> Im using 0.8rc1 and submitting from tp-login.
>>>>
>>>> I am running with a DOEgrids cert in the OSG VO.
>>>>
>>>> I *think* the issue is that when a gt2 jobs on this vo runs with a login
>>>> shell, it doesnt get java in its path.
>>>>
>>>> When I run /bin/sh *without* the "-l" option, under globus, I do get a
>>>> java in my path.
>>>>
>>>> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs
>>>> coaster run on teraport, after you fixed the walltime issue?
>>>>
>>>> It seems to me that this is a rough edge with coaster startup. Recall
>>>> that I had a similar problem running on abe last year: I had to edit out
>>>> the "-l" and create a custom .profile to get coasters to work.
>>>>
>>>> It would be great if we can iron this out in 0.8 or soon after. I'm
>>>> willing to do some testing and enlist help from Allan and Zhengxiong for
>>>> wider testing.
>>>>
>>>> Do we need special site attributes for specific sites to override
>>>> default behaviors when they dont work?
>>>>
>>>>
>>>> My sites.xml is:
>>>>
>>>> <config>
>>>> <pool handle="teraport" >
>>>> <profile namespace="globus" key="queue">fast</profile>
>>>> <profile namespace="globus" key="maxwalltime">00:05:00</profile>
>>>> <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
>>>> <execution provider="coaster"
>>>> url="tp-grid1.ci.uchicago.edu"
>>>> jobmanager="gt2:gt2:pbs" />
>>>> <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
>>>> </pool>
>>>> </config>
>>>>
>>>> I get this on stdout/err:
>>>>
>>>> ---------------------------------------------
>>>> Swift 0.8rc1 swift-r2448 cog-r2261
>>>>
>>>> RunID: 20090127-1305-hcxdpor3
>>>> Progress:
>>>> Progress: Selecting site:2 Stage in:1 Initializing site shared directory:1
>>>> Progress: Selecting site:2 Stage in:1 Submitting:1
>>>> Progress: Selecting site:2 Submitting:1 Submitted:1
>>>> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a
>>>> on teraport
>>>> Execution failed:
>>>> Exception in runoops:
>>>> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq,
>>>> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1,
>>>> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
>>>> Host: teraport
>>>> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
>>>> stderr.txt:
>>>>
>>>> stdout.txt:
>>>>
>>>> ----
>>>>
>>>> Caused by:
>>>> Could not submit job
>>>> Caused by:
>>>> Could not start coaster service
>>>> Caused by:
>>>> Task ended before registration was received.
>>>> STDOUT: which: no java in
>>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>>>> dirname: too few arguments
>>>> Try `dirname --help' for more information.
>>>> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No
>>>> such file or directory
>>>>
>>>> STDERR: null
>>>> Cleaning up...
>>>> Done
>>>>
>>>> ------------------------------------
>>>>
>>>> Checking out the environment with this cert I see:
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
>>>> /bin/sh: java: command not found
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
>>>> java version "1.5.0_14"
>>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java;
>>>> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>>>> JAVA_HOME IS:
>>>> PATH IS:
>>>> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
>>>> /usr/bin/which: no java in
>>>> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
>>>> tp$
>>>>
>>>>
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo
>>>> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
>>>>
>>>> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
>>>> JAVA_HOME IS:
>>>> PATH IS:
>>>> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin
> :/
>> o
>>> pt
>>>> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
>>>> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java
>>>> -version'java version "1.5.0_14"
>>>> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
>>>>
>>>>
>>>> - Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/24/09 5:03 PM, Allan Espinosa wrote:
>>>>> Hi,
>>>>>
>>>>> I am using swift0.8rc1. the same also happens to v0.7
>>>>>
>>>>> I tried submitting a job from communicado to tp-grid1 (teraport) using
>>>>> coasters. The swift runtime does not give any error but it does not
>>>>> finish as well. Looking through the files received by the teraport
>>>>> head node, i observed that swift keeps submitting gram jobs. It looks
>>>>> like that the submitted pbs scripts kept finishing / failing.
>>>>>
>>>>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
>>>>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
>>>>>
>>>>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
>>>>> "http://128.135.125.118:50001" "1728236079"
>>>>> #! /bin/sh
>>>>> # PBS batch job script built by Globus job manager
>>>>> #
>>>>> #PBS -S /bin/sh
>>>>> #PBS -m n
>>>>> #PBS -q fast
>>>>> #PBS -l walltime=101:00
>>>>> #PBS -o /dev/null
>>>>> #PBS -e /dev/null
>>>>> #PBS -l nodes=1
>>>>> HOME="/home/aespinosa";
>>>>> export HOME;
>>>>> OSG_DATA="/gpfs1/osg/data";
>>>>> ...
>>>>> ...
>>>>> counter=0
>>>>> exit_code=0
>>>>> while test $counter -lt 1; do
>>>>> /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
>>>>>
>>>>> read tmp_exit_code <
>>>>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
>>>>> if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
>>>>> exit_code=$tmp_exit_code
>>>>> fi
>>>>> counter=`expr $counter + 1`
>>>>> done
>>>>>
>>>>> exit $exit_code
>>>>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
>>>>> walltime requirement
>>>>>
>>>>>
>>>>>
>>>>> Below is my sites.xml:
>>>>>
>>>>> <config>
>>>>>
>>>>> <pool handle="Teraport" sysinfo="INTEL32::LINUX">
>>>>> <profile namespace="globus" key="queue">fast</profile>
>>>>> <profile namespace="globus" key="maxwalltime">00:10:00</profile>
>>>>> <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
>>>>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
>>>>> </gridftp>
>>>>> <execution provider="coaster" url="tp-grid1.uchicago.edu"
>>>>> jobmanager="gt2:gt2:pbs" />
>>>>> <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
>>>>> <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
>>>>> </pool>
>>>>>
>>>>> </config>
>>>>>
>>>>> This does not happen if i use "local:pbs" as the jobmanager for the
>>>>> coaster and was successful in running jobs
>>>>> -Allan
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
More information about the Swift-devel
mailing list