[Swift-devel] Coasters failing on Teraport - cant find Java?

Mihael Hategan hategan at mcs.anl.gov
Fri Jan 30 12:42:00 CST 2009


Cog r2267 contains a tentative fix for this. The bootstrap script is
started without -l, and if java cannot be found, it attempts to get that
information using bash -l.

I haven't tested it.

On Tue, 2009-01-27 at 23:03 -0600, Michael Wilde wrote:
> I dug a bit deeper. As far as I can tell, this is what's happening:
> 
> 1) On OSG sites, the jobmanager(s) are modified to inset OSG env vars 
> and set the PATH to contain OSG stuff. So if you do a globus-job-run of 
> /usr/bin/printenv (i.e. with no shell) you see all this, including java 
> in the path (from an osg dir).
> 
> 2) when you globus-job-run /bin/sh, all this stays around, but
> 
> 3) when you globus-job-run /bin/sh with -l, it runs /etc/profile, which 
> un-does the path and LD_LIBRARY_PATH, setting PATH to some default and 
> LD_LIBRARY_PATH to null.  I *think* this is being done by softenv which 
> runs from /etc/profile.d, called at the end of /etc/profile.
> 
> You can simulate this with:
> 
> globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c "which java; source 
> /etc/profile; which java"  (or try printenv instead of which java to see 
> the details)
> 
> So bottom line: there's at least two cases where -l hurts, this one, and 
> abe, where attempts to run login shells from globus are thwarted.
> 
> If the purpose of -l was just to get java in the path,, then
> for OSG sites that behave like teraport, just omitting -l should work, 
> because the OSG jobmanager modes put it in the path.
> 
> For sites like abe, bypassing -l, and forcing the user to put Java in 
> the path with a .bashrc or equivalent, may work. (The hack I used on abe 
> was to remove the -l arg, and insert this in bootstrap.sh:
> 
> +if [ -f ~/.myetcprofile ]; then
> +  source ~/.myetcprofile
> +else
> +  source /etc/profile
> +fi
> 
> One option is to accept a per-site option from sites.xml to bypass "-l" 
>   on the startup shell, and insert the logic above for something like 
> .coasterinit, sourcing that if the user provides it.
> 
> Another option is to put a +java line in the OSG .soft file on TeraPort.
> 
> Its possible this problem only eists on the few sites like teraport that 
> run both OSG mods and softenv???
> 
> I think we need to test coasters broadly across OSG to be sure (Ben's IP 
> problem is a case in point).  But a simple shell test across all the OSG 
> VO sites could detect whether Java will be there or not, with and 
> without -l.
> 
> - Mike
> 
> 
> On 1/27/09 9:33 PM, Mihael Hategan wrote:
> > Hmm. Looks like -l has the opposite effect of what I thought it should
> > do (end up with an environment equivalent to the one you get in when you
> > log in as an interactive session). Is it my misunderstanding or
> > something else?
> > 
> > On Tue, 2009-01-27 at 13:41 -0600, Michael Wilde wrote:
> >> Related to: Re: [Swift-devel] swift changing walltime of prews-gram jobs
> >>
> >> I can't get a Swift script to run on coasters on TeraPort in gt2:gt2:pbs 
> >> mode.
> >>
> >> Im using 0.8rc1 and submitting from tp-login.
> >>
> >> I am running with a DOEgrids cert in the OSG VO.
> >>
> >> I *think* the issue is that when a gt2 jobs on this vo runs with a login 
> >> shell, it doesnt get java in its path.
> >>
> >> When I run /bin/sh *without* the "-l" option, under globus, I do get a 
> >> java in my path.
> >>
> >> Allan: what VO did you run on when you got a sucsessful gt2:gt2:pbs 
> >> coaster run on teraport, after you fixed the walltime issue?
> >>
> >> It seems to me that this is a rough edge with coaster startup. Recall 
> >> that I had a similar problem running on abe last year: I had to edit out 
> >> the "-l" and create a custom .profile to get coasters to work.
> >>
> >> It would be great if we can iron this out in 0.8 or soon after. I'm 
> >> willing to do some testing and enlist help from Allan and Zhengxiong for 
> >> wider testing.
> >>
> >> Do we need special site attributes for specific sites to override 
> >> default behaviors when they dont work?
> >>
> >>
> >> My sites.xml is:
> >>
> >> <config>
> >> <pool handle="teraport" >
> >>    <profile namespace="globus" key="queue">fast</profile>
> >>    <profile namespace="globus" key="maxwalltime">00:05:00</profile>
> >>    <gridftp url="gsiftp://tp-grid1.ci.uchicago.edu" />
> >>    <execution provider="coaster"
> >>       url="tp-grid1.ci.uchicago.edu"
> >>       jobmanager="gt2:gt2:pbs" />
> >>    <workdirectory>/gpfs1/osg/data/oops/swiftwork</workdirectory>
> >> </pool>
> >> </config>
> >>
> >> I get this on stdout/err:
> >>
> >> ---------------------------------------------
> >> Swift 0.8rc1 swift-r2448 cog-r2261
> >>
> >> RunID: 20090127-1305-hcxdpor3
> >> Progress:
> >> Progress:  Selecting site:2 Stage in:1 Initializing site shared directory:1
> >> Progress:  Selecting site:2 Stage in:1 Submitting:1
> >> Progress:  Selecting site:2 Submitting:1 Submitted:1
> >> Failed to transfer wrapper log from oops5-20090127-1305-hcxdpor3/info/a 
> >> on teraport
> >> Execution failed:
> >>          Exception in runoops:
> >> Arguments: [input/fasta/T1af7.fasta, input/secseq/T1af7.secseq, 
> >> input/native/T1af7.pdb, output/T1af7.1.pdt, output/T1af7.1.rmsd, 1, 
> >> [TEMP UPDATE INTERVAL = 10, SMOOTH DEVIATION COEFFICIENT = 0.80001]]
> >> Host: teraport
> >> Directory: oops5-20090127-1305-hcxdpor3/jobs/a/runoops-asq0ir5j
> >> stderr.txt:
> >>
> >> stdout.txt:
> >>
> >> ----
> >>
> >> Caused by:
> >>          Could not submit job
> >> Caused by:
> >>          Could not start coaster service
> >> Caused by:
> >>          Task ended before registration was received.
> >> STDOUT: which: no java in 
> >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
> >> dirname: too few arguments
> >> Try `dirname --help' for more information.
> >> http://tp-login2.ci.uchicago.edu:50001: line 55: -Djava.home=/..: No 
> >> such file or directory
> >>
> >> STDERR: null
> >> Cleaning up...
> >>   Done
> >>
> >> ------------------------------------
> >>
> >> Checking out the environment with this cert I see:
> >>
> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'java -version'
> >> /bin/sh: java: command not found
> >>
> >>
> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java -version'
> >> java version "1.5.0_14"
> >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
> >>
> >>
> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -l -c 'which java; 
> >> echo JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
> >> JAVA_HOME IS:
> >> PATH IS: 
> >> /usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin
> >> /usr/bin/which: no java in 
> >> (/usr/kerberos/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/bin:/software/common/softenv-1.6.0-r1/bin:/home/osgvo/osg/bin/linux-rhel4-x86_64:/home/osgvo/osg/bin:/soft/xcat-1.2.0-r1/bin:/soft/xcat-1.2.0-r1/sbin:/soft/xcat-1.2.0-r1/x86_64/bin:/soft/xcat-1.2.0-r1/x86_64/sbin:/soft/xcat-1.2.0-r1/contrib/bin:/soft/xcat-1.2.0-r1/contrib/sbin:/soft/xcat-1.2.0-r1/contrib/x86_64/bin:/soft/xcat-1.2.0-r1/contrib/x86_64/sbin)
> >> tp$
> >>
> >>
> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'which java; echo 
> >> JAVA_HOME IS: $JAVA_HOME;echo PATH IS: $PATH'
> >>
> >> /opt/osg-ce-0.8.0-r1/jdk1.5/bin/java
> >> JAVA_HOME IS:
> >> PATH IS: 
> >> /opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin:/opt/osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/opt/osg-ce-0.8.0-r1/globus/bin:/opt/osg-ce-0.8.0-r1/globus/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/opt/osg-ce-0.8.0-r1/condor/sbin:/opt/osg-ce-0.8.0-r1/condor/bin:/opt/osg-ce-0.8.0-r1/apache/bin:/opt/osg-ce-0.8.0-r1/srm-v2-client/bin:/opt/osg-ce-0.8.0-r1/srm-v1-client/sbin:/opt/osg-ce-0.8.0-r1/srm-v1-client/bin
 :/
> o
> >  pt
> >> /osg-ce-0.8.0-r1/wget/bin:/opt/osg-ce-0.8.0-r1/gums/scripts:/opt/osg-ce-0.8.0-r1/cert-scripts/bin:/opt/osg-ce-0.8.0-r1/glite/sbin:/opt/osg-ce-0.8.0-r1/glite/bin:/opt/osg-ce-0.8.0-r1/edg/sbin:/opt/osg-ce-0.8.0-r1/prima/bin:/opt/osg-ce-0.8.0-r1/mysql/bin:/opt/osg-ce-0.8.0-r1/logrotate/sbin:/opt/osg-ce-0.8.0-r1/ant/bin:/opt/osg-ce-0.8.0-r1/jdk1.5/bin:/opt/osg-ce-0.8.0-r1/gpt/sbin:/software/linux-rhel4-x86_64/pacman-3.21-r1/bin:/opt/osg-ce-0.8.0-r1/vdt/sbin:/opt/osg-ce-0.8.0-r1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin
> >> tp$ globus-job-run tp-grid1.ci.uchicago.edu /bin/sh -c 'java 
> >> -version'java version "1.5.0_14"
> >> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-b03)
> >> Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-b03, mixed mode)
> >>
> >>
> >> - Mike
> >>
> >>
> >>
> >>
> >>
> >> On 1/24/09 5:03 PM, Allan Espinosa wrote:
> >>> Hi,
> >>>
> >>> I am using swift0.8rc1.  the same also happens to v0.7
> >>>
> >>> I tried submitting a job from communicado to tp-grid1 (teraport) using
> >>> coasters.  The swift runtime does not give any error but it does not
> >>> finish as well. Looking through the files received by the teraport
> >>> head node, i observed that swift keeps submitting gram jobs.  It looks
> >>> like that the submitted pbs scripts kept finishing / failing.
> >>>
> >>> diging through ~/.globus/jobs/tp-grid1.uchicago.edu/*/scheduler* we
> >>> see that maxwalltime become 101:00 from 00:10:00 (in sites.xml)
> >>>
> >>> /usr/bin/perl "/home/aespinosa/.globus/coasters/cscript63266.pl"
> >>> "http://128.135.125.118:50001" "1728236079"
> >>> #! /bin/sh
> >>> # PBS batch job script built by Globus job manager
> >>> #
> >>> #PBS -S /bin/sh
> >>> #PBS -m n
> >>> #PBS -q fast
> >>> #PBS -l walltime=101:00
> >>> #PBS -o /dev/null
> >>> #PBS -e /dev/null
> >>> #PBS -l nodes=1
> >>> HOME="/home/aespinosa";
> >>> export HOME;
> >>> OSG_DATA="/gpfs1/osg/data";
> >>> ...
> >>> ...
> >>> counter=0
> >>> exit_code=0
> >>> while test $counter -lt 1; do
> >>>     /bin/touch /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter;
> >>>
> >>>     read tmp_exit_code <
> >>> /home/aespinosa/.globus/job/tp-grid1.ci.uchicago.edu/7432.1232837576/exit.$counter
> >>>     if [ $exit_code = 0 -a $tmp_exit_code != 0 ]; then
> >>>         exit_code=$tmp_exit_code
> >>>     fi
> >>>     counter=`expr $counter + 1`
> >>> done
> >>>
> >>> exit $exit_code
> >>> qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max
> >>> walltime requirement
> >>>
> >>>
> >>>
> >>> Below is my sites.xml:
> >>>
> >>> <config>
> >>>
> >>>   <pool handle="Teraport" sysinfo="INTEL32::LINUX">
> >>>     <profile namespace="globus" key="queue">fast</profile>
> >>>     <profile namespace="globus" key="maxwalltime">00:10:00</profile>
> >>>     <gridftp  url="gsiftp://tp-grid1.ci.uchicago.edu/disks/tp-gpfs/scratch/aespinosa"
> >>> storage="/opt/osg/data/aespinosa" major="2" minor="2" patch="4">
> >>>     </gridftp>
> >>>     <execution provider="coaster" url="tp-grid1.uchicago.edu"
> >>> jobmanager="gt2:gt2:pbs" />
> >>>     <filesystem provider="coaster" url="gt2://tp-grid1.uchicago.edu" />
> >>>     <workdirectory >/disks/tp-gpfs/scratch/aespinosa</workdirectory>
> >>>   </pool>
> >>>
> >>> </config>
> >>>
> >>> This does not happen if i use "local:pbs" as the jobmanager for the
> >>> coaster and was successful in running jobs
> >>> -Allan
> >>>
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 




More information about the Swift-devel mailing list