[Swift-user] Error message on Cray XE6

Michael Wilde wilde at mcs.anl.gov
Tue Apr 17 12:38:22 CDT 2012


I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made.  I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
> Sent: Tuesday, April 17, 2012 12:08:49 PM
> Subject: Re: [Swift-user] Error message on Cray XE6
> There is a site file entry for that.
> 
> <profile namespace="globus" key="queue">scalability</profile>
> 
> You must make certain that the shape of your job fits in the queue you
> requested. If it does not fit, there is a silent failure and Swift
> hangs.
> 
> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
> 
> > Works great!
> >
> > Is there a way I can ask swift to put me in a specific queue, such
> > as scalability of some reservation?
> >
> >
> >
> > On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
> >
> >> OK, here's a workaround for this problem:
> >>
> >> You need to add this line to the swift command bin/swift in your
> >> Swift release.
> >>
> >> After:
> >>
> >> updateOptions "$SWIFT_HOME" "swift.home"
> >>
> >> Add:
> >>
> >> updateOptions "$USER_HOME" "user.home"
> >>
> >> This is near line 92 in the version I tested, Swift trunk
> >> swift-r5739 cog-r3368.
> >>
> >> Then you can do:
> >>
> >> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
> >> -sites.file pbs.xml catsn.swift -n=1
> >>
> >> Lorenzo, if you are using "module load swift" we'll need to update
> >> that, or you can copy the swift release directory structure that
> >> module load points you to, then modify the swift command there, and
> >> put that modified release first in your PATH.
> >>
> >> We'll work out a way to get something like this into the production
> >> module and trunk. I dont know of other systems that are currently
> >> affected by this, but Im sure they will come up.
> >>
> >> - Mike
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>> Cc: swift-user at ci.uchicago.edu
> >>> Sent: Saturday, April 14, 2012 10:13:40 AM
> >>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>> stackoverflow says this should work:
> >>>
> >>> java -Duser.home=<new_location> <your_program>
> >>>
> >>> Need to get that in via the swift command.
> >>>
> >>> - Mike
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
> >>>> swift-user at ci.uchicago.edu
> >>>> Sent: Saturday, April 14, 2012 10:10:00 AM
> >>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
> >>>> user.home to the same thing. Neither works. I think user.home is
> >>>> coming from the Java property, and that doesnt seem to be
> >>>> influenced
> >>>> by the HOME env var. I was about to look if Java can be asked to
> >>>> change home. Maybe by setting a command line arg to Java.
> >>>>
> >>>> - Mike
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
> >>>>> swift-user at ci.uchicago.edu
> >>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
> >>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>> That is an easy fix I believe. I know where the code is so I
> >>>>> will
> >>>>> change and test.
> >>>>>
> >>>>> In the mean time could you try something? Try setting
> >>>>> user.home=<someplace.on.lustre>
> >>>>> in your config file and try again.
> >>>>>
> >>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
> >>>>> wrote:
> >>>>>
> >>>>>> /home is no longer mounted by the compute nodes, per the
> >>>>>> post-maitenance summary:
> >>>>>>
> >>>>>> "External filesystem dependencies minimized: Compute nodes and
> >>>>>> the
> >>>>>> scheduler should now continue to process and complete jobs
> >>>>>> without
> >>>>>> the threat of interference of external filesystem outages.
> >>>>>> /gpfs/pads is only available on login1 through login5; /home is
> >>>>>> on
> >>>>>> login and mom nodes only."
> >>>>>>
> >>>>>> So we need to (finally) remove Swift's dependence on
> >>>>>> $HOME/.globus
> >>>>>> and $HOME/.globus/scripts in particular.
> >>>>>>
> >>>>>> I suggest - since the swift command already needs to write to
> >>>>>> "."
> >>>>>> -
> >>>>>> that we create a scripts/ directory in "." instead of
> >>>>>> $HOME/.globus.
> >>>>>> And this should be used by any provider that would have
> >>>>>> previously
> >>>>>> created files below .globus.
> >>>>>>
> >>>>>> I'll echo this to swift-devel and start a thread there to
> >>>>>> discuss.
> >>>>>> Its possible there's already a property to cause scripts/ to be
> >>>>>> created elsewhere. If not, I think we should make one. I think
> >>>>>> grouping the scripts created by a run into the current dir,
> >>>>>> along
> >>>>>> with the swift log, _concurrent, and (in the conventions I use
> >>>>>> in
> >>>>>> my
> >>>>>> run scripts) swiftwork/.
> >>>>>>
> >>>>>> Lorenzo, hopefully we can at least get you a workaround for
> >>>>>> this
> >>>>>> soon.
> >>>>>>
> >>>>>> You *might* be able to trick swift into doing this by setting
> >>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
> >>>>>> .globus
> >>>>>> and that didnt work, as /home is not even readable by the
> >>>>>> compute
> >>>>>> nodes, which in this case need to run the coaster worker (.pl)
> >>>>>> script.
> >>>>>>
> >>>>>> - Mike
> >>>>>>
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> >>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>>>> Cc: swift-user at ci.uchicago.edu
> >>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
> >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>>>> In principle the access to the /home filesystem should still
> >>>>>>> be
> >>>>>>> there.
> >>>>>>>
> >>>>>>> The only thing I did was to chance the cf file to remove some
> >>>>>>> errors I
> >>>>>>> had into it, so that might also be the source of the problem.
> >>>>>>> This
> >>>>>>> is
> >>>>>>> what it looks like now:
> >>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
> >>>>>>>
> >>>>>>>
> >>>>>>> # Whether to transfer the wrappers from the compute nodes
> >>>>>>> # I like to launch from my home dir, but keep everything on
> >>>>>>> # lustre
> >>>>>>> wrapperlog.always.transfer=false
> >>>>>>>
> >>>>>>> #Indicates whether the working directory on the remote site
> >>>>>>> # should be left intact even when a run completes successfully
> >>>>>>> sitedir.keep=true
> >>>>>>>
> >>>>>>> #try only once
> >>>>>>> execution.retries=1
> >>>>>>>
> >>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
> >>>>>>> errors
> >>>>>>> lazy.errors=true
> >>>>>>>
> >>>>>>> # to reduce filesystem access
> >>>>>>> status.mode=provider
> >>>>>>>
> >>>>>>> use.provider.staging=false
> >>>>>>>
> >>>>>>> provider.staging.pin.swiftfiles=false
> >>>>>>>
> >>>>>>> foreach.max.threads=100
> >>>>>>>
> >>>>>>> provenance.log=false
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
> >>>>>>>
> >>>>>>>> The perl script is the worker script that is submitted with
> >>>>>>>> PBS.
> >>>>>>>> I
> >>>>>>>> have not tried to run on Beagle since the maintenance period
> >>>>>>>> has
> >>>>>>>> ended so I am not exactly sure why the error popped up. One
> >>>>>>>> reason
> >>>>>>>> could be that the home file system is no longer mounted on
> >>>>>>>> the
> >>>>>>>> compute nodes. I know they spoke about that being a
> >>>>>>>> possibility
> >>>>>>>> but
> >>>>>>>> not sure they implemented that during the maintenance period.
> >>>>>>>> Do
> >>>>>>>> you
> >>>>>>>> know if the home file system is still mounted on the compute
> >>>>>>>> nodes?
> >>>>>>>>
> >>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
> >>>>>>>> <lpesce at uchicago.edu>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi --
> >>>>>>>>> I haven't seen this one before:
> >>>>>>>>>
> >>>>>>>>> Can't open perl script
> >>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
> >>>>>>>>> No
> >>>>>>>>> such file or directory
> >>>>>>>>>
> >>>>>>>>> The config of the cray has changed, might this have anything
> >>>>>>>>> to
> >>>>>>>>> do
> >>>>>>>>> with it?
> >>>>>>>>> I have no idea what perl script is it talking about and why
> >>>>>>>>> it
> >>>>>>>>> is
> >>>>>>>>> looking to home.
> >>>>>>>>>
> >>>>>>>>> Thanks a lot,
> >>>>>>>>>
> >>>>>>>>> Lorenzo
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-user mailing list
> >>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Swift-user mailing list
> >>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>
> >>>>>> --
> >>>>>> Michael Wilde
> >>>>>> Computation Institute, University of Chicago
> >>>>>> Mathematics and Computer Science Division
> >>>>>> Argonne National Laboratory
> >>>>>>
> >>>>
> >>>> --
> >>>> Michael Wilde
> >>>> Computation Institute, University of Chicago
> >>>> Mathematics and Computer Science Division
> >>>> Argonne National Laboratory
> >>>
> >>> --
> >>> Michael Wilde
> >>> Computation Institute, University of Chicago
> >>> Mathematics and Computer Science Division
> >>> Argonne National Laboratory
> >>>
> >>> _______________________________________________
> >>> Swift-user mailing list
> >>> Swift-user at ci.uchicago.edu
> >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>
> >> --
> >> Michael Wilde
> >> Computation Institute, University of Chicago
> >> Mathematics and Computer Science Division
> >> Argonne National Laboratory
> >>
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list