[Swift-user] Error message on Cray XE6

Mihael Hategan hategan at mcs.anl.gov
Tue Apr 17 19:25:28 CDT 2012


Hmm, so if a block job fails, coasters will fail at least one swift job.
If this happens enough times, the failure should propagate through the
retries and to the user. It might take some time though.

So maybe there's a distinction between "hangs" and "takes a lot of time
but eventually fails".

Mihael

On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote:
> I do not think that is the case where PBS leaves the job queued, maybe
> on some machines but no on Beagle.  When I had a job that did not fit
> in the scalability queue Swift hung but when checking the log I found
> a message from qsub saying the job was rejected.  There is a bug
> ticket open for this issue.  I will find the log that has the
> message(or just recreate it) and post the message to the ticket.
> Swift also hangs(with a qsub message in the log) if you try to submit
> a PBS job to machine where you no longer have an allocation.  I
> received this message when trying to use Fusion after a long time.
> 
> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote:
> 
> > I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made.  I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).
> > 
> > - Mike
> > 
> > ----- Original Message -----
> >> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
> >> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
> >> Sent: Tuesday, April 17, 2012 12:08:49 PM
> >> Subject: Re: [Swift-user] Error message on Cray XE6
> >> There is a site file entry for that.
> >> 
> >> <profile namespace="globus" key="queue">scalability</profile>
> >> 
> >> You must make certain that the shape of your job fits in the queue you
> >> requested. If it does not fit, there is a silent failure and Swift
> >> hangs.
> >> 
> >> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
> >> 
> >>> Works great!
> >>> 
> >>> Is there a way I can ask swift to put me in a specific queue, such
> >>> as scalability of some reservation?
> >>> 
> >>> 
> >>> 
> >>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
> >>> 
> >>>> OK, here's a workaround for this problem:
> >>>> 
> >>>> You need to add this line to the swift command bin/swift in your
> >>>> Swift release.
> >>>> 
> >>>> After:
> >>>> 
> >>>> updateOptions "$SWIFT_HOME" "swift.home"
> >>>> 
> >>>> Add:
> >>>> 
> >>>> updateOptions "$USER_HOME" "user.home"
> >>>> 
> >>>> This is near line 92 in the version I tested, Swift trunk
> >>>> swift-r5739 cog-r3368.
> >>>> 
> >>>> Then you can do:
> >>>> 
> >>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
> >>>> -sites.file pbs.xml catsn.swift -n=1
> >>>> 
> >>>> Lorenzo, if you are using "module load swift" we'll need to update
> >>>> that, or you can copy the swift release directory structure that
> >>>> module load points you to, then modify the swift command there, and
> >>>> put that modified release first in your PATH.
> >>>> 
> >>>> We'll work out a way to get something like this into the production
> >>>> module and trunk. I dont know of other systems that are currently
> >>>> affected by this, but Im sure they will come up.
> >>>> 
> >>>> - Mike
> >>>> 
> >>>> 
> >>>> ----- Original Message -----
> >>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>> Cc: swift-user at ci.uchicago.edu
> >>>>> Sent: Saturday, April 14, 2012 10:13:40 AM
> >>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>> stackoverflow says this should work:
> >>>>> 
> >>>>> java -Duser.home=<new_location> <your_program>
> >>>>> 
> >>>>> Need to get that in via the swift command.
> >>>>> 
> >>>>> - Mike
> >>>>> 
> >>>>> 
> >>>>> ----- Original Message -----
> >>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
> >>>>>> swift-user at ci.uchicago.edu
> >>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM
> >>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
> >>>>>> user.home to the same thing. Neither works. I think user.home is
> >>>>>> coming from the Java property, and that doesnt seem to be
> >>>>>> influenced
> >>>>>> by the HOME env var. I was about to look if Java can be asked to
> >>>>>> change home. Maybe by setting a command line arg to Java.
> >>>>>> 
> >>>>>> - Mike
> >>>>>> 
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
> >>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
> >>>>>>> swift-user at ci.uchicago.edu
> >>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
> >>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>>>> That is an easy fix I believe. I know where the code is so I
> >>>>>>> will
> >>>>>>> change and test.
> >>>>>>> 
> >>>>>>> In the mean time could you try something? Try setting
> >>>>>>> user.home=<someplace.on.lustre>
> >>>>>>> in your config file and try again.
> >>>>>>> 
> >>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
> >>>>>>> wrote:
> >>>>>>> 
> >>>>>>>> /home is no longer mounted by the compute nodes, per the
> >>>>>>>> post-maitenance summary:
> >>>>>>>> 
> >>>>>>>> "External filesystem dependencies minimized: Compute nodes and
> >>>>>>>> the
> >>>>>>>> scheduler should now continue to process and complete jobs
> >>>>>>>> without
> >>>>>>>> the threat of interference of external filesystem outages.
> >>>>>>>> /gpfs/pads is only available on login1 through login5; /home is
> >>>>>>>> on
> >>>>>>>> login and mom nodes only."
> >>>>>>>> 
> >>>>>>>> So we need to (finally) remove Swift's dependence on
> >>>>>>>> $HOME/.globus
> >>>>>>>> and $HOME/.globus/scripts in particular.
> >>>>>>>> 
> >>>>>>>> I suggest - since the swift command already needs to write to
> >>>>>>>> "."
> >>>>>>>> -
> >>>>>>>> that we create a scripts/ directory in "." instead of
> >>>>>>>> $HOME/.globus.
> >>>>>>>> And this should be used by any provider that would have
> >>>>>>>> previously
> >>>>>>>> created files below .globus.
> >>>>>>>> 
> >>>>>>>> I'll echo this to swift-devel and start a thread there to
> >>>>>>>> discuss.
> >>>>>>>> Its possible there's already a property to cause scripts/ to be
> >>>>>>>> created elsewhere. If not, I think we should make one. I think
> >>>>>>>> grouping the scripts created by a run into the current dir,
> >>>>>>>> along
> >>>>>>>> with the swift log, _concurrent, and (in the conventions I use
> >>>>>>>> in
> >>>>>>>> my
> >>>>>>>> run scripts) swiftwork/.
> >>>>>>>> 
> >>>>>>>> Lorenzo, hopefully we can at least get you a workaround for
> >>>>>>>> this
> >>>>>>>> soon.
> >>>>>>>> 
> >>>>>>>> You *might* be able to trick swift into doing this by setting
> >>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
> >>>>>>>> .globus
> >>>>>>>> and that didnt work, as /home is not even readable by the
> >>>>>>>> compute
> >>>>>>>> nodes, which in this case need to run the coaster worker (.pl)
> >>>>>>>> script.
> >>>>>>>> 
> >>>>>>>> - Mike
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> ----- Original Message -----
> >>>>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> >>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
> >>>>>>>>> Cc: swift-user at ci.uchicago.edu
> >>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
> >>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
> >>>>>>>>> In principle the access to the /home filesystem should still
> >>>>>>>>> be
> >>>>>>>>> there.
> >>>>>>>>> 
> >>>>>>>>> The only thing I did was to chance the cf file to remove some
> >>>>>>>>> errors I
> >>>>>>>>> had into it, so that might also be the source of the problem.
> >>>>>>>>> This
> >>>>>>>>> is
> >>>>>>>>> what it looks like now:
> >>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> # Whether to transfer the wrappers from the compute nodes
> >>>>>>>>> # I like to launch from my home dir, but keep everything on
> >>>>>>>>> # lustre
> >>>>>>>>> wrapperlog.always.transfer=false
> >>>>>>>>> 
> >>>>>>>>> #Indicates whether the working directory on the remote site
> >>>>>>>>> # should be left intact even when a run completes successfully
> >>>>>>>>> sitedir.keep=true
> >>>>>>>>> 
> >>>>>>>>> #try only once
> >>>>>>>>> execution.retries=1
> >>>>>>>>> 
> >>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
> >>>>>>>>> errors
> >>>>>>>>> lazy.errors=true
> >>>>>>>>> 
> >>>>>>>>> # to reduce filesystem access
> >>>>>>>>> status.mode=provider
> >>>>>>>>> 
> >>>>>>>>> use.provider.staging=false
> >>>>>>>>> 
> >>>>>>>>> provider.staging.pin.swiftfiles=false
> >>>>>>>>> 
> >>>>>>>>> foreach.max.threads=100
> >>>>>>>>> 
> >>>>>>>>> provenance.log=false
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
> >>>>>>>>> 
> >>>>>>>>>> The perl script is the worker script that is submitted with
> >>>>>>>>>> PBS.
> >>>>>>>>>> I
> >>>>>>>>>> have not tried to run on Beagle since the maintenance period
> >>>>>>>>>> has
> >>>>>>>>>> ended so I am not exactly sure why the error popped up. One
> >>>>>>>>>> reason
> >>>>>>>>>> could be that the home file system is no longer mounted on
> >>>>>>>>>> the
> >>>>>>>>>> compute nodes. I know they spoke about that being a
> >>>>>>>>>> possibility
> >>>>>>>>>> but
> >>>>>>>>>> not sure they implemented that during the maintenance period.
> >>>>>>>>>> Do
> >>>>>>>>>> you
> >>>>>>>>>> know if the home file system is still mounted on the compute
> >>>>>>>>>> nodes?
> >>>>>>>>>> 
> >>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
> >>>>>>>>>> <lpesce at uchicago.edu>
> >>>>>>>>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>>> Hi --
> >>>>>>>>>>> I haven't seen this one before:
> >>>>>>>>>>> 
> >>>>>>>>>>> Can't open perl script
> >>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
> >>>>>>>>>>> No
> >>>>>>>>>>> such file or directory
> >>>>>>>>>>> 
> >>>>>>>>>>> The config of the cray has changed, might this have anything
> >>>>>>>>>>> to
> >>>>>>>>>>> do
> >>>>>>>>>>> with it?
> >>>>>>>>>>> I have no idea what perl script is it talking about and why
> >>>>>>>>>>> it
> >>>>>>>>>>> is
> >>>>>>>>>>> looking to home.
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanks a lot,
> >>>>>>>>>>> 
> >>>>>>>>>>> Lorenzo
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Swift-user mailing list
> >>>>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Swift-user mailing list
> >>>>>>>>> Swift-user at ci.uchicago.edu
> >>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>>>>>> 
> >>>>>>>> --
> >>>>>>>> Michael Wilde
> >>>>>>>> Computation Institute, University of Chicago
> >>>>>>>> Mathematics and Computer Science Division
> >>>>>>>> Argonne National Laboratory
> >>>>>>>> 
> >>>>>> 
> >>>>>> --
> >>>>>> Michael Wilde
> >>>>>> Computation Institute, University of Chicago
> >>>>>> Mathematics and Computer Science Division
> >>>>>> Argonne National Laboratory
> >>>>> 
> >>>>> --
> >>>>> Michael Wilde
> >>>>> Computation Institute, University of Chicago
> >>>>> Mathematics and Computer Science Division
> >>>>> Argonne National Laboratory
> >>>>> 
> >>>>> _______________________________________________
> >>>>> Swift-user mailing list
> >>>>> Swift-user at ci.uchicago.edu
> >>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >>>> 
> >>>> --
> >>>> Michael Wilde
> >>>> Computation Institute, University of Chicago
> >>>> Mathematics and Computer Science Division
> >>>> Argonne National Laboratory
> >>>> 
> >>> 
> > 
> > -- 
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> > 
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user





More information about the Swift-user mailing list