[Swift-user] Error message on Cray XE6

Jonathan Monette jonmon at mcs.anl.gov
Tue Apr 17 19:33:53 CDT 2012


So that is not what I was witnessing. It seems the scheduler rejected the job(the PBS scheduler) because no jobs showed up under qstat but Swift still showed that jobs were submitted with no failures. If I checked the log I found a message from qsub saying could not submit job. I will reproduce the issue and post what I see. Perhaps this is happening though because the scheduler rejects the job but does not return an error code?

On Apr 17, 2012, at 19:25, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Hmm, so if a block job fails, coasters will fail at least one swift job.
> If this happens enough times, the failure should propagate through the
> retries and to the user. It might take some time though.
> 
> So maybe there's a distinction between "hangs" and "takes a lot of time
> but eventually fails".
> 
> Mihael
> 
> On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote:
>> I do not think that is the case where PBS leaves the job queued, maybe
>> on some machines but no on Beagle.  When I had a job that did not fit
>> in the scalability queue Swift hung but when checking the log I found
>> a message from qsub saying the job was rejected.  There is a bug
>> ticket open for this issue.  I will find the log that has the
>> message(or just recreate it) and post the message to the ticket.
>> Swift also hangs(with a qsub message in the log) if you try to submit
>> a PBS job to machine where you no longer have an allocation.  I
>> received this message when trying to use Fusion after a long time.
>> 
>> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote:
>> 
>>> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made.  I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).
>>> 
>>> - Mike
>>> 
>>> ----- Original Message -----
>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
>>>> Sent: Tuesday, April 17, 2012 12:08:49 PM
>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>> There is a site file entry for that.
>>>> 
>>>> <profile namespace="globus" key="queue">scalability</profile>
>>>> 
>>>> You must make certain that the shape of your job fits in the queue you
>>>> requested. If it does not fit, there is a silent failure and Swift
>>>> hangs.
>>>> 
>>>> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>>>> 
>>>>> Works great!
>>>>> 
>>>>> Is there a way I can ask swift to put me in a specific queue, such
>>>>> as scalability of some reservation?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
>>>>> 
>>>>>> OK, here's a workaround for this problem:
>>>>>> 
>>>>>> You need to add this line to the swift command bin/swift in your
>>>>>> Swift release.
>>>>>> 
>>>>>> After:
>>>>>> 
>>>>>> updateOptions "$SWIFT_HOME" "swift.home"
>>>>>> 
>>>>>> Add:
>>>>>> 
>>>>>> updateOptions "$USER_HOME" "user.home"
>>>>>> 
>>>>>> This is near line 92 in the version I tested, Swift trunk
>>>>>> swift-r5739 cog-r3368.
>>>>>> 
>>>>>> Then you can do:
>>>>>> 
>>>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
>>>>>> -sites.file pbs.xml catsn.swift -n=1
>>>>>> 
>>>>>> Lorenzo, if you are using "module load swift" we'll need to update
>>>>>> that, or you can copy the swift release directory structure that
>>>>>> module load points you to, then modify the swift command there, and
>>>>>> put that modified release first in your PATH.
>>>>>> 
>>>>>> We'll work out a way to get something like this into the production
>>>>>> module and trunk. I dont know of other systems that are currently
>>>>>> affected by this, but Im sure they will come up.
>>>>>> 
>>>>>> - Mike
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM
>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>> stackoverflow says this should work:
>>>>>>> 
>>>>>>> java -Duser.home=<new_location> <your_program>
>>>>>>> 
>>>>>>> Need to get that in via the swift command.
>>>>>>> 
>>>>>>> - Mike
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM
>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
>>>>>>>> user.home to the same thing. Neither works. I think user.home is
>>>>>>>> coming from the Java property, and that doesnt seem to be
>>>>>>>> influenced
>>>>>>>> by the HOME env var. I was about to look if Java can be asked to
>>>>>>>> change home. Maybe by setting a command line arg to Java.
>>>>>>>> 
>>>>>>>> - Mike
>>>>>>>> 
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>> That is an easy fix I believe. I know where the code is so I
>>>>>>>>> will
>>>>>>>>> change and test.
>>>>>>>>> 
>>>>>>>>> In the mean time could you try something? Try setting
>>>>>>>>> user.home=<someplace.on.lustre>
>>>>>>>>> in your config file and try again.
>>>>>>>>> 
>>>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> /home is no longer mounted by the compute nodes, per the
>>>>>>>>>> post-maitenance summary:
>>>>>>>>>> 
>>>>>>>>>> "External filesystem dependencies minimized: Compute nodes and
>>>>>>>>>> the
>>>>>>>>>> scheduler should now continue to process and complete jobs
>>>>>>>>>> without
>>>>>>>>>> the threat of interference of external filesystem outages.
>>>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is
>>>>>>>>>> on
>>>>>>>>>> login and mom nodes only."
>>>>>>>>>> 
>>>>>>>>>> So we need to (finally) remove Swift's dependence on
>>>>>>>>>> $HOME/.globus
>>>>>>>>>> and $HOME/.globus/scripts in particular.
>>>>>>>>>> 
>>>>>>>>>> I suggest - since the swift command already needs to write to
>>>>>>>>>> "."
>>>>>>>>>> -
>>>>>>>>>> that we create a scripts/ directory in "." instead of
>>>>>>>>>> $HOME/.globus.
>>>>>>>>>> And this should be used by any provider that would have
>>>>>>>>>> previously
>>>>>>>>>> created files below .globus.
>>>>>>>>>> 
>>>>>>>>>> I'll echo this to swift-devel and start a thread there to
>>>>>>>>>> discuss.
>>>>>>>>>> Its possible there's already a property to cause scripts/ to be
>>>>>>>>>> created elsewhere. If not, I think we should make one. I think
>>>>>>>>>> grouping the scripts created by a run into the current dir,
>>>>>>>>>> along
>>>>>>>>>> with the swift log, _concurrent, and (in the conventions I use
>>>>>>>>>> in
>>>>>>>>>> my
>>>>>>>>>> run scripts) swiftwork/.
>>>>>>>>>> 
>>>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for
>>>>>>>>>> this
>>>>>>>>>> soon.
>>>>>>>>>> 
>>>>>>>>>> You *might* be able to trick swift into doing this by setting
>>>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
>>>>>>>>>> .globus
>>>>>>>>>> and that didnt work, as /home is not even readable by the
>>>>>>>>>> compute
>>>>>>>>>> nodes, which in this case need to run the coaster worker (.pl)
>>>>>>>>>> script.
>>>>>>>>>> 
>>>>>>>>>> - Mike
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
>>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>>>> In principle the access to the /home filesystem should still
>>>>>>>>>>> be
>>>>>>>>>>> there.
>>>>>>>>>>> 
>>>>>>>>>>> The only thing I did was to chance the cf file to remove some
>>>>>>>>>>> errors I
>>>>>>>>>>> had into it, so that might also be the source of the problem.
>>>>>>>>>>> This
>>>>>>>>>>> is
>>>>>>>>>>> what it looks like now:
>>>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> # Whether to transfer the wrappers from the compute nodes
>>>>>>>>>>> # I like to launch from my home dir, but keep everything on
>>>>>>>>>>> # lustre
>>>>>>>>>>> wrapperlog.always.transfer=false
>>>>>>>>>>> 
>>>>>>>>>>> #Indicates whether the working directory on the remote site
>>>>>>>>>>> # should be left intact even when a run completes successfully
>>>>>>>>>>> sitedir.keep=true
>>>>>>>>>>> 
>>>>>>>>>>> #try only once
>>>>>>>>>>> execution.retries=1
>>>>>>>>>>> 
>>>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
>>>>>>>>>>> errors
>>>>>>>>>>> lazy.errors=true
>>>>>>>>>>> 
>>>>>>>>>>> # to reduce filesystem access
>>>>>>>>>>> status.mode=provider
>>>>>>>>>>> 
>>>>>>>>>>> use.provider.staging=false
>>>>>>>>>>> 
>>>>>>>>>>> provider.staging.pin.swiftfiles=false
>>>>>>>>>>> 
>>>>>>>>>>> foreach.max.threads=100
>>>>>>>>>>> 
>>>>>>>>>>> provenance.log=false
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> The perl script is the worker script that is submitted with
>>>>>>>>>>>> PBS.
>>>>>>>>>>>> I
>>>>>>>>>>>> have not tried to run on Beagle since the maintenance period
>>>>>>>>>>>> has
>>>>>>>>>>>> ended so I am not exactly sure why the error popped up. One
>>>>>>>>>>>> reason
>>>>>>>>>>>> could be that the home file system is no longer mounted on
>>>>>>>>>>>> the
>>>>>>>>>>>> compute nodes. I know they spoke about that being a
>>>>>>>>>>>> possibility
>>>>>>>>>>>> but
>>>>>>>>>>>> not sure they implemented that during the maintenance period.
>>>>>>>>>>>> Do
>>>>>>>>>>>> you
>>>>>>>>>>>> know if the home file system is still mounted on the compute
>>>>>>>>>>>> nodes?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
>>>>>>>>>>>> <lpesce at uchicago.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi --
>>>>>>>>>>>>> I haven't seen this one before:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Can't open perl script
>>>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
>>>>>>>>>>>>> No
>>>>>>>>>>>>> such file or directory
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The config of the cray has changed, might this have anything
>>>>>>>>>>>>> to
>>>>>>>>>>>>> do
>>>>>>>>>>>>> with it?
>>>>>>>>>>>>> I have no idea what perl script is it talking about and why
>>>>>>>>>>>>> it
>>>>>>>>>>>>> is
>>>>>>>>>>>>> looking to home.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Michael Wilde
>>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>>> Argonne National Laboratory
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Michael Wilde
>>>>>>>> Computation Institute, University of Chicago
>>>>>>>> Mathematics and Computer Science Division
>>>>>>>> Argonne National Laboratory
>>>>>>> 
>>>>>>> --
>>>>>>> Michael Wilde
>>>>>>> Computation Institute, University of Chicago
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Swift-user mailing list
>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>> 
>>>>>> --
>>>>>> Michael Wilde
>>>>>> Computation Institute, University of Chicago
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Laboratory
>>>>>> 
>>>>> 
>>> 
>>> -- 
>>> Michael Wilde
>>> Computation Institute, University of Chicago
>>> Mathematics and Computer Science Division
>>> Argonne National Laboratory
>>> 
>> 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> 



More information about the Swift-user mailing list