[Swift-user] Error message on Cray XE6

Jonathan Monette jonmon at mcs.anl.gov
Tue Apr 17 12:49:12 CDT 2012


I do not think that is the case where PBS leaves the job queued, maybe on some machines but no on Beagle.  When I had a job that did not fit in the scalability queue Swift hung but when checking the log I found a message from qsub saying the job was rejected.  There is a bug ticket open for this issue.  I will find the log that has the message(or just recreate it) and post the message to the ticket.  Swift also hangs(with a qsub message in the log) if you try to submit a PBS job to machine where you no longer have an allocation.  I received this message when trying to use Fusion after a long time.

On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote:

> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made.  I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).
> 
> - Mike
> 
> ----- Original Message -----
>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
>> Sent: Tuesday, April 17, 2012 12:08:49 PM
>> Subject: Re: [Swift-user] Error message on Cray XE6
>> There is a site file entry for that.
>> 
>> <profile namespace="globus" key="queue">scalability</profile>
>> 
>> You must make certain that the shape of your job fits in the queue you
>> requested. If it does not fit, there is a silent failure and Swift
>> hangs.
>> 
>> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>> 
>>> Works great!
>>> 
>>> Is there a way I can ask swift to put me in a specific queue, such
>>> as scalability of some reservation?
>>> 
>>> 
>>> 
>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
>>> 
>>>> OK, here's a workaround for this problem:
>>>> 
>>>> You need to add this line to the swift command bin/swift in your
>>>> Swift release.
>>>> 
>>>> After:
>>>> 
>>>> updateOptions "$SWIFT_HOME" "swift.home"
>>>> 
>>>> Add:
>>>> 
>>>> updateOptions "$USER_HOME" "user.home"
>>>> 
>>>> This is near line 92 in the version I tested, Swift trunk
>>>> swift-r5739 cog-r3368.
>>>> 
>>>> Then you can do:
>>>> 
>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
>>>> -sites.file pbs.xml catsn.swift -n=1
>>>> 
>>>> Lorenzo, if you are using "module load swift" we'll need to update
>>>> that, or you can copy the swift release directory structure that
>>>> module load points you to, then modify the swift command there, and
>>>> put that modified release first in your PATH.
>>>> 
>>>> We'll work out a way to get something like this into the production
>>>> module and trunk. I dont know of other systems that are currently
>>>> affected by this, but Im sure they will come up.
>>>> 
>>>> - Mike
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>> Cc: swift-user at ci.uchicago.edu
>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM
>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>> stackoverflow says this should work:
>>>>> 
>>>>> java -Duser.home=<new_location> <your_program>
>>>>> 
>>>>> Need to get that in via the swift command.
>>>>> 
>>>>> - Mike
>>>>> 
>>>>> 
>>>>> ----- Original Message -----
>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>> swift-user at ci.uchicago.edu
>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM
>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
>>>>>> user.home to the same thing. Neither works. I think user.home is
>>>>>> coming from the Java property, and that doesnt seem to be
>>>>>> influenced
>>>>>> by the HOME env var. I was about to look if Java can be asked to
>>>>>> change home. Maybe by setting a command line arg to Java.
>>>>>> 
>>>>>> - Mike
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>> swift-user at ci.uchicago.edu
>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>> That is an easy fix I believe. I know where the code is so I
>>>>>>> will
>>>>>>> change and test.
>>>>>>> 
>>>>>>> In the mean time could you try something? Try setting
>>>>>>> user.home=<someplace.on.lustre>
>>>>>>> in your config file and try again.
>>>>>>> 
>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> /home is no longer mounted by the compute nodes, per the
>>>>>>>> post-maitenance summary:
>>>>>>>> 
>>>>>>>> "External filesystem dependencies minimized: Compute nodes and
>>>>>>>> the
>>>>>>>> scheduler should now continue to process and complete jobs
>>>>>>>> without
>>>>>>>> the threat of interference of external filesystem outages.
>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is
>>>>>>>> on
>>>>>>>> login and mom nodes only."
>>>>>>>> 
>>>>>>>> So we need to (finally) remove Swift's dependence on
>>>>>>>> $HOME/.globus
>>>>>>>> and $HOME/.globus/scripts in particular.
>>>>>>>> 
>>>>>>>> I suggest - since the swift command already needs to write to
>>>>>>>> "."
>>>>>>>> -
>>>>>>>> that we create a scripts/ directory in "." instead of
>>>>>>>> $HOME/.globus.
>>>>>>>> And this should be used by any provider that would have
>>>>>>>> previously
>>>>>>>> created files below .globus.
>>>>>>>> 
>>>>>>>> I'll echo this to swift-devel and start a thread there to
>>>>>>>> discuss.
>>>>>>>> Its possible there's already a property to cause scripts/ to be
>>>>>>>> created elsewhere. If not, I think we should make one. I think
>>>>>>>> grouping the scripts created by a run into the current dir,
>>>>>>>> along
>>>>>>>> with the swift log, _concurrent, and (in the conventions I use
>>>>>>>> in
>>>>>>>> my
>>>>>>>> run scripts) swiftwork/.
>>>>>>>> 
>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for
>>>>>>>> this
>>>>>>>> soon.
>>>>>>>> 
>>>>>>>> You *might* be able to trick swift into doing this by setting
>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
>>>>>>>> .globus
>>>>>>>> and that didnt work, as /home is not even readable by the
>>>>>>>> compute
>>>>>>>> nodes, which in this case need to run the coaster worker (.pl)
>>>>>>>> script.
>>>>>>>> 
>>>>>>>> - Mike
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>> In principle the access to the /home filesystem should still
>>>>>>>>> be
>>>>>>>>> there.
>>>>>>>>> 
>>>>>>>>> The only thing I did was to chance the cf file to remove some
>>>>>>>>> errors I
>>>>>>>>> had into it, so that might also be the source of the problem.
>>>>>>>>> This
>>>>>>>>> is
>>>>>>>>> what it looks like now:
>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> # Whether to transfer the wrappers from the compute nodes
>>>>>>>>> # I like to launch from my home dir, but keep everything on
>>>>>>>>> # lustre
>>>>>>>>> wrapperlog.always.transfer=false
>>>>>>>>> 
>>>>>>>>> #Indicates whether the working directory on the remote site
>>>>>>>>> # should be left intact even when a run completes successfully
>>>>>>>>> sitedir.keep=true
>>>>>>>>> 
>>>>>>>>> #try only once
>>>>>>>>> execution.retries=1
>>>>>>>>> 
>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
>>>>>>>>> errors
>>>>>>>>> lazy.errors=true
>>>>>>>>> 
>>>>>>>>> # to reduce filesystem access
>>>>>>>>> status.mode=provider
>>>>>>>>> 
>>>>>>>>> use.provider.staging=false
>>>>>>>>> 
>>>>>>>>> provider.staging.pin.swiftfiles=false
>>>>>>>>> 
>>>>>>>>> foreach.max.threads=100
>>>>>>>>> 
>>>>>>>>> provenance.log=false
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
>>>>>>>>> 
>>>>>>>>>> The perl script is the worker script that is submitted with
>>>>>>>>>> PBS.
>>>>>>>>>> I
>>>>>>>>>> have not tried to run on Beagle since the maintenance period
>>>>>>>>>> has
>>>>>>>>>> ended so I am not exactly sure why the error popped up. One
>>>>>>>>>> reason
>>>>>>>>>> could be that the home file system is no longer mounted on
>>>>>>>>>> the
>>>>>>>>>> compute nodes. I know they spoke about that being a
>>>>>>>>>> possibility
>>>>>>>>>> but
>>>>>>>>>> not sure they implemented that during the maintenance period.
>>>>>>>>>> Do
>>>>>>>>>> you
>>>>>>>>>> know if the home file system is still mounted on the compute
>>>>>>>>>> nodes?
>>>>>>>>>> 
>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
>>>>>>>>>> <lpesce at uchicago.edu>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi --
>>>>>>>>>>> I haven't seen this one before:
>>>>>>>>>>> 
>>>>>>>>>>> Can't open perl script
>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
>>>>>>>>>>> No
>>>>>>>>>>> such file or directory
>>>>>>>>>>> 
>>>>>>>>>>> The config of the cray has changed, might this have anything
>>>>>>>>>>> to
>>>>>>>>>>> do
>>>>>>>>>>> with it?
>>>>>>>>>>> I have no idea what perl script is it talking about and why
>>>>>>>>>>> it
>>>>>>>>>>> is
>>>>>>>>>>> looking to home.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>> 
>>>>>>>>>>> Lorenzo
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-user mailing list
>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Michael Wilde
>>>>>>>> Computation Institute, University of Chicago
>>>>>>>> Mathematics and Computer Science Division
>>>>>>>> Argonne National Laboratory
>>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Michael Wilde
>>>>>> Computation Institute, University of Chicago
>>>>>> Mathematics and Computer Science Division
>>>>>> Argonne National Laboratory
>>>>> 
>>>>> --
>>>>> Michael Wilde
>>>>> Computation Institute, University of Chicago
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Laboratory
>>>>> 
>>>>> _______________________________________________
>>>>> Swift-user mailing list
>>>>> Swift-user at ci.uchicago.edu
>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>> 
>>>> --
>>>> Michael Wilde
>>>> Computation Institute, University of Chicago
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>> 
>>> 
> 
> -- 
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 




More information about the Swift-user mailing list