[Swift-user] Error message on Cray XE6

Jonathan Monette jonmon at mcs.anl.gov
Tue Apr 17 20:20:03 CDT 2012


So here is the case where there is no allocation available for a user.  I am running on Fusion where my allocation has expired.

Here is what Swift is showing:
Swift 0.93 swift-r5483 cog-r3339

RunID: 20120417-2011-kky5yb46
Progress:  time: Tue, 17 Apr 2012 20:11:36 -0500
Progress:  time: Tue, 17 Apr 2012 20:12:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:12:36 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:13:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:13:36 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:14:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:14:36 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:15:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:15:36 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:16:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:16:36 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:17:06 -0500  Submitted:2
Progress:  time: Tue, 17 Apr 2012 20:17:36 -0500  Submitted:2

And here is the message in the log:

012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Wrote PBS script to /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit
2012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Command line: qsub /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit
2012-04-17 20:11:37,433-0500 DEBUG AbstractExecutor Waiting for output from qsub
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: ""
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Waiting for output from qsub
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: "ERROR: Project "startup-jonmon" has no allocation; can't run job."
2012-04-17 20:11:37,434-0500 INFO  BlockTaskSubmitter Error submitting block task: Cannot submit job: Could not submit job (qsub reported an exit code of 1). 
ERROR: Project "startup-jonmon" has no allocation; can't run job.


So it shows that qsub failed(with error code 1) but Swift keeps going showing a submitted count of 2 when there is no jobs under qstat -u jonmon

I will try and get the case for when no job fits in the the specified queue.  I do not think this is high priority but this is definitely something that users should be aware of.

On Apr 17, 2012, at 7:33 PM, Jonathan Monette wrote:

> So that is not what I was witnessing. It seems the scheduler rejected the job(the PBS scheduler) because no jobs showed up under qstat but Swift still showed that jobs were submitted with no failures. If I checked the log I found a message from qsub saying could not submit job. I will reproduce the issue and post what I see. Perhaps this is happening though because the scheduler rejects the job but does not return an error code?
> 
> On Apr 17, 2012, at 19:25, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
>> Hmm, so if a block job fails, coasters will fail at least one swift job.
>> If this happens enough times, the failure should propagate through the
>> retries and to the user. It might take some time though.
>> 
>> So maybe there's a distinction between "hangs" and "takes a lot of time
>> but eventually fails".
>> 
>> Mihael
>> 
>> On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote:
>>> I do not think that is the case where PBS leaves the job queued, maybe
>>> on some machines but no on Beagle.  When I had a job that did not fit
>>> in the scalability queue Swift hung but when checking the log I found
>>> a message from qsub saying the job was rejected.  There is a bug
>>> ticket open for this issue.  I will find the log that has the
>>> message(or just recreate it) and post the message to the ticket.
>>> Swift also hangs(with a qsub message in the log) if you try to submit
>>> a PBS job to machine where you no longer have an allocation.  I
>>> received this message when trying to use Fusion after a long time.
>>> 
>>> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote:
>>> 
>>>> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made.  I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).
>>>> 
>>>> - Mike
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
>>>>> Sent: Tuesday, April 17, 2012 12:08:49 PM
>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>> There is a site file entry for that.
>>>>> 
>>>>> <profile namespace="globus" key="queue">scalability</profile>
>>>>> 
>>>>> You must make certain that the shape of your job fits in the queue you
>>>>> requested. If it does not fit, there is a silent failure and Swift
>>>>> hangs.
>>>>> 
>>>>> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>>>>> 
>>>>>> Works great!
>>>>>> 
>>>>>> Is there a way I can ask swift to put me in a specific queue, such
>>>>>> as scalability of some reservation?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
>>>>>> 
>>>>>>> OK, here's a workaround for this problem:
>>>>>>> 
>>>>>>> You need to add this line to the swift command bin/swift in your
>>>>>>> Swift release.
>>>>>>> 
>>>>>>> After:
>>>>>>> 
>>>>>>> updateOptions "$SWIFT_HOME" "swift.home"
>>>>>>> 
>>>>>>> Add:
>>>>>>> 
>>>>>>> updateOptions "$USER_HOME" "user.home"
>>>>>>> 
>>>>>>> This is near line 92 in the version I tested, Swift trunk
>>>>>>> swift-r5739 cog-r3368.
>>>>>>> 
>>>>>>> Then you can do:
>>>>>>> 
>>>>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
>>>>>>> -sites.file pbs.xml catsn.swift -n=1
>>>>>>> 
>>>>>>> Lorenzo, if you are using "module load swift" we'll need to update
>>>>>>> that, or you can copy the swift release directory structure that
>>>>>>> module load points you to, then modify the swift command there, and
>>>>>>> put that modified release first in your PATH.
>>>>>>> 
>>>>>>> We'll work out a way to get something like this into the production
>>>>>>> module and trunk. I dont know of other systems that are currently
>>>>>>> affected by this, but Im sure they will come up.
>>>>>>> 
>>>>>>> - Mike
>>>>>>> 
>>>>>>> 
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM
>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>> stackoverflow says this should work:
>>>>>>>> 
>>>>>>>> java -Duser.home=<new_location> <your_program>
>>>>>>>> 
>>>>>>>> Need to get that in via the swift command.
>>>>>>>> 
>>>>>>>> - Mike
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM
>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
>>>>>>>>> user.home to the same thing. Neither works. I think user.home is
>>>>>>>>> coming from the Java property, and that doesnt seem to be
>>>>>>>>> influenced
>>>>>>>>> by the HOME env var. I was about to look if Java can be asked to
>>>>>>>>> change home. Maybe by setting a command line arg to Java.
>>>>>>>>> 
>>>>>>>>> - Mike
>>>>>>>>> 
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>>> That is an easy fix I believe. I know where the code is so I
>>>>>>>>>> will
>>>>>>>>>> change and test.
>>>>>>>>>> 
>>>>>>>>>> In the mean time could you try something? Try setting
>>>>>>>>>> user.home=<someplace.on.lustre>
>>>>>>>>>> in your config file and try again.
>>>>>>>>>> 
>>>>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> /home is no longer mounted by the compute nodes, per the
>>>>>>>>>>> post-maitenance summary:
>>>>>>>>>>> 
>>>>>>>>>>> "External filesystem dependencies minimized: Compute nodes and
>>>>>>>>>>> the
>>>>>>>>>>> scheduler should now continue to process and complete jobs
>>>>>>>>>>> without
>>>>>>>>>>> the threat of interference of external filesystem outages.
>>>>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is
>>>>>>>>>>> on
>>>>>>>>>>> login and mom nodes only."
>>>>>>>>>>> 
>>>>>>>>>>> So we need to (finally) remove Swift's dependence on
>>>>>>>>>>> $HOME/.globus
>>>>>>>>>>> and $HOME/.globus/scripts in particular.
>>>>>>>>>>> 
>>>>>>>>>>> I suggest - since the swift command already needs to write to
>>>>>>>>>>> "."
>>>>>>>>>>> -
>>>>>>>>>>> that we create a scripts/ directory in "." instead of
>>>>>>>>>>> $HOME/.globus.
>>>>>>>>>>> And this should be used by any provider that would have
>>>>>>>>>>> previously
>>>>>>>>>>> created files below .globus.
>>>>>>>>>>> 
>>>>>>>>>>> I'll echo this to swift-devel and start a thread there to
>>>>>>>>>>> discuss.
>>>>>>>>>>> Its possible there's already a property to cause scripts/ to be
>>>>>>>>>>> created elsewhere. If not, I think we should make one. I think
>>>>>>>>>>> grouping the scripts created by a run into the current dir,
>>>>>>>>>>> along
>>>>>>>>>>> with the swift log, _concurrent, and (in the conventions I use
>>>>>>>>>>> in
>>>>>>>>>>> my
>>>>>>>>>>> run scripts) swiftwork/.
>>>>>>>>>>> 
>>>>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for
>>>>>>>>>>> this
>>>>>>>>>>> soon.
>>>>>>>>>>> 
>>>>>>>>>>> You *might* be able to trick swift into doing this by setting
>>>>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
>>>>>>>>>>> .globus
>>>>>>>>>>> and that didnt work, as /home is not even readable by the
>>>>>>>>>>> compute
>>>>>>>>>>> nodes, which in this case need to run the coaster worker (.pl)
>>>>>>>>>>> script.
>>>>>>>>>>> 
>>>>>>>>>>> - Mike
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
>>>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>>>>> In principle the access to the /home filesystem should still
>>>>>>>>>>>> be
>>>>>>>>>>>> there.
>>>>>>>>>>>> 
>>>>>>>>>>>> The only thing I did was to chance the cf file to remove some
>>>>>>>>>>>> errors I
>>>>>>>>>>>> had into it, so that might also be the source of the problem.
>>>>>>>>>>>> This
>>>>>>>>>>>> is
>>>>>>>>>>>> what it looks like now:
>>>>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> # Whether to transfer the wrappers from the compute nodes
>>>>>>>>>>>> # I like to launch from my home dir, but keep everything on
>>>>>>>>>>>> # lustre
>>>>>>>>>>>> wrapperlog.always.transfer=false
>>>>>>>>>>>> 
>>>>>>>>>>>> #Indicates whether the working directory on the remote site
>>>>>>>>>>>> # should be left intact even when a run completes successfully
>>>>>>>>>>>> sitedir.keep=true
>>>>>>>>>>>> 
>>>>>>>>>>>> #try only once
>>>>>>>>>>>> execution.retries=1
>>>>>>>>>>>> 
>>>>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
>>>>>>>>>>>> errors
>>>>>>>>>>>> lazy.errors=true
>>>>>>>>>>>> 
>>>>>>>>>>>> # to reduce filesystem access
>>>>>>>>>>>> status.mode=provider
>>>>>>>>>>>> 
>>>>>>>>>>>> use.provider.staging=false
>>>>>>>>>>>> 
>>>>>>>>>>>> provider.staging.pin.swiftfiles=false
>>>>>>>>>>>> 
>>>>>>>>>>>> foreach.max.threads=100
>>>>>>>>>>>> 
>>>>>>>>>>>> provenance.log=false
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> The perl script is the worker script that is submitted with
>>>>>>>>>>>>> PBS.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> have not tried to run on Beagle since the maintenance period
>>>>>>>>>>>>> has
>>>>>>>>>>>>> ended so I am not exactly sure why the error popped up. One
>>>>>>>>>>>>> reason
>>>>>>>>>>>>> could be that the home file system is no longer mounted on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> compute nodes. I know they spoke about that being a
>>>>>>>>>>>>> possibility
>>>>>>>>>>>>> but
>>>>>>>>>>>>> not sure they implemented that during the maintenance period.
>>>>>>>>>>>>> Do
>>>>>>>>>>>>> you
>>>>>>>>>>>>> know if the home file system is still mounted on the compute
>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
>>>>>>>>>>>>> <lpesce at uchicago.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi --
>>>>>>>>>>>>>> I haven't seen this one before:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can't open perl script
>>>>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
>>>>>>>>>>>>>> No
>>>>>>>>>>>>>> such file or directory
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The config of the cray has changed, might this have anything
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>> with it?
>>>>>>>>>>>>>> I have no idea what perl script is it talking about and why
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> looking to home.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Michael Wilde
>>>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>>>> Argonne National Laboratory
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Michael Wilde
>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>> Argonne National Laboratory
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Michael Wilde
>>>>>>>> Computation Institute, University of Chicago
>>>>>>>> Mathematics and Computer Science Division
>>>>>>>> Argonne National Laboratory
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Swift-user mailing list
>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>> 
>>>>>>> --
>>>>>>> Michael Wilde
>>>>>>> Computation Institute, University of Chicago
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>> 
>>>>>> 
>>>> 
>>>> -- 
>>>> Michael Wilde
>>>> Computation Institute, University of Chicago
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>> 
>>> 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>> 
>> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user




More information about the Swift-user mailing list