[Swift-user] Error message on Cray XE6
Jonathan Monette
jonmon at mcs.anl.gov
Tue Apr 17 20:20:03 CDT 2012
So here is the case where there is no allocation available for a user. I am running on Fusion where my allocation has expired.
Here is what Swift is showing:
Swift 0.93 swift-r5483 cog-r3339
RunID: 20120417-2011-kky5yb46
Progress: time: Tue, 17 Apr 2012 20:11:36 -0500
Progress: time: Tue, 17 Apr 2012 20:12:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:12:36 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:13:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:13:36 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:14:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:14:36 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:15:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:15:36 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:16:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:16:36 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:17:06 -0500 Submitted:2
Progress: time: Tue, 17 Apr 2012 20:17:36 -0500 Submitted:2
And here is the message in the log:
012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Wrote PBS script to /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit
2012-04-17 20:11:37,306-0500 DEBUG AbstractExecutor Command line: qsub /homes/jonmon/.globus/scripts/PBS5153949301806424155.submit
2012-04-17 20:11:37,433-0500 DEBUG AbstractExecutor Waiting for output from qsub
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: ""
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Waiting for output from qsub
2012-04-17 20:11:37,434-0500 DEBUG AbstractExecutor Output from qsub is: "ERROR: Project "startup-jonmon" has no allocation; can't run job."
2012-04-17 20:11:37,434-0500 INFO BlockTaskSubmitter Error submitting block task: Cannot submit job: Could not submit job (qsub reported an exit code of 1).
ERROR: Project "startup-jonmon" has no allocation; can't run job.
So it shows that qsub failed(with error code 1) but Swift keeps going showing a submitted count of 2 when there is no jobs under qstat -u jonmon
I will try and get the case for when no job fits in the the specified queue. I do not think this is high priority but this is definitely something that users should be aware of.
On Apr 17, 2012, at 7:33 PM, Jonathan Monette wrote:
> So that is not what I was witnessing. It seems the scheduler rejected the job(the PBS scheduler) because no jobs showed up under qstat but Swift still showed that jobs were submitted with no failures. If I checked the log I found a message from qsub saying could not submit job. I will reproduce the issue and post what I see. Perhaps this is happening though because the scheduler rejects the job but does not return an error code?
>
> On Apr 17, 2012, at 19:25, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>
>> Hmm, so if a block job fails, coasters will fail at least one swift job.
>> If this happens enough times, the failure should propagate through the
>> retries and to the user. It might take some time though.
>>
>> So maybe there's a distinction between "hangs" and "takes a lot of time
>> but eventually fails".
>>
>> Mihael
>>
>> On Tue, 2012-04-17 at 12:49 -0500, Jonathan Monette wrote:
>>> I do not think that is the case where PBS leaves the job queued, maybe
>>> on some machines but no on Beagle. When I had a job that did not fit
>>> in the scalability queue Swift hung but when checking the log I found
>>> a message from qsub saying the job was rejected. There is a bug
>>> ticket open for this issue. I will find the log that has the
>>> message(or just recreate it) and post the message to the ticket.
>>> Swift also hangs(with a qsub message in the log) if you try to submit
>>> a PBS job to machine where you no longer have an allocation. I
>>> received this message when trying to use Fusion after a long time.
>>>
>>> On Apr 17, 2012, at 12:38 PM, Michael Wilde wrote:
>>>
>>>> I think Swift hangs in such cases because PBS silently leaves the job queued even though no current queue can take the job. That seems to be a "feature" of the RM: being able to queue jobs that *might* be runnable in the future iff new queue settings are made. I dont know a good wat for Swift to detect this, but thats something to discuss (and create a ticket for).
>>>>
>>>> - Mike
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>> Cc: "Michael Wilde" <wilde at mcs.anl.gov>, swift-user at ci.uchicago.edu
>>>>> Sent: Tuesday, April 17, 2012 12:08:49 PM
>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>> There is a site file entry for that.
>>>>>
>>>>> <profile namespace="globus" key="queue">scalability</profile>
>>>>>
>>>>> You must make certain that the shape of your job fits in the queue you
>>>>> requested. If it does not fit, there is a silent failure and Swift
>>>>> hangs.
>>>>>
>>>>> On Apr 17, 2012, at 11:58, Lorenzo Pesce <lpesce at uchicago.edu> wrote:
>>>>>
>>>>>> Works great!
>>>>>>
>>>>>> Is there a way I can ask swift to put me in a specific queue, such
>>>>>> as scalability of some reservation?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 14, 2012, at 10:51 AM, Michael Wilde wrote:
>>>>>>
>>>>>>> OK, here's a workaround for this problem:
>>>>>>>
>>>>>>> You need to add this line to the swift command bin/swift in your
>>>>>>> Swift release.
>>>>>>>
>>>>>>> After:
>>>>>>>
>>>>>>> updateOptions "$SWIFT_HOME" "swift.home"
>>>>>>>
>>>>>>> Add:
>>>>>>>
>>>>>>> updateOptions "$USER_HOME" "user.home"
>>>>>>>
>>>>>>> This is near line 92 in the version I tested, Swift trunk
>>>>>>> swift-r5739 cog-r3368.
>>>>>>>
>>>>>>> Then you can do:
>>>>>>>
>>>>>>> USER_HOME=/lustre/beagle/wilde swift -config cf -tc.file tc
>>>>>>> -sites.file pbs.xml catsn.swift -n=1
>>>>>>>
>>>>>>> Lorenzo, if you are using "module load swift" we'll need to update
>>>>>>> that, or you can copy the swift release directory structure that
>>>>>>> module load points you to, then modify the swift command there, and
>>>>>>> put that modified release first in your PATH.
>>>>>>>
>>>>>>> We'll work out a way to get something like this into the production
>>>>>>> module and trunk. I dont know of other systems that are currently
>>>>>>> affected by this, but Im sure they will come up.
>>>>>>>
>>>>>>> - Mike
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>> Sent: Saturday, April 14, 2012 10:13:40 AM
>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>> stackoverflow says this should work:
>>>>>>>>
>>>>>>>> java -Duser.home=<new_location> <your_program>
>>>>>>>>
>>>>>>>> Need to get that in via the swift command.
>>>>>>>>
>>>>>>>> - Mike
>>>>>>>>
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>>> Sent: Saturday, April 14, 2012 10:10:00 AM
>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>> I just tried both setting HOME=/lustre/beagle/wilde and setting
>>>>>>>>> user.home to the same thing. Neither works. I think user.home is
>>>>>>>>> coming from the Java property, and that doesnt seem to be
>>>>>>>>> influenced
>>>>>>>>> by the HOME env var. I was about to look if Java can be asked to
>>>>>>>>> change home. Maybe by setting a command line arg to Java.
>>>>>>>>>
>>>>>>>>> - Mike
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>>>>> Cc: "Lorenzo Pesce" <lpesce at uchicago.edu>,
>>>>>>>>>> swift-user at ci.uchicago.edu
>>>>>>>>>> Sent: Saturday, April 14, 2012 10:02:14 AM
>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>>> That is an easy fix I believe. I know where the code is so I
>>>>>>>>>> will
>>>>>>>>>> change and test.
>>>>>>>>>>
>>>>>>>>>> In the mean time could you try something? Try setting
>>>>>>>>>> user.home=<someplace.on.lustre>
>>>>>>>>>> in your config file and try again.
>>>>>>>>>>
>>>>>>>>>> On Apr 14, 2012, at 9:58, Michael Wilde <wilde at mcs.anl.gov>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> /home is no longer mounted by the compute nodes, per the
>>>>>>>>>>> post-maitenance summary:
>>>>>>>>>>>
>>>>>>>>>>> "External filesystem dependencies minimized: Compute nodes and
>>>>>>>>>>> the
>>>>>>>>>>> scheduler should now continue to process and complete jobs
>>>>>>>>>>> without
>>>>>>>>>>> the threat of interference of external filesystem outages.
>>>>>>>>>>> /gpfs/pads is only available on login1 through login5; /home is
>>>>>>>>>>> on
>>>>>>>>>>> login and mom nodes only."
>>>>>>>>>>>
>>>>>>>>>>> So we need to (finally) remove Swift's dependence on
>>>>>>>>>>> $HOME/.globus
>>>>>>>>>>> and $HOME/.globus/scripts in particular.
>>>>>>>>>>>
>>>>>>>>>>> I suggest - since the swift command already needs to write to
>>>>>>>>>>> "."
>>>>>>>>>>> -
>>>>>>>>>>> that we create a scripts/ directory in "." instead of
>>>>>>>>>>> $HOME/.globus.
>>>>>>>>>>> And this should be used by any provider that would have
>>>>>>>>>>> previously
>>>>>>>>>>> created files below .globus.
>>>>>>>>>>>
>>>>>>>>>>> I'll echo this to swift-devel and start a thread there to
>>>>>>>>>>> discuss.
>>>>>>>>>>> Its possible there's already a property to cause scripts/ to be
>>>>>>>>>>> created elsewhere. If not, I think we should make one. I think
>>>>>>>>>>> grouping the scripts created by a run into the current dir,
>>>>>>>>>>> along
>>>>>>>>>>> with the swift log, _concurrent, and (in the conventions I use
>>>>>>>>>>> in
>>>>>>>>>>> my
>>>>>>>>>>> run scripts) swiftwork/.
>>>>>>>>>>>
>>>>>>>>>>> Lorenzo, hopefully we can at least get you a workaround for
>>>>>>>>>>> this
>>>>>>>>>>> soon.
>>>>>>>>>>>
>>>>>>>>>>> You *might* be able to trick swift into doing this by setting
>>>>>>>>>>> HOME=/lustre/beagle/$USER. I already tried a symlink under
>>>>>>>>>>> .globus
>>>>>>>>>>> and that didnt work, as /home is not even readable by the
>>>>>>>>>>> compute
>>>>>>>>>>> nodes, which in this case need to run the coaster worker (.pl)
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> - Mike
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> From: "Lorenzo Pesce" <lpesce at uchicago.edu>
>>>>>>>>>>>> To: "Jonathan Monette" <jonmon at mcs.anl.gov>
>>>>>>>>>>>> Cc: swift-user at ci.uchicago.edu
>>>>>>>>>>>> Sent: Saturday, April 14, 2012 8:15:39 AM
>>>>>>>>>>>> Subject: Re: [Swift-user] Error message on Cray XE6
>>>>>>>>>>>> In principle the access to the /home filesystem should still
>>>>>>>>>>>> be
>>>>>>>>>>>> there.
>>>>>>>>>>>>
>>>>>>>>>>>> The only thing I did was to chance the cf file to remove some
>>>>>>>>>>>> errors I
>>>>>>>>>>>> had into it, so that might also be the source of the problem.
>>>>>>>>>>>> This
>>>>>>>>>>>> is
>>>>>>>>>>>> what it looks like now:
>>>>>>>>>>>> (BTW, the comments are not mine, I run swift only from lustre)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # Whether to transfer the wrappers from the compute nodes
>>>>>>>>>>>> # I like to launch from my home dir, but keep everything on
>>>>>>>>>>>> # lustre
>>>>>>>>>>>> wrapperlog.always.transfer=false
>>>>>>>>>>>>
>>>>>>>>>>>> #Indicates whether the working directory on the remote site
>>>>>>>>>>>> # should be left intact even when a run completes successfully
>>>>>>>>>>>> sitedir.keep=true
>>>>>>>>>>>>
>>>>>>>>>>>> #try only once
>>>>>>>>>>>> execution.retries=1
>>>>>>>>>>>>
>>>>>>>>>>>> # Attempt to run as much as possible, i.g., ignore non-fatal
>>>>>>>>>>>> errors
>>>>>>>>>>>> lazy.errors=true
>>>>>>>>>>>>
>>>>>>>>>>>> # to reduce filesystem access
>>>>>>>>>>>> status.mode=provider
>>>>>>>>>>>>
>>>>>>>>>>>> use.provider.staging=false
>>>>>>>>>>>>
>>>>>>>>>>>> provider.staging.pin.swiftfiles=false
>>>>>>>>>>>>
>>>>>>>>>>>> foreach.max.threads=100
>>>>>>>>>>>>
>>>>>>>>>>>> provenance.log=false
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Apr 14, 2012, at 12:10 AM, Jonathan Monette wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The perl script is the worker script that is submitted with
>>>>>>>>>>>>> PBS.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> have not tried to run on Beagle since the maintenance period
>>>>>>>>>>>>> has
>>>>>>>>>>>>> ended so I am not exactly sure why the error popped up. One
>>>>>>>>>>>>> reason
>>>>>>>>>>>>> could be that the home file system is no longer mounted on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> compute nodes. I know they spoke about that being a
>>>>>>>>>>>>> possibility
>>>>>>>>>>>>> but
>>>>>>>>>>>>> not sure they implemented that during the maintenance period.
>>>>>>>>>>>>> Do
>>>>>>>>>>>>> you
>>>>>>>>>>>>> know if the home file system is still mounted on the compute
>>>>>>>>>>>>> nodes?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 13, 2012, at 17:18, Lorenzo Pesce
>>>>>>>>>>>>> <lpesce at uchicago.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi --
>>>>>>>>>>>>>> I haven't seen this one before:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can't open perl script
>>>>>>>>>>>>>> "/home/lpesce/.globus/coasters/cscript7176272791806289394.pl":
>>>>>>>>>>>>>> No
>>>>>>>>>>>>>> such file or directory
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The config of the cray has changed, might this have anything
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>> with it?
>>>>>>>>>>>>>> I have no idea what perl script is it talking about and why
>>>>>>>>>>>>>> it
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>> looking to home.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Swift-user mailing list
>>>>>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Michael Wilde
>>>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>>>> Argonne National Laboratory
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Michael Wilde
>>>>>>>>> Computation Institute, University of Chicago
>>>>>>>>> Mathematics and Computer Science Division
>>>>>>>>> Argonne National Laboratory
>>>>>>>>
>>>>>>>> --
>>>>>>>> Michael Wilde
>>>>>>>> Computation Institute, University of Chicago
>>>>>>>> Mathematics and Computer Science Division
>>>>>>>> Argonne National Laboratory
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-user mailing list
>>>>>>>> Swift-user at ci.uchicago.edu
>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>>>>>>
>>>>>>> --
>>>>>>> Michael Wilde
>>>>>>> Computation Institute, University of Chicago
>>>>>>> Mathematics and Computer Science Division
>>>>>>> Argonne National Laboratory
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Michael Wilde
>>>> Computation Institute, University of Chicago
>>>> Mathematics and Computer Science Division
>>>> Argonne National Laboratory
>>>>
>>>
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu
>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
More information about the Swift-user
mailing list