[Swift-devel] trunk-cobalt block task ended prematurely
Michael Wilde
wilde at anl.gov
Wed Mar 4 15:17:26 CST 2015
This brings to mind a similar problem that we encountered a while back,
with an MPICH2 bug.
See this ticket:
https://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=1014
- Mike
On 3/4/15 3:12 PM, Mihael Hategan wrote:
> I'm still confused. I don't see any difference in stdin handling between
> _swiftwrap and _swiftwrap.staging (which is used for direct staging).
>
> Maybe we should always feed the app a /dev/null if there is no stdin=
> specified.
>
> Mihael
>
> On Wed, 2015-03-04 at 08:50 -0600, Ketan Maheshwari wrote:
>> I added stdin="/dev/null" to app invocation line and it has worked now.
>> --Ketan
>>
>> On Wed, Mar 4, 2015 at 8:44 AM, Ketan Maheshwari <ketan at mcs.anl.gov> wrote:
>>
>>> Please find one with 59 minutes attached. --Ketan
>>>
>>> On Tue, Mar 3, 2015 at 11:17 PM, Mihael Hategan <hategan at mcs.anl.gov>
>>> wrote:
>>>
>>>> You are using coasters, so what gets queued is the block, not the job.
>>>>
>>>> You should specify execution.options.maxJobTime = "00:59:00".
>>>>
>>>> Then you can probably do a walltime of about "00:50:00". But 7 minutes
>>>> vs. 5 minutes isn't much of a difference.
>>>>
>>>> Mihael
>>>>
>>>> On Tue, 2015-03-03 at 22:28 -0600, Ketan Maheshwari wrote:
>>>>> Attached is a log for maxWalltime set to 7 minutes beyond which the job
>>>>> does not get submitted because of the 1 hour walltime limit of Cetus.
>>>>> --Ketan
>>>>>
>>>>> On Tue, Mar 3, 2015 at 10:15 PM, Ketan Maheshwari <ketan at mcs.anl.gov>
>>>> wrote:
>>>>>> When I check queue with qstat, I see the job is submitted for 40
>>>> minutes.
>>>>>> When I try to increase maxWallTime the workflow does not get submitted
>>>>>> because on Cetus maximum allowed walltime is 60 minutes. --Ketan
>>>>>>
>>>>>> On Tue, Mar 3, 2015 at 10:03 PM, Hategan-Marandiuc, Philip M. <
>>>>>> hategan at mcs.anl.gov> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Looks like almost exactly 5 minutes to me:
>>>>>>>
>>>>>>> 2015-03-04 01:45:43,943+0000 INFO Execute TASK_STATUS_CHANGE
>>>>>>> taskid=urn:R-3-0-2-1425432781969 status=2
>>>>>>> workerid=0304-3301040-000000:000000
>>>>>>> 2015-03-04 01:50:44,676+0000 INFO Execute TASK_STATUS_CHANGE
>>>>>>> taskid=urn:R-3-0-2-1425432781969 status=5 Walltime exceeded
>>>>>>>
>>>>>>> Which is what the config file is asking for:
>>>>>>>
>>>>>>> app.bgsh {
>>>>>>> env.SUBBLOCK_SIZE: "16" # [R] line
>>>> 27
>>>>>>> executable: "/home/ketan/SwiftApps/subjobs/bg.sh" # [R] line
>>>> 25
>>>>>>> maxWallTime: "00:05:00" # [R] line
>>>> 26
>>>>>>> }
>>>>>>>
>>>>>>> Again, the wrapper log shows the app as still running. Last line is:
>>>>>>> Progress 2015-03-04 01:45:43.971393118+0000 EXECUTE
>>>>>>>
>>>>>>> Please do me a favor and increase the walltime to one hour and let's
>>>> see
>>>>>>> what happens then.
>>>>>>>
>>>>>>> If it still doesn't finish after one hour, we could try to strace it
>>>> and
>>>>>>> see what is happening there.
>>>>>>>
>>>>>>> Mihael
>>>>>>>
>>>>>>> On Tue, 2015-03-03 at 19:53 -0600, Ketan Maheshwari wrote:
>>>>>>>> Please find the log attached. --Ketan
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2015 at 7:03 PM, Hategan-Marandiuc, Philip M. <
>>>>>>>> hategan at mcs.anl.gov> wrote:
>>>>>>>>
>>>>>>>>> On Tue, 2015-03-03 at 15:42 -0600, Ketan Maheshwari wrote:
>>>>>>>>>> Slow network looks unlikely to be a cause:
>>>>>>>>> It's the only variable obvious, so I wouldn't say that.
>>>>>>> I meant "only obvious variable" there.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>>>
>>>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
More information about the Swift-devel
mailing list