[Swift-devel] trunk-cobalt block task ended prematurely

Ketan Maheshwari ketan at mcs.anl.gov
Thu Mar 5 10:20:10 CST 2015


Hi Mihael,

Moving the STDIN block from _swiftwrap to _swiftwrap.staging did not make
any difference, ie. it did not work.

Changing

"$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR"

to

"$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR" </dev/null

in _swiftwrap.staging did work. I think it is a good idea to make this
amend to the code.

Thanks,
Ketan


On Wed, Mar 4, 2015 at 3:43 PM, Ketan Maheshwari <ketan at mcs.anl.gov> wrote:

> Sorry, I realized your question now. I have copied the whole if ... fi
> block (including genScripts) from _swiftwrap to _swiftwrap.staging to see
> what happens. Will keep you posted. --Ketan
>
> On Wed, Mar 4, 2015 at 3:37 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
>
>> Right. However, you mentioned before that it works fine when you switch
>> staging from "direct" to "swift". That causes swift to use the plain
>> _swiftwrap, which pretty much has the same code:
>>
>> if [ "$STDIN" == "" ]; then
>>         if [ "$SWIFT_GEN_SCRIPTS" != "" ]; then
>>                 genScripts
>>         fi
>>
>>             if [ -n "$TIMECMD" ] && [ -n "$TIMEARGS" ]; then
>>                "$TIMECMD" "${TIMEARGS[@]}" "$EXEC" "${CMDARGS[@]}"
>> 1>"$STDOUT" 2>"$STDERR"
>>             else
>>                "$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR"
>>             fi
>> else
>> ...
>>
>> So that's what's puzzling to me.
>>
>> Can you try changing this line in _swiftwrap.staging:
>>
>> "$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR"
>>
>> to
>>
>> "$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR" </dev/null
>>
>> and then remove stdin= from the app in the swift script and seeing if
>> that works?
>>
>> While it seems like a wise choice to do this in general, I'm trying to
>> see if this fixes this particular issue.
>>
>> Mihael
>>
>>
>> On Wed, 2015-03-04 at 15:22 -0600, Ketan Maheshwari wrote:
>> > Hi Mihael,
>> >
>> > The code in _swiftwrap.staging branches based on if STDIN is present or
>> > not. See the snippet below:
>> >
>> > if [ "$STDIN" == "" ]; then
>> >     if [ "$SWIFT_GEN_SCRIPTS" != "" ]; then
>> >         echo "#!/bin/bash" > run.sh
>> >         echo "\"$EXEC\" \"${CMDARGS[@]}\" 1>\"$STDOUT\" 2>\"$STDERR\""
>> >>
>> > run.sh
>> >         chmod +x run.sh
>> >     fi
>> >     "$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR"
>> > else
>> >     if [ "$SWIFT_GEN_SCRIPTS" != "" ]; then
>> >         echo "#!/bin/bash" > run.sh
>> >         echo "\"$EXEC\" \"${CMDARGS[@]}\" 1>\"$STDOUT\" 2>\"$STDERR\"
>> > <\"$STDIN\"" >> run.sh
>> >         chmod +x run.sh
>> >     fi
>> >     "$EXEC" "${CMDARGS[@]}" 1>"$STDOUT" 2>"$STDERR" <"$STDIN"
>> > fi
>> >
>> > When "stdin=" is not provided the code takes the first branch and
>> hangs. It
>> > works otherwise.
>> >
>> > It is possible that it hangs because if mpich bug Mike mentioned.
>> >
>> > I agree we should stick in a </dev/null in there.
>> >
>> > --Ketan
>> >
>> >
>> > On Wed, Mar 4, 2015 at 3:12 PM, Hategan-Marandiuc, Philip M. <
>> > hategan at mcs.anl.gov> wrote:
>> >
>> > > I'm still confused. I don't see any difference in stdin handling
>> between
>> > > _swiftwrap and _swiftwrap.staging (which is used for direct staging).
>> > >
>> > > Maybe we should always feed the app a /dev/null if there is no stdin=
>> > > specified.
>> > >
>> > > Mihael
>> > >
>> > > On Wed, 2015-03-04 at 08:50 -0600, Ketan Maheshwari wrote:
>> > > > I added stdin="/dev/null" to app invocation line and it has worked
>> now.
>> > > > --Ketan
>> > > >
>> > > > On Wed, Mar 4, 2015 at 8:44 AM, Ketan Maheshwari <ketan at mcs.anl.gov
>> >
>> > > wrote:
>> > > >
>> > > > > Please find one with 59 minutes attached. --Ketan
>> > > > >
>> > > > > On Tue, Mar 3, 2015 at 11:17 PM, Mihael Hategan <
>> hategan at mcs.anl.gov>
>> > > > > wrote:
>> > > > >
>> > > > >> You are using coasters, so what gets queued is the block, not
>> the job.
>> > > > >>
>> > > > >> You should specify execution.options.maxJobTime = "00:59:00".
>> > > > >>
>> > > > >> Then you can probably do a walltime of about "00:50:00". But 7
>> minutes
>> > > > >> vs. 5 minutes isn't much of a difference.
>> > > > >>
>> > > > >> Mihael
>> > > > >>
>> > > > >> On Tue, 2015-03-03 at 22:28 -0600, Ketan Maheshwari wrote:
>> > > > >> > Attached is a log for maxWalltime set to 7 minutes beyond
>> which the
>> > > job
>> > > > >> > does not get submitted because of the 1 hour walltime limit of
>> > > Cetus.
>> > > > >> > --Ketan
>> > > > >> >
>> > > > >> > On Tue, Mar 3, 2015 at 10:15 PM, Ketan Maheshwari <
>> > > ketan at mcs.anl.gov>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > When I check queue with qstat, I see the job is submitted
>> for 40
>> > > > >> minutes.
>> > > > >> > > When I try to increase maxWallTime the workflow does not get
>> > > submitted
>> > > > >> > > because on Cetus maximum allowed walltime is 60 minutes.
>> --Ketan
>> > > > >> > >
>> > > > >> > > On Tue, Mar 3, 2015 at 10:03 PM, Hategan-Marandiuc, Philip
>> M. <
>> > > > >> > > hategan at mcs.anl.gov> wrote:
>> > > > >> > >
>> > > > >> > >> Hi,
>> > > > >> > >>
>> > > > >> > >> Looks like almost exactly 5 minutes to me:
>> > > > >> > >>
>> > > > >> > >> 2015-03-04 01:45:43,943+0000 INFO  Execute
>> TASK_STATUS_CHANGE
>> > > > >> > >> taskid=urn:R-3-0-2-1425432781969 status=2
>> > > > >> > >> workerid=0304-3301040-000000:000000
>> > > > >> > >> 2015-03-04 01:50:44,676+0000 INFO  Execute
>> TASK_STATUS_CHANGE
>> > > > >> > >> taskid=urn:R-3-0-2-1425432781969 status=5 Walltime exceeded
>> > > > >> > >>
>> > > > >> > >> Which is what the config file is asking for:
>> > > > >> > >>
>> > > > >> > >> app.bgsh {
>> > > > >> > >>   env.SUBBLOCK_SIZE: "16"                                 #
>> [R]
>> > > line
>> > > > >> 27
>> > > > >> > >>   executable: "/home/ketan/SwiftApps/subjobs/bg.sh"       #
>> [R]
>> > > line
>> > > > >> 25
>> > > > >> > >>   maxWallTime: "00:05:00"                                 #
>> [R]
>> > > line
>> > > > >> 26
>> > > > >> > >> }
>> > > > >> > >>
>> > > > >> > >> Again, the wrapper log shows the app as still running. Last
>> line
>> > > is:
>> > > > >> > >> Progress  2015-03-04 01:45:43.971393118+0000  EXECUTE
>> > > > >> > >>
>> > > > >> > >> Please do me a favor and increase the walltime to one hour
>> and
>> > > let's
>> > > > >> see
>> > > > >> > >> what happens then.
>> > > > >> > >>
>> > > > >> > >> If it still doesn't finish after one hour, we could try to
>> > > strace it
>> > > > >> and
>> > > > >> > >> see what is happening there.
>> > > > >> > >>
>> > > > >> > >> Mihael
>> > > > >> > >>
>> > > > >> > >> On Tue, 2015-03-03 at 19:53 -0600, Ketan Maheshwari wrote:
>> > > > >> > >> > Please find the log attached. --Ketan
>> > > > >> > >> >
>> > > > >> > >> > On Tue, Mar 3, 2015 at 7:03 PM, Hategan-Marandiuc, Philip
>> M. <
>> > > > >> > >> > hategan at mcs.anl.gov> wrote:
>> > > > >> > >> >
>> > > > >> > >> > > On Tue, 2015-03-03 at 15:42 -0600, Ketan Maheshwari
>> wrote:
>> > > > >> > >> > > > Slow network looks unlikely to be a cause:
>> > > > >> > >> > >
>> > > > >> > >> > > It's the only variable obvious, so I wouldn't say that.
>> > > > >> > >>
>> > > > >> > >> I meant "only obvious variable" there.
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >
>> > > > >>
>> > > > >>
>> > > > >> _______________________________________________
>> > > > >> Swift-devel mailing list
>> > > > >> Swift-devel at ci.uchicago.edu
>> > > > >>
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>> > > > >>
>> > > > >
>> > > > >
>> > >
>> > >
>> > >
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20150305/83f835a6/attachment.html>


More information about the Swift-devel mailing list