[Swift-user] trunk-cobalt block task ended prematurely

Ketan Maheshwari ketan at mcs.anl.gov
Mon Mar 2 20:22:28 CST 2015


OK, I found that worker.pl was crashing because of my subjob related mods.
I forgot to declare a variable using "my". After this change, it runs.

However, jobs that complete are not reported to be completed; they stay in
"active" state as seen from the progress log till the job times out. I also
see the following lines in stderr:

Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235.
Use of uninitialized value in concatenation (.) or string at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 387.
Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235.
Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235.
Use of uninitialized value in concatenation (.) or string at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 387.
Use of uninitialized value $SOFT_IMAGE_JOB_ID in numeric eq (==) at
/home/ketan/.globus/coasters/cscript225276003254762418.pl line 2235.

Not sure if these are errors or warnings and relevant.

Attached is the complete log.

Thanks,
Ketan

On Mon, Mar 2, 2015 at 7:35 PM, Hategan-Marandiuc, Philip M. <
hategan at mcs.anl.gov> wrote:

> On Mon, 2015-03-02 at 18:55 -0600, Ketan Maheshwari wrote:
> > I do not see any logs in ~/.globus/coasters; yes, /home is mounted on
> > service nodes and is writable from there.
> >
> > I added "--mode script" as a default arg to qsub in provider code, but
> > still getting the same error. Attached is the new log.
> >
> > About the manual option, would we also need coaster service to be
> running?
> > Or just invoking worker would suffice (for troubleshooting purposes)?
>
> Just invoking worker.pl. You should eventually get a log file from the
> worker that indicates that the perl process has started. It will fail,
> unable to connect to the service, but that's secondary.
>
> I'm surprised that you are not getting any stdout/stderr from the
> process. Maybe the secret is somewhere around that.
>
> Mihael
>
> >
> > --Ketan
> >
> > On Mon, Mar 2, 2015 at 6:25 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote:
> > > > I tried this option but did not seem to work. Attached is the log.
> > >
> > > Check /home/ketan/.globus/coasters for worker logs. If there aren't
> any,
> > > it means that worker.pl isn't being started (I'm assuming that /home
> is
> > > mounted on compute/service nodes).
> > >
> > > If that's the case, I would suggest troubleshooting by manually running
> > > the qsub command and seeing why the worker doesn't start.
> > >
> > > Mihael
> > >
> > > >
> > > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >
> > > > > It would really be much more useful if you posted the full log.
> > > > >
> > > > > Anyway, I believe that what you need to do is:
> > > > > site.cluster.execution.options.workerLoggingLevel = "DEBUG"
> > > > >
> > > > > Mihael
> > > > >
> > > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote:
> > > > > > The qsub command from the log says:
> > > > > >
> > > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40
> --cwd
> > > ...
> > > > > >
> > > > > > So, the env variable on swift.conf does not seem to take effect.
> > > > > >
> > > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. <
> > > > > > hategan at mcs.anl.gov> wrote:
> > > > > >
> > > > > > > Well, we need to figure out why. Since the qsub command line
> is in
> > > the
> > > > > > > swift log, and the qsub command line should reflect the
> setting, it
> > > > > > > would be useful if you posted the swift log.
> > > > > > >
> > > > > > > Mihael
> > > > > > >
> > > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote:
> > > > > > > > For workerlogs, I am trying:
> > > > > > > >
> > > > > > > >  app.bgsh {
> > > > > > > >         executable: "/home/ketan/SwiftApps/subjobs/bg.sh"
> > > > > > > >         maxWallTime: "00:04:00"
> > > > > > > >         env.ENABLE_WORKER_LOGGING="TRUE"
> > > > > > > >         env.WORKER_LOGGING_LEVEL="DEBUG"
> > > > > > > >         env.WORKER_LOG_DIR="/home/ketan/workerlogs"
> > > > > > > >     }
> > > > > > > >
> > > > > > > > Does not seem to trigger logging.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Ketan
> > > > > > > >
> > > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M.
> <
> > > > > > > > hategan at mcs.anl.gov> wrote:
> > > > > > > >
> > > > > > > > > I would recommend enabling worker logging to see if we get
> any
> > > info
> > > > > > > from
> > > > > > > > > the worker process. Could be some simple thing, like the
> wrong
> > > IP
> > > > > > > > > address.
> > > > > > > > >
> > > > > > > > > Mihael
> > > > > > > > >
> > > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote:
> > > > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but
> > > Swift
> > > > > > > crashes
> > > > > > > > > with
> > > > > > > > > > the following error:
> > > > > > > > > >
> > > > > > > > > > Caused by: Exception in bgsh:
> > > > > > > > > >     Arguments:
> > > > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap,
> > > > > > > > > >
> > > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt,
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out,
> > > > > > > > > > 1]
> > > > > > > > > >     Host: cluster
> > > > > > > > > >     Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m
> > > > > > > > > > exception @ swift-int-staging.k, line: 165
> > > > > > > > > > Caused by:
> > > > > > > > > > exception @ swift-int-staging.k, line: 160
> > > > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block
> task
> > > > > ended
> > > > > > > > > > prematurely
> > > > > > > > > >
> > > > > > > > > > In the log, I see the qsub call being made and a jobid is
> > > > > returned.
> > > > > > > > > > However, I could not figure what is the cause for the
> task to
> > > > > fail.
> > > > > > > > > >
> > > > > > > > > > One more thing I noticed when translating from old sites
> > > conf to
> > > > > new
> > > > > > > is
> > > > > > > > > > that the new conf did not accept the property
> "globus:mode =
> > > > > script".
> > > > > > > > > >
> > > > > > > > > > A full run log is attached. Thanks for any suggestions.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Ketan
> > > > > > > > > > _______________________________________________
> > > > > > > > > > Swift-user mailing list
> > > > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > > >
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Swift-user mailing list
> > > > > Swift-user at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-user mailing list
> > > Swift-user at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150302/5d2f7e0a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run001.tgz
Type: application/x-gzip
Size: 19894 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150302/5d2f7e0a/attachment.bin>


More information about the Swift-user mailing list