[Swift-user] trunk-cobalt block task ended prematurely

Mihael Hategan hategan at mcs.anl.gov
Mon Mar 2 19:35:56 CST 2015


On Mon, 2015-03-02 at 18:55 -0600, Ketan Maheshwari wrote:
> I do not see any logs in ~/.globus/coasters; yes, /home is mounted on
> service nodes and is writable from there.
> 
> I added "--mode script" as a default arg to qsub in provider code, but
> still getting the same error. Attached is the new log.
> 
> About the manual option, would we also need coaster service to be running?
> Or just invoking worker would suffice (for troubleshooting purposes)?

Just invoking worker.pl. You should eventually get a log file from the
worker that indicates that the perl process has started. It will fail,
unable to connect to the service, but that's secondary.

I'm surprised that you are not getting any stdout/stderr from the
process. Maybe the secret is somewhere around that.

Mihael

> 
> --Ketan
> 
> On Mon, Mar 2, 2015 at 6:25 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > On Mon, 2015-03-02 at 18:11 -0600, Ketan Maheshwari wrote:
> > > I tried this option but did not seem to work. Attached is the log.
> >
> > Check /home/ketan/.globus/coasters for worker logs. If there aren't any,
> > it means that worker.pl isn't being started (I'm assuming that /home is
> > mounted on compute/service nodes).
> >
> > If that's the case, I would suggest troubleshooting by manually running
> > the qsub command and seeing why the worker doesn't start.
> >
> > Mihael
> >
> > >
> > > On Mon, Mar 2, 2015 at 5:27 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > >
> > > > It would really be much more useful if you posted the full log.
> > > >
> > > > Anyway, I believe that what you need to do is:
> > > > site.cluster.execution.options.workerLoggingLevel = "DEBUG"
> > > >
> > > > Mihael
> > > >
> > > > On Mon, 2015-03-02 at 16:37 -0600, Ketan Maheshwari wrote:
> > > > > The qsub command from the log says:
> > > > >
> > > > > qsub -e WORKER_LOGGING_LEVEL=NONE --proccount 32 -n 32 -t 40 --cwd
> > ...
> > > > >
> > > > > So, the env variable on swift.conf does not seem to take effect.
> > > > >
> > > > > On Mon, Mar 2, 2015 at 4:33 PM, Hategan-Marandiuc, Philip M. <
> > > > > hategan at mcs.anl.gov> wrote:
> > > > >
> > > > > > Well, we need to figure out why. Since the qsub command line is in
> > the
> > > > > > swift log, and the qsub command line should reflect the setting, it
> > > > > > would be useful if you posted the swift log.
> > > > > >
> > > > > > Mihael
> > > > > >
> > > > > > On Mon, 2015-03-02 at 16:27 -0600, Ketan Maheshwari wrote:
> > > > > > > For workerlogs, I am trying:
> > > > > > >
> > > > > > >  app.bgsh {
> > > > > > >         executable: "/home/ketan/SwiftApps/subjobs/bg.sh"
> > > > > > >         maxWallTime: "00:04:00"
> > > > > > >         env.ENABLE_WORKER_LOGGING="TRUE"
> > > > > > >         env.WORKER_LOGGING_LEVEL="DEBUG"
> > > > > > >         env.WORKER_LOG_DIR="/home/ketan/workerlogs"
> > > > > > >     }
> > > > > > >
> > > > > > > Does not seem to trigger logging.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Ketan
> > > > > > >
> > > > > > > On Mon, Mar 2, 2015 at 4:07 PM, Hategan-Marandiuc, Philip M. <
> > > > > > > hategan at mcs.anl.gov> wrote:
> > > > > > >
> > > > > > > > I would recommend enabling worker logging to see if we get any
> > info
> > > > > > from
> > > > > > > > the worker process. Could be some simple thing, like the wrong
> > IP
> > > > > > > > address.
> > > > > > > >
> > > > > > > > Mihael
> > > > > > > >
> > > > > > > > On Mon, 2015-03-02 at 15:47 -0600, Ketan Maheshwari wrote:
> > > > > > > > > I trying to run on BG/Q with local:cobalt with trunk but
> > Swift
> > > > > > crashes
> > > > > > > > with
> > > > > > > > > the following error:
> > > > > > > > >
> > > > > > > > > Caused by: Exception in bgsh:
> > > > > > > > >     Arguments:
> > > > > > [/home/ketan/SwiftApps/subjobs/mpicatsnsleep/mpicatnap,
> > > > > > > > >
> > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./data.txt,
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > /gpfs/mira-home/ketan/SwiftApps/subjobs/mpicatsnsleep/./outdir/f.0002.out,
> > > > > > > > > 1]
> > > > > > > > >     Host: cluster
> > > > > > > > >     Directory: catsnsleepmpi-run001/jobs/b/bgsh-3nq3uc5m
> > > > > > > > > exception @ swift-int-staging.k, line: 165
> > > > > > > > > Caused by:
> > > > > > > > > exception @ swift-int-staging.k, line: 160
> > > > > > > > > Caused by: Block task failed: 0302-2109420-000000 Block task
> > > > ended
> > > > > > > > > prematurely
> > > > > > > > >
> > > > > > > > > In the log, I see the qsub call being made and a jobid is
> > > > returned.
> > > > > > > > > However, I could not figure what is the cause for the task to
> > > > fail.
> > > > > > > > >
> > > > > > > > > One more thing I noticed when translating from old sites
> > conf to
> > > > new
> > > > > > is
> > > > > > > > > that the new conf did not accept the property "globus:mode =
> > > > script".
> > > > > > > > >
> > > > > > > > > A full run log is attached. Thanks for any suggestions.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Ketan
> > > > > > > > > _______________________________________________
> > > > > > > > > Swift-user mailing list
> > > > > > > > > Swift-user at ci.uchicago.edu
> > > > > > > > >
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Swift-user mailing list
> > > > Swift-user at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> > > >
> >
> >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> >





More information about the Swift-user mailing list