[Swift-devel] Lammps on BGQ: task completes but status shows active

Ketan Maheshwari ketan at mcs.anl.gov
Mon Dec 8 16:15:04 CST 2014


Hi Mihael,

Please find the strace output from _swiftwrap attached. It gives the same
error on trying with -f switch though.

Thanks,
Ketan

On Mon, Dec 8, 2014 at 3:36 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> Again, can you put the strace call in _swiftwrap rather than bg.sh?
>
> Also, can you paste the exact line that you used to run strace? You are
> asking me to debug an invisible program.
>
> Mihael
>
> On Mon, 2014-12-08 at 15:26 -0600, Ketan Maheshwari wrote:
> > Hi Mihael,
> >
> > The strace command is not accepting the -f option. From the man page of
> > strace, I see that the option relates to the forked processes which might
> > be the reason why that option is causing error on BG/Q. Here is the error
> > message:
> >
> > Execution failed:
> > Exception in strace:
> >     Arguments: [-fo, /home/ketan/strace.f.out,
> > /home/ketan/SwiftApps/subjobs/bg.sh,
> > /soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]
> >     Host: cluster
> >     Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m
> > exception @ swift-int-staging.k, line: 181
> > Caused by: The following output files were not created by the
> application:
> > lammps.dump
> >
> > ------- Application STDERR --------
> > 2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0]
> ibm.runjob.AbstractOptions:
> > using properties file /bgsys/local/etc/bg.properties
> > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> ibm.runjob.AbstractOptions:
> > max open file descriptors: 65536
> > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> ibm.runjob.AbstractOptions:
> > core file limit: 18446744073709551615
> > 2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > scheduler job id is 377978
> > log4cxx: No appender could be found for logger (tatu.runjob.monitor).
> > log4cxx: Please initialize the log4cxx system properly.
> > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > failed reading: Connection reset by peer
> > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > protocol version exchange between the runjob client and monitor failed
> > -----------------------------------
> >
> > Thanks,
> > Ketan
> >
> > On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:
> > > > I tried to get strace output with two methods:
> > > >
> > > > stderr.txt: This was obtained by attaching the "--strace 0" switch
> to the
> > > > runjob command. It seems to be exiting normally after writing a
> bunch of
> > > > stuff.
> > > >
> > > > strace.out: This one was obtained by wrapping the app exe with
> strace -o
> > > > $HOME/strace.out  ...
> > >
> > > Are you sure? It looks like you wrapped the execution of bg.sh in
> > > strace. This log only tells us that bg.sh starts runjob and runjob
> never
> > > completes, which we already know. You probably want to go to the lowest
> > > level possible. But see below (*).
> > >
> > > >
> > > > This one shows a stuck output with the last line as:
> > > >
> > > > waitpid(-1, %
> > >
> > > waitpid means it's waiting for a subprocess, so this isn't useful
> > > because we want to find out what the leaf subprocess is hanging on. You
> > > could use the '-f' argument to strace to make it follow subprocesses.
> If
> > > you do that, it probably won't matter (aside from noise) at what level
> > > you use strace (*).
> > >
> > > Mihael
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20141208/aa0876d8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strace.swiftwrap.out
Type: application/octet-stream
Size: 40253 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20141208/aa0876d8/attachment.obj>


More information about the Swift-devel mailing list