[Swift-devel] Lammps on BGQ: task completes but status shows active

Ketan Maheshwari ketan at mcs.anl.gov
Mon Dec 8 17:08:30 CST 2014


On Mon, Dec 8, 2014 at 4:30 PM, Hategan-Marandiuc, Philip M. <
hategan at mcs.anl.gov> wrote:

> This looks like the strace you initially sent, the one that was stracing
> bg.sh, so I suspect that you didn't remove the failing strace from
> wherever it was, unless I'm misunderstanding what gets called from
> where.
>

This is the new strace output obtained by putting "strace -o" in front of
$EXEC call in _swiftwrap.staging. This strace output is distinct from the
previous one which was obtained by putting "strace -o" in front of bgsh in
app call. They are similar because they invoke the same executable with
same arguments.


>
> So it looks like we need to untangle things.
>
> So can you do exactly as follows, please:
> 1. remove all modifications you have made regarding strace to all the
> files
> 2. create a very simple shell wrapper for your app that simply calls the
> app with all arguments; post the wrapper back here.
> 3. make sure that this runs (and hopefully hangs); confirm and post back
> here whether it hangs or not.
> 4. if it hangs, modify the wrapper from step (2) to run strace around
> the app; post the modified wrapper here.
> 5. run and post the output from strace.
>
> Mihael
>
> On Mon, 2014-12-08 at 16:15 -0600, Ketan Maheshwari wrote:
> > Hi Mihael,
> >
> > Please find the strace output from _swiftwrap attached. It gives the same
> > error on trying with -f switch though.
> >
> > Thanks,
> > Ketan
> >
> > On Mon, Dec 8, 2014 at 3:36 PM, Mihael Hategan <hategan at mcs.anl.gov>
> wrote:
> >
> > > Again, can you put the strace call in _swiftwrap rather than bg.sh?
> > >
> > > Also, can you paste the exact line that you used to run strace? You are
> > > asking me to debug an invisible program.
> > >
> > > Mihael
> > >
> > > On Mon, 2014-12-08 at 15:26 -0600, Ketan Maheshwari wrote:
> > > > Hi Mihael,
> > > >
> > > > The strace command is not accepting the -f option. From the man page
> of
> > > > strace, I see that the option relates to the forked processes which
> might
> > > > be the reason why that option is causing error on BG/Q. Here is the
> error
> > > > message:
> > > >
> > > > Execution failed:
> > > > Exception in strace:
> > > >     Arguments: [-fo, /home/ketan/strace.f.out,
> > > > /home/ketan/SwiftApps/subjobs/bg.sh,
> > > > /soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]
> > > >     Host: cluster
> > > >     Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m
> > > > exception @ swift-int-staging.k, line: 181
> > > > Caused by: The following output files were not created by the
> > > application:
> > > > lammps.dump
> > > >
> > > > ------- Application STDERR --------
> > > > 2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0]
> > > ibm.runjob.AbstractOptions:
> > > > using properties file /bgsys/local/etc/bg.properties
> > > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> > > ibm.runjob.AbstractOptions:
> > > > max open file descriptors: 65536
> > > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> > > ibm.runjob.AbstractOptions:
> > > > core file limit: 18446744073709551615
> > > > 2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0]
> 27211:tatu.runjob.client:
> > > > scheduler job id is 377978
> > > > log4cxx: No appender could be found for logger (tatu.runjob.monitor).
> > > > log4cxx: Please initialize the log4cxx system properly.
> > > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0]
> 27211:tatu.runjob.client:
> > > > failed reading: Connection reset by peer
> > > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0]
> 27211:tatu.runjob.client:
> > > > protocol version exchange between the runjob client and monitor
> failed
> > > > -----------------------------------
> > > >
> > > > Thanks,
> > > > Ketan
> > > >
> > > > On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > > wrote:
> > > >
> > > > > On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:
> > > > > > I tried to get strace output with two methods:
> > > > > >
> > > > > > stderr.txt: This was obtained by attaching the "--strace 0"
> switch
> > > to the
> > > > > > runjob command. It seems to be exiting normally after writing a
> > > bunch of
> > > > > > stuff.
> > > > > >
> > > > > > strace.out: This one was obtained by wrapping the app exe with
> > > strace -o
> > > > > > $HOME/strace.out  ...
> > > > >
> > > > > Are you sure? It looks like you wrapped the execution of bg.sh in
> > > > > strace. This log only tells us that bg.sh starts runjob and runjob
> > > never
> > > > > completes, which we already know. You probably want to go to the
> lowest
> > > > > level possible. But see below (*).
> > > > >
> > > > > >
> > > > > > This one shows a stuck output with the last line as:
> > > > > >
> > > > > > waitpid(-1, %
> > > > >
> > > > > waitpid means it's waiting for a subprocess, so this isn't useful
> > > > > because we want to find out what the leaf subprocess is hanging
> on. You
> > > > > could use the '-f' argument to strace to make it follow
> subprocesses.
> > > If
> > > > > you do that, it probably won't matter (aside from noise) at what
> level
> > > > > you use strace (*).
> > > > >
> > > > > Mihael
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > >
> > >
> > >
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20141208/c625b6cf/attachment.html>


More information about the Swift-devel mailing list