[Swift-devel] Lammps on BGQ: task completes but status shows active

Mihael Hategan hategan at mcs.anl.gov
Mon Dec 8 16:30:51 CST 2014


This looks like the strace you initially sent, the one that was stracing
bg.sh, so I suspect that you didn't remove the failing strace from
wherever it was, unless I'm misunderstanding what gets called from
where.

So it looks like we need to untangle things.

So can you do exactly as follows, please:
1. remove all modifications you have made regarding strace to all the
files
2. create a very simple shell wrapper for your app that simply calls the
app with all arguments; post the wrapper back here.
3. make sure that this runs (and hopefully hangs); confirm and post back
here whether it hangs or not.
4. if it hangs, modify the wrapper from step (2) to run strace around
the app; post the modified wrapper here.
5. run and post the output from strace.

Mihael

On Mon, 2014-12-08 at 16:15 -0600, Ketan Maheshwari wrote:
> Hi Mihael,
> 
> Please find the strace output from _swiftwrap attached. It gives the same
> error on trying with -f switch though.
> 
> Thanks,
> Ketan
> 
> On Mon, Dec 8, 2014 at 3:36 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > Again, can you put the strace call in _swiftwrap rather than bg.sh?
> >
> > Also, can you paste the exact line that you used to run strace? You are
> > asking me to debug an invisible program.
> >
> > Mihael
> >
> > On Mon, 2014-12-08 at 15:26 -0600, Ketan Maheshwari wrote:
> > > Hi Mihael,
> > >
> > > The strace command is not accepting the -f option. From the man page of
> > > strace, I see that the option relates to the forked processes which might
> > > be the reason why that option is causing error on BG/Q. Here is the error
> > > message:
> > >
> > > Execution failed:
> > > Exception in strace:
> > >     Arguments: [-fo, /home/ketan/strace.f.out,
> > > /home/ketan/SwiftApps/subjobs/bg.sh,
> > > /soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]
> > >     Host: cluster
> > >     Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m
> > > exception @ swift-int-staging.k, line: 181
> > > Caused by: The following output files were not created by the
> > application:
> > > lammps.dump
> > >
> > > ------- Application STDERR --------
> > > 2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0]
> > ibm.runjob.AbstractOptions:
> > > using properties file /bgsys/local/etc/bg.properties
> > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> > ibm.runjob.AbstractOptions:
> > > max open file descriptors: 65536
> > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]
> > ibm.runjob.AbstractOptions:
> > > core file limit: 18446744073709551615
> > > 2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > > scheduler job id is 377978
> > > log4cxx: No appender could be found for logger (tatu.runjob.monitor).
> > > log4cxx: Please initialize the log4cxx system properly.
> > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > > failed reading: Connection reset by peer
> > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> > > protocol version exchange between the runjob client and monitor failed
> > > -----------------------------------
> > >
> > > Thanks,
> > > Ketan
> > >
> > > On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <hategan at mcs.anl.gov>
> > wrote:
> > >
> > > > On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:
> > > > > I tried to get strace output with two methods:
> > > > >
> > > > > stderr.txt: This was obtained by attaching the "--strace 0" switch
> > to the
> > > > > runjob command. It seems to be exiting normally after writing a
> > bunch of
> > > > > stuff.
> > > > >
> > > > > strace.out: This one was obtained by wrapping the app exe with
> > strace -o
> > > > > $HOME/strace.out  ...
> > > >
> > > > Are you sure? It looks like you wrapped the execution of bg.sh in
> > > > strace. This log only tells us that bg.sh starts runjob and runjob
> > never
> > > > completes, which we already know. You probably want to go to the lowest
> > > > level possible. But see below (*).
> > > >
> > > > >
> > > > > This one shows a stuck output with the last line as:
> > > > >
> > > > > waitpid(-1, %
> > > >
> > > > waitpid means it's waiting for a subprocess, so this isn't useful
> > > > because we want to find out what the leaf subprocess is hanging on. You
> > > > could use the '-f' argument to strace to make it follow subprocesses.
> > If
> > > > you do that, it probably won't matter (aside from noise) at what level
> > > > you use strace (*).
> > > >
> > > > Mihael
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > >
> >
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >





More information about the Swift-devel mailing list