[Swift-devel] Lammps on BGQ: task completes but status shows active

Mihael Hategan hategan at mcs.anl.gov
Mon Dec 8 15:36:07 CST 2014


Again, can you put the strace call in _swiftwrap rather than bg.sh?

Also, can you paste the exact line that you used to run strace? You are
asking me to debug an invisible program.

Mihael

On Mon, 2014-12-08 at 15:26 -0600, Ketan Maheshwari wrote:
> Hi Mihael,
> 
> The strace command is not accepting the -f option. From the man page of
> strace, I see that the option relates to the forked processes which might
> be the reason why that option is causing error on BG/Q. Here is the error
> message:
> 
> Execution failed:
> Exception in strace:
>     Arguments: [-fo, /home/ketan/strace.f.out,
> /home/ketan/SwiftApps/subjobs/bg.sh,
> /soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]
>     Host: cluster
>     Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m
> exception @ swift-int-staging.k, line: 181
> Caused by: The following output files were not created by the application:
> lammps.dump
> 
> ------- Application STDERR --------
> 2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
> using properties file /bgsys/local/etc/bg.properties
> 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
> max open file descriptors: 65536
> 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
> core file limit: 18446744073709551615
> 2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0] 27211:tatu.runjob.client:
> scheduler job id is 377978
> log4cxx: No appender could be found for logger (tatu.runjob.monitor).
> log4cxx: Please initialize the log4cxx system properly.
> 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> failed reading: Connection reset by peer
> 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
> protocol version exchange between the runjob client and monitor failed
> -----------------------------------
> 
> Thanks,
> Ketan
> 
> On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> 
> > On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:
> > > I tried to get strace output with two methods:
> > >
> > > stderr.txt: This was obtained by attaching the "--strace 0" switch to the
> > > runjob command. It seems to be exiting normally after writing a bunch of
> > > stuff.
> > >
> > > strace.out: This one was obtained by wrapping the app exe with strace -o
> > > $HOME/strace.out  ...
> >
> > Are you sure? It looks like you wrapped the execution of bg.sh in
> > strace. This log only tells us that bg.sh starts runjob and runjob never
> > completes, which we already know. You probably want to go to the lowest
> > level possible. But see below (*).
> >
> > >
> > > This one shows a stuck output with the last line as:
> > >
> > > waitpid(-1, %
> >
> > waitpid means it's waiting for a subprocess, so this isn't useful
> > because we want to find out what the leaf subprocess is hanging on. You
> > could use the '-f' argument to strace to make it follow subprocesses. If
> > you do that, it probably won't matter (aside from noise) at what level
> > you use strace (*).
> >
> > Mihael
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >





More information about the Swift-devel mailing list