[Swift-devel] Lammps on BGQ: task completes but status shows active

Ketan Maheshwari ketan at mcs.anl.gov
Mon Dec 8 15:26:56 CST 2014


Hi Mihael,

The strace command is not accepting the -f option. From the man page of
strace, I see that the option relates to the forked processes which might
be the reason why that option is causing error on BG/Q. Here is the error
message:

Execution failed:
Exception in strace:
    Arguments: [-fo, /home/ketan/strace.f.out,
/home/ketan/SwiftApps/subjobs/bg.sh,
/soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]
    Host: cluster
    Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m
exception @ swift-int-staging.k, line: 181
Caused by: The following output files were not created by the application:
lammps.dump

------- Application STDERR --------
2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
using properties file /bgsys/local/etc/bg.properties
2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
max open file descriptors: 65536
2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0] ibm.runjob.AbstractOptions:
core file limit: 18446744073709551615
2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0] 27211:tatu.runjob.client:
scheduler job id is 377978
log4cxx: No appender could be found for logger (tatu.runjob.monitor).
log4cxx: Please initialize the log4cxx system properly.
2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
failed reading: Connection reset by peer
2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:
protocol version exchange between the runjob client and monitor failed
-----------------------------------

Thanks,
Ketan

On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <hategan at mcs.anl.gov> wrote:

> On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:
> > I tried to get strace output with two methods:
> >
> > stderr.txt: This was obtained by attaching the "--strace 0" switch to the
> > runjob command. It seems to be exiting normally after writing a bunch of
> > stuff.
> >
> > strace.out: This one was obtained by wrapping the app exe with strace -o
> > $HOME/strace.out  ...
>
> Are you sure? It looks like you wrapped the execution of bg.sh in
> strace. This log only tells us that bg.sh starts runjob and runjob never
> completes, which we already know. You probably want to go to the lowest
> level possible. But see below (*).
>
> >
> > This one shows a stuck output with the last line as:
> >
> > waitpid(-1, %
>
> waitpid means it's waiting for a subprocess, so this isn't useful
> because we want to find out what the leaf subprocess is hanging on. You
> could use the '-f' argument to strace to make it follow subprocesses. If
> you do that, it probably won't matter (aside from noise) at what level
> you use strace (*).
>
> Mihael
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20141208/4b5150ca/attachment.html>


More information about the Swift-devel mailing list