<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Dec 8, 2014 at 4:30 PM, Hategan-Marandiuc, Philip M. <span dir="ltr"><<a href="mailto:hategan@mcs.anl.gov" target="_blank">hategan@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">This looks like the strace you initially sent, the one that was stracing<br>
bg.sh, so I suspect that you didn't remove the failing strace from<br>
wherever it was, unless I'm misunderstanding what gets called from<br>
where.<br></blockquote><div><br></div><div>This is the new strace output obtained by putting "strace -o" in front of $EXEC call in _swiftwrap.staging. This strace output is distinct from the previous one which was obtained by putting "strace -o" in front of bgsh in app call. They are similar because they invoke the same executable with same arguments.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
So it looks like we need to untangle things.<br>
<br>
So can you do exactly as follows, please:<br>
1. remove all modifications you have made regarding strace to all the<br>
files<br>
2. create a very simple shell wrapper for your app that simply calls the<br>
app with all arguments; post the wrapper back here.<br>
3. make sure that this runs (and hopefully hangs); confirm and post back<br>
here whether it hangs or not.<br>
4. if it hangs, modify the wrapper from step (2) to run strace around<br>
the app; post the modified wrapper here.<br>
5. run and post the output from strace.<br>
<br>
Mihael<br>
<div class="HOEnZb"><div class="h5"><br>
On Mon, 2014-12-08 at 16:15 -0600, Ketan Maheshwari wrote:<br>
> Hi Mihael,<br>
><br>
> Please find the strace output from _swiftwrap attached. It gives the same<br>
> error on trying with -f switch though.<br>
><br>
> Thanks,<br>
> Ketan<br>
><br>
> On Mon, Dec 8, 2014 at 3:36 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>> wrote:<br>
><br>
> > Again, can you put the strace call in _swiftwrap rather than bg.sh?<br>
> ><br>
> > Also, can you paste the exact line that you used to run strace? You are<br>
> > asking me to debug an invisible program.<br>
> ><br>
> > Mihael<br>
> ><br>
> > On Mon, 2014-12-08 at 15:26 -0600, Ketan Maheshwari wrote:<br>
> > > Hi Mihael,<br>
> > ><br>
> > > The strace command is not accepting the -f option. From the man page of<br>
> > > strace, I see that the option relates to the forked processes which might<br>
> > > be the reason why that option is causing error on BG/Q. Here is the error<br>
> > > message:<br>
> > ><br>
> > > Execution failed:<br>
> > > Exception in strace:<br>
> > > Arguments: [-fo, /home/ketan/strace.f.out,<br>
> > > /home/ketan/SwiftApps/subjobs/bg.sh,<br>
> > > /soft/applications/lammps/24Apr13/lmp_bgq_xlomp, -in, input.lammps]<br>
> > > Host: cluster<br>
> > > Directory: workflow.bgq-run016/jobs/r/strace-rqnmne1m<br>
> > > exception @ swift-int-staging.k, line: 181<br>
> > > Caused by: The following output files were not created by the<br>
> > application:<br>
> > > lammps.dump<br>
> > ><br>
> > > ------- Application STDERR --------<br>
> > > 2014-12-08 21:20:43.872 (INFO ) [0xfff7c25bde0]<br>
> > ibm.runjob.AbstractOptions:<br>
> > > using properties file /bgsys/local/etc/bg.properties<br>
> > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]<br>
> > ibm.runjob.AbstractOptions:<br>
> > > max open file descriptors: 65536<br>
> > > 2014-12-08 21:20:43.874 (INFO ) [0xfff7c25bde0]<br>
> > ibm.runjob.AbstractOptions:<br>
> > > core file limit: 18446744073709551615<br>
> > > 2014-12-08 21:20:43.876 (INFO ) [0xfff7c25bde0] 27211:tatu.runjob.client:<br>
> > > scheduler job id is 377978<br>
> > > log4cxx: No appender could be found for logger (tatu.runjob.monitor).<br>
> > > log4cxx: Please initialize the log4cxx system properly.<br>
> > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:<br>
> > > failed reading: Connection reset by peer<br>
> > > 2014-12-08 21:20:43.912 (FATAL) [0xfff7c25bde0] 27211:tatu.runjob.client:<br>
> > > protocol version exchange between the runjob client and monitor failed<br>
> > > -----------------------------------<br>
> > ><br>
> > > Thanks,<br>
> > > Ketan<br>
> > ><br>
> > > On Mon, Dec 8, 2014 at 3:09 PM, Mihael Hategan <<a href="mailto:hategan@mcs.anl.gov">hategan@mcs.anl.gov</a>><br>
> > wrote:<br>
> > ><br>
> > > > On Mon, 2014-12-08 at 14:07 -0600, Ketan Maheshwari wrote:<br>
> > > > > I tried to get strace output with two methods:<br>
> > > > ><br>
> > > > > stderr.txt: This was obtained by attaching the "--strace 0" switch<br>
> > to the<br>
> > > > > runjob command. It seems to be exiting normally after writing a<br>
> > bunch of<br>
> > > > > stuff.<br>
> > > > ><br>
> > > > > strace.out: This one was obtained by wrapping the app exe with<br>
> > strace -o<br>
> > > > > $HOME/strace.out ...<br>
> > > ><br>
> > > > Are you sure? It looks like you wrapped the execution of bg.sh in<br>
> > > > strace. This log only tells us that bg.sh starts runjob and runjob<br>
> > never<br>
> > > > completes, which we already know. You probably want to go to the lowest<br>
> > > > level possible. But see below (*).<br>
> > > ><br>
> > > > ><br>
> > > > > This one shows a stuck output with the last line as:<br>
> > > > ><br>
> > > > > waitpid(-1, %<br>
> > > ><br>
> > > > waitpid means it's waiting for a subprocess, so this isn't useful<br>
> > > > because we want to find out what the leaf subprocess is hanging on. You<br>
> > > > could use the '-f' argument to strace to make it follow subprocesses.<br>
> > If<br>
> > > > you do that, it probably won't matter (aside from noise) at what level<br>
> > > > you use strace (*).<br>
> > > ><br>
> > > > Mihael<br>
> > > ><br>
> > > > _______________________________________________<br>
> > > > Swift-devel mailing list<br>
> > > > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> > > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
> > > ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > Swift-devel mailing list<br>
> > <a href="mailto:Swift-devel@ci.uchicago.edu">Swift-devel@ci.uchicago.edu</a><br>
> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
> ><br>
<br>
<br>
</div></div></blockquote></div><br></div></div>