<div dir="ltr"><div>I" currently use <b>mpich2</b>.</div><div><b>Configuration</b>:</div><div><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> I'm
using plain ./configure no options</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""> ./configure
--prefix=/space/local/mpich2</span></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><b>Execution:</b></span></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif"">mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines2.txt -n 10 mpi_send_rec_testany 1000 10000 2 20 1 logs/res_test</span></p>
<p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><br></span></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><b>Success run output on screen:</b></span></p>
<p class="MsoNormal"></p><p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)</font></p><p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">This typically refers to a problem with your application.</font></p>
<p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">Please see the FAQ page for debugging suggestions</font></p><div style="font-weight: bold; font-family: Arial, sans-serif; font-size: 10pt; "><br>
</div><p></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><b>Not success run output on screen:</b></span></p><p class="MsoNormal"></p><p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">control_cb (./pm/pmiserv/pmiserv_cb.c:321): assert (!closed) failed</font></p>
<p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status</font></p><p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event</font></p>
<p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif">[mpiexec@student1-ib0] main (./ui/mpich/mpiexec.c:420): process manager error waiting for completion</font></p><p></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><br>
</span></p><p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif" size="2">The program is not stable. It recognizes fail of slave 1 as expected, but not always completed </font><font class="Apple-style-span" face="Arial, sans-serif">successfully</font><font class="Apple-style-span" face="Arial, sans-serif" size="2">.</font></p>
<p class="MsoNormal"><font class="Apple-style-span" face="Arial, sans-serif" size="2">Currently I"m ignore SIGUSR1 (overwrite of signal handler).</font></p><p class="MsoNormal"><span class="Apple-style-span" style="font-family: Arial, sans-serif; "><br>
</span></p><p class="MsoNormal"><span class="Apple-style-span" style="font-family: Arial, sans-serif; ">Can you please tell me:</span></p><p class="MsoNormal"></p><ul><li><span class="Apple-style-span" style="font-family: Arial, sans-serif; ">what should I do in order to stabilize my test. How should I refer to SIGUSR1 in the case of failure.</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, sans-serif; ">This test uses polling mechanism implemented via MPI_Test per connection. What should I do to get same results using MPI_Waitany. How can I recognize fail process rank, exclude it from communication, and continue working with survived processes.</span></li>
</ul><p></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><br></span></p><p class="MsoNormal"><span style="font-size:10.0pt;font-family:"Arial","sans-serif""><br>
</span></p></div><br><br><div class="gmail_quote">On Tue, Oct 25, 2011 at 5:27 PM, Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
It looks like you're using MVAPICH. I don't believe MVAPICH supports the fault-tolerance features you're looking for.<br>
<br>
You'll need to use MPICH2 with the default channel (i.e., don't specify --with-device=...), leave out --enable-threads (it's the default already) and don't use mpd (leave out --with-pm=...)<br>
<font color="#888888"><br>
-d<br>
</font><div><div></div><div class="h5"><br>
On Oct 25, 2011, at 10:16 AM, Anatoly G wrote:<br>
<br>
> Compilation performed by my boss.<br>
> Configuration:<br>
> ./configure --with-device=ch3:sock --enable-debuginfo --prefix=/space/local/mvapich2 CFLAGS=-fPIC --enable-shared --enable-threads --enable-sharedlibs=gcc --with-pm=mpd:hydra<br>
><br>
> mvapich2-1.7rc2<br>
><br>
><br>
> Anatoly.<br>
><br>
> On Tue, Oct 25, 2011 at 4:17 PM, Darius Buntinas <<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>> wrote:<br>
><br>
> Did you configure and compile MPICH2 yourself? If you did, please send us the command you used to configure it (e.g., ./configure --prefix=...).<br>
><br>
> If you didn't compile it yourself, you'll need to talk to the person who did to get that information.<br>
><br>
> Also, what version of MPICH2 are you using?<br>
><br>
> -d<br>
><br>
> On Oct 25, 2011, at 2:30 AM, Anatoly G wrote:<br>
><br>
> > Initilization lines are:<br>
> > MPI::Init(argc, argv);<br>
> > MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);<br>
> ><br>
> > Execution command:<br>
> > mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync<br>
> ><br>
> > Anatoly.<br>
> ><br>
> ><br>
> > On Mon, Oct 24, 2011 at 10:17 PM, Darius Buntinas <<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>> wrote:<br>
> ><br>
> > In MPI_Init, the signal handler should be installed, so SIGUSR1 shouldn't kill the process.<br>
> ><br>
> > Can you send us the configure line you used?<br>
> ><br>
> > -d<br>
> ><br>
> > On Oct 23, 2011, at 1:54 AM, Anatoly G wrote:<br>
> ><br>
> > > Sorry, I"m still don't understand.<br>
> > > When remote process fails, rest of processes get SIGUSR1, and by default are failed, because they don't have any signal handler.<br>
> > > If I"ll create signal handler for SIGUSR1, I can't detect that one of remote/local processes dead. How can I recognize which remote process dead. Signal has only local host process information.<br>
> > ><br>
> > > Anatoly.<br>
> > ><br>
> > ><br>
> > > On Mon, Oct 17, 2011 at 7:40 PM, Darius Buntinas <<a href="mailto:buntinas@mcs.anl.gov">buntinas@mcs.anl.gov</a>> wrote:<br>
> > ><br>
> > > On Oct 15, 2011, at 4:47 AM, Pavan Balaji wrote:<br>
> > ><br>
> > > ><br>
> > > > On 10/11/2011 02:35 PM, Darius Buntinas wrote:<br>
> > > >> I took a look at your code. Mpiexec will send a SIGUSR1 signal to<br>
> > > >> each process to notify it of a failed process (Oops, I forgot about<br>
> > > >> that when I responded to your previous email). If you need a signal<br>
> > > >> for your application, you'll need to choose another one. The signal<br>
> > > >> handler you installed replaced MPICH's signal handler, so the library<br>
> > > >> wasn't able to detect that the process had failed.<br>
> > > ><br>
> > > > Anatoly: In stacked libraries, you are supposed to chain signal handlers. Replacing another library's signal handlers can lead to unexpected behavior.<br>
> > ><br>
> > > If you set the signal handler before calling MPI_Init, MPICH will chain your signal handler.<br>
> > ><br>
> > > ><br>
> > > >> Another problem is that MPI_Abort() isn't killing all processes, so<br>
> > > >> when I commented out CreateOwnSignalHandler(), the master detected<br>
> > > >> the failure and called MPI_Abort(), but some slave processes were<br>
> > > >> still hanging in MPI_Barrier(). We'll need to fix that.<br>
> > > ><br>
> > > > Darius: What's the expected behavior here? Should a regular exit look at whether the user asked for a cleanup or not, and an abort kill all processes?<br>
> > ><br>
> > > That's what I think it should do. MPI_Abort should kill all processes in the specified communicator. If you can't kill only the processes in the communicator, then it should kill all connected processes (i.e., the job, plus any dynamic procs).<br>
> > ><br>
> > > -d<br>
> > ><br>
> > > > -- Pavan<br>
> > > ><br>
> > > > --<br>
> > > > Pavan Balaji<br>
> > > > <a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
> > ><br>
> > > _______________________________________________<br>
> > > mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> > > To manage subscription options or unsubscribe:<br>
> > > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
> > ><br>
> > > _______________________________________________<br>
> > > mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> > > To manage subscription options or unsubscribe:<br>
> > > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
> ><br>
> > _______________________________________________<br>
> > mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> > To manage subscription options or unsubscribe:<br>
> > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
> ><br>
> > _______________________________________________<br>
> > mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> > To manage subscription options or unsubscribe:<br>
> > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
><br>
> _______________________________________________<br>
> mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
><br>
> _______________________________________________<br>
> mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div></div></blockquote></div><br></div>