Hi, all<div><br></div><div>I'm a student learning fault tolerance of MPI, kind of on a startup. I'm not very familiar with MPI, but the project I'm working on is in urgent need of an MPI implementation which could provide fault tolerant to some extent. So I ask you for help.</div>
<div><br></div><div>The newly announced version of MPICH says that </div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
" collective operations on communicators containing the failed process is undefined, and may give incorrect results or hang some processes." </blockquote><div><br></div><div>But it seems that it is hard for us not to use collective communications and also difficult to guarantee all nodes alive, especially in HPC. So, If I want to modify the source code concerning "Barrier" operation of MPICH2-1.3.1, could you give me some advice, such as the amount of work etc? </div>
<div>The barrier operation which I need should <font class="Apple-style-span" color="#FF0000">exclude the dead process</font> and <font class="Apple-style-span" color="#FF0000">include a new process which take place of the dead process</font><font class="Apple-style-span" color="#FF6666">. </font></div>
<div><br></div><div><br></div><div>MPICH2-1.3.1 also supports that </div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
" If a communication operation fails (e.g., due to a process failure) MPICH2 will return an error, and further communication to that process is not possible. However, communication with other processes will still proceed normally." </blockquote>
<div><br></div><div>I want to know that :</div><div><br></div><div>1, What information can be retrieved from the error which MPICH2 returns on a node failure ? (such as which process is dead, etc. ?)</div><div><br></div><div>
2, can we create a new communicator containing the processes alive after a process failure? If so, could you possibly list some possible means?</div><div><br></div><div>Best Regards,</div><div><br></div><div>Rui</div><div>
<br></div><div> </div><div><br>
</div>