Thanks for Darius's idea. And I find in the blocking communiction, if a process failed, all others process will die.<br>Today I attemped to program in the nonblocking communiction. However in my test programs, if a process failed to die ,another process still was alive and able to do its business normally.<br>Thank mpich-discuss&nbsp;mailing&nbsp;list!<br><pre><br>At&nbsp;2011-01-06&nbsp;03:40:32£¬"Darius&nbsp;Buntinas"&nbsp;&lt;buntinas@mcs.anl.gov&gt;&nbsp;wrote:

&gt;
&gt;The&nbsp;latest&nbsp;release&nbsp;has&nbsp;some&nbsp;support&nbsp;for&nbsp;tolerating&nbsp;communication&nbsp;failures,&nbsp;such&nbsp;as&nbsp;those&nbsp;due&nbsp;to&nbsp;a&nbsp;failed&nbsp;process,&nbsp;however&nbsp;it&nbsp;doesn't&nbsp;do&nbsp;a&nbsp;good&nbsp;job&nbsp;of&nbsp;detecting&nbsp;failed&nbsp;processes,&nbsp;so&nbsp;you&nbsp;can&nbsp;get&nbsp;a&nbsp;process&nbsp;that&nbsp;hangs&nbsp;in&nbsp;recv&nbsp;waiting&nbsp;for&nbsp;a&nbsp;message&nbsp;from&nbsp;a&nbsp;failed&nbsp;process.&nbsp;&nbsp;We&nbsp;are&nbsp;working&nbsp;on&nbsp;improving&nbsp;detection&nbsp;of&nbsp;and&nbsp;tolerance&nbsp;to&nbsp;failed&nbsp;processes.&nbsp;&nbsp;The&nbsp;next&nbsp;release&nbsp;should&nbsp;include&nbsp;many&nbsp;improvements.
&gt;
&gt;In&nbsp;addition&nbsp;to&nbsp;setting&nbsp;an&nbsp;error&nbsp;handler,&nbsp;you'll&nbsp;need&nbsp;to&nbsp;tell&nbsp;the&nbsp;process&nbsp;manager&nbsp;not&nbsp;to&nbsp;terminate&nbsp;the&nbsp;job&nbsp;when&nbsp;a&nbsp;process&nbsp;fails.&nbsp;&nbsp;If&nbsp;you're&nbsp;using&nbsp;the&nbsp;hydra&nbsp;process&nbsp;manager&nbsp;(which&nbsp;is&nbsp;the&nbsp;default&nbsp;in&nbsp;the&nbsp;latest&nbsp;release),&nbsp;you&nbsp;can&nbsp;give&nbsp;the&nbsp;-disable-auto-cleanup&nbsp;option&nbsp;to&nbsp;mpiexec.
&gt;
&gt;-d
&gt;
&gt;On&nbsp;Jan&nbsp;4,&nbsp;2011,&nbsp;at&nbsp;7:41&nbsp;PM,&nbsp;ejoywx&nbsp;wrote:
&gt;
&gt;&gt;&nbsp;Dear&nbsp;Sir,
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;Sorry&nbsp;to&nbsp;trouble&nbsp;you!
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;Maybe&nbsp;I&nbsp;am&nbsp;to&nbsp;ask&nbsp;this&nbsp;question.&nbsp;But&nbsp;for&nbsp;me,&nbsp;"Can&nbsp;MPICH2&nbsp;handle&nbsp;the&nbsp;fault&nbsp;that&nbsp;some&nbsp;processes&nbsp;die&nbsp;irregularly"&nbsp;,&nbsp;it&nbsp;is&nbsp;very&nbsp;important:&nbsp;In&nbsp;our&nbsp;computer&nbsp;cluster,&nbsp;I&nbsp;find&nbsp;if&nbsp;a&nbsp;process&nbsp;dies&nbsp;in&nbsp;some&nbsp;node&nbsp;or&nbsp;a&nbsp;node&nbsp;is&nbsp;shutdown,&nbsp;all&nbsp;process&nbsp;of&nbsp;the&nbsp;cluster&nbsp;will&nbsp;die.&nbsp;We&nbsp;attempt&nbsp;to&nbsp;register&nbsp;a&nbsp;error&nbsp;handler&nbsp;to&nbsp;deal&nbsp;with&nbsp;such&nbsp;fault,&nbsp;unfortunately,&nbsp;We&nbsp;fail!
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;I&nbsp;admit&nbsp;that&nbsp;I&nbsp;do&nbsp;not&nbsp;know&nbsp;MPICH2,&nbsp;but&nbsp;I&nbsp;hope&nbsp;I&nbsp;am&nbsp;able&nbsp;to&nbsp;get&nbsp;help&nbsp;from&nbsp;you!&nbsp;&nbsp;"Can&nbsp;MPICH2&nbsp;handle&nbsp;the&nbsp;fault&nbsp;that&nbsp;some&nbsp;processes&nbsp;die&nbsp;irregularly?"
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;I&nbsp;look&nbsp;forward&nbsp;to&nbsp;receiving&nbsp;your&nbsp;e-mail.Thanks.
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;ejoywx
&gt;&gt;&nbsp;2011-01-05
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;
&gt;&gt;&nbsp;_______________________________________________
&gt;&gt;&nbsp;mpich-discuss&nbsp;mailing&nbsp;list
&gt;&gt;&nbsp;mpich-discuss@mcs.anl.gov
&gt;&gt;&nbsp;https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
&gt;
&gt;_______________________________________________
&gt;mpich-discuss&nbsp;mailing&nbsp;list
&gt;mpich-discuss@mcs.anl.gov
&gt;https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
</pre><br><br><span title="neteasefooter"><span id="netease_mail_footer"></span></span>