Hi Pavan,<div><br></div><div>Even cpi program displays the same error message.</div><div>Yes, these errors occur irrespective of whether with checkpointing or without checkpointing.<br><div><br></div><div>Reverting back to mpich2-1.2 and using mpdboot-mpirun does not give these errors...</div>
<div><br></div><div>Thank you.</div><div><br><div class="gmail_quote">On Sun, Oct 10, 2010 at 1:40 PM, Pavan Balaji <span dir="ltr"><<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im"><br>
On 10/10/2010 12:19 PM, kishor kharbas wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">
1 That exactly was the problem, I re-compiled my program and it works,<br>
except for one issue,<br>
<br>
After restarting the parallel process from the checkpoint file, the<br>
mpiexec process hangs and does not terminate at all.<br>
The spawned process hover around in <defunct> state. After I stop<br>
mpiexec myself, these error messages are displayed,<br>
<br>
/ ^C[mpiexec@opt09] connection to proxy terminated unexpectedly/<br>
/ Ctrl-C caught... cleaning up processes/<br>
/ [press Ctrl-C again to force abort]/<br></div>
/ APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/<br>
</blockquote>
<br>
I'll let Darius reply to this part.<div class="im"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2. There is another independent problem(more severe) with running<br>
programs on multiple hosts. For all my previous mails in this chain, I<br>
had run my programs on single host.<br>
running mpiexec with multiple hosts displays the following error:<br>
<br>
/Fatal error in MPI_Send: Other MPI error, error stack:/<br>
/ MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60,<br>
count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed/<br>
/ MPIDI_CH3I_Progress(334)..........:/<br>
/ MPID_nem_mpich2_blocking_recv(906):/<br>
/ MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:/<br>
/ APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)/<br>
<br>
I also ran 'make testing' with HYDRA_HOST_FILE set to the host<br>
file. All the tests emitted same error stack.<br>
Can you please suggest how do I troubleshoot this problem ?<br>
</blockquote>
<br></div>
This is very surprising. It looks like the different hosts are not able to "see" each other. Can you run the simple "cpi" program in the examples directory, across multiple hosts? I'm assuming this error occurs irrespective of whether you do checkpointing or not.<br>
<br>
-- Pavan<br><font color="#888888">
<br>
-- <br>
Pavan Balaji<br>
<a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>
</font></blockquote></div><br><br clear="all"><br>-- <br><i>Kishor Kharbas</i><br><i style="font-family:times new roman,serif">MS Student<br>Department of Computer Science<br>NC State University</i><i style="font-family:times new roman,serif"><br>
Raleigh, NC 27606</i><br>
</div></div>