Hi Darius,<div><br></div><div>1 That exactly was the problem, I re-compiled my program and it works, except for one issue,</div><div><br></div><div> After restarting the parallel process from the checkpoint file, the mpiexec process hangs and does not terminate at all.</div>
<div> The spawned process hover around in <defunct> state. After I stop mpiexec myself, these error messages are displayed,<br><br></div><div><div><i> ^C[mpiexec@opt09] connection to proxy terminated unexpectedly</i></div>
<div><i> Ctrl-C caught... cleaning up processes</i></div>
<div><i> [press Ctrl-C again to force abort]</i></div><div><i> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)</i></div><div><br></div><div>2. There is another independent problem(more severe) with running programs on multiple hosts. For all my previous mails in this chain, I had run my programs on single host.</div>
<div> running mpiexec with multiple hosts displays the following error:</div>
<div><br></div><div> <i>Fatal error in MPI_Send: Other MPI error, error stack:</i></div><div><i> MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60, count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed</i></div>
<div><i> MPIDI_CH3I_Progress(334)..........:</i></div><div><i> MPID_nem_mpich2_blocking_recv(906):</i></div><div><i> MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:</i></div><div><i> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)</i></div>
<div><br></div><div> I also ran 'make testing' with HYDRA_HOST_FILE set to the host file. All the tests emitted same error stack.</div><div> </div><div> Can you please suggest how do I troubleshoot this problem ?</div>
<div><br></div><div>Thank you.</div><div>Kishor</div><div class="gmail_quote">On Fri, Oct 8, 2010 at 4:54 PM, Darius Buntinas <span dir="ltr"><<a href="mailto:buntinas@mcs.anl.gov" target="_blank">buntinas@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Did you recompile mpiexample with BLCR-enabled MPICH2? The error you're getting is from blcr that typically means that you're trying to checkpoint a process that doesn't support checkpointing.<br>
<font color="#888888"><br>
-d<br>
</font><div><div></div><div><br>
On Oct 8, 2010, at 3:00 PM, kishor kharbas wrote:<br>
<br>
> Thank you Darius for your response.<br>
><br>
> I have now built mpich2-1.3rc2 and hydra.<br>
> So I use mpiexec.hydra in Hydra installation, this is the command I run<br>
><br>
> mpiexec.hydra -ckpointlib blcr -ckpoint-prefix=/home/kkharba/chkpnts -n 2 ./mpiexample<br>
><br>
> But when I send SIGUSR1 to the mpiexec process, I get following error.<br>
><br>
> [proxy:0:0@opt09] requesting checkpoint<br>
> [proxy:0:0@opt09] HYDT_ckpoint_blcr_suspend (./tools/ckpoint/blcr/ckpoint_blcr.c:164): cr_request_checkpoint failed, Unknown error 2356<br>
> [proxy:0:0@opt09] HYDT_ckpoint_suspend (./tools/ckpoint/ckpoint.c:78): blcr checkpoint returned error<br>
> [proxy:0:0@opt09] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:901): checkpoint suspend failed<br>
> [proxy:0:0@opt09] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:76): callback returned error status<br>
> [proxy:0:0@opt09] main (./pm/pmiserv/pmip.c:221): demux engine error waiting for event<br>
> [mpiexec@opt09] connection to proxy terminated unexpectedly<br>
><br>
><br>
> Is there anything wrong that I might be doing ?<br>
><br>
><br>
> Thank you.<br>
> On Fri, Oct 8, 2010 at 12:33 PM, Darius Buntinas <<a href="mailto:buntinas@mcs.anl.gov" target="_blank">buntinas@mcs.anl.gov</a>> wrote:<br>
> BLCR checkpointing is not supported in 1.2.1 (hydra supports it, but the mpich2 library doesn't). Try 1.3rc2. You can find documentation in the user manual and the README:<br>
><br>
> <a href="http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads" target="_blank">http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads</a><br>
><br>
> -d<br>
><br>
> On Oct 8, 2010, at 8:41 AM, kishor kharbas wrote:<br>
><br>
> > Hi all,<br>
> ><br>
> > I am trying to install mpich2-1.2.1p1 with BLCR support.<br>
> ><br>
> > The configure script which I run is:<br>
> > ./configure --prefix=/home/kkharba/mpich2-1.2.1p1-install --enable-checkpointing --with-hydra-ckpointlib=blcr --with-blcr=/home/kkharba/blcr-install<br>
> ><br>
> > the configure script does not complete but gives this error:<br>
> ><br>
> > configure: error: 'cli.h not found. Did you specify --with-cli-dir=?'<br>
> > configure: error: ./configure failed for channels/nemesis<br>
> > configure: error: Configure of src/mpid/ch3 failed!<br>
> ><br>
> > I searched all the file systems but could not find this file.<br>
> ><br>
> > Can you help me out in this issue !!<br>
> ><br>
> > Thank you.<br>
> > Kishor Kharbas<br>
> ><br>
> > _______________________________________________<br>
> > mpich-discuss mailing list<br>
> > <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
> > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
><br>
> _______________________________________________<br>
> mpich-discuss mailing list<br>
> <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
><br>
><br>
><br>
> --<br>
> Kishor Kharbas<br>
> MS Student<br>
> Department of Computer Science<br>
> NC State University<br>
> Raleigh, NC 27606<br>
> _______________________________________________<br>
> mpich-discuss mailing list<br>
> <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br>
_______________________________________________<br>
mpich-discuss mailing list<br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br><i>Kishor Kharbas</i><br><i style="font-family:times new roman,serif">MS Student<br>Department of Computer Science<br>NC State University</i><i style="font-family:times new roman,serif"><br>
Raleigh, NC 27606</i><br>
</div>