[mpich-discuss] cli.h file not found error

kishor kharbas kishor.kharbas at gmail.com
Sun Oct 10 12:19:27 CDT 2010


Hi Darius,

1 That exactly was the problem, I re-compiled my program and it works,
except for one issue,

   After restarting the parallel process from the checkpoint file, the
mpiexec process hangs and does not terminate at all.
   The spawned process hover around in <defunct> state. After I stop mpiexec
myself, these error messages are displayed,

*  ^C[mpiexec at opt09] connection to proxy terminated unexpectedly*
*  Ctrl-C caught... cleaning up processes*
*  [press Ctrl-C again to force abort]*
*  APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)*

2. There is another independent problem(more severe) with running programs
on multiple hosts. For all my previous mails in this chain, I had run my
programs on single host.
   running mpiexec with multiple hosts displays the following error:

   *Fatal error in MPI_Send: Other MPI error, error stack:*
*   MPI_Send(173).....................: MPI_Send(buf=0x7fff8d47fe60,
count=1, MPI_INT, dest=1, tag=1, MPI_COMM_WORLD) failed*
*   MPIDI_CH3I_Progress(334)..........:*
*   MPID_nem_mpich2_blocking_recv(906):*
*   MPID_nem_tcp_connpoll(1861).......: Communication error with rank 1:*
*   APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)*

   I also ran 'make testing' with  HYDRA_HOST_FILE set to the host file. All
the tests emitted same error stack.

  Can you please suggest how do I troubleshoot this problem ?

Thank you.
Kishor
On Fri, Oct 8, 2010 at 4:54 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

>
> Did you recompile mpiexample with BLCR-enabled MPICH2?  The error you're
> getting is from blcr that typically means that you're trying to checkpoint a
> process that doesn't support checkpointing.
>
> -d
>
> On Oct 8, 2010, at 3:00 PM, kishor kharbas wrote:
>
> > Thank you Darius for your response.
> >
> > I have now built mpich2-1.3rc2 and hydra.
> > So I use mpiexec.hydra in Hydra installation, this is the command I run
> >
> > mpiexec.hydra -ckpointlib blcr -ckpoint-prefix=/home/kkharba/chkpnts -n 2
> ./mpiexample
> >
> > But when I send SIGUSR1 to the mpiexec process, I get following error.
> >
> > [proxy:0:0 at opt09] requesting checkpoint
> > [proxy:0:0 at opt09] HYDT_ckpoint_blcr_suspend
> (./tools/ckpoint/blcr/ckpoint_blcr.c:164): cr_request_checkpoint failed,
> Unknown error 2356
> > [proxy:0:0 at opt09] HYDT_ckpoint_suspend (./tools/ckpoint/ckpoint.c:78):
> blcr checkpoint returned error
> > [proxy:0:0 at opt09] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:901): checkpoint suspend failed
> > [proxy:0:0 at opt09] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:76): callback returned error status
> > [proxy:0:0 at opt09] main (./pm/pmiserv/pmip.c:221): demux engine error
> waiting for event
> > [mpiexec at opt09] connection to proxy terminated unexpectedly
> >
> >
> > Is there anything wrong that I might be doing ?
> >
> >
> > Thank you.
> > On Fri, Oct 8, 2010 at 12:33 PM, Darius Buntinas <buntinas at mcs.anl.gov>
> wrote:
> > BLCR checkpointing is not supported in 1.2.1 (hydra supports it, but the
> mpich2 library doesn't).  Try 1.3rc2.  You can find documentation in the
> user manual and the README:
> >
> >
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
> >
> > -d
> >
> > On Oct 8, 2010, at 8:41 AM, kishor kharbas wrote:
> >
> > > Hi all,
> > >
> > > I am trying to install mpich2-1.2.1p1 with BLCR support.
> > >
> > > The configure script which I run is:
> > > ./configure --prefix=/home/kkharba/mpich2-1.2.1p1-install
> --enable-checkpointing --with-hydra-ckpointlib=blcr
> --with-blcr=/home/kkharba/blcr-install
> > >
> > > the configure script does not complete but gives this error:
> > >
> > > configure: error: 'cli.h not found.  Did you specify --with-cli-dir=?'
> > > configure: error: ./configure failed for channels/nemesis
> > > configure: error: Configure of src/mpid/ch3 failed!
> > >
> > > I searched all the file systems but could not find this file.
> > >
> > > Can you help me out in this issue !!
> > >
> > > Thank you.
> > > Kishor Kharbas
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list
> > > mpich-discuss at mcs.anl.gov
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> >
> > --
> > Kishor Kharbas
> > MS Student
> > Department of Computer Science
> > NC State University
> > Raleigh, NC 27606
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
*Kishor Kharbas*
*MS Student
Department of Computer Science
NC State University**
Raleigh, NC 27606*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101010/e7314874/attachment.htm>


More information about the mpich-discuss mailing list