[mpich2-dev] Checkpointing failed

Darius Buntinas buntinas at mcs.anl.gov
Wed Nov 30 10:20:13 CST 2011


When a checkpoint is requested, each process needs to execute a checkpoint algorithm.  This will only happen if each process is making a call into MPICH that enters the progress engine.  So if one or more processes are in a computation loop when the first checkpoint is initiated and don't call an MPI routine before the second checkpoint is requested, you'll probably get that error.

To make sure MPICH enters the progress engine, you'll need to call a communication operation (MPI_Iprobe is the simplest).  There is a bug currently where calling just a send operation will not allow the checkpoint algorithm to progress.

If you think this is the problem you're having, try looping on MPI_Iprobe when you expect the checkpoint and see if that solves the problem.

Alternatively, you could enable the MPICH progress thread by setting the MPICH_ASYNC_PROGRESS environment variable to 1.  Note that this sets the thread-safety level to MPI_THREAD_MULITPLE which can have an impact on performance, and it starts a busy thread, so you'll need a spare core to avoid oversubscribing your cores.

-d


On Nov 29, 2011, at 7:04 PM, Bo Fang wrote:

> Hi,
> 
> I am working on a course project which aims to evaluate MPICH2 with BLCR. But I am having some problems with running my benchmarks under ckpoint mode. The problem I have is that when the second checkpoint is requested, a error would occur, no matter what time interval I specify or which benchmark is running.
> 
> Here is the error message:
> 
> --------------------------------------------------------------------------------------
> [proxy:0:0 at bo-laptop] requesting checkpoint
> [proxy:0:0 at bo-laptop] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0 at bo-laptop] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:947): checkpoint suspend failed
> [proxy:0:0 at bo-laptop] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at bo-laptop] main (./pm/pmiserv/pmip.c:225): demux engine error waiting for event
> [mpiexec at bo-laptop] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
> [mpiexec at bo-laptop] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at bo-laptop] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> [mpiexec at bo-laptop] main (./ui/mpich/mpiexec.c:420): process manager error waiting for completion
> ------------------------------------------------------------------------------------
> 
> It happened when the second checkpoint is requested. It seems that the first one is not complete when the second one is coming. But from the code I don't see any hint for why the first checkpoint is not complete. The checkpointing file of the first one is actually very large (over 150 MB). 
> 
> Thank you very much for your help.
> 
> Bo Fang



More information about the mpich2-dev mailing list