[mpich-discuss] MPICH2 Checkpointing Error with BLCR

Darius Buntinas buntinas at mcs.anl.gov
Mon Sep 24 11:57:21 CDT 2012


I believe the "checkpoint completed" message is fibbing.  In order for the checkpoint to complete the application at each process needs to make an MPI call after the checkpoint is signaled.  One solution is to make the time between checkpoints large enough that the processes have a chance to call an MPI function.  Alternatively you can set the MPICH_ASYNC_PROGRESS environment variable to 1 (e.g., "MPICH_ASYNC_PROGRESS=1 mpiexec ...").  That will create an internal thread to make progress and handle the checkpoint request.

-d

On Sep 21, 2012, at 11:07 PM, Manisha Chauhan wrote:

> Hi,
> 
> I am working on check-pointing my MPI application. I installed both hydra and blcr.  I have also checked "mpiexec --info" and it shows check pointing library as blcr, But still I am not able to checkpoint my application.
> 
> It makes a request of "requesting checkpoint"  and returned with "checkpoint completed" but the context file is empty. The next time it tries it end with the following error.
> 
> 
> MPICH2 
> version= 1.4.1
> 
> [proxy:0:0 at tom-laptop] requesting checkpoint
> [proxy:0:0 at tom-laptop] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0 at tom-laptop] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:902): checkpoint suspend failed
> [proxy:0:0 at tom-laptop] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at tom-laptop] main (./pm/pmiserv/pmip.c:210): demux engine error waiting for event
> [mpiexec at tom-laptop] control_cb (./pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
> [mpiexec at tom-laptop] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at tom-laptop] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at tom-laptop] main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion
> 
> Can you please help me  to find out the issue.
> 
> Regards
> Manisha
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list