[mpich-discuss] Checkpointing Manually with BLCR

Darius Buntinas buntinas at mcs.anl.gov
Thu Sep 6 13:26:38 CDT 2012


When taking a checkpoint, the MPICH library needs to perform a distributed checkpointing protocol to ensure that channels are flushed, etc..  In order for the library to be able to do this, the application needs to call into the MPICH library.  It's possible that your application does not call into the library before the next checkpoint is initiated.

You can try to enable the async progress thread by setting the MPICH_ASYNC_PROGRESS environment variable to 1.  E.g.:
    MPICH_ASYNC_PROGRESS=1 mpiexec ...

See if this helps.

-d


On Sep 6, 2012, at 9:32 AM, Mehmet Kurt wrote:

> Hello,
> 
> I'm using MPICH2 (mpich2-1.4.1p1) with BLCR support.
> 
> I have the following related problems;  
> 
> 1) I'm trying to checkpoint my application by using a checkpoint interval of 20 secs. The first checkpoint seems to be completed successfully, because it says 
> 
> [proxy:0:0 at node55] requesting checkpoint
> [proxy:0:0 at node55] checkpoint completed
> ...
> 
> However, when it tries to do the 2nd checkpoint after 20 secs, i got the following error message;
> 
> [proxy:0:0 at node55] HYDT_ckpoint_checkpoint (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.
> [proxy:0:0 at node55] HYD_pmcd_pmip_control_cmd_cb (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip_cb.c:947): checkpoint suspend failed
> [proxy:0:0 at node55] HYDT_dmxu_poll_wait_for_event (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at node55] main (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event
> 
> 2) To solve this problem, i tried to send checkpoint signals manually by using "pkill -USR1 mpiexec" command from another terminal, but it doesn't work.
> Is there anything else we need to do to checkpoint manually?
> 
> Mehmet Can Kurt
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list