[mpich-discuss] Checkpointing Manually with BLCR

Mehmet Kurt kurt.16 at buckeyemail.osu.edu
Thu Sep 6 09:32:08 CDT 2012


Hello,

I'm using MPICH2 (mpich2-1.4.1p1) with BLCR support.

I have the following related problems;  

1) I'm trying to checkpoint my application by using a checkpoint interval of 20 secs. The first checkpoint seems to be completed successfully, because it says 

[proxy:0:0 at node55] requesting checkpoint
[proxy:0:0 at node55] checkpoint completed
...

However, when it tries to do the 2nd checkpoint after 20 secs, i got the following error message;

[proxy:0:0 at node55] HYDT_ckpoint_checkpoint (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.
[proxy:0:0 at node55] HYD_pmcd_pmip_control_cmd_cb (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip_cb.c:947): checkpoint suspend failed
[proxy:0:0 at node55] HYDT_dmxu_poll_wait_for_event (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at node55] main (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event

2) To solve this problem, i tried to send checkpoint signals manually by using "pkill -USR1 mpiexec" command from another terminal, but it doesn't work.
 Is there anything else we need to do to checkpoint manually?

Mehmet Can Kurt



More information about the mpich-discuss mailing list