[mpich-discuss] Checkpointing Manually with BLCR
Mehmet Kurt
kurt.16 at buckeyemail.osu.edu
Thu Sep 6 09:32:08 CDT 2012
Hello,
I'm using MPICH2 (mpich2-1.4.1p1) with BLCR support.
I have the following related problems;
1) I'm trying to checkpoint my application by using a checkpoint interval of 20 secs. The first checkpoint seems to be completed successfully, because it says
[proxy:0:0 at node55] requesting checkpoint
[proxy:0:0 at node55] checkpoint completed
...
However, when it tries to do the 2nd checkpoint after 20 secs, i got the following error message;
[proxy:0:0 at node55] HYDT_ckpoint_checkpoint (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.
[proxy:0:0 at node55] HYD_pmcd_pmip_control_cmd_cb (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip_cb.c:947): checkpoint suspend failed
[proxy:0:0 at node55] HYDT_dmxu_poll_wait_for_event (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at node55] main (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event
2) To solve this problem, i tried to send checkpoint signals manually by using "pkill -USR1 mpiexec" command from another terminal, but it doesn't work.
Is there anything else we need to do to checkpoint manually?
Mehmet Can Kurt
More information about the mpich-discuss
mailing list