[mpich-discuss] Checkpointing Manually with BLCR

Mehmet Kurt kurt.16 at buckeyemail.osu.edu
Thu Sep 6 22:37:26 CDT 2012


Yes,  MPICH_ASYNC_PROGRESS=1 mpiexec ... solved the problem.

I also found out that sending "SIGALRM" instead of "SIGUSR1" works to checkpoint manually.

Thank you for your help,

Mehmet 
__________________________________
From: mpich-discuss-bounces at mcs.anl.gov [mpich-discuss-bounces at mcs.anl.gov] on behalf of Jim Dinan [dinan at mcs.anl.gov]
Sent: Thursday, September 06, 2012 2:39 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] Checkpointing Manually with BLCR

Would adding one of these also do the trick?

{
   int flag;
   MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag,
MPI_STATUS_IGNORE);
}

  ~Jim.

On 9/6/12 1:26 PM, Darius Buntinas wrote:
>
> When taking a checkpoint, the MPICH library needs to perform a distributed checkpointing protocol to ensure that channels are flushed, etc..  In order for the library to be able to do this, the application needs to call into the MPICH library.  It's possible that your application does not call into the library before the next checkpoint is initiated.
>
> You can try to enable the async progress thread by setting the MPICH_ASYNC_PROGRESS environment variable to 1.  E.g.:
>      MPICH_ASYNC_PROGRESS=1 mpiexec ...
>
> See if this helps.
>
> -d
>
>
> On Sep 6, 2012, at 9:32 AM, Mehmet Kurt wrote:
>
>> Hello,
>>
>> I'm using MPICH2 (mpich2-1.4.1p1) with BLCR support.
>>
>> I have the following related problems;
>>
>> 1) I'm trying to checkpoint my application by using a checkpoint interval of 20 secs. The first checkpoint seems to be completed successfully, because it says
>>
>> [proxy:0:0 at node55] requesting checkpoint
>> [proxy:0:0 at node55] checkpoint completed
>> ...
>>
>> However, when it tries to do the 2nd checkpoint after 20 secs, i got the following error message;
>>
>> [proxy:0:0 at node55] HYDT_ckpoint_checkpoint (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.
>> [proxy:0:0 at node55] HYD_pmcd_pmip_control_cmd_cb (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip_cb.c:947): checkpoint suspend failed
>> [proxy:0:0 at node55] HYDT_dmxu_poll_wait_for_event (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at node55] main (/home/wjiang/mpich2-1.4.1p1/src/pm/hydra/pm/pmiserv/pmip.c:226): demux engine error waiting for event
>>
>> 2) To solve this problem, i tried to send checkpoint signals manually by using "pkill -USR1 mpiexec" command from another terminal, but it doesn't work.
>> Is there anything else we need to do to checkpoint manually?
>>
>> Mehmet Can Kurt
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss




More information about the mpich-discuss mailing list