[mpich-discuss] Asking standard checkpoint in MPICH2

Darius Buntinas buntinas at mcs.anl.gov
Mon May 10 15:17:34 CDT 2010


Hi Bagus,

Sorry, I haven't written up the documentation on this yet.  You'll need 
to install BLCR, and configure mpich2 with the following configure options:

--with-hydra-ckpointlib=blcr --enable-checkpointing

If you didn't install BLCR in a standard system location (e.g., if you 
installed it in your home directory), then you'll need to specify the 
install location using the --with-blcr= configure option as well.  Also, 
make sure that your LD_LIBRARY_PATH is set correctly if necessary.

Once you configure and make, you'll need to make sure the BLCR kernel 
modules are loaded on each machine.  Use the -ckpoint-interval option 
for mpiexec to specify how often to take checkpoints.  You'll also need 
to specify the location where the checkpoint files should be written 
using the -ckpoint-prefix option (make sure the directory exists).

To restart from a checkpoint specify the same number of processes as the 
original run and the -ckpoint-prefix option, but leave off the name of 
the executable.

Let us know how this works for you.  Remember that you're using a beta 
version, so you might still encounter some bugs.

-d

On 05/10/2010 05:46 AM, Bagus Jati Santoso wrote:
> Dear all members,
>
> I have some question. But first, I've installed MPICH2 in my cluster (11
> computers).
> It is my cluster specification :
> -Debian 5
> -MPICH2-1.3a2 (by download the package)
> and BLCR package hasn't installed yet
>
> Since websites said that there is BLCR-based checkpointing feature in
> standard MPICH2, I want to do little modification with BLCR mechanism in
> MPICH2-1.3a2.
> I have try to modify it in one week, but it didn't give significant
> progress.
> So, my question are :
> 1. Is it true that blcr have already embedded in MPICH2-1.3a2 package?
> 2. ckpoint_blcr.c calls many standard cr function. And I've succesfully
> run checkpointing MPICH2 standard mechanism for my MPI program.
>       But I didn't find the file that contain definintion of
> cr_checkpoint, cr_poll_checkpoint, and others. Where can I find ?
> 3. Where is the location of libcr.h? ckpoint_blcr.c call this function.
> Do I need to install BLCR package first?
> 4. Do nyone have idea how to modify blcr mechanism in MPICH or how to
> create our checkpoint/restart mechanism?
>
> Please answer my question. Thank you so much for your attention.
>
> Best regards,
> Bagus
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list