[mpich-discuss] Bugs in Checkpoint/ Restart support - MPICH2-1.3.1

Raghunath rajachan at cse.ohio-state.edu
Sun Nov 21 18:32:33 CST 2010


Hi all,

I was trying to get the Checkpoint/ Restart functionality in MPICH2-1.3.1
working, but noticed a couple of errors.

Case 1:

Running IMB, 4 processes on 2 nodes (2 procs per node) - the application
hangs after requesting a checkpoint:

        32768         1000        40.34        40.36        40.35
>         65536          640        97.36        97.42        97.39
>        131072          320       199.74       199.93       199.84
>        262144          160       399.00       399.57       399.29
>        524288           80       803.46       805.65       804.56
>       1048576           40      1672.60      1681.23      1676.91
> [proxy:0:0 at wci30] requesting checkpoint

*hangs here*


I've attached back traces of the application (imb_trace) and the
pmi_proxy(pmi_proxy_trace).

Case 2:

Running IMB, 2 processes on 2 nodes - the first checkpoint/restart is
seamless, but the next checkpoint attempt fails. I've attached the error
message (2nd_ckpt_error)

Looks like the CR functionality works only when the application is run with
one process per node, and when only a single checkpoint is taken.

For both these cases, I configured MPICH2 with "--enable-checkpointing
--with-hydra-ckpointlib=blcr"  flags.

I launched my job in the following manner:

<mpich2 installation path>/bin/mpiexec -ckpoint-prefix=<path to ckpt dir>
-ckpoint-interval 30 -f ./mf -n 4 ./IMB-EXT


We weren't sure if this was a known bug, and so wanted to report it to the
MPICH2 group.



Thanks,

--
Raghu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2nd_ckpt_error
Type: application/octet-stream
Size: 1789 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imb_trace
Type: application/octet-stream
Size: 800 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pmi_proxy_trace
Type: application/octet-stream
Size: 923 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment-0002.obj>


More information about the mpich-discuss mailing list