[mpich-discuss] Bugs in Checkpoint/ Restart support - MPICH2-1.3.1
Raghunath
rajachan at cse.ohio-state.edu
Sun Nov 21 18:32:33 CST 2010
Hi all,
I was trying to get the Checkpoint/ Restart functionality in MPICH2-1.3.1
working, but noticed a couple of errors.
Case 1:
Running IMB, 4 processes on 2 nodes (2 procs per node) - the application
hangs after requesting a checkpoint:
32768 1000 40.34 40.36 40.35
> 65536 640 97.36 97.42 97.39
> 131072 320 199.74 199.93 199.84
> 262144 160 399.00 399.57 399.29
> 524288 80 803.46 805.65 804.56
> 1048576 40 1672.60 1681.23 1676.91
> [proxy:0:0 at wci30] requesting checkpoint
*hangs here*
I've attached back traces of the application (imb_trace) and the
pmi_proxy(pmi_proxy_trace).
Case 2:
Running IMB, 2 processes on 2 nodes - the first checkpoint/restart is
seamless, but the next checkpoint attempt fails. I've attached the error
message (2nd_ckpt_error)
Looks like the CR functionality works only when the application is run with
one process per node, and when only a single checkpoint is taken.
For both these cases, I configured MPICH2 with "--enable-checkpointing
--with-hydra-ckpointlib=blcr" flags.
I launched my job in the following manner:
<mpich2 installation path>/bin/mpiexec -ckpoint-prefix=<path to ckpt dir>
-ckpoint-interval 30 -f ./mf -n 4 ./IMB-EXT
We weren't sure if this was a known bug, and so wanted to report it to the
MPICH2 group.
Thanks,
--
Raghu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2nd_ckpt_error
Type: application/octet-stream
Size: 1789 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imb_trace
Type: application/octet-stream
Size: 800 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pmi_proxy_trace
Type: application/octet-stream
Size: 923 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101121/8a2a6d12/attachment-0002.obj>
More information about the mpich-discuss
mailing list