[mpich-discuss] Bugs in Checkpoint/ Restart support - MPICH2-1.3.1
Darius Buntinas
buntinas at mcs.anl.gov
Mon Nov 22 15:45:56 CST 2010
Hi Raghu,
I haven't verified this here yet, but we have noticed similar issues with applications with processes that are continually sending (and not receiving). Do you think this is the case here?
-d
On Nov 21, 2010, at 6:32 PM, Raghunath wrote:
> Hi all,
>
> I was trying to get the Checkpoint/ Restart functionality in MPICH2-1.3.1 working, but noticed a couple of errors.
>
> Case 1:
>
> Running IMB, 4 processes on 2 nodes (2 procs per node) - the application hangs after requesting a checkpoint:
>
> 32768 1000 40.34 40.36 40.35
> 65536 640 97.36 97.42 97.39
> 131072 320 199.74 199.93 199.84
> 262144 160 399.00 399.57 399.29
> 524288 80 803.46 805.65 804.56
> 1048576 40 1672.60 1681.23 1676.91
> [proxy:0:0 at wci30] requesting checkpoint
> *hangs here*
>
> I've attached back traces of the application (imb_trace) and the pmi_proxy(pmi_proxy_trace).
>
> Case 2:
>
> Running IMB, 2 processes on 2 nodes - the first checkpoint/restart is seamless, but the next checkpoint attempt fails. I've attached the error message (2nd_ckpt_error)
>
> Looks like the CR functionality works only when the application is run with one process per node, and when only a single checkpoint is taken.
>
> For both these cases, I configured MPICH2 with "--enable-checkpointing --with-hydra-ckpointlib=blcr" flags.
>
> I launched my job in the following manner:
>
> <mpich2 installation path>/bin/mpiexec -ckpoint-prefix=<path to ckpt dir> -ckpoint-interval 30 -f ./mf -n 4 ./IMB-EXT
>
>
> We weren't sure if this was a known bug, and so wanted to report it to the MPICH2 group.
>
>
>
> Thanks,
>
> --
> Raghu
> <2nd_ckpt_error><imb_trace><pmi_proxy_trace>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list