[mpich-discuss] Bugs in Checkpoint/ Restart support - MPICH2-1.3.1

Darius Buntinas buntinas at mcs.anl.gov
Mon Nov 22 15:45:56 CST 2010


Hi Raghu,

I haven't verified this here yet, but we have noticed similar issues with applications with processes that are continually sending (and not receiving). Do you think this is the case here?

-d

On Nov 21, 2010, at 6:32 PM, Raghunath wrote:

> Hi all,
> 
> I was trying to get the Checkpoint/ Restart functionality in MPICH2-1.3.1 working, but noticed a couple of errors. 
> 
> Case 1:
> 
> Running IMB, 4 processes on 2 nodes (2 procs per node) - the application hangs after requesting a checkpoint:
> 
>         32768         1000        40.34        40.36        40.35
>         65536          640        97.36        97.42        97.39
>        131072          320       199.74       199.93       199.84
>        262144          160       399.00       399.57       399.29
>        524288           80       803.46       805.65       804.56
>       1048576           40      1672.60      1681.23      1676.91
> [proxy:0:0 at wci30] requesting checkpoint 
> *hangs here*
> 
> I've attached back traces of the application (imb_trace) and the pmi_proxy(pmi_proxy_trace).
> 
> Case 2:
> 
> Running IMB, 2 processes on 2 nodes - the first checkpoint/restart is seamless, but the next checkpoint attempt fails. I've attached the error message (2nd_ckpt_error)
> 
> Looks like the CR functionality works only when the application is run with one process per node, and when only a single checkpoint is taken.
> 
> For both these cases, I configured MPICH2 with "--enable-checkpointing --with-hydra-ckpointlib=blcr"  flags.
> 
> I launched my job in the following manner:
> 
> <mpich2 installation path>/bin/mpiexec -ckpoint-prefix=<path to ckpt dir> -ckpoint-interval 30 -f ./mf -n 4 ./IMB-EXT
> 
> 
> We weren't sure if this was a known bug, and so wanted to report it to the MPICH2 group.
> 
> 
> 
> Thanks,
> 
> --
> Raghu
> <2nd_ckpt_error><imb_trace><pmi_proxy_trace>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list