Hi all,<br><br>I was trying to get the Checkpoint/ Restart functionality in MPICH2-1.3.1 working, but noticed a couple of errors. <br><br>Case 1:<br><br>Running IMB, 4 processes on 2 nodes (2 procs per node) - the application hangs after requesting a checkpoint:<br>
<br><blockquote style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" class="gmail_quote"><span style="font-family: courier new,monospace;"> 32768 1000 40.34 40.36 40.35</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 65536 640 97.36 97.42 97.39</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 131072 320 199.74 199.93 199.84</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 262144 160 399.00 399.57 399.29</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> 524288 80 803.46 805.65 804.56</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> 1048576 40 1672.60 1681.23 1676.91</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">[proxy:0:0@wci30] requesting checkpoint</span> </blockquote>
<blockquote style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;" class="gmail_quote">*hangs here* </blockquote><div><br>I've attached back traces of the application (imb_trace) and the pmi_proxy(pmi_proxy_trace).<br>
<br>Case 2:<br><br>Running IMB, 2 processes on 2 nodes - the first checkpoint/restart is seamless, but the next checkpoint attempt fails. I've attached the error message (2nd_ckpt_error)<br><br>Looks like the CR functionality works only when the application is run with one process per node, and when only a single checkpoint is taken.<br>
<br>For both these cases, I configured MPICH2 with "<span style="font-family: courier new,monospace;">--enable-checkpointing --with-hydra-ckpointlib=blcr</span>" flags.<br></div><br clear="all"><div>I launched my job in the following manner:<br>
<br><span style="font-family: courier new,monospace;"><mpich2 installation path>/bin/mpiexec -ckpoint-prefix=<path to ckpt dir> -ckpoint-interval 30 -f ./mf -n 4 ./IMB-EXT</span><br><br><br>We weren't sure if this was a known bug, and so wanted to report it to the MPICH2 group.<br>
<br><br><br>Thanks,<br><br></div>--<br><font color="#888888">Raghu<br></font>