[mpich-discuss] Trouble with checkpoint

Fernando Luz fernando_luz at tpn.usp.br
Wed Oct 19 14:13:20 CDT 2011


Hi,

I tried use the checkpoint-restart with this execution. 

mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint
-ckpoint-interval 30 -f hosts -n 26 Dyna Prea_teste001.p3d 2

with mpich2 and I received the follows errors. 

  0% [=                                                 ] 00:00:28 /
00:56:27
[proxy:0:0 at s23n20.gradebr.tpn] requesting checkpoint
[proxy:0:1 at s23n21.gradebr.tpn] requesting checkpoint
[proxy:0:2 at s23n22.gradebr.tpn] requesting checkpoint
[proxy:0:3 at s23n23.gradebr.tpn] requesting checkpoint
[proxy:0:0 at s23n20.gradebr.tpn] checkpoint completed
[proxy:0:1 at s23n21.gradebr.tpn] checkpoint completed
[proxy:0:2 at s23n22.gradebr.tpn] checkpoint completed
[proxy:0:3 at s23n23.gradebr.tpn] checkpoint completed
  0% [=                                                 ] 00:00:29 /
00:56:28Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1ebebfc0, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7ba7b620) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1f84f600, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff957566a0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1fc58d50, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7ffff54102a0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x7752ca0, count=7, MPI_DOUBLE,
src=0, tag=2, MPI_COMM_WORLD, status=0x7fffeab72ca0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x12274ca0, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fffb55e4ea0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1b6c4600, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff74e63520) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x15511ca0, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff9fb57ca0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x815afc0, count=7, MPI_DOUBLE,
src=0, tag=2, MPI_COMM_WORLD, status=0x7fff87e31fa0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0xf1e7d80, count=7, MPI_DOUBLE,
src=0, tag=2, MPI_COMM_WORLD, status=0x7fff19d30120) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1758f9a0, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff06ac13a0) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0xaaf8ce0, count=7, MPI_DOUBLE,
src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7e488920) failed
MPIDI_CH3I_Progress(321)..: 
MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............: MPI_Recv(buf=0x1cc47990, count=7,
MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff00638520) failed

without checkpoint, the execution is accomplish. How I need to proceed
to solve this error?

Regards

Fernando Luz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111019/455956ed/attachment.htm>


More information about the mpich-discuss mailing list