[mpich-discuss] Trouble with checkpoint

Fernando Luz fernando_luz at tpn.usp.br
Wed Oct 26 10:10:35 CDT 2011


Darius,

Thanks for the help. It's works. I forgot to recompile my application.

But I have another question. It's possible use the checkpoint-restart
feature in mpich2 using slurm pm?

I tried execute 

salloc -n 26 mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint
-ckpoint-interval 30 Dyna Prea_teste001.p3d 2

And to restart I use

salloc -n 26 mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint
-ckpoint-interval 30 -ckpoint-num 2

I received the follow message
salloc: Granted job allocation 12613
[mpiexec at masternode1] HYD_pmcd_pmi_alloc_pg_scratch
(./pm/pmiserv/pmiserv_utils.c:594): assert (pg->pg_process_count *
sizeof(struct HYD_pmcd_pmi_ecount)) failed
[mpiexec at masternode1] HYD_pmci_launch_procs
(./pm/pmiserv/pmiserv_pmci.c:103): error allocating pg scratch space
[mpiexec at masternode1] main (./ui/mpich/mpiexec.c:401): process manager
returned error launching processes
salloc: Relinquishing job allocation 12613

The entire cluster are run under NFS.

But if I use salloc to select the nodes and I use the -f hosts (with the
nodes allocated by salloc) works perfectly. 

Regards

Fernando Luz


On Seg, 2011-10-24 at 16:41 -0500, Darius Buntinas wrote:

> Hmm strange.  Did you do a make clean first?  I.e.:
>   make clean
>   make
>   make install
> 
> Also make sure you recompile your app (maybe even do a make clean for the app too).
> 
> -d
> 
> 
> On Oct 22, 2011, at 3:28 PM, Fernando Luz wrote:
> 
> > Hi Darius,
> > 
> > I applied the patch, but I have the same errors.
> > 
> > Do you need some file or info about my system?
> > 
> > Regards
> > 
> > Fernando Luz
> > 
> > ----- Mensagem original -----
> > De: "Darius Buntinas" <buntinas at mcs.anl.gov>
> > Para: mpich-discuss at mcs.anl.gov
> > Enviadas: Sexta-feira, 21 de Outubro de 2011 16:20:49
> > Assunto: Re: [mpich-discuss] Trouble with checkpoint
> > 
> > Hi Fernando,
> > 
> > Can you apply this patch and see if it fixes your problem?
> > 
> > Let us know how it goes.
> > -d
> > 
> > 
> > 
> > On Oct 19, 2011, at 2:13 PM, Fernando Luz wrote:
> > 
> >> Hi,
> >> 
> >> I tried use the checkpoint-restart with this execution. 
> >> 
> >> mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint -ckpoint-interval 30 -f hosts -n 26 Dyna Prea_teste001.p3d 2
> >> 
> >> with mpich2 and I received the follows errors. 
> >> 
> >>  0% [=                                                 ] 00:00:28 / 00:56:27
> >> [proxy:0:0 at s23n20.gradebr.tpn] requesting checkpoint
> >> [proxy:0:1 at s23n21.gradebr.tpn] requesting checkpoint
> >> [proxy:0:2 at s23n22.gradebr.tpn] requesting checkpoint
> >> [proxy:0:3 at s23n23.gradebr.tpn] requesting checkpoint
> >> [proxy:0:0 at s23n20.gradebr.tpn] checkpoint completed
> >> [proxy:0:1 at s23n21.gradebr.tpn] checkpoint completed
> >> [proxy:0:2 at s23n22.gradebr.tpn] checkpoint completed
> >> [proxy:0:3 at s23n23.gradebr.tpn] checkpoint completed
> >>  0% [=                                                 ] 00:00:29 / 00:56:28Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1ebebfc0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7ba7b620) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1f84f600, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff957566a0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1fc58d50, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7ffff54102a0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x7752ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fffeab72ca0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x12274ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fffb55e4ea0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1b6c4600, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff74e63520) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x15511ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff9fb57ca0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x815afc0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff87e31fa0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0xf1e7d80, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff19d30120) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1758f9a0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff06ac13a0) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0xaaf8ce0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7e488920) failed
> >> MPIDI_CH3I_Progress(321)..: 
> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
> >> Fatal error in MPI_Recv: Other MPI error, error stack:
> >> MPI_Recv(186).............: MPI_Recv(buf=0x1cc47990, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff00638520) failed
> >> 
> >> without checkpoint, the execution is accomplish. How I need to proceed to solve this error?
> >> 
> >> Regards
> >> 
> >> Fernando Luz
> >> _______________________________________________
> >> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > 
> > 
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111026/9c247c85/attachment-0001.htm>


More information about the mpich-discuss mailing list