[mpich-discuss] Trouble with checkpoint

Darius Buntinas buntinas at mcs.anl.gov
Fri Oct 28 15:00:52 CDT 2011


OK, sorry I didn't have a chance to test the patch before I sent it to you (I switched to a mac and need to find a machine with BLCR).  I'll look into this further.

Thanks for trying.

-d


On Oct 28, 2011, at 2:54 PM, Fernando Luz wrote:

> Hi Darius,
> 
> I applied the patch following your instructions and don't work.
> 
> I received this message (I ran salloc with verbose option) 
> 
> [fernando_luz at masternode1 modelos_teste_mpi]$ salloc --verbose -n 26 --exclusive mpiexec -ckpointlib blcr -ckpoint-prefix teste.ckpoint/ -ckpoint-interval 20 -ckpoint-num 0
> salloc: auth plugin for Munge (http://home.gna.org/munge/) loaded
> salloc: Consumable Resources (CR) Node Selection plugin loaded with argument 4
> salloc: Granted job allocation 12941
> salloc: Relinquishing job allocation 12941
> salloc: Job allocation 12941 has been revoked.
> salloc: Command "mpiexec" was terminated by signal 11
> 
> Thanks for help,
> 
> Regards
> 
> Fernando Luz
> 
> 
> On Sex, 2011-10-28 at 11:13 -0500, Darius Buntinas wrote:
>> Hi Fernando,
>> 
>> Can you apply this patch and try it again?  Don't forget to do a make clean;make;make install again.
>> 
>> Please let us know if this fixes the problem.
>> 
>> Thanks,
>> -d
>> 
>> 
>> 
>> On Oct 26, 2011, at 10:10 AM, Fernando Luz wrote:
>> 
>> > Darius,
>> > 
>> > Thanks for the help. It's works. I forgot to recompile my application.
>> > 
>> > But I have another question. It's possible use the checkpoint-restart feature in mpich2 using slurm pm?
>> > 
>> > I tried execute 
>> > 
>> > salloc -n 26 mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint -ckpoint-interval 30 Dyna Prea_teste001.p3d 2
>> > 
>> > And to restart I use
>> > 
>> > salloc -n 26 mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint -ckpoint-interval 30 -ckpoint-num 2
>> > 
>> > I received the follow message
>> > salloc: Granted job allocation 12613
>> > [mpiexec at masternode1] HYD_pmcd_pmi_alloc_pg_scratch (./pm/pmiserv/pmiserv_utils.c:594): assert (pg->pg_process_count * sizeof(struct HYD_pmcd_pmi_ecount)) failed
>> > [mpiexec at masternode1] HYD_pmci_launch_procs (./pm/pmiserv/pmiserv_pmci.c:103): error allocating pg scratch space
>> > [mpiexec at masternode1] main (./ui/mpich/mpiexec.c:401): process manager returned error launching processes
>> > salloc: Relinquishing job allocation 12613
>> > 
>> > The entire cluster are run under NFS.
>> > 
>> > But if I use salloc to select the nodes and I use the -f hosts (with the nodes allocated by salloc) works perfectly. 
>> > 
>> > Regards
>> > 
>> > Fernando Luz
>> > 
>> > 
>> > On Seg, 2011-10-24 at 16:41 -0500, Darius Buntinas wrote:
>> >> Hmm strange.  Did you do a make clean first?  I.e.:
>> >>   make clean
>> >>   make
>> >>   make install
>> >> 
>> >> Also make sure you recompile your app (maybe even do a make clean for the app too).
>> >> 
>> >> -d
>> >> 
>> >> 
>> >> On Oct 22, 2011, at 3:28 PM, Fernando Luz wrote:
>> >> 
>> >> > Hi Darius,
>> >> > 
>> >> > I applied the patch, but I have the same errors.
>> >> > 
>> >> > Do you need some file or info about my system?
>> >> > 
>> >> > Regards
>> >> > 
>> >> > Fernando Luz
>> >> > 
>> >> > ----- Mensagem original -----
>> >> > De: "Darius Buntinas" <
>> >> 
>> buntinas at mcs.anl.gov
>> 
>> >> >
>> >> > Para: 
>> >> 
>> mpich-discuss at mcs.anl.gov
>> 
>> >> 
>> >> > Enviadas: Sexta-feira, 21 de Outubro de 2011 16:20:49
>> >> > Assunto: Re: [mpich-discuss] Trouble with checkpoint
>> >> > 
>> >> > Hi Fernando,
>> >> > 
>> >> > Can you apply this patch and see if it fixes your problem?
>> >> > 
>> >> > Let us know how it goes.
>> >> > -d
>> >> > 
>> >> > 
>> >> > 
>> >> > On Oct 19, 2011, at 2:13 PM, Fernando Luz wrote:
>> >> > 
>> >> >> Hi,
>> >> >> 
>> >> >> I tried use the checkpoint-restart with this execution. 
>> >> >> 
>> >> >> mpiexec -ckpointlib blcr -ckpoint-prefix ./teste.ckpoint -ckpoint-interval 30 -f hosts -n 26 Dyna Prea_teste001.p3d 2
>> >> >> 
>> >> >> with mpich2 and I received the follows errors. 
>> >> >> 
>> >> >>  0% [=                                                 ] 00:00:28 / 00:56:27
>> >> >> [proxy:0:
>> >> 
>> 0 at s23n20.gradebr.tpn
>> 
>> >> ] requesting checkpoint
>> >> >> [proxy:0:
>> >> 
>> 1 at s23n21.gradebr.tpn
>> 
>> >> ] requesting checkpoint
>> >> >> [proxy:0:
>> >> 
>> 2 at s23n22.gradebr.tpn
>> 
>> >> ] requesting checkpoint
>> >> >> [proxy:0:
>> >> 
>> 3 at s23n23.gradebr.tpn
>> 
>> >> ] requesting checkpoint
>> >> >> [proxy:0:
>> >> 
>> 0 at s23n20.gradebr.tpn
>> 
>> >> ] checkpoint completed
>> >> >> [proxy:0:
>> >> 
>> 1 at s23n21.gradebr.tpn
>> 
>> >> ] checkpoint completed
>> >> >> [proxy:0:
>> >> 
>> 2 at s23n22.gradebr.tpn
>> 
>> >> ] checkpoint completed
>> >> >> [proxy:0:
>> >> 
>> 3 at s23n23.gradebr.tpn
>> 
>> >> ] checkpoint completed
>> >> >>  0% [=                                                 ] 00:00:29 / 00:56:28Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1ebebfc0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7ba7b620) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1f84f600, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff957566a0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1fc58d50, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7ffff54102a0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x7752ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fffeab72ca0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x12274ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fffb55e4ea0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1b6c4600, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff74e63520) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x15511ca0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff9fb57ca0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x815afc0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff87e31fa0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0xf1e7d80, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff19d30120) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1758f9a0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff06ac13a0) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0xaaf8ce0, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff7e488920) failed
>> >> >> MPIDI_CH3I_Progress(321)..: 
>> >> >> MPIDI_nem_ckpt_finish(469): sem_wait() failed Interrupted system call
>> >> >> Fatal error in MPI_Recv: Other MPI error, error stack:
>> >> >> MPI_Recv(186).............: MPI_Recv(buf=0x1cc47990, count=7, MPI_DOUBLE, src=0, tag=2, MPI_COMM_WORLD, status=0x7fff00638520) failed
>> >> >> 
>> >> >> without checkpoint, the execution is accomplish. How I need to proceed to solve this error?
>> >> >> 
>> >> >> Regards
>> >> >> 
>> >> >> Fernando Luz
>> >> >> _______________________________________________
>> >> >> mpich-discuss mailing list     
>> >> 
>> mpich-discuss at mcs.anl.gov
>> 
>> >> 
>> >> >> To manage subscription options or unsubscribe:
>> >> >> 
>> >> 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> >> 
>> >> > 
>> >> > 
>> >> > _______________________________________________
>> >> > mpich-discuss mailing list     
>> >> 
>> mpich-discuss at mcs.anl.gov
>> 
>> >> 
>> >> > To manage subscription options or unsubscribe:
>> >> > 
>> >> 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> >> 
>> >> > _______________________________________________
>> >> > mpich-discuss mailing list     
>> >> 
>> mpich-discuss at mcs.anl.gov
>> 
>> >> 
>> >> > To manage subscription options or unsubscribe:
>> >> > 
>> >> 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> >> 
>> >> 
>> >> _______________________________________________
>> >> mpich-discuss mailing list     
>> >> 
>> mpich-discuss at mcs.anl.gov
>> 
>> >> 
>> >> To manage subscription options or unsubscribe:
>> >> 
>> >> 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> > 
>> > _______________________________________________
>> > mpich-discuss mailing list     
>> mpich-discuss at mcs.anl.gov
>> 
>> > To manage subscription options or unsubscribe:
>> > 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list     
>> mpich-discuss at mcs.anl.gov
>> 
>> To manage subscription options or unsubscribe:
>> 
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list