[mpich-discuss] cryptic (to me) error
Darius Buntinas
buntinas at mcs.anl.gov
Wed Aug 4 11:18:33 CDT 2010
This error message says that two processes terminated because they were unable to communicate with another (or two other) process. It's possible that another process died, so the others got errors trying to communicate with them. It's also possible that there is something preventing some processes from communicating with each other.
Are you able to run cpi from the examples directory with 12 processes?
At what point in your code does this fail? Are there any other communication operations before the MPI_Comm_dup?
Enable core files (add "ulimit -c unlimited" to your .bashrc or .tcshrc) then run your app and look for core files. If there is a bug in your application that causes a process to die this might tell you which one and why.
Let us know how this goes.
-d
On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
> Since I have had no responses, is there any other additional information could I provide to solicit some direction for overcoming these latest string of mpi errors?
>
> Thanks,
>
> Dave
>
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN David F (AREVA NP INC)
> Sent: Friday, July 23, 2010 4:29 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] cryptic (to me) error
>
> With my firewall issues firmly behind me, I have a new problem for the collective wisdom. I am attempting to run a program to which the response is as follows:
>
> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 o=TN04.o
> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x7fff58edb450) failed
> MPIR_Comm_copy(923)...............:
> MPIR_Get_contextid(639)...........:
> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
> MPIR_Allreduce(228)...............:
> MPIC_Send(41).....................:
> MPIC_Wait(513)....................:
> MPIDI_CH3I_Progress(150)..........:
> MPID_nem_mpich2_blocking_recv(933):
> MPID_nem_tcp_connpoll(1709).......: Communication error
> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x7fff 97dca620) failed
> MPIR_Comm_copy(923)...............:
> MPIR_Get_contextid(639)...........:
> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
> MPIR_Allreduce(289)...............:
> MPIC_Sendrecv(161)................:
> MPIC_Wait(513)....................:
> MPIDI_CH3I_Progress(150)..........:
> MPID_nem_mpich2_blocking_recv(948):
> MPID_nem_tcp_connpoll(1709).......: Communication error
> Killed by signal 2.
> Ctrl-C caught... cleaning up processes
> [mpiexec at athos] HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find fd to deregister: -2
> [mpiexec at athos] HYD_pmcd_pmiserv_cleanup (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd
> [press Ctrl-C again to force abort]
> APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
> [mcnp5_1-4 at athos ~]$
> Any ideas?
>
> Thanks in advance,
>
> David Sullivan
>
>
>
> AREVA NP INC
> 400 Donald Lynch Boulevard
> Marlborough, MA, 01752
> Phone: (508) 573-6721
> Fax: (434) 382-5597
> David.Sullivan at AREVA.com
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list