[mpich-discuss] cryptic (to me) error

SULLIVAN David (AREVA) David.Sullivan at areva.com
Wed Aug 4 11:37:44 CDT 2010


Rajeev,  Darius,

Thanks for your response.
cpi yields  the  following-

[dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog
Process 0 running on aramis
Process 2 running on aramis
Process 3 running on aramis
Process 1 running on aramis
Process 6 running on aramis
Process 7 running on aramis
Process 8 running on aramis
Process 4 running on aramis
Process 5 running on aramis
Process 9 running on aramis
Process 10 running on aramis
Process 11 running on aramis
pi is approximately 3.1415926535898762, Error is 0.0000000000000830
wall clock time = 0.058131
Writing logfile....
Enabling the Default clock synchronization...
clog_merger.c:CLOG_Merger_init() -
        Could not open file ./cpilog.clog2 for merging!
Backtrace of the callstack at rank 0:
        At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
        At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
        At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
        At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
        At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
        At [5]: ./cpilog(main+0x428)[0x415963]
        At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
        At [7]: ./cpilog[0x415449]
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15) 

So  it looks like it works  with some issues.

When does  it fail? Immediately

Is there  a  bug? Many sucessfully use the aplication (MCNP5,  from
LANL) with  mpi,  so  think that  a  bug there is  unlikely.

Core files, unfortunately reveals some ignorance on my part. Were
exactly should I be looking for them?

Thanks again,

Dave
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius Buntinas
Sent: Wednesday, August 04, 2010 12:19 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] cryptic (to me) error


This error message says that two processes terminated because they were
unable to communicate with another (or two other) process.  It's
possible that another process died, so the others got errors trying to
communicate with them.  It's also possible that there is something
preventing some processes from communicating with each other.

Are you able to run cpi from the examples directory with 12 processes?

At what point in your code does this fail?  Are there any other
communication operations before the MPI_Comm_dup?

Enable core files (add "ulimit -c unlimited" to your .bashrc or .tcshrc)
then run your app and look for core files.  If there is a bug in your
application that causes a process to die this might tell you which one
and why.

Let us know how this goes.

-d


On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:

> Since I have  had  no responses, is  there any other additional
information could I provide to solicit some direction for overcoming
these latest string of mpi errors?
>  
> Thanks,
>  
> Dave
> 
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN David

> F (AREVA NP INC)
> Sent: Friday, July 23, 2010 4:29 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] cryptic (to me) error
> 
> With my firewall issues firmly behind me, I have a new problem for the
collective wisdom. I am attempting to run a program to which the
response is as follows:
>  
> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 o=TN04.o 
> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD, 
> new_comm=0x7fff58edb450) failed
> MPIR_Comm_copy(923)...............:
> MPIR_Get_contextid(639)...........:
> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fff
58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
> MPIR_Allreduce(228)...............:
> MPIC_Send(41).....................:
> MPIC_Wait(513)....................:
> MPIDI_CH3I_Progress(150)..........:
> MPID_nem_mpich2_blocking_recv(933):
> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal error in

> MPI_Comm_dup: Other MPI error, error stack:
> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
new_comm=0x7fff
97dca620) failed
> MPIR_Comm_copy(923)...............:
> MPIR_Get_contextid(639)...........:
> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE,
rbuf=0x7fff
97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
> MPIR_Allreduce(289)...............:
> MPIC_Sendrecv(161)................:
> MPIC_Wait(513)....................:
> MPIDI_CH3I_Progress(150)..........:
> MPID_nem_mpich2_blocking_recv(948):
> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by 
> signal 2.
> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find fd 
> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup 
> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press Ctrl-C 
> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING: 
> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>  
> Thanks in advance,
>  
> David Sullivan
>  
>  
>  
> AREVA NP INC
> 400 Donald Lynch Boulevard
> Marlborough, MA, 01752
> Phone: (508) 573-6721
> Fax: (434) 382-5597
> David.Sullivan at AREVA.com
>  
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list