[mpich-discuss] cryptic (to me) error

Gus Correa gus at ldeo.columbia.edu
Wed Aug 4 12:13:18 CDT 2010


Hi David

I think the "examples" dir is not copied to the installation directory.
You may find it where you decompressed the MPICH2 tarball,
in case you installed it from source.
At least, this is what I have here.

Gus Correa


SULLIVAN David (AREVA) wrote:
> Yea, that always bothered me.  There is no such folder.
> There are :
> bin
> etc
> include
> lib
> sbin
> share
> 
> The  only examples I found were in the  share folder,  where  there  are
> examples for collchk,  graphics and  logging.   
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
> Sent: Wednesday, August 04, 2010 12:45 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Not cpilog. Can you run just cpi from the mpich2/examples directory.
> 
> Rajeev
> 
> 
> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
> 
>> Rajeev,  Darius,
>>
>> Thanks for your response.
>> cpi yields  the  following-
>>
>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog 
>> Process 0 running on aramis Process 2 running on aramis Process 3 
>> running on aramis Process 1 running on aramis Process 6 running on 
>> aramis Process 7 running on aramis Process 8 running on aramis Process
> 
>> 4 running on aramis Process 5 running on aramis Process 9 running on 
>> aramis Process 10 running on aramis Process 11 running on aramis pi is
> 
>> approximately 3.1415926535898762, Error is 0.0000000000000830 wall 
>> clock time = 0.058131 Writing logfile....
>> Enabling the Default clock synchronization...
>> clog_merger.c:CLOG_Merger_init() -
>>        Could not open file ./cpilog.clog2 for merging!
>> Backtrace of the callstack at rank 0:
>>        At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>        At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>        At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>        At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>        At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>        At [5]: ./cpilog(main+0x428)[0x415963]
>>        At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>        At [7]: ./cpilog[0x415449]
>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>> So  it looks like it works  with some issues.
>>
>> When does  it fail? Immediately
>>
>> Is there  a  bug? Many sucessfully use the aplication (MCNP5,  from
>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>
>> Core files, unfortunately reveals some ignorance on my part. Were 
>> exactly should I be looking for them?
>>
>> Thanks again,
>>
>> Dave
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>> Buntinas
>> Sent: Wednesday, August 04, 2010 12:19 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>
>>
>> This error message says that two processes terminated because they 
>> were unable to communicate with another (or two other) process.  It's 
>> possible that another process died, so the others got errors trying to
> 
>> communicate with them.  It's also possible that there is something 
>> preventing some processes from communicating with each other.
>>
>> Are you able to run cpi from the examples directory with 12 processes?
>>
>> At what point in your code does this fail?  Are there any other 
>> communication operations before the MPI_Comm_dup?
>>
>> Enable core files (add "ulimit -c unlimited" to your .bashrc or 
>> .tcshrc) then run your app and look for core files.  If there is a bug
> 
>> in your application that causes a process to die this might tell you 
>> which one and why.
>>
>> Let us know how this goes.
>>
>> -d
>>
>>
>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>
>>> Since I have  had  no responses, is  there any other additional
>> information could I provide to solicit some direction for overcoming 
>> these latest string of mpi errors?
>>> Thanks,
>>>
>>> Dave
>>>
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>> David
>>> F (AREVA NP INC)
>>> Sent: Friday, July 23, 2010 4:29 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [mpich-discuss] cryptic (to me) error
>>>
>>> With my firewall issues firmly behind me, I have a new problem for 
>>> the
>> collective wisdom. I am attempting to run a program to which the 
>> response is as follows:
>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 o=TN04.o
> 
>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>> new_comm=0x7fff58edb450) failed
>>> MPIR_Comm_copy(923)...............:
>>> MPIR_Get_contextid(639)...........:
>>> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fff
>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>> MPIR_Allreduce(228)...............:
>>> MPIC_Send(41).....................:
>>> MPIC_Wait(513)....................:
>>> MPIDI_CH3I_Progress(150)..........:
>>> MPID_nem_mpich2_blocking_recv(933):
>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal error 
>>> in
>>> MPI_Comm_dup: Other MPI error, error stack:
>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>> new_comm=0x7fff
>> 97dca620) failed
>>> MPIR_Comm_copy(923)...............:
>>> MPIR_Get_contextid(639)...........:
>>> MPI_Allreduce(773)................: MPI_Allreduce(sbuf=MPI_IN_PLACE,
>> rbuf=0x7fff
>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>> MPIR_Allreduce(289)...............:
>>> MPIC_Sendrecv(161)................:
>>> MPIC_Wait(513)....................:
>>> MPIDI_CH3I_Progress(150)..........:
>>> MPID_nem_mpich2_blocking_recv(948):
>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by 
>>> signal 2.
>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find fd
> 
>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press Ctrl-C
> 
>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING:
>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>
>>> Thanks in advance,
>>>
>>> David Sullivan
>>>
>>>
>>>
>>> AREVA NP INC
>>> 400 Donald Lynch Boulevard
>>> Marlborough, MA, 01752
>>> Phone: (508) 573-6721
>>> Fax: (434) 382-5597
>>> David.Sullivan at AREVA.com
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list