[mpich-discuss] cryptic (to me) error

Dave Goodell goodell at mcs.anl.gov
Thu Sep 2 09:27:12 CDT 2010


Can you try the latest release (1.3b1) to see if that fixes the problems you are seeing with your application?

http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads

-Dave

On Sep 2, 2010, at 9:15 AM CDT, SULLIVAN David (AREVA) wrote:

> Another output file, hopefully of use. 
> 
> Thanks again
> 
> Dave 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN David
> (AREVA)
> Sent: Thursday, September 02, 2010 8:20 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> First my apologies for the delay in continuing this thread.
> Unfortunately I have not resolved it so if I can indulge the gurus and
> developers once again...
> 
> As suggested by Rajeev I ran the testing suit in the source directory.
> The output of errors, which are similar to what I was seeing when I ran
> mcnp5 (v. 1.40 and 1.51), is attached. 
> 
> Any insights would be greatly appreciated,
> 
> Dave
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
> Sent: Wednesday, August 04, 2010 3:06 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Then one level above that directory (in the main MPICH2 source
> directory), type make testing, which will run through the entire MPICH2
> test suite.
> 
> Rajeev
> 
> On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:
> 
>> Oh. That's  embarrassing. Yea. I have those examples. It runs  with no
>> problems:
>> 
>> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2 of 4 
>> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on aramis 
>> Process 1 of 4 is on aramis pi is approximately 3.1415926544231239, 
>> Error is 0.0000000008333307 wall clock time = 0.000652
>> 
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>> Sent: Wednesday, August 04, 2010 1:13 PM
>> To: Mpich Discuss
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> Hi David
>> 
>> I think the "examples" dir is not copied to the installation
> directory.
>> You may find it where you decompressed the MPICH2 tarball, in case you
> 
>> installed it from source.
>> At least, this is what I have here.
>> 
>> Gus Correa
>> 
>> 
>> SULLIVAN David (AREVA) wrote:
>>> Yea, that always bothered me.  There is no such folder.
>>> There are :
>>> bin
>>> etc
>>> include
>>> lib
>>> sbin
>>> share
>>> 
>>> The  only examples I found were in the  share folder,  where  there
>> are
>>> examples for collchk,  graphics and  logging.   
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>>> Sent: Wednesday, August 04, 2010 12:45 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> Not cpilog. Can you run just cpi from the mpich2/examples directory.
>>> 
>>> Rajeev
>>> 
>>> 
>>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>>> 
>>>> Rajeev,  Darius,
>>>> 
>>>> Thanks for your response.
>>>> cpi yields  the  following-
>>>> 
>>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog 
>>>> Process 0 running on aramis Process 2 running on aramis Process 3 
>>>> running on aramis Process 1 running on aramis Process 6 running on 
>>>> aramis Process 7 running on aramis Process 8 running on aramis 
>>>> Process
>>> 
>>>> 4 running on aramis Process 5 running on aramis Process 9 running on
> 
>>>> aramis Process 10 running on aramis Process 11 running on aramis pi 
>>>> is
>>> 
>>>> approximately 3.1415926535898762, Error is 0.0000000000000830 wall 
>>>> clock time = 0.058131 Writing logfile....
>>>> Enabling the Default clock synchronization...
>>>> clog_merger.c:CLOG_Merger_init() -
>>>>      Could not open file ./cpilog.clog2 for merging!
>>>> Backtrace of the callstack at rank 0:
>>>>      At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>>      At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>>      At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>>      At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>>      At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>>      At [5]: ./cpilog(main+0x428)[0x415963]
>>>>      At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>>      At [7]: ./cpilog[0x415449]
>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>> 
>>>> So  it looks like it works  with some issues.
>>>> 
>>>> When does  it fail? Immediately
>>>> 
>>>> Is there  a  bug? Many sucessfully use the aplication (MCNP5,  from
>>>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>>> 
>>>> Core files, unfortunately reveals some ignorance on my part. Were 
>>>> exactly should I be looking for them?
>>>> 
>>>> Thanks again,
>>>> 
>>>> Dave
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>>>> Buntinas
>>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> 
>>>> This error message says that two processes terminated because they 
>>>> were unable to communicate with another (or two other) process.
> It's
>> 
>>>> possible that another process died, so the others got errors trying 
>>>> to
>>> 
>>>> communicate with them.  It's also possible that there is something 
>>>> preventing some processes from communicating with each other.
>>>> 
>>>> Are you able to run cpi from the examples directory with 12
>> processes?
>>>> 
>>>> At what point in your code does this fail?  Are there any other 
>>>> communication operations before the MPI_Comm_dup?
>>>> 
>>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>>> .tcshrc) then run your app and look for core files.  If there is a 
>>>> bug
>>> 
>>>> in your application that causes a process to die this might tell you
> 
>>>> which one and why.
>>>> 
>>>> Let us know how this goes.
>>>> 
>>>> -d
>>>> 
>>>> 
>>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>>> 
>>>>> Since I have  had  no responses, is  there any other additional
>>>> information could I provide to solicit some direction for overcoming
> 
>>>> these latest string of mpi errors?
>>>>> Thanks,
>>>>> 
>>>>> Dave
>>>>> 
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>>>> David F (AREVA NP INC)
>>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>>> 
>>>>> With my firewall issues firmly behind me, I have a new problem for 
>>>>> the
>>>> collective wisdom. I am attempting to run a program to which the 
>>>> response is as follows:
>>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 
>>>>> o=TN04.o
>>> 
>>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>>> new_comm=0x7fff58edb450) failed
>>>>> MPIR_Comm_copy(923)...............:
>>>>> MPIR_Get_contextid(639)...........:
>>>>> MPI_Allreduce(773)................:
> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>> rbuf=0x7fff
>>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>> MPIR_Allreduce(228)...............:
>>>>> MPIC_Send(41).....................:
>>>>> MPIC_Wait(513)....................:
>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>> MPID_nem_mpich2_blocking_recv(933):
>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal error
> 
>>>>> in
>>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>> new_comm=0x7fff
>>>> 97dca620) failed
>>>>> MPIR_Comm_copy(923)...............:
>>>>> MPIR_Get_contextid(639)...........:
>>>>> MPI_Allreduce(773)................:
> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>> rbuf=0x7fff
>>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>> MPIR_Allreduce(289)...............:
>>>>> MPIC_Sendrecv(161)................:
>>>>> MPIC_Wait(513)....................:
>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>> MPID_nem_mpich2_blocking_recv(948):
>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by 
>>>>> signal 2.
>>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find 
>>>>> fd
>>> 
>>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press 
>>>>> Ctrl-C
>>> 
>>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING:
>>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> David Sullivan
>>>>> 
>>>>> 
>>>>> 
>>>>> AREVA NP INC
>>>>> 400 Donald Lynch Boulevard
>>>>> Marlborough, MA, 01752
>>>>> Phone: (508) 573-6721
>>>>> Fax: (434) 382-5597
>>>>> David.Sullivan at AREVA.com
>>>>> 
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> <summary.xml>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list