[mpich-discuss] cryptic (to me) error

Rajeev Thakur thakur at mcs.anl.gov
Thu Sep 2 11:10:51 CDT 2010


Just try relinking with the new library at first.

Rajeev

On Sep 2, 2010, at 9:32 AM, SULLIVAN David (AREVA) wrote:

> I saw that there was a newer beta. I was really hoping to find I just
> configured something incorrectly. Will this not require me to re-build
> mcnp (the only program I run that uses mpi for parallel) if I change the
> mpi version? If so, this is a bit of a hardship, requiring codes to be
> revalidated. If not- I will try it in a second. 
> 
> Thanks,
> 
> Dave
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
> Sent: Thursday, September 02, 2010 10:27 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Can you try the latest release (1.3b1) to see if that fixes the problems
> you are seeing with your application?
> 
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=do
> wnloads
> 
> -Dave
> 
> On Sep 2, 2010, at 9:15 AM CDT, SULLIVAN David (AREVA) wrote:
> 
>> Another output file, hopefully of use. 
>> 
>> Thanks again
>> 
>> Dave
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN David
>> (AREVA)
>> Sent: Thursday, September 02, 2010 8:20 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> First my apologies for the delay in continuing this thread.
>> Unfortunately I have not resolved it so if I can indulge the gurus and
> 
>> developers once again...
>> 
>> As suggested by Rajeev I ran the testing suit in the source directory.
>> The output of errors, which are similar to what I was seeing when I 
>> ran
>> mcnp5 (v. 1.40 and 1.51), is attached. 
>> 
>> Any insights would be greatly appreciated,
>> 
>> Dave
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>> Sent: Wednesday, August 04, 2010 3:06 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> Then one level above that directory (in the main MPICH2 source 
>> directory), type make testing, which will run through the entire 
>> MPICH2 test suite.
>> 
>> Rajeev
>> 
>> On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:
>> 
>>> Oh. That's  embarrassing. Yea. I have those examples. It runs  with 
>>> no
>>> problems:
>>> 
>>> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2 of 4
> 
>>> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on aramis 
>>> Process 1 of 4 is on aramis pi is approximately 3.1415926544231239, 
>>> Error is 0.0000000008333307 wall clock time = 0.000652
>>> 
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>>> Sent: Wednesday, August 04, 2010 1:13 PM
>>> To: Mpich Discuss
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> Hi David
>>> 
>>> I think the "examples" dir is not copied to the installation
>> directory.
>>> You may find it where you decompressed the MPICH2 tarball, in case 
>>> you
>> 
>>> installed it from source.
>>> At least, this is what I have here.
>>> 
>>> Gus Correa
>>> 
>>> 
>>> SULLIVAN David (AREVA) wrote:
>>>> Yea, that always bothered me.  There is no such folder.
>>>> There are :
>>>> bin
>>>> etc
>>>> include
>>>> lib
>>>> sbin
>>>> share
>>>> 
>>>> The  only examples I found were in the  share folder,  where  there
>>> are
>>>> examples for collchk,  graphics and  logging.   
>>>> 
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev 
>>>> Thakur
>>>> Sent: Wednesday, August 04, 2010 12:45 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> Not cpilog. Can you run just cpi from the mpich2/examples directory.
>>>> 
>>>> Rajeev
>>>> 
>>>> 
>>>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>>>> 
>>>>> Rajeev,  Darius,
>>>>> 
>>>>> Thanks for your response.
>>>>> cpi yields  the  following-
>>>>> 
>>>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog 
>>>>> Process 0 running on aramis Process 2 running on aramis Process 3 
>>>>> running on aramis Process 1 running on aramis Process 6 running on 
>>>>> aramis Process 7 running on aramis Process 8 running on aramis 
>>>>> Process
>>>> 
>>>>> 4 running on aramis Process 5 running on aramis Process 9 running 
>>>>> on
>> 
>>>>> aramis Process 10 running on aramis Process 11 running on aramis pi
> 
>>>>> is
>>>> 
>>>>> approximately 3.1415926535898762, Error is 0.0000000000000830 wall 
>>>>> clock time = 0.058131 Writing logfile....
>>>>> Enabling the Default clock synchronization...
>>>>> clog_merger.c:CLOG_Merger_init() -
>>>>>     Could not open file ./cpilog.clog2 for merging!
>>>>> Backtrace of the callstack at rank 0:
>>>>>     At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>>>     At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>>>     At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>>>     At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>>>     At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>>>     At [5]: ./cpilog(main+0x428)[0x415963]
>>>>>     At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>>>     At [7]: ./cpilog[0x415449]
>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>> 
>>>>> So  it looks like it works  with some issues.
>>>>> 
>>>>> When does  it fail? Immediately
>>>>> 
>>>>> Is there  a  bug? Many sucessfully use the aplication (MCNP5,  from
>>>>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>>>> 
>>>>> Core files, unfortunately reveals some ignorance on my part. Were 
>>>>> exactly should I be looking for them?
>>>>> 
>>>>> Thanks again,
>>>>> 
>>>>> Dave
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>>>>> Buntinas
>>>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>> 
>>>>> 
>>>>> This error message says that two processes terminated because they 
>>>>> were unable to communicate with another (or two other) process.
>> It's
>>> 
>>>>> possible that another process died, so the others got errors trying
> 
>>>>> to
>>>> 
>>>>> communicate with them.  It's also possible that there is something 
>>>>> preventing some processes from communicating with each other.
>>>>> 
>>>>> Are you able to run cpi from the examples directory with 12
>>> processes?
>>>>> 
>>>>> At what point in your code does this fail?  Are there any other 
>>>>> communication operations before the MPI_Comm_dup?
>>>>> 
>>>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>>>> .tcshrc) then run your app and look for core files.  If there is a 
>>>>> bug
>>>> 
>>>>> in your application that causes a process to die this might tell 
>>>>> you
>> 
>>>>> which one and why.
>>>>> 
>>>>> Let us know how this goes.
>>>>> 
>>>>> -d
>>>>> 
>>>>> 
>>>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>>>> 
>>>>>> Since I have  had  no responses, is  there any other additional
>>>>> information could I provide to solicit some direction for 
>>>>> overcoming
>> 
>>>>> these latest string of mpi errors?
>>>>>> Thanks,
>>>>>> 
>>>>>> Dave
>>>>>> 
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>>>>> David F (AREVA NP INC)
>>>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>>>> 
>>>>>> With my firewall issues firmly behind me, I have a new problem for
> 
>>>>>> the
>>>>> collective wisdom. I am attempting to run a program to which the 
>>>>> response is as follows:
>>>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 
>>>>>> o=TN04.o
>>>> 
>>>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>>>> new_comm=0x7fff58edb450) failed
>>>>>> MPIR_Comm_copy(923)...............:
>>>>>> MPIR_Get_contextid(639)...........:
>>>>>> MPI_Allreduce(773)................:
>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7fff
>>>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>> MPIR_Allreduce(228)...............:
>>>>>> MPIC_Send(41).....................:
>>>>>> MPIC_Wait(513)....................:
>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>> MPID_nem_mpich2_blocking_recv(933):
>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal 
>>>>>> error
>> 
>>>>>> in
>>>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>>> new_comm=0x7fff
>>>>> 97dca620) failed
>>>>>> MPIR_Comm_copy(923)...............:
>>>>>> MPIR_Get_contextid(639)...........:
>>>>>> MPI_Allreduce(773)................:
>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>> rbuf=0x7fff
>>>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>> MPIR_Allreduce(289)...............:
>>>>>> MPIC_Sendrecv(161)................:
>>>>>> MPIC_Wait(513)....................:
>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>> MPID_nem_mpich2_blocking_recv(948):
>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by 
>>>>>> signal 2.
>>>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find
> 
>>>>>> fd
>>>> 
>>>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press 
>>>>>> Ctrl-C
>>>> 
>>>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING:
>>>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> 
>>>>>> David Sullivan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> AREVA NP INC
>>>>>> 400 Donald Lynch Boulevard
>>>>>> Marlborough, MA, 01752
>>>>>> Phone: (508) 573-6721
>>>>>> Fax: (434) 382-5597
>>>>>> David.Sullivan at AREVA.com
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> <summary.xml>_______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list