[mpich-discuss] cryptic (to me) error

Dave Goodell goodell at mcs.anl.gov
Thu Sep 2 15:22:29 CDT 2010


How long does your application run before giving the error message in error.txt?

Given the nature of the error, I suspect that something is flakey in your network (hard to say if it's the hardware or a driver somewhere).  Long-ish or variable runtimes also would help support this theory.

-Dave

On Sep 2, 2010, at 2:38 PM CDT, SULLIVAN David (AREVA) wrote:

> That fixed the compile. Thanks!
> 
> The latest release does not fix the issues I am having though. Cpi works
> fine, the test suit is certainly improved (see summary.xml output)
> though when I try to use mcnp it still crashes in the same way (see
> error.txt)
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anthony Chan
> Sent: Thursday, September 02, 2010 1:38 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> 
> There is a bug in 1.3b1 about the option --enable-fc.  Since Fortran 90
> is enabled by default, so remove the --enable-fc from your configure
> command and try again.  If there is error again, send us the configure
> output as you seen on your screen (See README) instead of config.log.
> 
> A.Chan
> 
> ----- "SULLIVAN David (AREVA)" <David.Sullivan at areva.com> wrote:
> 
>> Failure again.
>> The 1.3 beta version will not compile with Intel 10.1. It bombs at the
> 
>> configuration script:
>> 
>> checking for Fortran flag needed to allow free-form source... unknown
>> configure: WARNING: Fortran 90 test being disabled because the 
>> /home/dfs/mpich2-1.3b1/bin/mpif90 compiler does not accept a .f90 
>> extension
>> configure: error: Fortran does not accept free-form source
>> configure: error: ./configure failed for test/mpi
>> 
>> I have attached to config.log
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>> Sent: Thursday, September 02, 2010 12:11 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> Just try relinking with the new library at first.
>> 
>> Rajeev
>> 
>> On Sep 2, 2010, at 9:32 AM, SULLIVAN David (AREVA) wrote:
>> 
>>> I saw that there was a newer beta. I was really hoping to find I
>> just
>>> configured something incorrectly. Will this not require me to
>> re-build
>> 
>>> mcnp (the only program I run that uses mpi for parallel) if I change
>> 
>>> the mpi version? If so, this is a bit of a hardship, requiring codes
>> 
>>> to be revalidated. If not- I will try it in a second.
>>> 
>>> Thanks,
>>> 
>>> Dave
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave
>> Goodell
>>> Sent: Thursday, September 02, 2010 10:27 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> Can you try the latest release (1.3b1) to see if that fixes the 
>>> problems you are seeing with your application?
>>> 
>>> 
>> http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=
>>> do
>>> wnloads
>>> 
>>> -Dave
>>> 
>>> On Sep 2, 2010, at 9:15 AM CDT, SULLIVAN David (AREVA) wrote:
>>> 
>>>> Another output file, hopefully of use. 
>>>> 
>>>> Thanks again
>>>> 
>>>> Dave
>>>> 
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>>> David
>>>> (AREVA)
>>>> Sent: Thursday, September 02, 2010 8:20 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> First my apologies for the delay in continuing this thread.
>>>> Unfortunately I have not resolved it so if I can indulge the gurus
>> 
>>>> and
>>> 
>>>> developers once again...
>>>> 
>>>> As suggested by Rajeev I ran the testing suit in the source
>> directory.
>>>> The output of errors, which are similar to what I was seeing when I
>> 
>>>> ran
>>>> mcnp5 (v. 1.40 and 1.51), is attached. 
>>>> 
>>>> Any insights would be greatly appreciated,
>>>> 
>>>> Dave
>>>> 
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
>> Thakur
>>>> Sent: Wednesday, August 04, 2010 3:06 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> Then one level above that directory (in the main MPICH2 source 
>>>> directory), type make testing, which will run through the entire
>>>> MPICH2 test suite.
>>>> 
>>>> Rajeev
>>>> 
>>>> On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:
>>>> 
>>>>> Oh. That's  embarrassing. Yea. I have those examples. It runs
>> with
>>>>> no
>>>>> problems:
>>>>> 
>>>>> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2
>> of
>>>>> 4
>>> 
>>>>> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on
>> aramis
>> 
>>>>> Process 1 of 4 is on aramis pi is approximately
>> 3.1415926544231239,
>>>>> Error is 0.0000000008333307 wall clock time = 0.000652
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus
>> Correa
>>>>> Sent: Wednesday, August 04, 2010 1:13 PM
>>>>> To: Mpich Discuss
>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>> 
>>>>> Hi David
>>>>> 
>>>>> I think the "examples" dir is not copied to the installation
>>>> directory.
>>>>> You may find it where you decompressed the MPICH2 tarball, in case
>> 
>>>>> you
>>>> 
>>>>> installed it from source.
>>>>> At least, this is what I have here.
>>>>> 
>>>>> Gus Correa
>>>>> 
>>>>> 
>>>>> SULLIVAN David (AREVA) wrote:
>>>>>> Yea, that always bothered me.  There is no such folder.
>>>>>> There are :
>>>>>> bin
>>>>>> etc
>>>>>> include
>>>>>> lib
>>>>>> sbin
>>>>>> share
>>>>>> 
>>>>>> The  only examples I found were in the  share folder,  where
>> there
>>>>> are
>>>>>> examples for collchk,  graphics and  logging.   
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev 
>>>>>> Thakur
>>>>>> Sent: Wednesday, August 04, 2010 12:45 PM
>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>>> 
>>>>>> Not cpilog. Can you run just cpi from the mpich2/examples
>> directory.
>>>>>> 
>>>>>> Rajeev
>>>>>> 
>>>>>> 
>>>>>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>>>>>> 
>>>>>>> Rajeev,  Darius,
>>>>>>> 
>>>>>>> Thanks for your response.
>>>>>>> cpi yields  the  following-
>>>>>>> 
>>>>>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12
>> ./cpilog
>> 
>>>>>>> Process 0 running on aramis Process 2 running on aramis Process
>> 3
>>>>>>> running on aramis Process 1 running on aramis Process 6 running
>> on
>> 
>>>>>>> aramis Process 7 running on aramis Process 8 running on aramis 
>>>>>>> Process
>>>>>> 
>>>>>>> 4 running on aramis Process 5 running on aramis Process 9
>> running
>>>>>>> on
>>>> 
>>>>>>> aramis Process 10 running on aramis Process 11 running on aramis
>> 
>>>>>>> pi
>>> 
>>>>>>> is
>>>>>> 
>>>>>>> approximately 3.1415926535898762, Error is 0.0000000000000830
>> wall
>> 
>>>>>>> clock time = 0.058131 Writing logfile....
>>>>>>> Enabling the Default clock synchronization...
>>>>>>> clog_merger.c:CLOG_Merger_init() -
>>>>>>>    Could not open file ./cpilog.clog2 for merging!
>>>>>>> Backtrace of the callstack at rank 0:
>>>>>>>    At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>>>>>    At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>>>>>    At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>>>>>    At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>>>>>    At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>>>>>    At [5]: ./cpilog(main+0x428)[0x415963]
>>>>>>>    At [6]:
>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>>>>>    At [7]: ./cpilog[0x415449]
>>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal
>> 
>>>>>>> 15)
>>>>>>> 
>>>>>>> So  it looks like it works  with some issues.
>>>>>>> 
>>>>>>> When does  it fail? Immediately
>>>>>>> 
>>>>>>> Is there  a  bug? Many sucessfully use the aplication (MCNP5, 
>>>>>>> from
>>>>>>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>>>>>> 
>>>>>>> Core files, unfortunately reveals some ignorance on my part.
>> Were
>>>>>>> exactly should I be looking for them?
>>>>>>> 
>>>>>>> Thanks again,
>>>>>>> 
>>>>>>> Dave
>>>>>>> -----Original Message-----
>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>>>>>>> Buntinas
>>>>>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>>>> 
>>>>>>> 
>>>>>>> This error message says that two processes terminated because
>> they
>> 
>>>>>>> were unable to communicate with another (or two other) process.
>>>> It's
>>>>> 
>>>>>>> possible that another process died, so the others got errors 
>>>>>>> trying
>>> 
>>>>>>> to
>>>>>> 
>>>>>>> communicate with them.  It's also possible that there is
>> something
>> 
>>>>>>> preventing some processes from communicating with each other.
>>>>>>> 
>>>>>>> Are you able to run cpi from the examples directory with 12
>>>>> processes?
>>>>>>> 
>>>>>>> At what point in your code does this fail?  Are there any other
>> 
>>>>>>> communication operations before the MPI_Comm_dup?
>>>>>>> 
>>>>>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>>>>>> .tcshrc) then run your app and look for core files.  If there is
>> a
>> 
>>>>>>> bug
>>>>>> 
>>>>>>> in your application that causes a process to die this might tell
>> 
>>>>>>> you
>>>> 
>>>>>>> which one and why.
>>>>>>> 
>>>>>>> Let us know how this goes.
>>>>>>> 
>>>>>>> -d
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>>>>>> 
>>>>>>>> Since I have  had  no responses, is  there any other
>> additional
>>>>>>> information could I provide to solicit some direction for 
>>>>>>> overcoming
>>>> 
>>>>>>> these latest string of mpi errors?
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Dave
>>>>>>>> 
>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
>> SULLIVAN
>>>>>>>> David F (AREVA NP INC)
>>>>>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>>>>>> 
>>>>>>>> With my firewall issues firmly behind me, I have a new problem
>> 
>>>>>>>> for
>>> 
>>>>>>>> the
>>>>>>> collective wisdom. I am attempting to run a program to which the
>> 
>>>>>>> response is as follows:
>>>>>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 
>>>>>>>> o=TN04.o
>>>>>> 
>>>>>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>>>>>> MPI_Comm_dup(168).................:
>> MPI_Comm_dup(MPI_COMM_WORLD,
>>>>>>>> new_comm=0x7fff58edb450) failed
>>>>>>>> MPIR_Comm_copy(923)...............:
>>>>>>>> MPIR_Get_contextid(639)...........:
>>>>>>>> MPI_Allreduce(773)................:
>>>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>>>> rbuf=0x7fff
>>>>>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>>>> MPIR_Allreduce(228)...............:
>>>>>>>> MPIC_Send(41).....................:
>>>>>>>> MPIC_Wait(513)....................:
>>>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>>>> MPID_nem_mpich2_blocking_recv(933):
>>>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal 
>>>>>>>> error
>>>> 
>>>>>>>> in
>>>>>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>>>>>> MPI_Comm_dup(168).................:
>> MPI_Comm_dup(MPI_COMM_WORLD,
>>>>>>> new_comm=0x7fff
>>>>>>> 97dca620) failed
>>>>>>>> MPIR_Comm_copy(923)...............:
>>>>>>>> MPIR_Get_contextid(639)...........:
>>>>>>>> MPI_Allreduce(773)................:
>>>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>>>> rbuf=0x7fff
>>>>>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>>>> MPIR_Allreduce(289)...............:
>>>>>>>> MPIC_Sendrecv(161)................:
>>>>>>>> MPIC_Wait(513)....................:
>>>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>>>> MPID_nem_mpich2_blocking_recv(948):
>>>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed
>> by
>> 
>>>>>>>> signal 2.
>>>>>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>>>>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not 
>>>>>>>> find
>>> 
>>>>>>>> fd
>>>>>> 
>>>>>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>>>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press
>> 
>>>>>>>> Ctrl-C
>>>>>> 
>>>>>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT
>> STRING:
>>>>>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> 
>>>>>>>> David Sullivan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> AREVA NP INC
>>>>>>>> 400 Donald Lynch Boulevard
>>>>>>>> Marlborough, MA, 01752
>>>>>>>> Phone: (508) 573-6721
>>>>>>>> Fax: (434) 382-5597
>>>>>>>> David.Sullivan at AREVA.com
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> 
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> <summary.xml>_______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> <summary.xml><error.txt>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list