[mpich-discuss] cryptic (to me) error

SULLIVAN David (AREVA) David.Sullivan at areva.com
Fri Sep 3 05:00:51 CDT 2010


I was wondering about that. Is there a configuration file that sets up the cluster and defines which node to run on? Would that make the issue any clearer?

Thanks,

Dave


-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov on behalf of Rajeev Thakur
Sent: Thu 9/2/2010 10:22 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] cryptic (to me) error
 
There might be some connection issues between the two machines. The MPICH2 test suite that you ran with "make testing" probably ran on a single machine.

On Sep 2, 2010, at 6:27 PM, SULLIVAN David (AREVA) wrote:

> The error occurs immediately- I don't think it even starts the executable. It does work on the single machine with 4 processes.
> 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov on behalf of Rajeev Thakur
> Sent: Thu 9/2/2010 4:34 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Does it run with 2 processes on a single machine?
> 
> 
> On Sep 2, 2010, at 2:38 PM, SULLIVAN David (AREVA) wrote:
> 
>> That fixed the compile. Thanks!
>> 
>> The latest release does not fix the issues I am having though. Cpi works
>> fine, the test suit is certainly improved (see summary.xml output)
>> though when I try to use mcnp it still crashes in the same way (see
>> error.txt)
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anthony Chan
>> Sent: Thursday, September 02, 2010 1:38 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> 
>> There is a bug in 1.3b1 about the option --enable-fc.  Since Fortran 90
>> is enabled by default, so remove the --enable-fc from your configure
>> command and try again.  If there is error again, send us the configure
>> output as you seen on your screen (See README) instead of config.log.
>> 
>> A.Chan
>> 
>> ----- "SULLIVAN David (AREVA)" <David.Sullivan at areva.com> wrote:
>> 
>>> Failure again.
>>> The 1.3 beta version will not compile with Intel 10.1. It bombs at the
>> 
>>> configuration script:
>>> 
>>> checking for Fortran flag needed to allow free-form source... unknown
>>> configure: WARNING: Fortran 90 test being disabled because the 
>>> /home/dfs/mpich2-1.3b1/bin/mpif90 compiler does not accept a .f90 
>>> extension
>>> configure: error: Fortran does not accept free-form source
>>> configure: error: ./configure failed for test/mpi
>>> 
>>> I have attached to config.log
>>> 
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>>> Sent: Thursday, September 02, 2010 12:11 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> Just try relinking with the new library at first.
>>> 
>>> Rajeev
>>> 
>>> On Sep 2, 2010, at 9:32 AM, SULLIVAN David (AREVA) wrote:
>>> 
>>>> I saw that there was a newer beta. I was really hoping to find I
>>> just
>>>> configured something incorrectly. Will this not require me to
>>> re-build
>>> 
>>>> mcnp (the only program I run that uses mpi for parallel) if I change
>>> 
>>>> the mpi version? If so, this is a bit of a hardship, requiring codes
>>> 
>>>> to be revalidated. If not- I will try it in a second.
>>>> 
>>>> Thanks,
>>>> 
>>>> Dave
>>>> 
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave
>>> Goodell
>>>> Sent: Thursday, September 02, 2010 10:27 AM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> Can you try the latest release (1.3b1) to see if that fixes the 
>>>> problems you are seeing with your application?
>>>> 
>>>> 
>>> http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=
>>>> do
>>>> wnloads
>>>> 
>>>> -Dave
>>>> 
>>>> On Sep 2, 2010, at 9:15 AM CDT, SULLIVAN David (AREVA) wrote:
>>>> 
>>>>> Another output file, hopefully of use. 
>>>>> 
>>>>> Thanks again
>>>>> 
>>>>> Dave
>>>>> 
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>>>> David
>>>>> (AREVA)
>>>>> Sent: Thursday, September 02, 2010 8:20 AM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>> 
>>>>> First my apologies for the delay in continuing this thread.
>>>>> Unfortunately I have not resolved it so if I can indulge the gurus
>>> 
>>>>> and
>>>> 
>>>>> developers once again...
>>>>> 
>>>>> As suggested by Rajeev I ran the testing suit in the source
>>> directory.
>>>>> The output of errors, which are similar to what I was seeing when I
>>> 
>>>>> ran
>>>>> mcnp5 (v. 1.40 and 1.51), is attached. 
>>>>> 
>>>>> Any insights would be greatly appreciated,
>>>>> 
>>>>> Dave
>>>>> 
>>>>> -----Original Message-----
>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
>>> Thakur
>>>>> Sent: Wednesday, August 04, 2010 3:06 PM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>> 
>>>>> Then one level above that directory (in the main MPICH2 source 
>>>>> directory), type make testing, which will run through the entire
>>>>> MPICH2 test suite.
>>>>> 
>>>>> Rajeev
>>>>> 
>>>>> On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:
>>>>> 
>>>>>> Oh. That's  embarrassing. Yea. I have those examples. It runs
>>> with
>>>>>> no
>>>>>> problems:
>>>>>> 
>>>>>> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2
>>> of
>>>>>> 4
>>>> 
>>>>>> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on
>>> aramis
>>> 
>>>>>> Process 1 of 4 is on aramis pi is approximately
>>> 3.1415926544231239,
>>>>>> Error is 0.0000000008333307 wall clock time = 0.000652
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus
>>> Correa
>>>>>> Sent: Wednesday, August 04, 2010 1:13 PM
>>>>>> To: Mpich Discuss
>>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>>> 
>>>>>> Hi David
>>>>>> 
>>>>>> I think the "examples" dir is not copied to the installation
>>>>> directory.
>>>>>> You may find it where you decompressed the MPICH2 tarball, in case
>>> 
>>>>>> you
>>>>> 
>>>>>> installed it from source.
>>>>>> At least, this is what I have here.
>>>>>> 
>>>>>> Gus Correa
>>>>>> 
>>>>>> 
>>>>>> SULLIVAN David (AREVA) wrote:
>>>>>>> Yea, that always bothered me.  There is no such folder.
>>>>>>> There are :
>>>>>>> bin
>>>>>>> etc
>>>>>>> include
>>>>>>> lib
>>>>>>> sbin
>>>>>>> share
>>>>>>> 
>>>>>>> The  only examples I found were in the  share folder,  where
>>> there
>>>>>> are
>>>>>>> examples for collchk,  graphics and  logging.   
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev 
>>>>>>> Thakur
>>>>>>> Sent: Wednesday, August 04, 2010 12:45 PM
>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>>>> 
>>>>>>> Not cpilog. Can you run just cpi from the mpich2/examples
>>> directory.
>>>>>>> 
>>>>>>> Rajeev
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>>>>>>> 
>>>>>>>> Rajeev,  Darius,
>>>>>>>> 
>>>>>>>> Thanks for your response.
>>>>>>>> cpi yields  the  following-
>>>>>>>> 
>>>>>>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12
>>> ./cpilog
>>> 
>>>>>>>> Process 0 running on aramis Process 2 running on aramis Process
>>> 3
>>>>>>>> running on aramis Process 1 running on aramis Process 6 running
>>> on
>>> 
>>>>>>>> aramis Process 7 running on aramis Process 8 running on aramis 
>>>>>>>> Process
>>>>>>> 
>>>>>>>> 4 running on aramis Process 5 running on aramis Process 9
>>> running
>>>>>>>> on
>>>>> 
>>>>>>>> aramis Process 10 running on aramis Process 11 running on aramis
>>> 
>>>>>>>> pi
>>>> 
>>>>>>>> is
>>>>>>> 
>>>>>>>> approximately 3.1415926535898762, Error is 0.0000000000000830
>>> wall
>>> 
>>>>>>>> clock time = 0.058131 Writing logfile....
>>>>>>>> Enabling the Default clock synchronization...
>>>>>>>> clog_merger.c:CLOG_Merger_init() -
>>>>>>>>   Could not open file ./cpilog.clog2 for merging!
>>>>>>>> Backtrace of the callstack at rank 0:
>>>>>>>>   At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>>>>>>   At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>>>>>>   At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>>>>>>   At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>>>>>>   At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>>>>>>   At [5]: ./cpilog(main+0x428)[0x415963]
>>>>>>>>   At [6]:
>>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>>>>>>   At [7]: ./cpilog[0x415449]
>>>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>>>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal
>>> 
>>>>>>>> 15)
>>>>>>>> 
>>>>>>>> So  it looks like it works  with some issues.
>>>>>>>> 
>>>>>>>> When does  it fail? Immediately
>>>>>>>> 
>>>>>>>> Is there  a  bug? Many sucessfully use the aplication (MCNP5, 
>>>>>>>> from
>>>>>>>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>>>>>>> 
>>>>>>>> Core files, unfortunately reveals some ignorance on my part.
>>> Were
>>>>>>>> exactly should I be looking for them?
>>>>>>>> 
>>>>>>>> Thanks again,
>>>>>>>> 
>>>>>>>> Dave
>>>>>>>> -----Original Message-----
>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>>>>>>>> Buntinas
>>>>>>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This error message says that two processes terminated because
>>> they
>>> 
>>>>>>>> were unable to communicate with another (or two other) process.
>>>>> It's
>>>>>> 
>>>>>>>> possible that another process died, so the others got errors 
>>>>>>>> trying
>>>> 
>>>>>>>> to
>>>>>>> 
>>>>>>>> communicate with them.  It's also possible that there is
>>> something
>>> 
>>>>>>>> preventing some processes from communicating with each other.
>>>>>>>> 
>>>>>>>> Are you able to run cpi from the examples directory with 12
>>>>>> processes?
>>>>>>>> 
>>>>>>>> At what point in your code does this fail?  Are there any other
>>> 
>>>>>>>> communication operations before the MPI_Comm_dup?
>>>>>>>> 
>>>>>>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>>>>>>> .tcshrc) then run your app and look for core files.  If there is
>>> a
>>> 
>>>>>>>> bug
>>>>>>> 
>>>>>>>> in your application that causes a process to die this might tell
>>> 
>>>>>>>> you
>>>>> 
>>>>>>>> which one and why.
>>>>>>>> 
>>>>>>>> Let us know how this goes.
>>>>>>>> 
>>>>>>>> -d
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>>>>>>> 
>>>>>>>>> Since I have  had  no responses, is  there any other
>>> additional
>>>>>>>> information could I provide to solicit some direction for 
>>>>>>>> overcoming
>>>>> 
>>>>>>>> these latest string of mpi errors?
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Dave
>>>>>>>>> 
>>>>>>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>>>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
>>> SULLIVAN
>>>>>>>>> David F (AREVA NP INC)
>>>>>>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>>>>>>> To: mpich-discuss at mcs.anl.gov
>>>>>>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>>>>>>> 
>>>>>>>>> With my firewall issues firmly behind me, I have a new problem
>>> 
>>>>>>>>> for
>>>> 
>>>>>>>>> the
>>>>>>>> collective wisdom. I am attempting to run a program to which the
>>> 
>>>>>>>> response is as follows:
>>>>>>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 
>>>>>>>>> o=TN04.o
>>>>>>> 
>>>>>>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>>>>>>> MPI_Comm_dup(168).................:
>>> MPI_Comm_dup(MPI_COMM_WORLD,
>>>>>>>>> new_comm=0x7fff58edb450) failed
>>>>>>>>> MPIR_Comm_copy(923)...............:
>>>>>>>>> MPIR_Get_contextid(639)...........:
>>>>>>>>> MPI_Allreduce(773)................:
>>>>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>>>>> rbuf=0x7fff
>>>>>>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>>>>> MPIR_Allreduce(228)...............:
>>>>>>>>> MPIC_Send(41).....................:
>>>>>>>>> MPIC_Wait(513)....................:
>>>>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>>>>> MPID_nem_mpich2_blocking_recv(933):
>>>>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal 
>>>>>>>>> error
>>>>> 
>>>>>>>>> in
>>>>>>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>>>>>>> MPI_Comm_dup(168).................:
>>> MPI_Comm_dup(MPI_COMM_WORLD,
>>>>>>>> new_comm=0x7fff
>>>>>>>> 97dca620) failed
>>>>>>>>> MPIR_Comm_copy(923)...............:
>>>>>>>>> MPIR_Get_contextid(639)...........:
>>>>>>>>> MPI_Allreduce(773)................:
>>>>> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>>>>>> rbuf=0x7fff
>>>>>>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>>>>>> MPIR_Allreduce(289)...............:
>>>>>>>>> MPIC_Sendrecv(161)................:
>>>>>>>>> MPIC_Wait(513)....................:
>>>>>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>>>>>> MPID_nem_mpich2_blocking_recv(948):
>>>>>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed
>>> by
>>> 
>>>>>>>>> signal 2.
>>>>>>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>>>>>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not 
>>>>>>>>> find
>>>> 
>>>>>>>>> fd
>>>>>>> 
>>>>>>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>>>>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press
>>> 
>>>>>>>>> Ctrl-C
>>>>>>> 
>>>>>>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT
>>> STRING:
>>>>>>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> 
>>>>>>>>> David Sullivan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> AREVA NP INC
>>>>>>>>> 400 Donald Lynch Boulevard
>>>>>>>>> Marlborough, MA, 01752
>>>>>>>>> Phone: (508) 573-6721
>>>>>>>>> Fax: (434) 382-5597
>>>>>>>>> David.Sullivan at AREVA.com
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>> _______________________________________________
>>>>>>>> mpich-discuss mailing list
>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>> _______________________________________________
>>>>>>> mpich-discuss mailing list
>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> 
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> <summary.xml>_______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> 
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> <summary.xml><error.txt>_______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> <winmail.dat>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 9518 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100903/ae6699f3/attachment-0001.bin>


More information about the mpich-discuss mailing list