[mpich-discuss] cryptic (to me) error
SULLIVAN David (AREVA)
David.Sullivan at areva.com
Thu Sep 2 09:32:21 CDT 2010
I saw that there was a newer beta. I was really hoping to find I just
configured something incorrectly. Will this not require me to re-build
mcnp (the only program I run that uses mpi for parallel) if I change the
mpi version? If so, this is a bit of a hardship, requiring codes to be
revalidated. If not- I will try it in a second.
Thanks,
Dave
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
Sent: Thursday, September 02, 2010 10:27 AM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] cryptic (to me) error
Can you try the latest release (1.3b1) to see if that fixes the problems
you are seeing with your application?
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=do
wnloads
-Dave
On Sep 2, 2010, at 9:15 AM CDT, SULLIVAN David (AREVA) wrote:
> Another output file, hopefully of use.
>
> Thanks again
>
> Dave
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN David
> (AREVA)
> Sent: Thursday, September 02, 2010 8:20 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
>
> First my apologies for the delay in continuing this thread.
> Unfortunately I have not resolved it so if I can indulge the gurus and
> developers once again...
>
> As suggested by Rajeev I ran the testing suit in the source directory.
> The output of errors, which are similar to what I was seeing when I
> ran
> mcnp5 (v. 1.40 and 1.51), is attached.
>
> Any insights would be greatly appreciated,
>
> Dave
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
> Sent: Wednesday, August 04, 2010 3:06 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] cryptic (to me) error
>
> Then one level above that directory (in the main MPICH2 source
> directory), type make testing, which will run through the entire
> MPICH2 test suite.
>
> Rajeev
>
> On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:
>
>> Oh. That's embarrassing. Yea. I have those examples. It runs with
>> no
>> problems:
>>
>> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2 of 4
>> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on aramis
>> Process 1 of 4 is on aramis pi is approximately 3.1415926544231239,
>> Error is 0.0000000008333307 wall clock time = 0.000652
>>
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
>> Sent: Wednesday, August 04, 2010 1:13 PM
>> To: Mpich Discuss
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>
>> Hi David
>>
>> I think the "examples" dir is not copied to the installation
> directory.
>> You may find it where you decompressed the MPICH2 tarball, in case
>> you
>
>> installed it from source.
>> At least, this is what I have here.
>>
>> Gus Correa
>>
>>
>> SULLIVAN David (AREVA) wrote:
>>> Yea, that always bothered me. There is no such folder.
>>> There are :
>>> bin
>>> etc
>>> include
>>> lib
>>> sbin
>>> share
>>>
>>> The only examples I found were in the share folder, where there
>> are
>>> examples for collchk, graphics and logging.
>>>
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
>>> Thakur
>>> Sent: Wednesday, August 04, 2010 12:45 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>
>>> Not cpilog. Can you run just cpi from the mpich2/examples directory.
>>>
>>> Rajeev
>>>
>>>
>>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>>>
>>>> Rajeev, Darius,
>>>>
>>>> Thanks for your response.
>>>> cpi yields the following-
>>>>
>>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog
>>>> Process 0 running on aramis Process 2 running on aramis Process 3
>>>> running on aramis Process 1 running on aramis Process 6 running on
>>>> aramis Process 7 running on aramis Process 8 running on aramis
>>>> Process
>>>
>>>> 4 running on aramis Process 5 running on aramis Process 9 running
>>>> on
>
>>>> aramis Process 10 running on aramis Process 11 running on aramis pi
>>>> is
>>>
>>>> approximately 3.1415926535898762, Error is 0.0000000000000830 wall
>>>> clock time = 0.058131 Writing logfile....
>>>> Enabling the Default clock synchronization...
>>>> clog_merger.c:CLOG_Merger_init() -
>>>> Could not open file ./cpilog.clog2 for merging!
>>>> Backtrace of the callstack at rank 0:
>>>> At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>> At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>> At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>> At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>> At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>> At [5]: ./cpilog(main+0x428)[0x415963]
>>>> At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>> At [7]: ./cpilog[0x415449]
>>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>
>>>> So it looks like it works with some issues.
>>>>
>>>> When does it fail? Immediately
>>>>
>>>> Is there a bug? Many sucessfully use the aplication (MCNP5, from
>>>> LANL) with mpi, so think that a bug there is unlikely.
>>>>
>>>> Core files, unfortunately reveals some ignorance on my part. Were
>>>> exactly should I be looking for them?
>>>>
>>>> Thanks again,
>>>>
>>>> Dave
>>>> -----Original Message-----
>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius
>>>> Buntinas
>>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>>>
>>>>
>>>> This error message says that two processes terminated because they
>>>> were unable to communicate with another (or two other) process.
> It's
>>
>>>> possible that another process died, so the others got errors trying
>>>> to
>>>
>>>> communicate with them. It's also possible that there is something
>>>> preventing some processes from communicating with each other.
>>>>
>>>> Are you able to run cpi from the examples directory with 12
>> processes?
>>>>
>>>> At what point in your code does this fail? Are there any other
>>>> communication operations before the MPI_Comm_dup?
>>>>
>>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>>> .tcshrc) then run your app and look for core files. If there is a
>>>> bug
>>>
>>>> in your application that causes a process to die this might tell
>>>> you
>
>>>> which one and why.
>>>>
>>>> Let us know how this goes.
>>>>
>>>> -d
>>>>
>>>>
>>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>>>
>>>>> Since I have had no responses, is there any other additional
>>>> information could I provide to solicit some direction for
>>>> overcoming
>
>>>> these latest string of mpi errors?
>>>>> Thanks,
>>>>>
>>>>> Dave
>>>>>
>>>>> From: mpich-discuss-bounces at mcs.anl.gov
>>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN
>>>>> David F (AREVA NP INC)
>>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>>>
>>>>> With my firewall issues firmly behind me, I have a new problem for
>>>>> the
>>>> collective wisdom. I am attempting to run a program to which the
>>>> response is as follows:
>>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04
>>>>> o=TN04.o
>>>
>>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>>> new_comm=0x7fff58edb450) failed
>>>>> MPIR_Comm_copy(923)...............:
>>>>> MPIR_Get_contextid(639)...........:
>>>>> MPI_Allreduce(773)................:
> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>> rbuf=0x7fff
>>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>> MPIR_Allreduce(228)...............:
>>>>> MPIC_Send(41).....................:
>>>>> MPIC_Wait(513)....................:
>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>> MPID_nem_mpich2_blocking_recv(933):
>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal
>>>>> error
>
>>>>> in
>>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>> new_comm=0x7fff
>>>> 97dca620) failed
>>>>> MPIR_Comm_copy(923)...............:
>>>>> MPIR_Get_contextid(639)...........:
>>>>> MPI_Allreduce(773)................:
> MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>>> rbuf=0x7fff
>>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>>> MPIR_Allreduce(289)...............:
>>>>> MPIC_Sendrecv(161)................:
>>>>> MPIC_Wait(513)....................:
>>>>> MPIDI_CH3I_Progress(150)..........:
>>>>> MPID_nem_mpich2_blocking_recv(948):
>>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by
>>>>> signal 2.
>>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos]
>>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find
>>>>> fd
>>>
>>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press
>>>>> Ctrl-C
>>>
>>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING:
>>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> David Sullivan
>>>>>
>>>>>
>>>>>
>>>>> AREVA NP INC
>>>>> 400 Donald Lynch Boulevard
>>>>> Marlborough, MA, 01752
>>>>> Phone: (508) 573-6721
>>>>> Fax: (434) 382-5597
>>>>> David.Sullivan at AREVA.com
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> <summary.xml>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list