[mpich-discuss] cryptic (to me) error

SULLIVAN David (AREVA) David.Sullivan at areva.com
Thu Sep 2 07:20:05 CDT 2010


First my apologies for the delay in continuing this thread.
Unfortunately I have not resolved it so if I can indulge the gurus and
developers once again...

As suggested by Rajeev I ran the testing suit in the source directory.
The output of errors, which are similar to what I was seeing when I ran
mcnp5 (v. 1.40 and 1.51), is attached. 

Any insights would be greatly appreciated,

Dave

-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
Sent: Wednesday, August 04, 2010 3:06 PM
To: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] cryptic (to me) error

Then one level above that directory (in the main MPICH2 source
directory), type make testing, which will run through the entire MPICH2
test suite.

Rajeev

On Aug 4, 2010, at 2:04 PM, SULLIVAN David (AREVA) wrote:

> Oh. That's  embarrassing. Yea. I have those examples. It runs  with  
> no
> problems:
> 
> [dfs at aramis examples]$ mpiexec -host aramis -n 4 ./cpi Process 2 of 4 
> is on aramis Process 3 of 4 is on aramis Process 0 of 4 is on aramis 
> Process 1 of 4 is on aramis pi is approximately 3.1415926544231239, 
> Error is 0.0000000008333307 wall clock time = 0.000652
> 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Gus Correa
> Sent: Wednesday, August 04, 2010 1:13 PM
> To: Mpich Discuss
> Subject: Re: [mpich-discuss] cryptic (to me) error
> 
> Hi David
> 
> I think the "examples" dir is not copied to the installation
directory.
> You may find it where you decompressed the MPICH2 tarball, in case you
> installed it from source.
> At least, this is what I have here.
> 
> Gus Correa
> 
> 
> SULLIVAN David (AREVA) wrote:
>> Yea, that always bothered me.  There is no such folder.
>> There are :
>> bin
>> etc
>> include
>> lib
>> sbin
>> share
>> 
>> The  only examples I found were in the  share folder,  where  there
> are
>> examples for collchk,  graphics and  logging.   
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov 
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>> Sent: Wednesday, August 04, 2010 12:45 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] cryptic (to me) error
>> 
>> Not cpilog. Can you run just cpi from the mpich2/examples directory.
>> 
>> Rajeev
>> 
>> 
>> On Aug 4, 2010, at 11:37 AM, SULLIVAN David (AREVA) wrote:
>> 
>>> Rajeev,  Darius,
>>> 
>>> Thanks for your response.
>>> cpi yields  the  following-
>>> 
>>> [dfs at aramis examples_logging]$ mpiexec -host aramis -n 12 ./cpilog 
>>> Process 0 running on aramis Process 2 running on aramis Process 3 
>>> running on aramis Process 1 running on aramis Process 6 running on 
>>> aramis Process 7 running on aramis Process 8 running on aramis 
>>> Process
>> 
>>> 4 running on aramis Process 5 running on aramis Process 9 running on

>>> aramis Process 10 running on aramis Process 11 running on aramis pi 
>>> is
>> 
>>> approximately 3.1415926535898762, Error is 0.0000000000000830 wall 
>>> clock time = 0.058131 Writing logfile....
>>> Enabling the Default clock synchronization...
>>> clog_merger.c:CLOG_Merger_init() -
>>>       Could not open file ./cpilog.clog2 for merging!
>>> Backtrace of the callstack at rank 0:
>>>       At [0]: ./cpilog(CLOG_Util_abort+0x92)[0x456326]
>>>       At [1]: ./cpilog(CLOG_Merger_init+0x11f)[0x45db7c]
>>>       At [2]: ./cpilog(CLOG_Converge_init+0x8e)[0x45a691]
>>>       At [3]: ./cpilog(MPE_Finish_log+0xea)[0x4560aa]
>>>       At [4]: ./cpilog(MPI_Finalize+0x50c)[0x4268af]
>>>       At [5]: ./cpilog(main+0x428)[0x415963]
>>>       At [6]: /lib64/libc.so.6(__libc_start_main+0xf4)[0x3c1881d994]
>>>       At [7]: ./cpilog[0x415449]
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 
>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>> 
>>> So  it looks like it works  with some issues.
>>> 
>>> When does  it fail? Immediately
>>> 
>>> Is there  a  bug? Many sucessfully use the aplication (MCNP5,  from
>>> LANL) with  mpi,  so  think that  a  bug there is  unlikely.
>>> 
>>> Core files, unfortunately reveals some ignorance on my part. Were 
>>> exactly should I be looking for them?
>>> 
>>> Thanks again,
>>> 
>>> Dave
>>> -----Original Message-----
>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Darius 
>>> Buntinas
>>> Sent: Wednesday, August 04, 2010 12:19 PM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: Re: [mpich-discuss] cryptic (to me) error
>>> 
>>> 
>>> This error message says that two processes terminated because they 
>>> were unable to communicate with another (or two other) process.
It's
> 
>>> possible that another process died, so the others got errors trying 
>>> to
>> 
>>> communicate with them.  It's also possible that there is something 
>>> preventing some processes from communicating with each other.
>>> 
>>> Are you able to run cpi from the examples directory with 12
> processes?
>>> 
>>> At what point in your code does this fail?  Are there any other 
>>> communication operations before the MPI_Comm_dup?
>>> 
>>> Enable core files (add "ulimit -c unlimited" to your .bashrc or
>>> .tcshrc) then run your app and look for core files.  If there is a 
>>> bug
>> 
>>> in your application that causes a process to die this might tell you

>>> which one and why.
>>> 
>>> Let us know how this goes.
>>> 
>>> -d
>>> 
>>> 
>>> On Aug 4, 2010, at 11:03 AM, SULLIVAN David (AREVA) wrote:
>>> 
>>>> Since I have  had  no responses, is  there any other additional
>>> information could I provide to solicit some direction for overcoming

>>> these latest string of mpi errors?
>>>> Thanks,
>>>> 
>>>> Dave
>>>> 
>>>> From: mpich-discuss-bounces at mcs.anl.gov 
>>>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of SULLIVAN 
>>>> David F (AREVA NP INC)
>>>> Sent: Friday, July 23, 2010 4:29 PM
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Subject: [mpich-discuss] cryptic (to me) error
>>>> 
>>>> With my firewall issues firmly behind me, I have a new problem for 
>>>> the
>>> collective wisdom. I am attempting to run a program to which the 
>>> response is as follows:
>>>> [mcnp5_1-4 at athos ~]$ mpiexec -f nodes -n 12 mcnp5.mpi i=TN04 
>>>> o=TN04.o
>> 
>>>> Fatal error in MPI_Comm_dup: Other MPI error, error stack:
>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>>> new_comm=0x7fff58edb450) failed
>>>> MPIR_Comm_copy(923)...............:
>>>> MPIR_Get_contextid(639)...........:
>>>> MPI_Allreduce(773)................:
MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>> rbuf=0x7fff
>>> 58edb1a0, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>> MPIR_Allreduce(228)...............:
>>>> MPIC_Send(41).....................:
>>>> MPIC_Wait(513)....................:
>>>> MPIDI_CH3I_Progress(150)..........:
>>>> MPID_nem_mpich2_blocking_recv(933):
>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Fatal error

>>>> in
>>>> MPI_Comm_dup: Other MPI error, error stack:
>>>> MPI_Comm_dup(168).................: MPI_Comm_dup(MPI_COMM_WORLD,
>>> new_comm=0x7fff
>>> 97dca620) failed
>>>> MPIR_Comm_copy(923)...............:
>>>> MPIR_Get_contextid(639)...........:
>>>> MPI_Allreduce(773)................:
MPI_Allreduce(sbuf=MPI_IN_PLACE,
>>> rbuf=0x7fff
>>> 97dca370, count=64, MPI_INT, MPI_BAND, MPI_COMM_WORLD) failed
>>>> MPIR_Allreduce(289)...............:
>>>> MPIC_Sendrecv(161)................:
>>>> MPIC_Wait(513)....................:
>>>> MPIDI_CH3I_Progress(150)..........:
>>>> MPID_nem_mpich2_blocking_recv(948):
>>>> MPID_nem_tcp_connpoll(1709).......: Communication error Killed by 
>>>> signal 2.
>>>> Ctrl-C caught... cleaning up processes [mpiexec at athos] 
>>>> HYDT_dmx_deregister_fd (./tools/demux/demux.c:142): could not find 
>>>> fd
>> 
>>>> to deregister: -2 [mpiexec at athos] HYD_pmcd_pmiserv_cleanup
>>>> (./pm/pmiserv/pmiserv_cb.c:401): error deregistering fd [press 
>>>> Ctrl-C
>> 
>>>> again to force abort] APPLICATION TERMINATED WITH THE EXIT STRING:
>>>> Killed (signal 9) [mcnp5_1-4 at athos ~]$ Any ideas?
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> David Sullivan
>>>> 
>>>> 
>>>> 
>>>> AREVA NP INC
>>>> 400 Donald Lynch Boulevard
>>>> Marlborough, MA, 01752
>>>> Phone: (508) 573-6721
>>>> Fax: (434) 382-5597
>>>> David.Sullivan at AREVA.com
>>>> 
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: testing2.txt
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100902/42292e6b/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: testing1.txt
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20100902/42292e6b/attachment-0003.txt>


More information about the mpich-discuss mailing list