[mpich-discuss] Fatal error in PMPI_Reduce: Other MPI error, error stac

Pavan Balaji balaji at mcs.anl.gov
Tue Feb 22 10:59:42 CST 2011


It does look like a firewall problem to me -- the processes are trying 
to find an open port in the 50001:59999 range, which is taking time as 
everything might be behind a firewall.

  -- Pavan

On 02/22/2011 07:53 AM, T.R. Sanderson wrote:
> Just noticed something rather odd - if I specify a port range using
>
> export MPICH_PORT_RANGE=50001:59999
>
> the process just hangs after saying:
>
> [1] Process 1 of 2 is on node2
>
> [0] Process 0 of 2 is on node1
>
> And both processes continue to use 100% CPU until I kill the program. Does
> that explain anything? If I don't specify ports it gives the same message
> given before.
>
> Best
>
> Theo
>
> On Feb 22 2011, Pavan Balaji wrote:
>
>>
>> Here's an entry on the FAQ that describes this:
>>
>>
>> http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes
>>
>>   -- Pavan
>>
>> On 02/21/2011 06:03 PM, T.R. Sanderson wrote:
>>> Hello, I've been testing my MPICH2 installation with cpi and am having
>>> some issues. Cpi runs fine on either computer. If I add two nodes to my
>>> hosts file I receive the error below, if I only have one node in the
>>> host file it runs happily even if executed via MPI.
>>>
>>> I would be very grateful for any advice, if you would like the verbose
>>> output just let me know.
>>>
>>> Many thanks,
>>> Theo
>>>
>>> trs38 at node0:~$ mpiexec.hydra -l -n 2 /root/mpich2-1.3.2/examples/cpi
>>> [1] Process 1 of 2 is on node2 [0] Process 0 of 2 is on node1 [0] Fatal
>>> error in PMPI_Reduce: Other MPI error, error stack: [0]
>>> PMPI_Reduce(1322)...............: MPI_Reduce(sbuf=0x7fffbfe9d028,
>>> rbuf=0x7fffbfe9d020, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>> MPI_COMM_WORLD) failed [0] MPIR_Reduce_impl(1139)..........: [0]
>>> MPIR_Reduce_intra(947)..........: [0] MPIR_Reduce_binomial(176).......:
>>> [0] MPIDI_CH3U_Recvq_FDU_or_AEP(380): Communication error with rank 1
>>> [mpiexec at node0] ONE OF THE PROCESSES TERMINATED BADLY: CLEANING UP
>>> [proxy:0:1 at node2] HYD_pmcd_pmip_control_cmd_cb
>>> (./pm/pmiserv/pmip_cb.c:868): assert (!closed) failed [proxy:0:1 at node2]
>>> HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback
>>> returned error status [proxy:0:1 at node2] main (./pm/pmiserv/pmip.c:208):
>>> demux engine error waiting for event APPLICATION TERMINATED WITH THE
>>> EXIT STRING: Hangup (signal 1)
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list