[MPICH] RE: Send + Send on a 3 node

Rajeev Thakur thakur at mcs.anl.gov
Thu Jan 3 01:19:12 CST 2008


This kind of problem will require flow control within the MPI
implementation.

Rajeev

> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.no] 
> Sent: Wednesday, January 02, 2008 8:35 AM
> To: Rajeev Thakur
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Send + Send on a 3 node
> 
> <resend>The list has blocked zip attachments</resend>
> 
> I have reduced the MPI_Reduce_scatter test to a 3 node test where:
>      rank 0 sends to rank 1
>      rank 1 sends to rank 2
>      rank 3 does nothing.
> 
> As you can see, rank 1 will block on send because rank 2 has 
> full receive buffers, while rank 0 will continue sending.
> 
> The problem is that on rank 1, the sender also polls for 
> incoming receive buffers which are gathered into an 
> unexpected queue. This cannot happen forever because of 
> limited memory that is allocated for the unexpected requests.
> The error returned over sockets in 1.0.6p1 is this:
> 
> mpiexec -n 3 -machinefile machines.txt 
> \\linden-4\h$\SendOn3.exe job aborted:
> rank: node: exit code[: error message]
> 0: linden-2: 1
> 1: linden-3: 1: Fatal error in MPI_Send: Other MPI error, error stack:
> MPI_Send(173).............................: 
> MPI_Send(buf=008C2418, count=256, MPI_BYTE, dest=2, tag=1, 
> MPI_COMM_WORLD) f ailed
> MPIDI_CH3i_Progress_wait(215).............: an error occurred 
> while handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(436):
> MPIDI_EagerContigIsend(567)...............: failure occurred 
> while allocating memory for a request object
> 2: linden-4: 1
> 
> I have attached the source code to this mail. It's available 
> only on Windows 32.
> 
> Calin Iaru wrote:
> > It's not so easy because this is a third party RDMA 
> integration which 
> > now is expected to be broken.
> >
> > Rajeev Thakur wrote:
> >> 1.0.2p1 is a very old version of MPICH2. Some memory leaks 
> have been 
> >> fixed since then. Please try with the latest release, 1.0.6p1.
> >>
> >> Rajeev
> >>  
> >>> -----Original Message-----
> >>> From: owner-mpich-discuss at mcs.anl.gov 
> >>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> >>> Sent: Friday, December 21, 2007 9:32 AM
> >>> To: mpich-discuss at mcs.anl.gov
> >>> Subject: [MPICH] MPI_Reduce_scatter
> >>>
> >>> I am using PALLAS to stress MPI_Reduce_scatter. The error 
> reported 
> >>> after millions of inner loops is:
> >>>
> >>> 3: MPI error  875666319 occurred
> >>> 3: Other MPI error, error stack:
> >>> 3: MPI_Reduce_scatter(1201): 
> MPI_Reduce_scatter(sbuf=0x2aaaabdfb010,
> >>> rbuf=0x2aaaac1fc010, rcnts=0x176e1850, MPI_INT, MPI_SUM,
> >>> comm=0x84000000) failed
> >>> 3: MPIR_Reduce_scatter(372):
> >>> 3: MPIC_Send(48):
> >>> 3: MPIC_Wait(321):
> >>> 3: MPIDI_CH3_Progress(115): Unable to make message 
> passing progress
> >>> 3: handle_read(280):
> >>> 3: MPIDI_CH3U_Handle_recv_pkt(250): failure occurred while 
> >>> allocating memory for a request object
> >>> 3: aborting job:
> >>> 3: application called MPI_Abort(MPI_COMM_WORLD, 
> 875666319) - process 
> >>> 3
> >>>
> >>>
> >>> The library is 1.0.2p1 and I would like to know if there are some 
> >>> changes that would fix this issue.
> >>>
> >>> Best regards,
> >>>     Calin
> >>>
> >>>
> >>>     
> >>
> >>   
> >
> 
> 
> 




More information about the mpich-discuss mailing list