[MPICH] Send + Send on a 3 node

Calin Iaru calin at dolphinics.no
Wed Jan 2 08:35:16 CST 2008


<resend>The list has blocked zip attachments</resend>

I have reduced the MPI_Reduce_scatter test to a 3 node test where:
     rank 0 sends to rank 1
     rank 1 sends to rank 2
     rank 3 does nothing.

As you can see, rank 1 will block on send because rank 2 has full
receive buffers, while rank 0 will continue sending.

The problem is that on rank 1, the sender also polls for incoming
receive buffers which are gathered into an
unexpected queue. This cannot happen forever because of limited memory
that is allocated for the unexpected requests.
The error returned over sockets in 1.0.6p1 is this:

mpiexec -n 3 -machinefile machines.txt \\linden-4\h$\SendOn3.exe
job aborted:
rank: node: exit code[: error message]
0: linden-2: 1
1: linden-3: 1: Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173).............................: MPI_Send(buf=008C2418,
count=256, MPI_BYTE, dest=2, tag=1, MPI_COMM_WORLD) f
ailed
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(436):
MPIDI_EagerContigIsend(567)...............: failure occurred while
allocating memory for a request object
2: linden-4: 1

I have attached the source code to this mail. It's available only on
Windows 32.

Calin Iaru wrote:
> It's not so easy because this is a third party RDMA integration which 
> now is expected to be broken.
>
> Rajeev Thakur wrote:
>> 1.0.2p1 is a very old version of MPICH2. Some memory leaks have been 
>> fixed
>> since then. Please try with the latest release, 1.0.6p1.
>>
>> Rajeev
>>  
>>> -----Original Message-----
>>> From: owner-mpich-discuss at mcs.anl.gov 
>>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
>>> Sent: Friday, December 21, 2007 9:32 AM
>>> To: mpich-discuss at mcs.anl.gov
>>> Subject: [MPICH] MPI_Reduce_scatter
>>>
>>> I am using PALLAS to stress MPI_Reduce_scatter. The error reported 
>>> after millions of inner loops is:
>>>
>>> 3: MPI error  875666319 occurred
>>> 3: Other MPI error, error stack:
>>> 3: MPI_Reduce_scatter(1201): MPI_Reduce_scatter(sbuf=0x2aaaabdfb010,
>>> rbuf=0x2aaaac1fc010, rcnts=0x176e1850, MPI_INT, MPI_SUM, 
>>> comm=0x84000000) failed
>>> 3: MPIR_Reduce_scatter(372):
>>> 3: MPIC_Send(48):
>>> 3: MPIC_Wait(321):
>>> 3: MPIDI_CH3_Progress(115): Unable to make message passing progress
>>> 3: handle_read(280):
>>> 3: MPIDI_CH3U_Handle_recv_pkt(250): failure occurred while 
>>> allocating memory for a request object
>>> 3: aborting job:
>>> 3: application called MPI_Abort(MPI_COMM_WORLD, 875666319) - process 3
>>>
>>>
>>> The library is 1.0.2p1 and I would like to know if there are some 
>>> changes that would fix this issue.
>>>
>>> Best regards,
>>>     Calin
>>>
>>>
>>>     
>>
>>   
>


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SendOn3.c
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080102/7eb3a578/attachment.diff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SendOn3.vcproj
Type: application/xml
Size: 3930 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080102/7eb3a578/attachment.xml>


More information about the mpich-discuss mailing list