[mpich-discuss] Re: MPI_Brecv vs multiple MPI_Irecv

Wed Aug 27 15:51:36 CDT 2008

On Aug 27, 2008, at 3:01 PM, Darius Buntinas wrote:

>
> On 08/27/2008 01:00 PM, Robert Kubrick wrote:
>> On Aug 27, 2008, at 1:29 PM, Darius Buntinas wrote:
>>>
>>> I'm not sure what you mean by a queue on the receiving side.  If  
>>> receives have been posted, incoming messages that match the  
>>> posted receives will be received directly into the specified  
>>> location.
>> Yes, but you have to post receives and keep track of each request  
>> handler. That was the idea of my original question, one recv per  
>> message.
>> You can receive more than one element/message with each call of  
>> course:
>> MPI_Irecv(buf, 10, ...)
>> but then the Irecv handler won't be ready until *all* the elements  
>> have been received.
>> All I am saying is that it would be convenient to specify a  
>> receiving buffer where the implementation can store messages  
>> without blocking program flow on the send side and message  
>> transmission.
> >
>>> Depending on the particular MPI implementation you're using,  
>>> progress on the receives (i.e., taking them off the network,  
>>> doing the matching, copying/receiving into the user buffer, etc.)  
>>> may only happen while you're calling an MPI function call.  So if  
>>> you're in a long compute loop, the MPI library might not be  
>>> performing the receives.  But adding a user buffer wouldn't help  
>>> that situation either.
>> The program might be on a blocking recv on a different comm, which  
>> could allow progress on a different comm.
>> Also my understanding is that the MPI standard does not restrict  
>> the progress of MPI send/recv during MPI calls. Some MPI  
>> implementations are multi-threaded.
>>>
>>> Messages that are received which don't have matching posted  
>>> receives will be "queued" waiting for the matching receives, and  
>>> either buffered internally at the receiver (for small messages)  
>>> or will "stall" at the sender (for large messages).  But I  
>>> believe you're only concerned with the case where receives have  
>>> been posted.
>> Both cases. By specifying a receiving buffer to handle incoming  
>> messages, the application does not need to post recv to allow  
>> transmission (until there is room left in the buffer of course).
>> Even for small messages the send might block when all the internal  
>> receiver buffer space is gone. And what is the size of the  
>> internal buffer anyway?
>
> For mpich2, the internal buffer space is limited by available  
> memory. For each unexpected small message (<=128K for ch3:sock)  
> mpich2 does a malloc and receives the message into that buffer.  So  
> even unexpected small messages shouldn't block program flow...but  
> you'll eventually crash if you run out of memory.

Good to know.

>
> Unexpected large (>128 for ch3:sock) messages are another story.   
> The send won't complete on the send side until the receive is  
> matched at the receiver (and the receiver makes sufficient progress  
> to fully receive the message).  So unexpected large messages can  
> block program flow at the sender.  On the other hand you won't run  
> out of memory buffering unexpected large messages.
>
> If you are running into this case where unexpected large messages  
> are blocking program flow, it should be sufficient to increase the  
> small-large message threshold in the library, until you have no  
> small messages.  This can hurt performance when the library does an  
> extra copy of large amounts of data from the temporary buffers,  
> though, but it probably depends on the specific application as to  
> the degree it is affected.
>
> I think I understand where you're coming from.  I think you're  
> saying thatif the app knows that it'll receive a lot of messages  
> faster than it can post receives, there should be a way to tell the  
> library "just buffer X MB of unexpected messages, regardless of how  
> big they are." This is a little different from my previous  
> suggestion above, in that the previous suggestion will buffer an  
> unlimited number of messages that are smaller than the threshold,  
> whereas this suggestion would buffer messages of any size until the  
> total size is larger than the threshold X.

I was starting from the assumption that the library does not buffer  
messages on the receiving side (other than the MPI_Recv supplied  
buffer), even small ones.
So MPICH2 does buffer any message < 128k, which is good from a  
latency point of view on the sending side, but it should be checked  
for each implementation.

At this point the MPI_Bsend becomes less relevant, unless maybe for  
large messages that are "usually" not buffered.

>
> Does either suggestion fit your scenario?

Yes. If you have a process that sends many small messages, such a  
logging strings to a spooler process, by reading the MPI standard  
you're left with the impression that MPI_Send might block until a  
matching receiving has been posted on the other side. If sender  
performance is a priority, the solution is to queue those log  
messages somewhere (either on the sending side or better off on the  
receiving side) to let the process continue execution. MPI_Isend  
won't make it because the overhead to manage hundreds of request  
would probably slow down execution more.

If process priority is reversed (sending process has low priority,  
receiving process high), it's probably better to use MPI_Battach/ 
MPI_Bsend to move the buffering copy overhead to the sender?

>
> -d
>
>>>
>>> Does this help?
>>>
>>> -d
>>>
>>>
>>> On 08/27/2008 11:03 AM, Robert Kubrick wrote:
>>>> A buffered receive would allow the implementation to receive and  
>>>> store messages when the application is busy doing something  
>>>> else, like reading messages on a different comm. I now  
>>>> understand why a Brecv is not in the standard and it makes  
>>>> perfect sense, but the result is that on the sending you can  
>>>> control the size of a sending "queue", on the receiving side you  
>>>> can not.
>>>> On Aug 27, 2008, at 11:23 AM, Darius Buntinas wrote:
>>>>>
>>>>> Well, what would it mean to do a buffered receive?
>>>>>
>>>>> This?
>>>>>   buf = malloc(BUF_SZ);
>>>>>   MPI_Irecv(buf,...);
>>>>>   MPI_Wait(...);
>>>>>   memcpy(recv_ptr, buf, BUF_SZ);
>>>>>
>>>>> What would be the benefit?
>>>>>
>>>>> -d
>>>>>
>>>>> On 08/27/2008 10:13 AM, Robert Kubrick wrote:
>>>>>> I just found out that the standard actually doesn't have an  
>>>>>> MPI_Brecv call.
>>>>>> Any reason why the recv can not buffer messages in a user- 
>>>>>> provided memory space, as per MPI_Battach/MPI_Bsend?
>>>>>> On Aug 26, 2008, at 4:17 PM, Robert Kubrick wrote:
>>>>>>> From a performance point of view, which one is better:
>>>>>>>
>>>>>>> MPI_Battach(10*sizeof(MSG))
>>>>>>> MPI_Brecv()
>>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>> MPI_recv_init()
>>>>>>> MPI_recv_init()
>>>>>>> MPI_recv_init()
>>>>>>> ... /* 10 recv handlers */
>>>>>>> MPI_Start(all recv)
>>>>>>> MPI_Waitany()
>>>>>>>
>>>>>>>
>>>>>>> I understand MPI_Brecv will require an extra message copy,  
>>>>>>> from the attached buffer to the MPI_Brecv() buffer. I'd like  
>>>>>>> to know if there other differences between the two methods.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rob
>>>>>
>>>
>