[MPICH] RE: Send + Send on a 3 node
Calin Iaru
calin at dolphinics.com
Thu Jan 3 03:50:11 CST 2008
Is this flow control going to be implemented in the next patch? I think I
will make some changes to the old 1.2p1 release on my machine as an
alternative.
--------------------------------------------------
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
Sent: Thursday, January 03, 2008 8:19 AM
To: "'Calin Iaru'" <calin at dolphinics.no>
Cc: <mpich-discuss at mcs.anl.gov>
Subject: [MPICH] RE: Send + Send on a 3 node
> This kind of problem will require flow control within the MPI
> implementation.
>
> Rajeev
>
>> -----Original Message-----
>> From: Calin Iaru [mailto:calin at dolphinics.no]
>> Sent: Wednesday, January 02, 2008 8:35 AM
>> To: Rajeev Thakur
>> Cc: mpich-discuss at mcs.anl.gov
>> Subject: Send + Send on a 3 node
>>
>> <resend>The list has blocked zip attachments</resend>
>>
>> I have reduced the MPI_Reduce_scatter test to a 3 node test where:
>> rank 0 sends to rank 1
>> rank 1 sends to rank 2
>> rank 3 does nothing.
>>
>> As you can see, rank 1 will block on send because rank 2 has
>> full receive buffers, while rank 0 will continue sending.
>>
>> The problem is that on rank 1, the sender also polls for
>> incoming receive buffers which are gathered into an
>> unexpected queue. This cannot happen forever because of
>> limited memory that is allocated for the unexpected requests.
>> The error returned over sockets in 1.0.6p1 is this:
>>
>> mpiexec -n 3 -machinefile machines.txt
>> \\linden-4\h$\SendOn3.exe job aborted:
>> rank: node: exit code[: error message]
>> 0: linden-2: 1
>> 1: linden-3: 1: Fatal error in MPI_Send: Other MPI error, error stack:
>> MPI_Send(173).............................:
>> MPI_Send(buf=008C2418, count=256, MPI_BYTE, dest=2, tag=1,
>> MPI_COMM_WORLD) f ailed
>> MPIDI_CH3i_Progress_wait(215).............: an error occurred
>> while handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(436):
>> MPIDI_EagerContigIsend(567)...............: failure occurred
>> while allocating memory for a request object
>> 2: linden-4: 1
>>
>> I have attached the source code to this mail. It's available
>> only on Windows 32.
>>
>> Calin Iaru wrote:
>> > It's not so easy because this is a third party RDMA
>> integration which
>> > now is expected to be broken.
>> >
>> > Rajeev Thakur wrote:
>> >> 1.0.2p1 is a very old version of MPICH2. Some memory leaks
>> have been
>> >> fixed since then. Please try with the latest release, 1.0.6p1.
>> >>
>> >> Rajeev
>> >>
>> >>> -----Original Message-----
>> >>> From: owner-mpich-discuss at mcs.anl.gov
>> >>> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
>> >>> Sent: Friday, December 21, 2007 9:32 AM
>> >>> To: mpich-discuss at mcs.anl.gov
>> >>> Subject: [MPICH] MPI_Reduce_scatter
>> >>>
>> >>> I am using PALLAS to stress MPI_Reduce_scatter. The error
>> reported
>> >>> after millions of inner loops is:
>> >>>
>> >>> 3: MPI error 875666319 occurred
>> >>> 3: Other MPI error, error stack:
>> >>> 3: MPI_Reduce_scatter(1201):
>> MPI_Reduce_scatter(sbuf=0x2aaaabdfb010,
>> >>> rbuf=0x2aaaac1fc010, rcnts=0x176e1850, MPI_INT, MPI_SUM,
>> >>> comm=0x84000000) failed
>> >>> 3: MPIR_Reduce_scatter(372):
>> >>> 3: MPIC_Send(48):
>> >>> 3: MPIC_Wait(321):
>> >>> 3: MPIDI_CH3_Progress(115): Unable to make message
>> passing progress
>> >>> 3: handle_read(280):
>> >>> 3: MPIDI_CH3U_Handle_recv_pkt(250): failure occurred while
>> >>> allocating memory for a request object
>> >>> 3: aborting job:
>> >>> 3: application called MPI_Abort(MPI_COMM_WORLD,
>> 875666319) - process
>> >>> 3
>> >>>
>> >>>
>> >>> The library is 1.0.2p1 and I would like to know if there are some
>> >>> changes that would fix this issue.
>> >>>
>> >>> Best regards,
>> >>> Calin
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>>
>>
>
>
More information about the mpich-discuss
mailing list