[mpich-discuss] MPI_Alltoall problem

Sat Oct 8 05:55:39 CDT 2011

Hi Jie,

>>> Hi, I am working on some application that heavily uses MPI_Alltoall---matrix transpose.
>>> ... So the performance of MPI_Alltoall becomes very critical. Does anyone know an
>>> alternative to directly calling the MPI_Alltoall routine and reduce the run time?

>> One possibility would be trying FFTW 3.3's MPI transpose capabilities

> Actually my question stems from using fftw. I do not feel fftw fast enough, considering
> thatI need to do the transforms millions, even billions times.  I timed its routine and
> found that it was actually the transpose of the data that killed me.

If it is the on-node data transpose that consumes most of the time, as
I hinted, try restructuring your calculation to work with transposed
on-node data and use
FFTW_MPI_TRANSPOSED_OUT.  Note that an in-place MPI transpose will
incur higher on-node transpose costs than an out-of-place plan because
non-square, in-place matrix transposes are expensive.

> I just hope that someone here might have a smarter solution that what fftw provides...

If you're fully taking advantage of what FFTW MPI can do, I would be
surprised if you find something better.  If you do, please write back
as I'd be curious.

Best of luck,
Rhys