[mpich-discuss] MPI_Alltoall problem

Jeff Hammond jhammond at alcf.anl.gov
Sat Oct 8 09:30:13 CDT 2011


It is unlikely that you will find anything significantly better at FFT
than FFTW.

The best solution to your problem is to not call FFT millions or
billions of times.  My guess is your algorithm does not actually
require FFT but rather chose it as one of many possibilities based
upon the availability of good libraries like FFTW or the assumption of
performance of hardware from many years ago.

If you're doing molecular dynamics (quantum or classical), then the
problem is that you're using the wrong algorithm for Coulomb solves.
I can point you to codes that do not require an FFT every iteration
and therefore are more scalable.

Another way to solve this is to run on hardware that does MPI_Alltoall
really fast, e.g. Blue Gene/P.  Since you're at Argonne, this is
clearly feasible, although it is unlikely to impress any
mathematicians :-)

I am O(1) doors down the hall from you on the first floor of TCS if
you want to stop by to talk about this problem in more detail some
time.

Jeff

On Sat, Oct 8, 2011 at 5:40 AM, Jie Chen <jiechen at mcs.anl.gov> wrote:
> Rhys: thank you for the heads up. Actually my question stems from using fftw. I do not feel fftw fast enough, considering that I need to do the transforms millions, even billions times.  I timed its routine and found that it was actually the transpose of the data that killed me. I know that when creating the plan, fftw would test different options, including the mpi_alltoall way, and choose the fastest option to do transpose. I just hope that someone here might have a smarter solution that what fftw provides...
>
> Jie
>
>
>
> On Oct 8, 2011, at 4:52 AM, Rhys Ulerich <rhys.ulerich at gmail.com> wrote:
>
>>> Hi, I am working on some application that heavily uses MPI_Alltoall---matrix transpose.
>>> ... So the performance of MPI_Alltoall becomes very critical. Does anyone know an
>>> alternative to directly calling the MPI_Alltoall routine and reduce the run time?
>>
>> One possibility would be trying FFTW 3.3's MPI transpose capabilities
>> [1].  You pay a
>> one-time planning cost while FFTW figures out what the fastest way is
>> to perform your
>> transpose (Alltoall, pairwise sendrecv, etc) and then you can
>> repeatedly execute the
>> optimum choice.
>>
>> If this looks like a good option, be sure to read the entire FFTW MPI chapter as
>> many of the useful tidbits are buried within it (e.g. [2]).  Lastly,
>> if you structure your
>> computation so that you perform an MPI transpose of your matrix A and in the
>> "transposed" logic you compute with data strided like A^T, you may find that
>> FFTW_MPI_TRANSPOSED_OUT [3] will improve your runtime.
>>
>> Hope that helps,
>> Rhys
>>
>> [1] http://www.fftw.org/fftw3_doc/FFTW-MPI-Transposes.html#FFTW-MPI-Transposes
>> [2] http://www.fftw.org/fftw3_doc/FFTW-MPI-Performance-Tips.html#FFTW-MPI-Performance-Tips
>> [3] http://www.fftw.org/fftw3_doc/Transposed-distributions.html#Transposed-distributions
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond


More information about the mpich-discuss mailing list