Previous: Advanced distributed-transpose interface, Up: FFTW MPI Transposes


6.7.3 An improved replacement for MPI_Alltoall

We close this section by noting that FFTW's MPI transpose routines can be thought of as a generalization for the MPI_Alltoall function (albeit only for floating-point types), and in some circumstances can function as an improved replacement.

MPI_Alltoall is defined by the MPI standard as:

     int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
                      void *recvbuf, int recvcnt, MPI_Datatype recvtype,
                      MPI_Comm comm);

In particular, for double* arrays in and out, consider the call:

     MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);

This is completely equivalent to:

     MPI_Comm_size(comm, &P);
     plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
     fftw_execute(plan);
     fftw_destroy_plan(plan);

That is, computing a P × P transpose on P processes, with a block size of 1, is just a standard all-to-all communication.

However, using the FFTW routine instead of MPI_Alltoall may have certain advantages. First of all, FFTW's routine can operate in-place (in == out) whereas MPI_Alltoall can only operate out-of-place.

Second, even for out-of-place plans, FFTW's routine may be faster, especially if you need to perform the all-to-all communication many times and can afford to use FFTW_MEASURE or FFTW_PATIENT. It should certainly be no slower, not including the time to create the plan, since one of the possible algorithms that FFTW uses for an out-of-place transpose is simply to call MPI_Alltoall. However, FFTW also considers several other possible algorithms that, depending on your MPI implementation and your hardware, may be faster.