Basic distributed-transpose interface

Next: Advanced distributed-transpose interface, Previous: FFTW MPI Transposes, Up: FFTW MPI Transposes

6.7.1 Basic distributed-transpose interface

In particular, suppose that we have an n0 by n1 array in row-major order, block-distributed across the n0 dimension. To transpose this into an n1 by n0 array block-distributed across the n1 dimension, we would create a plan by calling the following function:

     fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
                                       double *in, double *out,
                                       MPI_Comm comm, unsigned flags);

The input and output arrays (in and out) can be the same. The transpose is actually executed by calling fftw_execute on the plan, as usual.

The flags are the usual FFTW planner flags, but support two additional flags: FFTW_MPI_TRANSPOSED_OUT and/or FFTW_MPI_TRANSPOSED_IN. What these flags indicate, for transpose plans, is that the output and/or input, respectively, are locally transposed. That is, on each process input data is normally stored as a local_n0 by n1 array in row-major order, but for an FFTW_MPI_TRANSPOSED_IN plan the input data is stored as n1 by local_n0 in row-major order. Similarly, FFTW_MPI_TRANSPOSED_OUT means that the output is n0 by local_n1 instead of local_n1 by n0.

To determine the local size of the array on each process before and after the transpose, as well as the amount of storage that must be allocated, one should call fftw_mpi_local_size_2d_transposed, just as for a 2d DFT as described in the previous section:

     ptrdiff_t fftw_mpi_local_size_2d_transposed
                     (ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
                      ptrdiff_t *local_n1, ptrdiff_t *local_1_start);

Again, the return value is the local storage to allocate, which in this case is the number of real (double) values rather than complex numbers as in the previous examples.