Internally, FFTW's MPI transform algorithms work by first computing
transforms of the data local to each process, then by globally
transposing the data in some fashion to redistribute the data
among the processes, transforming the new data local to each process,
and transposing back. For example, a two-dimensional n0
by
n1
array, distributed across the n0
dimension, is
transformd by: (i) transforming the n1
dimension, which are
local to each process; (ii) transposing to an n1
by n0
array, distributed across the n1
dimension; (iii) transforming
the n0
dimension, which is now local to each process; (iv)
transposing back.
However, in many applications it is acceptable to compute a
multidimensional DFT whose results are produced in transposed order
(e.g., n1
by n0
in two dimensions). This provides a
significant performance advantage, because it means that the final
transposition step can be omitted. FFTW supports this optimization,
which you specify by passing the flag FFTW_MPI_TRANSPOSED_OUT
to the planner routines. To compute the inverse transform of
transposed output, you specify FFTW_MPI_TRANSPOSED_IN
to tell
it that the input is transposed. In this section, we explain how to
interpret the output format of such a transform.
Suppose you have are transforming multi-dimensional data with (at
least two) dimensions n0 × n1 × n2 × … × nd-1. As always, it is distributed along
the first dimension n0. Now, if we compute its DFT with the
FFTW_MPI_TRANSPOSED_OUT
flag, the resulting output data are stored
with the first two dimensions transposed: n1 × n0 × n2 ×…× nd-1,
distributed along the n1 dimension. Conversely, if we take the
n1 × n0 × n2 ×…× nd-1 data and transform it with the
FFTW_MPI_TRANSPOSED_IN
flag, then the format goes back to the
original n0 × n1 × n2 × … × nd-1 array.
There are two ways to find the portion of the transposed array that resides on the current process. First, you can simply call the appropriate ‘local_size’ function, passing n1 × n0 × n2 ×…× nd-1 (the transposed dimensions). This would mean calling the ‘local_size’ function twice, once for the transposed and once for the non-transposed dimensions. Alternatively, you can call one of the ‘local_size_transposed’ functions, which returns both the non-transposed and transposed data distribution from a single call. For example, for a 3d transform with transposed output (or input), you might call:
ptrdiff_t fftw_mpi_local_size_3d_transposed( ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm, ptrdiff_t *local_n0, ptrdiff_t *local_0_start, ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
Here, local_n0
and local_0_start
give the size and
starting index of the n0
dimension for the
non-transposed data, as in the previous sections. For
transposed data (e.g. the output for
FFTW_MPI_TRANSPOSED_OUT
), local_n1
and
local_1_start
give the size and starting index of the n1
dimension, which is the first dimension of the transposed data
(n1
by n0
by n2
).
(Note that FFTW_MPI_TRANSPOSED_IN
is completely equivalent to
performing FFTW_MPI_TRANSPOSED_OUT
and passing the first two
dimensions to the planner in reverse order, or vice versa. If you
pass both the FFTW_MPI_TRANSPOSED_IN
and
FFTW_MPI_TRANSPOSED_OUT
flags, it is equivalent to swapping the
first two dimensions passed to the planner and passing neither
flag.)