[petsc-users] Usage of parallel FFT for doing batch 1d-FFTs over the columns of a dense 2d-matrix

Tue Dec 8 20:13:34 CST 2020

  Roland,

     The 

> inembed, istride, idist,
>                                data_out, onembed, ostride, odist,
> 
variables all need to be set rank by rank, (since each rank is providing a matrix with its own number of local rows which may be different on other ranks) and not using the global matrix_size or or size (which I assume comes from MPI_Comm_size()?).

Use MatGetLocalSize(mat,&numrows,NULL); to get the number of rows on a process. Use this to set the need FFTW parameters.

The reason the current code works when "my matrix size is an integer multiple of x, " is because then all the ranks have the same numrows and your formula produces the correct local size but otherwise your formula will not produce the correct local number of rows.

  Barry

> On Dec 8, 2020, at 8:42 AM, Roland Richter <roland.richter at ntnu.no> wrote:
> 
> I replaced the FFT-code with the following:
> 
>     int FFT_rank = 1;
>     const int FFTW_size[] = {matrix_size};
>     int howmany = last_owned_row_index - first_owned_row_index;
>     int idist = 1;
>     int odist = 1;
>     int istride = matrix_size / size;
>     int ostride = matrix_size / size;
>     const int inembed[] = {matrix_size}, onembed[] = {matrix_size};
>     fplan = fftw_plan_many_dft(FFT_rank, FFTW_size,
>                                howmany,
>                                data_in, inembed, istride, idist,
>                                data_out, onembed, ostride, odist,
>                                FFTW_FORWARD, FFTW_ESTIMATE);
>     bplan = fftw_plan_many_dft(FFT_rank, FFTW_size,
>                                howmany,
>                                data_out, inembed, istride, idist,
>                                data_out2, onembed, ostride, odist,
>                                FFTW_BACKWARD, FFTW_ESTIMATE);
> 
> Now I get the expected results also for mpirun -n x with x > 1, but only if my matrix size is an integer multiple of x, else some parts of the resulting matrix are zeroed out. Therefore, I assume I made a mistake here with inembed and onembed, but I am not sure how they influence the result. Do you have any suggestions?
> 
> Thanks,
> 
> Roland
> 
> Am 08.12.20 um 14:55 schrieb Matthew Knepley:
>> On Tue, Dec 8, 2020 at 8:40 AM Roland Richter <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> wrote:
>> Dear all,
>> 
>> I tried the following code:
>> 
>> int main(int argc, char **args) {
>>     Mat C, F;
>>     Vec x, y, z;
>>     PetscViewer viewer;
>>     PetscMPIInt rank, size;
>>     PetscInitialize(&argc, &args, (char*) 0, help);
>> 
>>     MPI_Comm_size(PETSC_COMM_WORLD, &size);
>>     MPI_Comm_rank(PETSC_COMM_WORLD, &rank);
>> 
>>     PetscPrintf(PETSC_COMM_WORLD,"Number of processors = %d, rank = %d\n", size, rank);
>>     //    std::cout << "From rank " << rank << '\n';
>> 
>>     //MatCreate(PETSC_COMM_WORLD, &C);
>>     PetscViewerCreate(PETSC_COMM_WORLD, &viewer);
>>     PetscViewerSetType(viewer, PETSCVIEWERASCII);
>>     arma::cx_mat local_mat, local_zero_mat;
>>     const size_t matrix_size = 5;
>> 
>>     MatCreateDense(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, matrix_size, matrix_size, NULL, &C);
>>     MatCreateDense(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, matrix_size, matrix_size, NULL, &F);
>>     if(rank == 0) {
>>         arma::Col<int> indices = arma::linspace<arma::Col<int>>(0, matrix_size - 1, matrix_size);
>>         //if(rank == 0) {
>>         local_mat = arma::randu<arma::cx_mat>(matrix_size, matrix_size);
>>         local_zero_mat = arma::zeros<arma::cx_mat>(matrix_size, matrix_size);
>>         arma::cx_mat tmp_mat = local_mat.st <http://local_mat.st/>();
>>         MatSetValues(C, matrix_size, indices.memptr(), matrix_size, indices.memptr(), tmp_mat.memptr(), INSERT_VALUES);
>>         MatSetValues(F, matrix_size, indices.memptr(), matrix_size, indices.memptr(), local_zero_mat.memptr(), INSERT_VALUES);
>>     }
>> 
>>     MatAssemblyBegin(C, MAT_FINAL_ASSEMBLY);
>>     MatAssemblyEnd(C, MAT_FINAL_ASSEMBLY);
>>     MatAssemblyBegin(F, MAT_FINAL_ASSEMBLY);
>>     MatAssemblyEnd(F, MAT_FINAL_ASSEMBLY);
>> 
>>     //FFT test
>>     Mat FFT_A;
>>     Vec input, output;
>>     int first_owned_row_index = 0, last_owned_row_index = 0;
>>     const int FFT_length[] = {matrix_size};
>> 
>> 
>>     MatCreateFFT(PETSC_COMM_WORLD, 1, FFT_length, MATFFTW, &FFT_A);
>>     MatCreateVecsFFTW(FFT_A, &x, &y, &z);
>>     VecCreate(PETSC_COMM_WORLD, &input);
>>     VecSetFromOptions(input);
>>     VecSetSizes(input, PETSC_DECIDE, matrix_size);
>>     VecCreate(PETSC_COMM_WORLD, &output);
>>     VecSetFromOptions(output);
>>     VecSetSizes(output, PETSC_DECIDE, matrix_size);
>>     MatGetOwnershipRange(C, &first_owned_row_index, &last_owned_row_index);
>>     std::cout << "Rank " << rank << " owns row " << first_owned_row_index << " to row " << last_owned_row_index << '\n';
>> 
>>     //Testing FFT
>> 
>>     /*---------------------------------------------------------*/
>>     fftw_plan    fplan,bplan;
>>     fftw_complex *data_in,*data_out,*data_out2;
>>     ptrdiff_t    alloc_local, local_ni, local_i_start, local_n0,local_0_start;
>>     PetscRandom rdm;
>> 
>>     //    if (!rank)
>>     //        printf("Use FFTW without PETSc-FFTW interface\n");
>>     fftw_mpi_init();
>>     int N           = matrix_size * matrix_size;
>>     int N0 = matrix_size;
>>     int N1 = matrix_size;
>>     const ptrdiff_t n_data[] = {N0, 1};
>>     //alloc_local = fftw_mpi_local_size_2d(N0,N1,PETSC_COMM_WORLD,&local_n0,&local_0_start);
>>     alloc_local = fftw_mpi_local_size_many(1, n_data,
>>                                            matrix_size,
>>                                            FFTW_MPI_DEFAULT_BLOCK,
>>                                            PETSC_COMM_WORLD,
>>                                            &local_n0,
>>                                            &local_0_start);
>>     //data_in   = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*alloc_local);
>>     PetscScalar *C_ptr, *F_ptr;
>>     MatDenseGetArray(C, &C_ptr);
>>     MatDenseGetArray(F, &F_ptr);
>>     data_in = reinterpret_cast<fftw_complex*>(C_ptr);
>>     data_out = reinterpret_cast<fftw_complex*>(F_ptr);
>>     data_out2 = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*alloc_local);
>> 
>> 
>>     VecCreateMPIWithArray(PETSC_COMM_WORLD,1,(PetscInt)local_n0 * N1,(PetscInt)N,(const PetscScalar*)data_in,&x);
>>     PetscObjectSetName((PetscObject) x, "Real Space vector");
>>     VecCreateMPIWithArray(PETSC_COMM_WORLD,1,(PetscInt)local_n0 * N1,(PetscInt)N,(const PetscScalar*)data_out,&y);
>>     PetscObjectSetName((PetscObject) y, "Frequency space vector");
>>     VecCreateMPIWithArray(PETSC_COMM_WORLD,1,(PetscInt)local_n0 * N1,(PetscInt)N,(const PetscScalar*)data_out2,&z);
>>     PetscObjectSetName((PetscObject) z, "Reconstructed vector");
>> 
>>     int FFT_rank = 1;
>>     const ptrdiff_t FFTW_size[] = {matrix_size};
>>     int howmany = last_owned_row_index - first_owned_row_index;
>>     //std::cout << "Rank " << rank << " processes " << howmany << " rows\n";
>>     int idist = matrix_size;//1;
>>     int odist = matrix_size;//1;
>>     int istride = 1;//matrix_size;
>>     int ostride = 1;//matrix_size;
>>     const ptrdiff_t *inembed = FFTW_size, *onembed = FFTW_size;
>>     fplan = fftw_mpi_plan_many_dft(FFT_rank, FFTW_size,
>>                                    howmany,
>>                                    FFTW_MPI_DEFAULT_BLOCK, FFTW_MPI_DEFAULT_BLOCK,
>>                                    data_in, data_out,
>>                                    PETSC_COMM_WORLD,
>>                                    FFTW_FORWARD, FFTW_ESTIMATE);
>>     bplan = fftw_mpi_plan_many_dft(FFT_rank, FFTW_size,
>>                                    howmany,
>>                                    FFTW_MPI_DEFAULT_BLOCK, FFTW_MPI_DEFAULT_BLOCK,
>>                                    data_out, data_out2,
>>                                    PETSC_COMM_WORLD,
>>                                    FFTW_BACKWARD, FFTW_ESTIMATE);
>> 
>>     if (false) {VecView(x,PETSC_VIEWER_STDOUT_WORLD);}
>> 
>>     fftw_execute(fplan);
>>     if (false) {VecView(y,PETSC_VIEWER_STDOUT_WORLD);}
>> 
>>     fftw_execute(bplan);
>> 
>>     double a = 1.0 / matrix_size;
>>     double enorm = 0;
>>     VecScale(z,a);
>>     if (false) {VecView(z, PETSC_VIEWER_STDOUT_WORLD);}
>>     VecAXPY(z,-1.0,x);
>>     VecNorm(z,NORM_1,&enorm);
>>     if (enorm > 1.e-11) {
>>         PetscPrintf(PETSC_COMM_SELF,"  Error norm of |x - z| %g\n",(double)enorm);
>>     }
>> 
>>     /* Free spaces */
>>     fftw_destroy_plan(fplan);
>>     fftw_destroy_plan(bplan);
>>     fftw_free(data_out2);
>> 
>>     //Generate test matrix for comparison
>>     arma::cx_mat fft_test_mat = local_mat;
>>     fft_test_mat.each_row([&](arma::cx_rowvec &a){
>>         a = arma::fft(a);
>>     });
>>     std::cout << "-----------------------------------------------------\n";
>>     std::cout << "Input matrix:\n" << local_mat << '\n';
>>     MatView(C, viewer);
>>     std::cout << "-----------------------------------------------------\n";
>>     std::cout << "Expected output matrix:\n" << fft_test_mat << '\n';
>>     MatView(F, viewer);
>>     std::cout << "-----------------------------------------------------\n";
>>     MatDestroy(&FFT_A);
>>     VecDestroy(&input);
>>     VecDestroy(&output);
>>     VecDestroy(&x);
>>     VecDestroy(&y);
>>     VecDestroy(&z);
>>     MatDestroy(&C);
>>     MatDestroy(&F);
>>     PetscViewerDestroy(&viewer);
>>     PetscFinalize();
>>     return 0;
>> }
>> 
>> For mpirun -n 1 I get the expected output (i.e. armadillo and PETSc/FFTW return the same result), but for mpirun -n x with x > 1 every value which is not assigned to rank 0 is lost and set to zero instead. Every value assigned to rank 0 is calculated correctly, as far as I can see. Did I forget something here?
>> 
>> I do not understand why your FFTW calls use the WORLD communicator. Aren't they serial FFTs over the local rows?
>> 
>>   THanks,
>> 
>>      Matt
>> Thanks,
>> 
>> Roland
>> 
>> Am 05.12.20 um 01:59 schrieb Barry Smith:
>>> 
>>>   Roland,
>>> 
>>>     If you store your matrix as described in a parallel PETSc dense matrix then you should be able to call 
>>> 
>>> fftw_plan_many_dft() directly on the value obtained with MatDenseGetArray(). You just need to pass the arguments regarding column major ordering appropriately. Probably identically to what you do with your previous code.
>>> 
>>>    Barry
>>> 
>>> 
>>>> On Dec 4, 2020, at 6:47 AM, Roland Richter <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> wrote:
>>>> 
>>>> Ideally those FFTs could be handled in parallel, after they are not depending on each other. Is that possible with MatFFT, or should I rather use FFTW for that?
>>>> 
>>>> Thanks,
>>>> 
>>>> Roland
>>>> 
>>>> Am 04.12.20 um 13:19 schrieb Matthew Knepley:
>>>>> On Fri, Dec 4, 2020 at 5:32 AM Roland Richter <roland.richter at ntnu.no <mailto:roland.richter at ntnu.no>> wrote:
>>>>> Hei,
>>>>> 
>>>>> I am currently working on a problem which requires a large amount of
>>>>> transformations of a field E(r, t) from time space to Fourier space E(r,
>>>>> w) and back. The field is described in a 2d-matrix, with the r-dimension
>>>>> along the columns and the t-dimension along the rows.
>>>>> 
>>>>> For the transformation from time to frequency space and back I therefore
>>>>> have to apply a 1d-FFT operation over each row of my matrix. For my
>>>>> earlier attempts I used armadillo as matrix library and FFTW for doing
>>>>> the transformations. Here I could use fftw_plan_many_dft to do all FFTs
>>>>> at the same time. Unfortunately, armadillo does not support MPI, and
>>>>> therefore I had to switch to PETSc for larger matrices.
>>>>> 
>>>>> Based on the examples (such as example 143) PETSc has a way of doing
>>>>> FFTs internally by creating an FFT object (using MatCreateFFT).
>>>>> Unfortunately, I can not see how I could use that object to conduct the
>>>>> operation described above without having to iterate over each row in my
>>>>> original matrix (i.e. doing it sequential, not in parallel).
>>>>> 
>>>>> Ideally I could distribute the FFTs such over my nodes that each node
>>>>> takes several rows of the original matrix and applies the FFT to each of
>>>>> them. As example, for a matrix with a size of 4x4 and two nodes node 0
>>>>> would take row 0 and 1, while node 1 takes row 2 and 3, to avoid
>>>>> unnecessary memory transfer between the nodes while conducting the FFTs.
>>>>> Is that something PETSc can do, too?
>>>>> 
>>>>> The way I understand our setup (I did not write it), we use plan_many_dft to handle
>>>>> multiple dof FFTs, but these would be interlaced. You want many FFTs for non-interlaced
>>>>> storage, which is not something we do right now. You could definitely call FFTW directly
>>>>> if you want.
>>>>> 
>>>>> Second, above it seems like you just want serial FFTs. You can definitely create a MatFFT
>>>>> with PETSC_COMM_SELF, and apply it to each row in the local rows, or create the plan
>>>>> yourself for the stack of rows.
>>>>> 
>>>>>    Thanks,
>>>>> 
>>>>>      Matt
>>>>>  
>>>>> Thanks!
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Roland
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>>>> -- Norbert Wiener
>>>>> 
>>>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
>>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20201208/4d3fb4ab/attachment-0001.html>