[petsc-users] MPI_Attr_delete from MatDestroy

Jose E. Roman jroman at dsic.upv.es
Thu Jul 29 08:37:35 CDT 2010


On 29/07/2010, Niall Moran wrote:

> Hi,
> 
> I am getting some errors from a code that uses PETSc and SLEPc to diagonalise matrices in parallel. The code has been working fine on many machines but is giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is used to store the matrix and then the SLEPc Krylov-Schur solver is used to iteratively diagonalise. For each run the dimension of the matrices diagonalised can vary wildly from tens or hundreds of rows to hundreds of millions of rows. Even though the smaller matrices can be computed easily on a single core I wanted to be able to perform all calculations from a single run. When running on thousands of processors SLEPc does not like it when you have more cores than rows in the matrix.

In slepc-dev I have made a fix for the case when the number of rows assigned to one of the processes is zero. In slepc-3.0.0 I don't see this problem.
Jose

> To overcome this I create a new communicator with a sensible amount of cores before each diagonalisation and free it afterwards. When running four processors on four nodes of the cray XT4 machine for the case of a single diagonalisation of a matrix of dimension 4096 everything works however for a case with a single diagonalisation of a matrix with dimension 16.7 million the diagonalisation works correctly but the following errors are produced afterwards. 
> 
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) failed
> MPI_Attr_delete(86).: Invalid communicator
> aborting job:
> Fatal error in MPI_Attr_delete: Invalid communicator, error stack:
> MPI_Attr_delete(114): MPI_Attr_delete(comm=0x84000003, keyval=-1539309567) failed
> MPI_Attr_delete(86).: Invalid communicator 
> 
> In both cases the new communicator created will be made up of all four processors. The only function that calls MPI_Attr_delete seems to be the MatDestroy function which is called after the diagonalisation. 
> The structure of the code and order of relevant calls is as follows: 
> 
> SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in turn calls PETSc initialisation routine. 
> loop over diagonalisations to be performed { 
> Mat A; //matrix data structure
> 
> //code to create new communicator
> MPI_Comm  comm_world = PETSC_COMM_WORLD;
> MPI_Comm new_comm;
> MPI_Group processes_being_used;
> MPI_Group global_group;
> int number_relevant_ranks = 0;
> int *relevant_ranks;
> //code to determine number and allocate and populate relevant_ranks array.
> MPI_Comm_group(comm_world,&global_group);
> MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);
> MPI_Comm_create(comm_world,processes_being_used,&new_comm);
> MPI_Group_free(&processes_being_used);
> MPI_Group_free(&global_group);
> 
> //code to create and populate matrix
> ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);
> ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows ,global_columns);CHKERRQ(ierr);
> ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);
> MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);
> ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES); CHKERRQ(ierr);
> ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
> 
> //code to create and run eigensolver
> EPS eps;
> ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);
> ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A is the operator.
> ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is a hermitian eigenproblem.
> ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.
> ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);
> // retrieve eigenvalues and vectors.
> 
> //cleanup state
> ierr = EPSDestroy(eps);CHKERRQ(ierr);
> ierr = MatDestroy(A);CHKERRQ(ierr);
> MPI_Comm_free(&new_comm);  //free the communicator
> }
> 
> I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to no avail and also added check to ensure the communicator is not null before calling MatDestroy. 
> if ( new_comm != MPI_COMM_NULL ) ... 
> 
> At this stage I am confused as to how best to proceed. I have been considering adding a MACRO that will revert back to using PETSC_COMM_WORLD for everything. However the fact that the smaller size is working and not the larger one is confusing me. I have also considered memory errors. I do not have direct access to this machine and not sure how many debugging or memory checking tools can be used. Any suggestions or ideas are appreciated. 
> 
> Regards,
> 
> Niall. 
> 
> 



More information about the petsc-users mailing list