<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi,<div><br></div><div>I am getting some errors from a code that uses PETSc and SLEPc to diagonalise matrices in parallel. The code has been working fine on many machines but is giving problems on a Cray XT4 machine. The PETSc sparse matrix type MPIAIJ is used to store the matrix and then the SLEPc Krylov-Schur solver is used to iteratively diagonalise. For each run the dimension of the matrices diagonalised can vary wildly from tens or hundreds of rows to hundreds of millions of rows. Even though the smaller matrices can be computed easily on a single core I wanted to be able to perform all calculations from a single run. When running on thousands of processors SLEPc does not like it when you have more cores than rows in the matrix. To overcome this I create a new communicator with a sensible amount of cores before each diagonalisation and free it afterwards. When running four processors on four nodes of the cray XT4 machine for the case of a single diagonalisation of a matrix of dimension 4096 everything works however for a case with a single diagonalisation of a matrix with dimension 16.7 million the diagonalisation works correctly but the following errors are produced afterwards. </div><div><br></div><span class="Apple-style-span" style="font-family: arial, sans-serif; border-collapse: collapse; font-size: 13px; ">Fatal error in MPI_Attr_delete: Invalid communicator, error stack:<br>MPI_Attr_delete(114): MPI_Attr_delete(comm=<wbr>0x84000003, keyval=-1539309567) failed<br>MPI_Attr_delete(86).: Invalid communicator<br>aborting job:<br>Fatal error in MPI_Attr_delete: Invalid communicator, error stack:<br>MPI_Attr_delete(114): MPI_Attr_delete(comm=<wbr>0x84000003, keyval=-1539309567) failed<br>MPI_Attr_delete(86).: Invalid communicator<br>aborting job:<br>Fatal error in MPI_Attr_delete: Invalid communicator, error stack:<br>MPI_Attr_delete(114): MPI_Attr_delete(comm=<wbr>0x84000003, keyval=-1539309567) failed<br>MPI_Attr_delete(86).: Invalid communicator<br>aborting job:<br>Fatal error in MPI_Attr_delete: Invalid communicator, error stack:<br>MPI_Attr_delete(114): MPI_Attr_delete(comm=<wbr>0x84000003, keyval=-1539309567) failed<br></span><div><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">MPI_Attr_delete(86).: Invalid communicator</span> </div><div><br></div><div>In both cases the new communicator created will be made up of all four processors. The only function that calls MPI_Attr_delete seems to be the MatDestroy function which is called after the diagonalisation. </div><div>The structure of the code and order of relevant calls is as follows: </div><div><br></div><div>SlepcInitialize(&argc,&argv,(char*)0,help); //SLEPc initialisation which in turn calls PETSc initialisation routine. </div><div>loop over diagonalisations to be performed { </div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div>Mat A; //matrix data structure</div><div><br></div><div>//code to create new communicator</div><div>MPI_Comm comm_world = PETSC_COMM_WORLD;</div><div>MPI_Comm new_comm;</div><div><div>MPI_Group processes_being_used;</div></div><div><div>MPI_Group global_group;</div></div><div><div>int number_relevant_ranks = 0;</div></div><div><div>int *relevant_ranks;</div></div><div>//code to determine number and allocate and populate relevant_ranks array.</div><div style="text-align: left;">MPI_Comm_group(comm_world,&global_group);</div><div style="text-align: left;">MPI_Group_incl(global_group,number_relevant_ranks,relevant_ranks,&processes_being_used);</div><div style="text-align: left;">MPI_Comm_create(comm_world,processes_being_used,&new_comm);</div><div style="text-align: left;">MPI_Group_free(&processes_being_used);</div><div style="text-align: left;">MPI_Group_free(&global_group);</div><div style="text-align: left;"><br></div><div style="text-align: left;">//code to create and populate matrix</div><div style="text-align: left;">ierr = MatCreate(new_comm,&A);CHKERRQ(ierr);</div><div style="text-align: left;">ierr = MatSetSizes(A, local_rows, PETSC_DECIDE, global_rows ,global_columns);CHKERRQ(ierr);</div><div style="text-align: left;">ierr = MatSetType(A,MATMPIAIJ);CHKERRQ(ierr);</div><div style="text-align: left;">MatMPIAIJSetPreallocation(A,1,d_nnz ,0, o_nnz );CHKERRQ(ierr);</div><div style="text-align: left;">ierr = MatSetValues(A, 1, &row_idx, val_count, cols, values, ADD_VALUES); CHKERRQ(ierr);</div><div style="text-align: left;"><div style="text-align: left;">ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);</div></div><div style="text-align: left;"><div style="text-align: left;"><br></div></div><div style="text-align: left;"><div style="text-align: left;">//code to create and run eigensolver</div></div><div style="text-align: left;"><div style="text-align: left;">EPS eps;</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSCreate(new_comm,&eps);CHKERRQ(ierr);</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSSetOperators(eps,A,PETSC_NULL);CHKERRQ(ierr); //tell solver that A is the operator.</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSSetProblemType(eps, EPS_HEP);CHKERRQ(ierr); //specify that this is a hermitian eigenproblem.</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSSolve(eps);CHKERRQ(ierr); // run the eigen problem solver.</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSGetConverged(eps,&eigenvalues_converged);CHKERRQ(ierr);</div></div><div style="text-align: left;"><div style="text-align: left;">// retrieve eigenvalues and vectors.</div></div><div style="text-align: left;"><div style="text-align: left;"><br></div></div><div style="text-align: left;"><div style="text-align: left;">//cleanup state</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = EPSDestroy(eps);CHKERRQ(ierr);</div></div><div style="text-align: left;"><div style="text-align: left;">ierr = MatDestroy(A);CHKERRQ(ierr);</div><div style="text-align: left;">MPI_Comm_free(&new_comm); //free the communicator</div></div></blockquote><div style="text-align: left;"><div style="text-align: left;">}</div></div><div style="text-align: left;"><br></div><div style="text-align: left;">I have tried inserting a barrier between the MatDestroy and MPI_Comm_free to no avail and also added check to ensure the communicator is not null before calling MatDestroy. </div><div style="text-align: left;">if ( new_comm != MPI_COMM_NULL ) ... </div><div style="text-align: left;"><br></div><div style="text-align: left;">At this stage I am confused as to how best to proceed. I have been considering adding a MACRO that will revert back to using PETSC_COMM_WORLD for everything. However the fact that the smaller size is working and not the larger one is confusing me. I have also considered memory errors. I do not have direct access to this machine and not sure how many debugging or memory checking tools can be used. Any suggestions or ideas are appreciated. </div><div style="text-align: left;"><br></div><div style="text-align: left;">Regards,</div><div style="text-align: left;"><br></div><div style="text-align: left;">Niall. </div><div style="text-align: left;"><br></div><div style="text-align: left;"><br></div></body></html>