[petsc-dev] Bug in MatShift_MPIAIJ ?

Barry Smith bsmith at mcs.anl.gov
Wed Oct 21 10:36:23 CDT 2015


> On Oct 21, 2015, at 10:28 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
> 
> On 21/10/15 10:56 AM, Barry Smith wrote:
>> 
>>   I similarly saw the same behavior in the debugger that you reported and was mystified.
>>   This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
>> 
>> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
>>     PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
>>     ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
>>   }
>> 
>> but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.
> 
> ok!  I understand now...
> 
> then, would wrapping (in debug mode), these calls with MPI_Barrier before/after, would have prevented to mislead the debugging person?  In our code, we do control that all processes make the same collective mpi calls with a kind-of barrier (in debug mode) which communicates the line number and the file name (transformed into a int value) so that a process is blocked at the first wrongly matched mpi call...

    This is a good idea. Do you have a C macro implementation you would be willing to share, it would be trivial to add to PETSc if you had something like

#if debug
#define MPIU_Allreduce(.....) macro that first checks function and line number
#else
#define MPIU_Allreduce MPI_Allreduce


> 
> 
>> 
>>   The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.
> 
> okay...wow... you modified the mpi code?

   No, I just ran both processes in the debugger with the break point doing a cont each time until they did not match.

  Barry

>  I was asking myself if there is a runtime or compilation option for openmpi/mpich to do this for all relevant collective mpi calls?
> 
> Anyway, thanks again!
> 
> Eric
> 




More information about the petsc-dev mailing list