[petsc-dev] Bug in MatShift_MPIAIJ ?
Barry Smith
bsmith at mcs.anl.gov
Wed Oct 21 10:36:23 CDT 2015
> On Oct 21, 2015, at 10:28 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>
> On 21/10/15 10:56 AM, Barry Smith wrote:
>>
>> I similarly saw the same behavior in the debugger that you reported and was mystified.
>> This is why I called it a nasty bug. The setting of different nonew flag meant that one process called the Allreduce() at the end of MatAssemblyEnd_MPIAIJ()
>>
>> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
>> PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
>> ierr = MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
>> }
>>
>> but the other process skipped this call. Thus the other process got to the MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data between the two MPI_Allreduce() that serve very different purposes. Since MPI_Allreduce() doesn't have any concept of tags if one process skips an Allreduce call then the next Allreduce call gets matched with it.
>
> ok! I understand now...
>
> then, would wrapping (in debug mode), these calls with MPI_Barrier before/after, would have prevented to mislead the debugging person? In our code, we do control that all processes make the same collective mpi calls with a kind-of barrier (in debug mode) which communicates the line number and the file name (transformed into a int value) so that a process is blocked at the first wrongly matched mpi call...
This is a good idea. Do you have a C macro implementation you would be willing to share, it would be trivial to add to PETSc if you had something like
#if debug
#define MPIU_Allreduce(.....) macro that first checks function and line number
#else
#define MPIU_Allreduce MPI_Allreduce
>
>
>>
>> The way I finally found the bug was to put a break point in MPI_Allreduce() and run the program until the two processes were calling Allreduce() from different places.
>
> okay...wow... you modified the mpi code?
No, I just ran both processes in the debugger with the break point doing a cont each time until they did not match.
Barry
> I was asking myself if there is a runtime or compilation option for openmpi/mpich to do this for all relevant collective mpi calls?
>
> Anyway, thanks again!
>
> Eric
>
More information about the petsc-dev
mailing list