[petsc-dev] PetscCUDAInitialize

Smith, Barry F. bsmith at mcs.anl.gov
Thu Sep 19 21:24:00 CDT 2019



> On Sep 19, 2019, at 9:11 PM, Balay, Satish <balay at mcs.anl.gov> wrote:
> 
> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
> 
>> 
>>   This should be reported on gitlab, not in email.
>> 
>>   Anyways, my interpretation is that the machine runs low on swap space so the OS is killing things. Once Satish and I sat down and checked the system logs on one machine that had little swap and we saw system messages about low swap at exactly the time the tests were killed. Satish is resistant to increase swap I don't know why. Other times we see these kills and they may not be due to swap but then they are a mystery.
> 
> That was on bsd.
> 
> This machine has 8gb swap and should be sufficient. And this issue [on this machine] was triggered
> only by this MR - which was wierd..

   Does it happen every time to the same examples?

   If you login and run that one test does it happen?

   If the MR is changing scatter code could it have broken something.

   We need to know why this is happening? Otherwise our test system will drive us nuts with errors we don't have a clue where they come from.

  
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

  So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls MPI_Abort in a truck load of places and usually with a return code of 1. So the first thing that needs to be done is fix PETSc so each different call to MPI_Abort has a unique return code. Then in theory at least we know where it got aborted.

include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if (PetscUnlikely(ierr)) {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT," ");MPI_Abort(comm,ierr);}} while (0)
include/petscerror.h:    or CHKERRABORT(comm,n) to have MPI_Abort() returned immediately.
src/contrib/fun3d/incomp/flow.c:    /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
src/docs/tao_tex/manual/part1.tex:application called MPI_Abort(MPI_COMM_WORLD, 73) - process 0
src/docs/tex/manual/developers.tex:  \item \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()} after printing the error message; and
src/snes/examples/tests/ex12f.F:        call MPI_Abort(PETSC_COMM_WORLD,0,ierr)
src/snes/examples/tutorials/ex30.c:  MPI_Abort(PETSC_COMM_SELF,1);
src/sys/error/adebug.c:  MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/error/err.c:      If this is called from the main() routine we call MPI_Abort() instead of
src/sys/error/err.c:  if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
src/sys/error/errstop.c:  MPI_Abort(PETSC_COMM_WORLD,n);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/signal.c:  if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/signal.c:  MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
src/sys/fsrc/somefort.F:!     when MPI_Abort() is called directly by CHKERRQ(ierr);
src/sys/fsrc/somefort.F:      call MPI_Abort(comm,ierr,nierr)
src/sys/ftn-custom/zutils.c:    MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/ftn-custom/zutils.c:      MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/logging/utils/stagelog.c:    MPI_Abort(MPI_COMM_WORLD, PETSC_ERR_SUP);
src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceCounter(&StartTime)) MPI_Abort(MPI_COMM_WORLD,1);
src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceFrequency(&PerfFreq)) MPI_Abort(MPI_COMM_WORLD,1);
src/sys/mpiuni/mpitime.c:  if (!QueryPerformanceCounter(&CurTime)) MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/init.c:  in the debugger hence we call abort() instead of MPI_Abort().
src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm *comm,PetscMPIInt *flag,...)
src/sys/objects/init.c:  if (ierr) MPI_Abort(*comm,*flag); /* hopeless so get out */
src/sys/objects/init.c:      ierr = MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/ts/examples/tutorials/ex48.c:  if (dim < 2) {MPI_Abort(MPI_COMM_WORLD,1); return;} /* this is needed so that the clang static analyzer does not generate a warning about variables used by not set */
src/vec/vec/examples/tests/ex32f.F:        call MPI_Abort(MPI_COMM_WORLD,0,ierr)
src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/utils/comb.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/utils/comb.c:      MPI_Abort(MPI_COMM_SELF,1);

  Junchao,

     Maybe you could fix this and make a MR? I don't know how to organize the numbering. Should we have a central list of all numbers with macros in petscerror.h like 

#define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10 

etc?







   Barry

> 
> Satish
> 
> 
>> 
>>   You can return the particular job by clicking on the little circle after the job name and see what happens the next time.
>> 
>>   Barry
>> 
>>   It may be the -j and -l options for some systems need to adjusted down slightly and this will prevent these. Satish can that be done in the examples/arch-ci* files with configure options, or in in the runner files or in the yaml file?
> 
> configure has options --with-make-np --with-make-test-np --with-make-load
> 
> Satish
> 
>> 
>> 
>> 
>>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>>> 
>>> All failed tests just said "application called MPI_Abort" and had no stack trace. They are not cuda tests. I updated SF to avoid CUDA  related initialization if not needed. Let's see the new test result.
>>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
>>> #	application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>> 
>>> Failed?  Means nothing, send link or cut and paste error
>>> 
>>> It could be that since we have multiple separate tests running at the same time they overload the GPU or cause some inconsistent behavior that doesn't appear every time the tests are run.
>>> 
>>>   Barry
>>> 
>>> Maybe we need to sequentialize all the tests that use the GPUs, we just trust gnumake for the parallelism maybe you could some how add dependencies to get gnu make to achieve this?
>>> 
>>> 
>>> 
>>> 
>>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>>>> 
>>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>> 
>>>> 
>>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <jczhang at mcs.anl.gov> wrote:
>>>>> 
>>>>> I saw your update. In PetscCUDAInitialize we have
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>      /* First get the device count */
>>>>> 
>>>>>      err   = cudaGetDeviceCount(&devCount);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>      /* next determine the rank and then set the device via a mod */
>>>>> 
>>>>>      ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
>>>>> 
>>>>>      device = rank % devCount;
>>>>> 
>>>>>    }
>>>>> 
>>>>>    err = cudaSetDevice(device);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> If we rely on the first CUDA call to do initialization, how could CUDA know these MPI stuff.
>>>> 
>>>>  It doesn't, so it does whatever it does (which may be dumb).
>>>> 
>>>>  Are you proposing something?
>>>> 
>>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not reproduce it. I'm doing investigation. 
>>>> 
>>>>  Barry
>>>> 
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>>>>> 
>>>>>  Fixed the docs. Thanks for pointing out the lack of clarity
>>>>> 
>>>>> 
>>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev <petsc-dev at mcs.anl.gov> wrote:
>>>>>> 
>>>>>> Barry,
>>>>>> 
>>>>>> I saw you added these in init.c
>>>>>> 
>>>>>> 
>>>>>> +  -cuda_initialize - do the initialization in PetscInitialize()
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Notes:
>>>>>> 
>>>>>>   Initializing cuBLAS takes about 1/2 second there it is done by default in PetscInitialize() before logging begins
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be initialized?
>>>>>> --Junchao Zhang
>>>>> 
>>> 
>> 
> 



More information about the petsc-dev mailing list