[petsc-dev] [GPU] src/ksp/ksp/examples/tutorials/ex9 fails with mpirun -np 2

Karl Rupp rupp at iue.tuwien.ac.at
Tue Jul 29 15:18:56 CDT 2014


Hi Pierre,

Dominic Meiser created a couple of pull requests on GPU usage with 
multiple processes (https://bitbucket.org/petsc/petsc/pull-requests). 
I'll work through them in the next few days, so chances are good that 
this fixes your problem as well. I'll ping you when it's ready. Can you 
send me your modified ex30.c nevertheless?

Best regards,
Karli


On 07/29/2014 05:09 PM, Projet_TRIOU wrote:
> Yes Matt, I am calling VecAssemblyBegin/End() after VecSetValues.
>
> What brings me this idea is the ex30 exercice in
> src/ksp/ksp/examples/tests/ex30.c
> where there was the read of a matrix, then VecSet, then VecSetValues
> eventually
> (and it was working on GPU with mpiexec -np 2 ...). I played a lot with
> it to discover
> VecSet was a turnaround to fix the crash of VecSetValues.
>
> I can send my modified ex30.c test to reproduce the bug.
>
> Pierre
>> On Tue, Jul 29, 2014 at 9:58 AM, Projet_TRIOU <triou at cea.fr
>> <mailto:triou at cea.fr>> wrote:
>>
>>     Hello Karl,
>>
>>     For your interest, I understood why my code crashes on GPU with np
>>     CPUs > 1.
>>
>>     I am solving Ax=B with KSPSetInitialGuessNonzero set to true so I need
>>     also to fill x with VecSetValues before each solve.
>>
>>     VecSetValues works on GPU with np=1 but not for np>1. I noticed
>>     that filling
>>     x with VecSet worked fine so when I create x, I now initialize
>>     also with VecSet.
>>     (I suppose it copy/create something on GPU that VecSetValues doesn't):
>>
>>     Vec x;
>>     VecCreate(PETSC_COMM_WORLD,&x);
>>     VecSetSizes(x, nb_rows_, PETSC_DECIDE);
>>     VecSetType(x, VECMPICUSP);
>>     VecSet(x,0.0); // Needed on GPU during parallel calculation, else
>>     it crashes
>>
>>     // Later:
>>     for (int i=0; i<size; i++)
>>     {
>>             VecSetValues(b, 1, &colonne_globale, &secmem(i),
>>     INSERT_VALUES);
>>             VecSetValues(x, 1, &colonne_globale, &solution(i),
>>     INSERT_VALUES);
>>             colonne_globale++;
>>     }
>>
>>
>> Are you calling VecAssemblyBegin/End() after this? It is required for
>> VecSetValues().
>>
>>    Matt
>>
>>     Thanks for your help,
>>
>>     Pierre
>>>     Thanks Karl, and sorry for my mistake. Indeed, ex9 runs in //
>>>     with -pc_type none or -pc_type jacobi.
>>>     Good to know I have an example now with VecSetValues I can try to
>>>     replicate in my code.
>>>     I will update the thread as soon as I return to some more tests
>>>     on it.
>>>
>>>     Pierre
>>>>     Hi Pierre,
>>>>
>>>>     I could reproduce the problem, but I could also verify that the
>>>>     problem you ran into was due to a currently unsupported
>>>>     block-Jacobi preconditioner. That is, if you first comment the
>>>>     KSPSolve for the ksp2 object in ex9, recompile ex9, and then run
>>>>
>>>>     $> mpiexec -np 2 ./ex9 -ksp_monitor -mat_type aijcusp -vec_type
>>>>     cusp -pc_type none
>>>>
>>>>     thinks should work out. This may also be applicable to your
>>>>     case: Give it a try with -pc_type none and let us whether the
>>>>     problem with VecSetValues remains. (The use of no preconditioner
>>>>     is certainly not great, but at least it may give you a first
>>>>     result)
>>>>
>>>>     Best regards,
>>>>     Karli
>>>>
>>>>
>>>>
>>>>     On 03/26/2014 06:38 PM, Projet_TRIOU wrote:
>>>>>     Hello all,
>>>>>
>>>>>     My parallel code is really near to run using GPU device
>>>>>     thanks to PETSc but I am struggling with the best way
>>>>>     to fill the PETSc vectors in order in runs on CPU & GPU.
>>>>>
>>>>>     I was using in the past, VecCreateMPIWithArray, to
>>>>>     create vector but I had trouble with it on GPU.
>>>>>
>>>>>     So I followed the example in the
>>>>>     src/ksp/ksp/examples/tutorials/ex2.c
>>>>>     and create now the vectors like this:
>>>>>
>>>>>        // Build x
>>>>>        ierr = VecCreate(PETSC_COMM_WORLD,&SecondMembrePetsc_);
>>>>>     check(ierr);
>>>>>        ierr = VecSetSizes(SecondMembrePetsc_, nb_rows, nb_rows_tot);
>>>>>     check(ierr);
>>>>>        ierr = VecSetFromOptions(SecondMembrePetsc_); check(ierr);
>>>>>        // Build b
>>>>>        ierr =
>>>>>     VecDuplicate(SecondMembrePetsc_,&SolutionPetsc_);check(ierr);
>>>>>
>>>>>     And fills it with VecSetValues function. It runs well on CPU and
>>>>>     GPU but crashed only in parallel on GPU. It I use VecSet
>>>>>     instead of
>>>>>     VecSetValues, it didn't crash (but of course VecSet is not enough
>>>>>     for me :-)
>>>>>
>>>>>     I tried to find an example to reproduce for you the problem, and I
>>>>>     think src/ksp/ksp/examples/tutorials/ex9 (it is
>>>>>     usingVecSetValues,)
>>>>>     is a good one.
>>>>>
>>>>>     Or did I miss something (I also try VecPlaceArray/VecRestoreArray
>>>>>     but without success on GPU) ?
>>>>>
>>>>>     Thanks, and yes you are right,  "WARNING: Using GPUs
>>>>>     effectively is
>>>>>     difficult!" :-)
>>>>>
>>>>>     Pierre
>>>>>
>>>>>     sitre.intra.cea.fr:/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt/src/ksp/ksp/examples/tutorials
>>>>>
>>>>>      > ./ex9 -ksp_monitor
>>>>>        0 KSP Residual norm 6.612932697792e+00
>>>>>        1 KSP Residual norm 4.261830032389e-01
>>>>>        2 KSP Residual norm 2.121746090851e-02
>>>>>        3 KSP Residual norm 1.233779841608e-03
>>>>>        4 KSP Residual norm 1.265903168531e-05
>>>>>        0 KSP Residual norm 1.309416176382e-05
>>>>>        0 KSP Residual norm 1.404919664063e-05
>>>>>
>>>>>     sitre.intra.cea.fr:/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt/src/ksp/ksp/examples/tutorials
>>>>>
>>>>>      > mpiexec -np 2 ./ex9 -ksp_monitor
>>>>>        0 KSP Residual norm 2.496821857304e+02
>>>>>        1 KSP Residual norm 4.522206074831e+01
>>>>>        2 KSP Residual norm 1.959482408314e+01
>>>>>        3 KSP Residual norm 7.002013703407e+00
>>>>>        4 KSP Residual norm 2.144105201713e+00
>>>>>        5 KSP Residual norm 1.780095080270e-01
>>>>>        6 KSP Residual norm 5.642702243268e-02
>>>>>        7 KSP Residual norm 6.439343992306e-03
>>>>>        8 KSP Residual norm 3.012756374415e-04
>>>>>     Norm of error 0.000249108, Iterations 8
>>>>>     Norm of error 0.000715584, Iterations 6
>>>>>        0 KSP Residual norm 3.422287562824e-04
>>>>>     Norm of error 0.000249108, Iterations 0
>>>>>     Norm of error 0.000192805, Iterations 7
>>>>>        0 KSP Residual norm 4.140588954098e-04
>>>>>     Norm of error 0.000249108, Iterations 0
>>>>>     Norm of error 0.000109507, Iterations 7
>>>>>
>>>>>     sitre.intra.cea.fr:/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt/src/ksp/ksp/examples/tutorials
>>>>>
>>>>>      > ./ex9 -ksp_monitor -mat_type aijcusp -vec_type cusp
>>>>>        0 KSP Residual norm 6.612932697792e+00
>>>>>        1 KSP Residual norm 4.261830032389e-01
>>>>>        2 KSP Residual norm 2.121746090851e-02
>>>>>        3 KSP Residual norm 1.233779841608e-03
>>>>>        4 KSP Residual norm 1.265903168531e-05
>>>>>        0 KSP Residual norm 1.309416176403e-05
>>>>>        0 KSP Residual norm 1.404919664088e-05
>>>>>
>>>>>     sitre.intra.cea.fr:/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt/src/ksp/ksp/examples/tutorials
>>>>>
>>>>>      > mpiexec -np 2 ./ex9 -ksp_monitor -mat_type aijcusp -vec_type
>>>>>     cusp
>>>>>     [0]PETSC ERROR:
>>>>>     ------------------------------------------------------------------------
>>>>>
>>>>>     [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>     Violation,
>>>>>     probably memory access out of range
>>>>>     [0]PETSC ERROR: Try option -start_in_debugger or
>>>>>     -on_error_attach_debugger
>>>>>     [0]PETSC ERROR: or see
>>>>>     http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[0]PETSC
>>>>>
>>>>>     ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
>>>>>     X to
>>>>>     find memory corruption errors
>>>>>     [0]PETSC ERROR: configure using --with-debugging=yes,
>>>>>     recompile, link,
>>>>>     and run
>>>>>     [0]PETSC ERROR: to get more information on the crash.
>>>>>     [0]PETSC ERROR: --------------------- Error Message
>>>>>     --------------------------------------------------------------
>>>>>     [0]PETSC ERROR: Signal received
>>>>>     [0]PETSC ERROR: See
>>>>>     http://http://www.mcs.anl.gov/petsc/documentation/faq.html for
>>>>>     trouble
>>>>>     shooting.
>>>>>     [0]PETSC ERROR: Petsc Development GIT revision:
>>>>>     v3.4.4-3713-g576f62e
>>>>>     GIT Date: 2014-03-23 15:59:15 -0500
>>>>>     [0]PETSC ERROR: ./ex9 on a linux_opt named sitre.intra.cea.fr
>>>>>     <http://sitre.intra.cea.fr> by triou
>>>>>     Wed Mar 26 18:32:34 2014
>>>>>     [0]PETSC ERROR: Configure options
>>>>>     --prefix=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt
>>>>>
>>>>>     --with-single-library --with-shared-libraries=0 --with-debugging=0
>>>>>     --with-errorchecking=1 --COPTFLAGS="   -O3 -fPIC "
>>>>>     --CXXOPTFLAGS=" -O3
>>>>>     -fPIC " --FOPTFLAGS="  -O3 -fPIC " --with-fortran=yes
>>>>>     --with-clean=1
>>>>>     --download-scalapack=../scalapack-2.0.2.tgz
>>>>>     --download-mumps=../MUMPS_4.10.0-p3.tar.gz
>>>>>     --download-superlu_dist=yes
>>>>>     --download-parmetis=../parmetis-4.0.2-p4.tar.gz
>>>>>     --download-metis=../metis-5.0.2-p3.tar.gz
>>>>>     --download-ptscotch=../ptscotch.tar.gz
>>>>>     --download-hypre=../hypre-2.9.1a.tar.gz
>>>>>     --with-valgrind-include=/work/triou/git/petsc-dev/Trio_U/exec/valgrind/include
>>>>>
>>>>>     --with-blas-lapack-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBLAPACK
>>>>>     --with-cuda=1
>>>>>     --with-cuda-dir=/work/triou/git/petsc-dev/Trio_U/exec/cuda5.5
>>>>>     --with-cusp=1
>>>>>     --with-cusp-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/cusplibrary-0.4.0
>>>>>
>>>>>     --with-thrust=1 --with-cuda-arch=sm_21 --with-ssl=0
>>>>>     --with-mpi-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBMPI/mpich
>>>>>
>>>>>     --with-x=1
>>>>>     [0]PETSC ERROR: #1 User provided function() line 0 in  unknown
>>>>>     file
>>>>>     [1]PETSC ERROR:
>>>>>     ------------------------------------------------------------------------
>>>>>
>>>>>     [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
>>>>>     Violation,
>>>>>     probably memory access out of range
>>>>>     [1]PETSC ERROR: Try option -start_in_debugger or
>>>>>     -on_error_attach_debugger
>>>>>     [1]PETSC ERROR: or see
>>>>>     http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind[1]PETSC
>>>>>
>>>>>     ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
>>>>>     X to
>>>>>     find memory corruption errors
>>>>>     [1]PETSC ERROR: configure using --with-debugging=yes,
>>>>>     recompile, link,
>>>>>     and run
>>>>>     [1]PETSC ERROR: application called MPI_Abort(MPI_COMM_WORLD,
>>>>>     59) - process 0
>>>>>     to get more information on the crash.
>>>>>     [1]PETSC ERROR: --------------------- Error Message
>>>>>     --------------------------------------------------------------
>>>>>     [1]PETSC ERROR: Signal received
>>>>>     [1]PETSC ERROR: See
>>>>>     http://http://www.mcs.anl.gov/petsc/documentation/faq.html for
>>>>>     trouble
>>>>>     shooting.
>>>>>     [1]PETSC ERROR: Petsc Development GIT revision:
>>>>>     v3.4.4-3713-g576f62e
>>>>>     GIT Date: 2014-03-23 15:59:15 -0500
>>>>>     [1]PETSC ERROR: ./ex9 on a linux_opt named sitre.intra.cea.fr
>>>>>     <http://sitre.intra.cea.fr> by triou
>>>>>     Wed Mar 26 18:32:34 2014
>>>>>     [1]PETSC ERROR: Configure options
>>>>>     --prefix=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/petsc/linux_opt
>>>>>
>>>>>     --with-single-library --with-shared-libraries=0 --with-debugging=0
>>>>>     --with-errorchecking=1 --COPTFLAGS="   -O3 -fPIC "
>>>>>     --CXXOPTFLAGS=" -O3
>>>>>     -fPIC " --FOPTFLAGS="  -O3 -fPIC " --with-fortran=yes
>>>>>     --with-clean=1
>>>>>     --download-scalapack=../scalapack-2.0.2.tgz
>>>>>     --download-mumps=../MUMPS_4.10.0-p3.tar.gz
>>>>>     --download-superlu_dist=yes
>>>>>     --download-parmetis=../parmetis-4.0.2-p4.tar.gz
>>>>>     --download-metis=../metis-5.0.2-p3.tar.gz
>>>>>     --download-ptscotch=../ptscotch.tar.gz
>>>>>     --download-hypre=../hypre-2.9.1a.tar.gz
>>>>>     --with-valgrind-include=/work/triou/git/petsc-dev/Trio_U/exec/valgrind/include
>>>>>
>>>>>     --with-blas-lapack-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBLAPACK
>>>>>     --with-cuda=1
>>>>>     --with-cuda-dir=/work/triou/git/petsc-dev/Trio_U/exec/cuda5.5
>>>>>     --with-cusp=1
>>>>>     --with-cusp-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBPETSC/cusplibrary-0.4.0
>>>>>
>>>>>     --with-thrust=1 --with-cuda-arch=sm_21 --with-ssl=0
>>>>>     --with-mpi-dir=/work/triou/git/petsc-dev/Trio_U/lib/src/LIBMPI/mpich
>>>>>
>>>>>     --with-x=1
>>>>>     [1]PETSC ERROR: #1 User provided function() line 0 in  unknown
>>>>>     file
>>>>>     application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1
>>>>>
>>>>>
>>>>>
>>>>>     --
>>>>>     *Trio_U support team*
>>>>>     Marthe ROUX (01 69 08 00 02) Saclay
>>>>>     Pierre LEDAC (04 38 78 91 49) Grenoble
>>>>
>>>
>>>
>>>     --
>>>     *Trio_U support team*
>>>     Marthe ROUX (01 69 08 00 02) Saclay
>>>     Pierre LEDAC (04 38 78 91 49) Grenoble
>>
>>
>>     --
>>     *Trio_U support team*
>>     Marthe ROUX (01 69 08 00 02) Saclay
>>     Pierre LEDAC (04 38 78 91 49) Grenoble
>>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which
>> their experiments lead.
>> -- Norbert Wiener
>
>
> --
> *Trio_U support team*
> Marthe ROUX (01 69 08 00 02) Saclay
> Pierre LEDAC (04 38 78 91 49) Grenoble




More information about the petsc-dev mailing list