[petsc-users] SuperLU + GPUs

Satish Balay balay at mcs.anl.gov
Wed Apr 15 20:58:57 CDT 2020


The crash is inside Superlu_DIST - so don't know what to suggest.

Might have to debug this via debugger and check with Sherry.

Satish

On Wed, 15 Apr 2020, Mark Adams wrote:

> Ah, OK 'check' will test SuperLU. Semi worked:
> 
> s20:13 mark/feature-xgc-interface-rebase *= ~/petsc$ make
> PETSC_DIR=/ccs/home/adams/petsc PETSC_ARCH=arch-summit-dbg-gnu-cuda-omp
> check
> Running check examples to verify correct installation
> Using PETSC_DIR=/ccs/home/adams/petsc and
> PETSC_ARCH=arch-summit-dbg-gnu-cuda-omp
> C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
> C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
> 2c2,38
> < Number of SNES iterations = 2
> ---
> > CUDA version:   v 10010
> > CUDA Devices:
> >
> > 0 : Tesla V100-SXM2-16GB 7 0
> >   Global memory:   16128 mb
> >   Shared memory:   48 kb
> >   Constant memory: 64 kb
> >   Block registers: 65536
> >
> > ex19: cudahook.cc:762: CUresult host_free_callback(void*): Assertion
> `cacheNode != __null' failed.
> > [h16n07:78357] *** Process received signal ***
> > [h16n07:78357] Signal: Aborted (6)
> > [h16n07:78357] Signal code:  (1704218624)
> > [h16n07:78357] [ 0] [0x2000000504d8]
> > [h16n07:78357] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200023992094]
> > [h16n07:78357] [ 2] /lib64/libc.so.6(+0x356d4)[0x2000239856d4]
> > [h16n07:78357] [ 3] /lib64/libc.so.6(__assert_fail+0x64)[0x2000239857c4]
> > [h16n07:78357] [ 4]
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libpami_cudahook.so(_Z18host_free_callbackPv+0x2d8)[0x2000000cd2c8]
> > [h16n07:78357] [ 5]
> /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libpami_cudahook.so(cuMemFreeHost+0xb0)[0x2000000c3cc0]
> > [h16n07:78357] [ 6]
> /sw/summit/cuda/10.1.243/lib64/libcudart.so.10.1(+0x42f50)[0x200010aa2f50]
> > [h16n07:78357] [ 7]
> /sw/summit/cuda/10.1.243/lib64/libcudart.so.10.1(+0x11db8)[0x200010a71db8]
> > [h16n07:78357] [ 8]
> /sw/summit/cuda/10.1.243/lib64/libcudart.so.10.1(cudaFreeHost+0x74)[0x200010ab2ea4]
> > [h16n07:78357] [ 9]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libsuperlu_dist.so.6(dDestroy_LU+0x150)[0x200003188058]
> > [h16n07:78357] [10]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(+0x12ebc6c)[0x2000013dbc6c]
> > [h16n07:78357] [11]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(MatLUFactorNumeric+0x934)[0x200000d2fae4]
> > [h16n07:78357] [12]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(+0x1cca7a4)[0x200001dba7a4]
> > [h16n07:78357] [13]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(PCSetUp+0xde0)[0x200001f3f990]
> > [h16n07:78357] [14]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(KSPSetUp+0x1848)[0x200001fc5594]
> > [h16n07:78357] [15]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(+0x1ed9908)[0x200001fc9908]
> > [h16n07:78357] [16]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(KSPSolve+0x5d0)[0x200001fcc690]
> > [h16n07:78357] [17]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(+0x21e16ac)[0x2000022d16ac]
> > [h16n07:78357] [18]
> /ccs/home/adams/petsc/arch-summit-dbg-gnu-cuda-omp/lib/libpetsc.so.3.013(SNESSolve+0x23f4)[0x2000022255c0]
> > [h16n07:78357] [19] ./ex19[0x10002ac8]
> > [h16n07:78357] [20] /lib64/libc.so.6(+0x25200)[0x200023975200]
> > [h16n07:78357] [21]
> /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000239753f4]
> > [h16n07:78357] *** End of error message ***
> > ERROR:  One or more process (first noticed rank 0) terminated with signal
> 6
> /ccs/home/adams/petsc/src/snes/tutorials
> Possible problem with ex19 running with superlu_dist, diffs above
> =========================================
> 
> On Wed, Apr 15, 2020 at 5:58 PM Satish Balay <balay at mcs.anl.gov> wrote:
> 
> > Please send configure.log
> >
> > This is what I get on my linux build:
> >
> > [balay at p1 petsc]$ ./configure
> > --with-mpi-dir=/home/petsc/soft/openmpi-4.0.2-cuda --with-cuda=1
> > --with-openmp=1 --download-superlu-dist=1 && make && make check
> > <snip>
> > Running check examples to verify correct installation
> > Using PETSC_DIR=/home/balay/petsc and PETSC_ARCH=arch-linux-c-debug
> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process
> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes
> > 1a2,19
> > > CUDA version:   v 10020
> > > CUDA Devices:
> > >
> > > 0 : Quadro T2000 7 5
> > >   Global memory:   3911 mb
> > >   Shared memory:   48 kb
> > >   Constant memory: 64 kb
> > >   Block registers: 65536
> > >
> > > CUDA version:   v 10020
> > > CUDA Devices:
> > >
> > > 0 : Quadro T2000 7 5
> > >   Global memory:   3911 mb
> > >   Shared memory:   48 kb
> > >   Constant memory: 64 kb
> > >   Block registers: 65536
> > >
> > /home/balay/petsc/src/snes/tutorials
> > Possible problem with ex19 running with superlu_dist, diffs above
> > =========================================
> > Fortran example src/snes/tutorials/ex5f run successfully with 1 MPI process
> > Completed test examples
> >
> >
> > On Wed, 15 Apr 2020, Mark Adams wrote:
> >
> > > On Wed, Apr 15, 2020 at 5:17 PM Satish Balay <balay at mcs.anl.gov> wrote:
> > >
> > > > The build should work. It should give some verbose info [at runtime]
> > > > regarding GPUs - from the following code.
> > > >
> > > >
> > > I don't see that and I am running GPUs in my code and have gotten
> > cusparse
> > > LU to run. Should I use '-info :sys:'  ?
> > >
> > >
> > > > >>>>> SRC/cublas_utils.c >>>>>>>>>>>
> > > >  void DisplayHeader()
> > > > {
> > > >     const int kb = 1024;
> > > >     const int mb = kb * kb;
> > > >     // cout << "NBody.GPU" << endl << "=========" << endl << endl;
> > > >
> > > >     printf("CUDA version:   v %d\n",CUDART_VERSION);
> > > >     //cout << "Thrust version: v" << THRUST_MAJOR_VERSION << "." <<
> > > > THRUST_MINOR_VERSION << endl << endl;
> > > >
> > > >     int devCount;
> > > >     cudaGetDeviceCount(&devCount);
> > > >     printf( "CUDA Devices: \n \n");
> > > > <snip>
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > >
> > > > Satish
> > > >
> > > > On Wed, 15 Apr 2020, Junchao Zhang wrote:
> > > >
> > > > > I remember Barry said superlu gpu support is broken.
> > > > > --Junchao Zhang
> > > > >
> > > > >
> > > > > On Wed, Apr 15, 2020 at 3:47 PM Mark Adams <mfadams at lbl.gov> wrote:
> > > > >
> > > > > > How does one use SuperLU with GPUs. I don't seem to get any GPU
> > > > > > performance data so I assume GPUs are not getting turned on. Am I
> > wrong
> > > > > > about that?
> > > > > >
> > > > > > I configure with:
> > > > > > configure options: --with-fc=0 --COPTFLAGS="-g -O2 -fPIC -fopenmp"
> > > > > > --CXXOPTFLAGS="-g -O2 -fPIC -fopenmp" --FOPTFLAGS="-g -O2 -fPIC
> > > > -fopenmp"
> > > > > > --CUDAOPTFLAGS="-O2 -g" --with-ssl=0 --with-batch=0
> > --with-cxx=mpicxx
> > > > > > --with-mpiexec="jsrun -g1" --with-cuda=1 --with-cudac=nvcc
> > > > > > --download-p4est=1 --download-zlib --download-hdf5=1
> > --download-metis
> > > > > > --download-superlu --download-superlu_dist --with-make-np=16
> > > > > > --download-parmetis --download-triangle
> > > > > >
> > > >
> > --with-blaslapack-lib="-L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/netlib-lapack-3.8.0-wcabdyqhdi5rooxbkqa6x5d7hxyxwdkm/lib64
> > > > > > -lblas -llapack" --with-cc=mpicc --with-shared-libraries=1
> > --with-x=0
> > > > > > --with-64-bit-indices=0 --with-debugging=0
> > > > > > PETSC_ARCH=arch-summit-opt-gnu-cuda-omp --with-openmp=1
> > > > > > --with-threadsaftey=1 --with-log=1
> > > > > >
> > > > > > Thanks,
> > > > > > Mark
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
> 



More information about the petsc-users mailing list