[petsc-dev] [petsc-maint #72279] PETSc and multigpu

Thu May 5 13:57:15 CDT 2011

Alexander

    Thank you for the sample code; it will be very useful.

    We have run parallel jobs with CUDA where each node has only a single MPI process and uses a single GPU without the crash that you get below. I cannot explain why it would not work in your situation. Do you have access to two nodes each with a GPU so you could try that? 

   It is crashing in a delete of a 

struct  _p_PetscCUSPIndices {
  CUSPINTARRAYCPU indicesCPU;
  CUSPINTARRAYGPU indicesGPU;
};

where cusp::array1d<PetscInt,cusp::device_memory>

thus it is crashing after it has completed actually doing the computation. If you run with -snes_monitor -ksp_monitor with and without the -da_vec_type cusp on 2 processes what do you get for output in the two cases? I want to see if it is running correctly on two processes?

Could the crash be due to memory corruption sometime doing the computation?

   Barry

On May 5, 2011, at 3:38 AM, Alexander Grayver wrote:

> Hello!
> 
> We work with petsc-dev branch and ex47cu.cu example. Our platform is 
> Intel Quad processor and 8 identical Tesla GPUs. CUDA 3.2 toolkit is 
> installed.
> Ideally we would like to make petsc working in a multi-GPU way within 
> just one node so that different GPUs could be attached to different 
> processes.
> Since it's not possible within current PETSc implementation we created a 
> preload library (see LD_PRELOAD for details) for CUBLAS function 
> cublasInit().
> When PETSc calls this function our library gets control and we assign 
> GPUs according to rank within MPI communicator, then we call original 
> cublasInit().
> This preload library is very simple, see petsc_mgpu.c attached.
> This trick makes each process to have its own context and ideally all 
> computations should be distributed over several GPUs.
> 
> We managed to build petsc and example (see makefile attached) and we 
> tested it as follows:
> 
> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -info > cpu_1process.out
> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 -info > 
> cpu_2processes.out
> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -da_vec_type cusp 
> -info > gpu_1process.out
> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 
> -da_vec_type cusp -info > gpu_2processes.out
> 
> Everything except last configuration works well. The last one crashes 
> with the following exception and callstack:
> terminate called after throwing an instance of 
> 'thrust::system::system_error'
>   what():  invalid device pointer
> [tesla-cmc:15549] *** Process received signal ***
> [tesla-cmc:15549] Signal: Aborted (6)
> [tesla-cmc:15549] Signal code:  (-6)
> [tesla-cmc:15549] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
> [tesla-cmc:15549] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3de50330c5]
> [tesla-cmc:15549] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
> [tesla-cmc:15549] [ 3] 
> /opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d) 
> [0x7f0d3530b95d]
> [tesla-cmc:15549] [ 4] 
> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f0d35309b76]
> [tesla-cmc:15549] [ 5] 
> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f0d35309ba3]
> [tesla-cmc:15549] [ 6] 
> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f0d35309cae]
> [tesla-cmc:15549] [ 7] 
> ./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69) 
> [0x426320]
> [tesla-cmc:15549] [ 8] 
> ./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b) 
> [0x4258b2]
> [tesla-cmc:15549] [ 9] 
> ./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f) [0x424f78]
> [tesla-cmc:15549] [10] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33) 
> [0x7f0d36aeacff]
> [tesla-cmc:15549] [11] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e) 
> [0x7f0d36ae8e78]
> [tesla-cmc:15549] [12] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19) 
> [0x7f0d36ae75f7]
> [tesla-cmc:15549] [13] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52) 
> [0x7f0d36ae65f4]
> [tesla-cmc:15549] [14] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18) 
> [0x7f0d36ae5c2e]
> [tesla-cmc:15549] [15] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d) [0x7f0d3751e45f]
> [tesla-cmc:15549] [16] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f) 
> [0x7f0d3750c840]
> [tesla-cmc:15549] [17] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8) 
> [0x7f0d375af8af]
> [tesla-cmc:15549] [18] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586) 
> [0x7f0d375e9ddf]
> [tesla-cmc:15549] [19] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f) 
> [0x7f0d37191d24]
> [tesla-cmc:15549] [20] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546) [0x7f0d370d54fe]
> [tesla-cmc:15549] [21] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1) [0x7f0d3746fac3]
> [tesla-cmc:15549] [22] 
> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8) [0x7f0d37470210]
> [tesla-cmc:15549] [23] ./lapexp(main+0x5ed) [0x420745]
> 
> I've sent all detailed output files for different execution 
> configuration listed above as well as configure.log and make.log to 
> petsc-maint at mcs.anl.gov hoping that someone could recognize the problem.
> Now we have one node with multi-GPU, but I'm also wondering if someone 
> really tested usage of GPU functionality over several nodes with one GPU 
> each?
> 
> Regards,
> Alexander
> 
> <petsc_mgpu.c><makefile.txt><configure.log>