[petsc-dev] [petsc-maint #72279] PETSc and multigpu

Alexander Grayver agrayver at gfz-potsdam.de
Wed May 11 07:31:39 CDT 2011


Hello,

Victor thanks. We've got last version and now it doesn't crash. However 
it seems there is still problem.

Let's look at three different runs:

[agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 
-snes_monitor -ksp_monitor
   0 SNES Function norm 3.906279802209e-03
     0 KSP Residual norm 5.994156809227e+00
     1 KSP Residual norm 3.538158441448e-04
     2 KSP Residual norm 3.124431921666e-04
     3 KSP Residual norm 4.109213410989e-06
   1 SNES Function norm 7.201017610490e-04
     0 KSP Residual norm 3.317803708316e-02
     1 KSP Residual norm 2.447380361169e-06
     2 KSP Residual norm 2.164193969957e-06
     3 KSP Residual norm 2.124317398679e-08
   2 SNES Function norm 1.719678934825e-05
     0 KSP Residual norm 1.651586453143e-06
     1 KSP Residual norm 2.037037536868e-08
     2 KSP Residual norm 1.109736798274e-08
     3 KSP Residual norm 1.857218772156e-12
   3 SNES Function norm 1.159391068583e-09
     0 KSP Residual norm 3.116936044619e-11
     1 KSP Residual norm 1.366503312678e-12
     2 KSP Residual norm 6.598477672192e-13
     3 KSP Residual norm 5.306147277879e-17
   4 SNES Function norm 2.202297235559e-10
[agraiver at tesla-cmc new]$ mpirun -np 1 ./lapexp -da_grid_x 65535 
-da_vec_type cusp -snes_monitor -ksp_monitor
   0 SNES Function norm 3.906279802209e-03
     0 KSP Residual norm 2.600060425819e+01
     1 KSP Residual norm 1.711173401491e-09
   1 SNES Function norm 2.518839283204e-05
     0 KSP Residual norm 1.864270712051e-01
     1 KSP Residual norm 1.123567613474e-11
   2 SNES Function norm 1.475752536169e-09
     0 KSP Residual norm 1.065095925089e-05
     1 KSP Residual norm 8.918344224261e-16
   3 SNES Function norm 2.186342855894e-10
     0 KSP Residual norm 6.313874615230e-11
     1 KSP Residual norm 2.338370003621e-21
[agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 
-da_vec_type cusp -snes_monitor -ksp_monitor
   0 SNES Function norm 3.906279802209e-03
     0 KSP Residual norm 5.994156809227e+00
     1 KSP Residual norm 5.927247846249e-05
   1 SNES Function norm 3.906225077938e-03
     0 KSP Residual norm 5.993813868985e+00
     1 KSP Residual norm 5.927575078206e-05
[agraiver at tesla-cmc new]$

lepexp is the default example, just renamed. The first run used 2 CPUs, 
the second one used 1 GPU and the third one ran with 2 processes using 1 
GPU.
First different is that when use cpu the last string in output is always:
4 SNES Function norm 2.202297235559e-10
whereas for CPU the last string is "N KSP ...something..."
Then is seems that for 2 processes using 1 GPU example doesn't converge, 
the norm is quite big. The same situation happens when we use 2 process 
and 2 GPUs. Can you explain this?
BTW, we can even give you access to our server with 6 CPUs and 8 GPUs 
within one node.

Regards,
Alexander

On 11.05.2011 01:07, Victor Minden wrote:
> I pushed my change to petsc-dev, so hopefully a new pull of the latest 
> mercurial repository should do it, let me know if not.
> ---
> Victor L. Minden
>
> Tufts University
> School of Engineering
> Class of 2012
>
>
> On Tue, May 10, 2011 at 6:59 PM, Alexander Grayver 
> <agrayver at gfz-potsdam.de <mailto:agrayver at gfz-potsdam.de>> wrote:
>
>     Hi Victor,
>
>     Thanks a lot!
>     What should we do to get new version?
>
>     Regards,
>     Alexander
>
>
>     On 10.05.2011 02:02, Victor Minden wrote:
>>     I believe I've resolved this issue.
>>
>>     Cheers,
>>
>>     Victor
>>     ---
>>     Victor L. Minden
>>
>>     Tufts University
>>     School of Engineering
>>     Class of 2012
>>
>>
>>     On Sun, May 8, 2011 at 5:26 PM, Victor Minden
>>     <victorminden at gmail.com <mailto:victorminden at gmail.com>> wrote:
>>
>>         Barry,
>>
>>         I can verify this on breadboard now,
>>
>>         with two processes, cuda
>>
>>         minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>>         /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra
>>         -machinefile
>>         /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535
>>         -log_summary
>>         -snes_monitor -ksp_monitor -da_vec_type cusp
>>          0 SNES Function norm 3.906279802209e-03
>>            0 KSP Residual norm 5.994156809227e+00
>>            1 KSP Residual norm 5.927247846249e-05
>>          1 SNES Function norm 3.906225077938e-03
>>            0 KSP Residual norm 5.993813868985e+00
>>            1 KSP Residual norm 5.927575078206e-05
>>         terminate called after throwing an instance of
>>         'thrust::system::system_error'
>>          what():  invalid device pointer
>>         terminate called after throwing an instance of
>>         'thrust::system::system_error'
>>          what():  invalid device pointer
>>         Aborted (signal 6)
>>
>>
>>
>>         Without cuda
>>
>>         minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>>         /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra
>>         -machinefile
>>         /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535
>>         -log_summary
>>         -snes_monitor -ksp_monitor
>>          0 SNES Function norm 3.906279802209e-03
>>            0 KSP Residual norm 5.994156809227e+00
>>            1 KSP Residual norm 3.538158441448e-04
>>            2 KSP Residual norm 3.124431921666e-04
>>            3 KSP Residual norm 4.109213410989e-06
>>          1 SNES Function norm 7.201017610490e-04
>>            0 KSP Residual norm 3.317803708316e-02
>>            1 KSP Residual norm 2.447380361169e-06
>>            2 KSP Residual norm 2.164193969957e-06
>>            3 KSP Residual norm 2.124317398679e-08
>>          2 SNES Function norm 1.7196789348 <tel:1.7196789348>25e-05
>>            0 KSP Residual norm 1.6515864531 <tel:1.6515864531>43e-06
>>            1 KSP Residual norm 2.037037536868e-08
>>            2 KSP Residual norm 1.109736798274e-08
>>            3 KSP Residual norm 1.8572187721 <tel:1.8572187721>56e-12
>>          3 SNES Function norm 1.159391068583e-09
>>            0 KSP Residual norm 3.116936044619e-11
>>            1 KSP Residual norm 1.366503312678e-12
>>            2 KSP Residual norm 6.598477672192e-13
>>            3 KSP Residual norm 5.306147277879e-17
>>          4 SNES Function norm 2.202297235559e-10
>>
>>         Note the repeated norms when using cuda.  Looks like I'll
>>         have to take
>>         a closer look at this.
>>
>>         -Victor
>>
>>         ---
>>         Victor L. Minden
>>
>>         Tufts University
>>         School of Engineering
>>         Class of 2012
>>
>>
>>
>>         On Thu, May 5, 2011 at 2:57 PM, Barry Smith
>>         <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>>         >
>>         > Alexander
>>         >
>>         >    Thank you for the sample code; it will be very useful.
>>         >
>>         >    We have run parallel jobs with CUDA where each node has
>>         only a single MPI process and uses a single GPU without the
>>         crash that you get below. I cannot explain why it would not
>>         work in your situation. Do you have access to two nodes each
>>         with a GPU so you could try that?
>>         >
>>         >   It is crashing in a delete of a
>>         >
>>         > struct  _p_PetscCUSPIndices {
>>         >  CUSPINTARRAYCPU indicesCPU;
>>         >  CUSPINTARRAYGPU indicesGPU;
>>         > };
>>         >
>>         > where cusp::array1d<PetscInt,cusp::device_memory>
>>         >
>>         > thus it is crashing after it has completed actually doing
>>         the computation. If you run with -snes_monitor -ksp_monitor
>>         with and without the -da_vec_type cusp on 2 processes what do
>>         you get for output in the two cases? I want to see if it is
>>         running correctly on two processes?
>>         >
>>         > Could the crash be due to memory corruption sometime doing
>>         the computation?
>>         >
>>         >
>>         >   Barry
>>         >
>>         >
>>         >
>>         >
>>         >
>>         > On May 5, 2011, at 3:38 AM, Alexander Grayver wrote:
>>         >
>>         >> Hello!
>>         >>
>>         >> We work with petsc-dev branch and ex47cu.cu
>>         <http://ex47cu.cu> example. Our platform is
>>         >> Intel Quad processor and 8 identical Tesla GPUs. CUDA 3.2
>>         toolkit is
>>         >> installed.
>>         >> Ideally we would like to make petsc working in a multi-GPU
>>         way within
>>         >> just one node so that different GPUs could be attached to
>>         different
>>         >> processes.
>>         >> Since it's not possible within current PETSc
>>         implementation we created a
>>         >> preload library (see LD_PRELOAD for details) for CUBLAS
>>         function
>>         >> cublasInit().
>>         >> When PETSc calls this function our library gets control
>>         and we assign
>>         >> GPUs according to rank within MPI communicator, then we
>>         call original
>>         >> cublasInit().
>>         >> This preload library is very simple, see petsc_mgpu.c
>>         attached.
>>         >> This trick makes each process to have its own context and
>>         ideally all
>>         >> computations should be distributed over several GPUs.
>>         >>
>>         >> We managed to build petsc and example (see makefile
>>         attached) and we
>>         >> tested it as follows:
>>         >>
>>         >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -info
>>         > cpu_1process.out
>>         >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x
>>         65535 -info >
>>         >> cpu_2processes.out
>>         >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535
>>         -da_vec_type cusp
>>         >> -info > gpu_1process.out
>>         >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x
>>         65535
>>         >> -da_vec_type cusp -info > gpu_2processes.out
>>         >>
>>         >> Everything except last configuration works well. The last
>>         one crashes
>>         >> with the following exception and callstack:
>>         >> terminate called after throwing an instance of
>>         >> 'thrust::system::system_error'
>>         >>   what():  invalid device pointer
>>         >> [tesla-cmc:15549] *** Process received signal ***
>>         >> [tesla-cmc:15549] Signal: Aborted (6)
>>         >> [tesla-cmc:15549] Signal code:  (-6)
>>         >> [tesla-cmc:15549] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
>>         >> [tesla-cmc:15549] [ 1] /lib64/libc.so.6(gsignal+0x35)
>>         [0x3de50330c5]
>>         >> [tesla-cmc:15549] [ 2] /lib64/libc.so.6(abort+0x186)
>>         [0x3de5034a76]
>>         >> [tesla-cmc:15549] [ 3]
>>         >>
>>         /opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
>>         >> [0x7f0d3530b95d]
>>         >> [tesla-cmc:15549] [ 4]
>>         >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76)
>>         [0x7f0d35309b76]
>>         >> [tesla-cmc:15549] [ 5]
>>         >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3)
>>         [0x7f0d35309ba3]
>>         >> [tesla-cmc:15549] [ 6]
>>         >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae)
>>         [0x7f0d35309cae]
>>         >> [tesla-cmc:15549] [ 7]
>>         >>
>>         ./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
>>         >> [0x426320]
>>         >> [tesla-cmc:15549] [ 8]
>>         >>
>>         ./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
>>         >> [0x4258b2]
>>         >> [tesla-cmc:15549] [ 9]
>>         >> ./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f)
>>         [0x424f78]
>>         >> [tesla-cmc:15549] [10]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
>>         >> [0x7f0d36aeacff]
>>         >> [tesla-cmc:15549] [11]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
>>         >> [0x7f0d36ae8e78]
>>         >> [tesla-cmc:15549] [12]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
>>         >> [0x7f0d36ae75f7]
>>         >> [tesla-cmc:15549] [13]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
>>         >> [0x7f0d36ae65f4]
>>         >> [tesla-cmc:15549] [14]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
>>         >> [0x7f0d36ae5c2e]
>>         >> [tesla-cmc:15549] [15]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d)
>>         [0x7f0d3751e45f]
>>         >> [tesla-cmc:15549] [16]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
>>         >> [0x7f0d3750c840]
>>         >> [tesla-cmc:15549] [17]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
>>         >> [0x7f0d375af8af]
>>         >> [tesla-cmc:15549] [18]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
>>         >> [0x7f0d375e9ddf]
>>         >> [tesla-cmc:15549] [19]
>>         >>
>>         /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
>>         >> [0x7f0d37191d24]
>>         >> [tesla-cmc:15549] [20]
>>         >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546)
>>         [0x7f0d370d54fe]
>>         >> [tesla-cmc:15549] [21]
>>         >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1)
>>         [0x7f0d3746fac3]
>>         >> [tesla-cmc:15549] [22]
>>         >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8)
>>         [0x7f0d37470210]
>>         >> [tesla-cmc:15549] [23] ./lapexp(main+0x5ed) [0x420745]
>>         >>
>>         >> I've sent all detailed output files for different execution
>>         >> configuration listed above as well as configure.log and
>>         make.log to
>>         >> petsc-maint at mcs.anl.gov <mailto:petsc-maint at mcs.anl.gov>
>>         hoping that someone could recognize the problem.
>>         >> Now we have one node with multi-GPU, but I'm also
>>         wondering if someone
>>         >> really tested usage of GPU functionality over several
>>         nodes with one GPU
>>         >> each?
>>         >>
>>         >> Regards,
>>         >> Alexander
>>         >>
>>         >> <petsc_mgpu.c><makefile.txt><configure.log>
>>         >
>>         >
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110511/9c37a6a6/attachment.html>


More information about the petsc-dev mailing list