[petsc-dev] [petsc-maint #72279] PETSc and multigpu

Wed May 18 19:01:26 CDT 2011

Hi Alexander,

Looking through the runs for CPU and GPU with only 1 process, I'm seeing the
following oddity which you pointed out:

CPU 1 process
minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
/home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
/home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor
-ksp_monitor
  0 SNES Function norm 3.906279802209e-03
    0 KSP Residual norm 2.600060425819e+01
    1 KSP Residual norm 1.727316216725e-09
  1 SNES Function norm 2.518839280713e-05
    0 KSP Residual norm 1.864270710157e-01
    1 KSP Residual norm 1.518456989028e-11
  2 SNES Function norm 1.475794371713e-09
    0 KSP Residual norm 1.065102315659e-05
    1 KSP Residual norm 1.258453455440e-15
  3 SNES Function norm 2.207728411745e-10
    0 KSP Residual norm 6.963755704792e-12
    1 KSP Residual norm 1.188067869190e-21
  4 SNES Function norm 2.199244040060e-10

GPU 1 process
minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
/home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
/home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor
-ksp_monitor -da_vec_type cusp
  0 SNES Function norm 3.906279802209e-03
    0 KSP Residual norm 2.600060425819e+01
    1 KSP Residual norm 1.711173401491e-09
  1 SNES Function norm 2.518839283204e-05
    0 KSP Residual norm 1.864270712051e-01
    1 KSP Residual norm 1.123567613474e-11
  2 SNES Function norm 1.475752536169e-09
    0 KSP Residual norm 1.065095925089e-05
    1 KSP Residual norm 8.918344224261e-16
  3 SNES Function norm 2.186342855894e-10
    0 KSP Residual norm 6.313874615230e-11
    1 KSP Residual norm 2.338370003621e-21

As you noted, the CPU version terminates on SNES function norm whereas the
GPU version stops on a KSP residual norm.  Looking through the exact
numbers, I found that the small differences in values between the GPU and
CPU versions cause the convergence criterion for the SNES to be off by about
2e-23, so it goes for another round of line search before concluding it has
found a local minimum and terminating.  By using GPU matrix as well,

GPU 1 process with cusp matrix
minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
/home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
/home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor
-ksp_monitor -da_vec_type cusp -da_mat_type aijcusp
  0 SNES Function norm 3.906279802209e-03
    0 KSP Residual norm 2.600060425819e+01
    1 KSP Residual norm 8.745056654228e-10
  1 SNES Function norm 2.518839297589e-05
    0 KSP Residual norm 1.864270723743e-01
    1 KSP Residual norm 1.265482694189e-11
  2 SNES Function norm 1.475659976840e-09
    0 KSP Residual norm 1.065091221064e-05
    1 KSP Residual norm 8.245135443599e-16
  3 SNES Function norm 2.200530918322e-10
    0 KSP Residual norm 7.730316189302e-11
    1 KSP Residual norm 1.115126544733e-21
  4 SNES Function norm 2.192093087025e-10

It changes the values again just enough to push it to the right side of the
convergence check.

I am still looking into the problems for 2 processes with GPU, it seems to
somewhere be using old data as you can see by the fact that the function
norm is the same at the beginning of each SNES iteration

GPU, 2 processes
[agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
-da_vec_type cusp -snes_monitor -ksp_monitor

  0 SNES Function norm 3.906279802209e-03<-----
    0 KSP Residual norm 5.994156809227e+00
    1 KSP Residual norm 5.927247846249e-05
  1 SNES Function norm 3.906225077938e-03<------
    0 KSP Residual norm 5.993813868985e+00
    1 KSP Residual norm 5.927575078206e-05

So, it's doing some good calculations and then throwing them away and
starting over again.  I will continue to look into this.

Cheers,

Victor

---
Victor L. Minden

Tufts University
School of Engineering
Class of 2012

On Wed, May 11, 2011 at 8:31 AM, Alexander Grayver
<agrayver at gfz-potsdam.de>wrote:

>  Hello,
>
> Victor thanks. We've got last version and now it doesn't crash. However it
> seems there is still problem.
>
> Let's look at three different runs:
>
> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
> -snes_monitor -ksp_monitor
>
>   0 SNES Function norm 3.906279802209e-03
>     0 KSP Residual norm 5.994156809227e+00
>     1 KSP Residual norm 3.538158441448e-04
>     2 KSP Residual norm 3.124431921666e-04
>     3 KSP Residual norm 4.109213410989e-06
>   1 SNES Function norm 7.201017610490e-04
>     0 KSP Residual norm 3.317803708316e-02
>     1 KSP Residual norm 2.447380361169e-06
>     2 KSP Residual norm 2.164193969957e-06
>     3 KSP Residual norm 2.124317398679e-08
>   2 SNES Function norm 1.719678934825e-05
>     0 KSP Residual norm 1.651586453143e-06
>     1 KSP Residual norm 2.037037536868e-08
>     2 KSP Residual norm 1.109736798274e-08
>     3 KSP Residual norm 1.857218772156e-12
>   3 SNES Function norm 1.159391068583e-09
>     0 KSP Residual norm 3.116936044619e-11
>     1 KSP Residual norm 1.366503312678e-12
>     2 KSP Residual norm 6.598477672192e-13
>     3 KSP Residual norm 5.306147277879e-17
>   4 SNES Function norm 2.202297235559e-10
> [agraiver at tesla-cmc new]$ mpirun -np 1 ./lapexp -da_grid_x 65535
> -da_vec_type cusp -snes_monitor -ksp_monitor
>
>   0 SNES Function norm 3.906279802209e-03
>     0 KSP Residual norm 2.600060425819e+01
>     1 KSP Residual norm 1.711173401491e-09
>   1 SNES Function norm 2.518839283204e-05
>     0 KSP Residual norm 1.864270712051e-01
>     1 KSP Residual norm 1.123567613474e-11
>   2 SNES Function norm 1.475752536169e-09
>     0 KSP Residual norm 1.065095925089e-05
>     1 KSP Residual norm 8.918344224261e-16
>   3 SNES Function norm 2.186342855894e-10
>     0 KSP Residual norm 6.313874615230e-11
>     1 KSP Residual norm 2.338370003621e-21
> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
> -da_vec_type cusp -snes_monitor -ksp_monitor
>
>   0 SNES Function norm 3.906279802209e-03
>     0 KSP Residual norm 5.994156809227e+00
>     1 KSP Residual norm 5.927247846249e-05
>   1 SNES Function norm 3.906225077938e-03
>     0 KSP Residual norm 5.993813868985e+00
>     1 KSP Residual norm 5.927575078206e-05
> [agraiver at tesla-cmc new]$
>
> lepexp is the default example, just renamed. The first run used 2 CPUs, the
> second one used 1 GPU and the third one ran with 2 processes using 1 GPU.
> First different is that when use cpu the last string in output is always:
>
> 4 SNES Function norm 2.202297235559e-10
> whereas for CPU the last string is "N KSP ...something..."
> Then is seems that for 2 processes using 1 GPU example doesn't converge,
> the norm is quite big. The same situation happens when we use 2 process and
> 2 GPUs. Can you explain this?
> BTW, we can even give you access to our server with 6 CPUs and 8 GPUs
> within one node.
>
> Regards,
> Alexander
>
>
> On 11.05.2011 01:07, Victor Minden wrote:
>
> I pushed my change to petsc-dev, so hopefully a new pull of the latest
> mercurial repository should do it, let me know if not.
> ---
> Victor L. Minden
>
> Tufts University
> School of Engineering
> Class of 2012
>
>
> On Tue, May 10, 2011 at 6:59 PM, Alexander Grayver <
> agrayver at gfz-potsdam.de> wrote:
>
>>  Hi Victor,
>>
>> Thanks a lot!
>> What should we do to get new version?
>>
>> Regards,
>>  Alexander
>>
>>
>> On 10.05.2011 02:02, Victor Minden wrote:
>>
>> I believe I've resolved this issue.
>>
>>  Cheers,
>>
>>  Victor
>> ---
>> Victor L. Minden
>>
>> Tufts University
>> School of Engineering
>> Class of 2012
>>
>>
>> On Sun, May 8, 2011 at 5:26 PM, Victor Minden <victorminden at gmail.com>wrote:
>>
>>> Barry,
>>>
>>> I can verify this on breadboard now,
>>>
>>> with two processes, cuda
>>>
>>> minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>>> /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
>>> /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535 -log_summary
>>> -snes_monitor -ksp_monitor -da_vec_type cusp
>>>   0 SNES Function norm 3.906279802209e-03
>>>    0 KSP Residual norm 5.994156809227e+00
>>>    1 KSP Residual norm 5.927247846249e-05
>>>  1 SNES Function norm 3.906225077938e-03
>>>    0 KSP Residual norm 5.993813868985e+00
>>>    1 KSP Residual norm 5.927575078206e-05
>>>  terminate called after throwing an instance of
>>> 'thrust::system::system_error'
>>>  what():  invalid device pointer
>>> terminate called after throwing an instance of
>>> 'thrust::system::system_error'
>>>  what():  invalid device pointer
>>>  Aborted (signal 6)
>>>
>>>
>>>
>>> Without cuda
>>>
>>> minden at bb45:~/petsc-dev/src/snes/examples/tutorials$
>>> /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile
>>> /home/balay/machinefile -n 2 ./ex47cu -da_grid_x 65535 -log_summary
>>> -snes_monitor -ksp_monitor
>>>  0 SNES Function norm 3.906279802209e-03
>>>    0 KSP Residual norm 5.994156809227e+00
>>>    1 KSP Residual norm 3.538158441448e-04
>>>    2 KSP Residual norm 3.124431921666e-04
>>>    3 KSP Residual norm 4.109213410989e-06
>>>  1 SNES Function norm 7.201017610490e-04
>>>    0 KSP Residual norm 3.317803708316e-02
>>>    1 KSP Residual norm 2.447380361169e-06
>>>    2 KSP Residual norm 2.164193969957e-06
>>>    3 KSP Residual norm 2.124317398679e-08
>>>  2 SNES Function norm 1.719678934825e-05
>>>    0 KSP Residual norm 1.651586453143e-06
>>>    1 KSP Residual norm 2.037037536868e-08
>>>    2 KSP Residual norm 1.109736798274e-08
>>>    3 KSP Residual norm 1.857218772156e-12
>>>  3 SNES Function norm 1.159391068583e-09
>>>    0 KSP Residual norm 3.116936044619e-11
>>>    1 KSP Residual norm 1.366503312678e-12
>>>    2 KSP Residual norm 6.598477672192e-13
>>>    3 KSP Residual norm 5.306147277879e-17
>>>  4 SNES Function norm 2.202297235559e-10
>>>
>>>  Note the repeated norms when using cuda.  Looks like I'll have to take
>>> a closer look at this.
>>>
>>> -Victor
>>>
>>> ---
>>> Victor L. Minden
>>>
>>> Tufts University
>>> School of Engineering
>>> Class of 2012
>>>
>>>
>>>
>>> On Thu, May 5, 2011 at 2:57 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>>> >
>>> > Alexander
>>> >
>>> >    Thank you for the sample code; it will be very useful.
>>> >
>>> >    We have run parallel jobs with CUDA where each node has only a
>>> single MPI process and uses a single GPU without the crash that you get
>>> below. I cannot explain why it would not work in your situation. Do you have
>>> access to two nodes each with a GPU so you could try that?
>>> >
>>> >   It is crashing in a delete of a
>>> >
>>> > struct  _p_PetscCUSPIndices {
>>> >  CUSPINTARRAYCPU indicesCPU;
>>> >  CUSPINTARRAYGPU indicesGPU;
>>> > };
>>> >
>>> > where cusp::array1d<PetscInt,cusp::device_memory>
>>> >
>>> > thus it is crashing after it has completed actually doing the
>>> computation. If you run with -snes_monitor -ksp_monitor with and without the
>>> -da_vec_type cusp on 2 processes what do you get for output in the two
>>> cases? I want to see if it is running correctly on two processes?
>>> >
>>> > Could the crash be due to memory corruption sometime doing the
>>> computation?
>>> >
>>> >
>>> >   Barry
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On May 5, 2011, at 3:38 AM, Alexander Grayver wrote:
>>> >
>>> >> Hello!
>>> >>
>>> >> We work with petsc-dev branch and ex47cu.cu example. Our platform is
>>> >> Intel Quad processor and 8 identical Tesla GPUs. CUDA 3.2 toolkit is
>>> >> installed.
>>> >> Ideally we would like to make petsc working in a multi-GPU way within
>>> >> just one node so that different GPUs could be attached to different
>>> >> processes.
>>> >> Since it's not possible within current PETSc implementation we created
>>> a
>>> >> preload library (see LD_PRELOAD for details) for CUBLAS function
>>> >> cublasInit().
>>> >> When PETSc calls this function our library gets control and we assign
>>> >> GPUs according to rank within MPI communicator, then we call original
>>> >> cublasInit().
>>> >> This preload library is very simple, see petsc_mgpu.c attached.
>>> >> This trick makes each process to have its own context and ideally all
>>> >> computations should be distributed over several GPUs.
>>> >>
>>> >> We managed to build petsc and example (see makefile attached) and we
>>> >> tested it as follows:
>>> >>
>>> >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -info >
>>> cpu_1process.out
>>> >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
>>> -info >
>>> >> cpu_2processes.out
>>> >> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -da_vec_type cusp
>>> >> -info > gpu_1process.out
>>> >> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
>>> >> -da_vec_type cusp -info > gpu_2processes.out
>>> >>
>>> >> Everything except last configuration works well. The last one crashes
>>> >> with the following exception and callstack:
>>> >> terminate called after throwing an instance of
>>> >> 'thrust::system::system_error'
>>> >>   what():  invalid device pointer
>>> >> [tesla-cmc:15549] *** Process received signal ***
>>> >> [tesla-cmc:15549] Signal: Aborted (6)
>>> >> [tesla-cmc:15549] Signal code:  (-6)
>>> >> [tesla-cmc:15549] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
>>> >> [tesla-cmc:15549] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3de50330c5]
>>> >> [tesla-cmc:15549] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
>>> >> [tesla-cmc:15549] [ 3]
>>> >>
>>> /opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
>>> >> [0x7f0d3530b95d]
>>> >> [tesla-cmc:15549] [ 4]
>>> >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f0d35309b76]
>>> >> [tesla-cmc:15549] [ 5]
>>> >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f0d35309ba3]
>>> >> [tesla-cmc:15549] [ 6]
>>> >> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f0d35309cae]
>>> >> [tesla-cmc:15549] [ 7]
>>> >>
>>> ./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
>>> >> [0x426320]
>>> >> [tesla-cmc:15549] [ 8]
>>> >>
>>> ./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
>>> >> [0x4258b2]
>>> >> [tesla-cmc:15549] [ 9]
>>> >> ./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f) [0x424f78]
>>> >> [tesla-cmc:15549] [10]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
>>> >> [0x7f0d36aeacff]
>>> >> [tesla-cmc:15549] [11]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
>>> >> [0x7f0d36ae8e78]
>>> >> [tesla-cmc:15549] [12]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
>>> >> [0x7f0d36ae75f7]
>>> >> [tesla-cmc:15549] [13]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
>>> >> [0x7f0d36ae65f4]
>>> >> [tesla-cmc:15549] [14]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
>>> >> [0x7f0d36ae5c2e]
>>> >> [tesla-cmc:15549] [15]
>>> >>
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d)
>>> [0x7f0d3751e45f]
>>> >> [tesla-cmc:15549] [16]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
>>> >> [0x7f0d3750c840]
>>> >> [tesla-cmc:15549] [17]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
>>> >> [0x7f0d375af8af]
>>> >> [tesla-cmc:15549] [18]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
>>> >> [0x7f0d375e9ddf]
>>> >> [tesla-cmc:15549] [19]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
>>> >> [0x7f0d37191d24]
>>> >> [tesla-cmc:15549] [20]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546)
>>> [0x7f0d370d54fe]
>>> >> [tesla-cmc:15549] [21]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1)
>>> [0x7f0d3746fac3]
>>> >> [tesla-cmc:15549] [22]
>>> >> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8)
>>> [0x7f0d37470210]
>>> >> [tesla-cmc:15549] [23] ./lapexp(main+0x5ed) [0x420745]
>>> >>
>>> >> I've sent all detailed output files for different execution
>>> >> configuration listed above as well as configure.log and make.log to
>>> >> petsc-maint at mcs.anl.gov hoping that someone could recognize the
>>> problem.
>>> >> Now we have one node with multi-GPU, but I'm also wondering if someone
>>> >> really tested usage of GPU functionality over several nodes with one
>>> GPU
>>> >> each?
>>> >>
>>> >> Regards,
>>> >> Alexander
>>> >>
>>> >> <petsc_mgpu.c><makefile.txt><configure.log>
>>> >
>>> >
>>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20110518/003d522c/attachment.html>