[petsc-users] Why does GPU solve the large sparse matrix equations only a little faster than CPU?

Xiangze Zeng zengshixiangze at 163.com
Sun Aug 5 23:44:15 CDT 2012


Do you mean all the computational work are done on the GPU? 


When I run ex5 with  -dm_vec_type veccusp -dm_mat_type mataijcusp, it appears the following error:


~/ex5\>./ex5 -dm_vec_type veccusp -dm_mat_type -log_summary ex5_log
[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Unknown type. Check for miss-spelling or missing external package needed for type:
see http://www.mcs.anl.gov/petsc/documentation/installation.html#external!
[0]PETSC ERROR: Unknown vector type: veccusp!
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development HG revision: d01946145980533f72b6500bd243b1dd3666686c  HG Date: Mon Jul 30 17:03:27 2012 -0500
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: ./ex5 on a arch-cuda named hohhot by hongwang Mon Aug  6 12:27:19 2012
[0]PETSC ERROR: Libraries linked from /usr/src/petsc/petsc-dev/arch-cuda-double/lib
[0]PETSC ERROR: Configure run at Sat Aug  4 15:10:44 2012
[0]PETSC ERROR: Configure options --doCleanup=1 --with-gnu-compilers=1 --with-vendor-compilers=0 --CFLAGS=-march=x86-64 --CXXFLAGS=-march=x86-64 --with-dynamic-loading --with-python=1 --with-debugging=0 --with-log=1 --download-mpich=1 --with-hypre=0 --with-64-bit-indices=yes --with-x11=1 --with-x11-include=/usr/include/X11 --download-f-blas-lapack=1 --with-cuda=1 --with-cusp=1 --with-thrust=1 --download-txpetscgpu=1 --with-precision=double --with-cudac="nvcc -m64" --download-txpetscgpu=1 --with-clanguage=c --with-cuda-arch=sm_20
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: VecSetType() line 44 in src/vec/vec/interface/vecreg.c
[0]PETSC ERROR: DMCreateGlobalVector_DA() line 36 in src/dm/impls/da/dadist.c
[0]PETSC ERROR: DMCreateGlobalVector() line 443 in src/dm/interface/dm.c
[0]PETSC ERROR: DMDASetUniformCoordinates() line 58 in src/dm/impls/da/gr1.c
[0]PETSC ERROR: main() line 113 in src/snes/examples/tutorials/ex5.c
application called MPI_Abort(MPI_COMM_WORLD, 86) - process 0
[unset]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 86) - process 0


Is there something wrong with the CUSP? My PETSc version is -dev, the cusp version I use is 0.3.1, CUDA version is 4.2.


Zeng Xiangze 
在 2012-08-06 03:18:58,"Matthew Knepley" <knepley at gmail.com> 写道:
On Sun, Aug 5, 2012 at 10:24 AM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

Dear Matt,


Thank you for your suggestion. I'm learning to use the GPU effectively step by step. I think it's useful for the novice if there is a manual about using PETSc with CUDA.
Each iteration is done, the VEC will be copied to the host to evaluate the stopping condition, is it right?


No, if that was true, we would have given up long ago. My guess is that some of your Vecs are not the correct type.
Can you look at ex5 suing -dm_vec_type veccusp -dm_mat_type mataijcusp and mail petsc-maint at mcs.anl.gov?


   Matt
 
Sincerely,
Zeng Xiangze



在 2012-08-05 20:27:55,"Matthew Knepley" <knepley at gmail.com> 写道:
On Sat, Aug 4, 2012 at 11:23 PM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

When I change the PC type to JACOBI, the KSP type to BICG, although the computational speed both in the GPU and CPU are higher than that when I use SOR+BCGS, the computational work in the GPU doesn't seem much more efficient, the speed only 20% higher. Is there any proposal? The attachments are the output of the log_summary.


You also have to look at the log_summary:


VecCUSPCopyTo       3967 1.0 1.3152e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
VecCUSPCopyFrom     3969 1.0 5.5139e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  9  0  0  0  0   9  0  0  0  0     0
MatCUSPCopyTo          1 1.0 4.5194e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0


1) I said to use GMRES for a reason. Listen to me. BiCG uses the transpose, which right now confuses the results


2) Look at the copies to/from the GPU. You should not be copying the vector 4000 times. Start simple until you understand
    everything about how the code is running. Use -pc_type none -ksp_type gmres and see if you can understand the results.
    Then try different KSP and PC. Trying everything at once does not help anyone, and it is not science.


    Matt
 
Thank you!


Zeng Xiangze 

At 2012-08-05 00:01:11,"Xiangze Zeng" <zengshixiangze at 163.com> wrote:

JACOBI+GMRES takes 124s to solve one system on the GPU, 172s on the CPU. When I use JACOBI+BICG, it takes 123s on the GPU, 162s on the CPU. In http://www.mcs.anl.gov/petsc/features/gpus.html, I see "All of the Krylov methods except KSPIBCGS run on the GPU. "  I don't find KSPIBCGS in the manual, is it KSPBCGS?

在 2012-08-04 23:04:55,"Matthew Knepley" <knepley at gmail.com> 写道:
On Sat, Aug 4, 2012 at 9:42 AM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

Another error happens when I change the PC type. When I change it to PCJACOBI,  it appears the following error message:


[0]PETSC ERROR: --------------------- Error Message ------------------------------------
[0]PETSC ERROR: Petsc has generated inconsistent data!
[0]PETSC ERROR: Divide by zero!
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Petsc Development HG revision: d01946145980533f72b6500bd243b1dd3666686c  HG Date: Mon Jul 30 17:03:27 2012 -0500
[0]PETSC ERROR: See docs/changes/index.html for recent updates.
[0]PETSC ERROR: See docs/faq.html for hints about trouble shooting.
[0]PETSC ERROR: See docs/index.html for manual pages.
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: ../../femsolcu/./femsolcu on a arch-cuda named hohhot by hongwang Sat Aug  4 22:23:58 2012
[0]PETSC ERROR: Libraries linked from /usr/src/petsc/petsc-dev/arch-cuda-double/lib
[0]PETSC ERROR: Configure run at Sat Aug  4 15:10:44 2012
[0]PETSC ERROR: Configure options --doCleanup=1 --with-gnu-compilers=1 --with-vendor-compilers=0 --CFLAGS=-march=x86-64 --CXXFLAGS=-march=x86-64 --with-dynamic-loading --with-python=1 --with-debugging=0 --with-log=1 --download-mpich=1 --with-hypre=0 --with-64-bit-indices=yes --with-x11=1 --with-x11-include=/usr/include/X11 --download-f-blas-lapack=1 --with-cuda=1 --with-cusp=1 --with-thrust=1 --download-txpetscgpu=1 --with-precision=double --with-cudac="nvcc -m64" --download-txpetscgpu=1 --with-clanguage=c --with-cuda-arch=sm_20
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: KSPSolve_BCGS() line 105 in src/ksp/ksp/impls/bcgs/bcgs.c
[0]PETSC ERROR: KSPSolve() line 446 in src/ksp/ksp/interface/itfunc.c
[0]PETSC ERROR: sol_comp() line 39 in "unknowndirectory/"solve.c


 And when I change it to PCSACUSP, PCSACUSPPOLY, it both prompts out of memory(I guess it's the GPU's memory). When I change it to  PCAINVCUSP, the result is not better than that when I don't change the type. 


This is breakdown in that algorithm. Try GMRES.


   Matt
 
Does it have something to do with the KSP type? Should I look for a suited KSP type to match the PC type which can work on the GPU?

在 2012-08-04 21:44:02,"Matthew Knepley" <knepley at gmail.com> 写道:
On Sat, Aug 4, 2012 at 5:58 AM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

After I rerun with "deugging=no", the CPU takes 30 minutes, GPU 22 minutes, a little better than before. The attachment are the output of -log_summary.


1) Notice how the PCApply takes most of the time, so MatMult is not very important


2) In g_log_3, notice that every time your PC is called, the vector is pulled from the GPU to the CPU.
    This means we do not support that PC on the GPU


There is a restriction on PCs since not many are coded for the GPU. Only PCJACOBI, PCSACUSP, PCSACUSPPOLY, and PCAINVCUSP
work there, see http://www.mcs.anl.gov/petsc/features/gpus.html.


   Matt
 
At 2012-08-04 14:40:33,"Azamat Mametjanov" <azamat.mametjanov at gmail.com> wrote:
What happens if you try to re-run with "--with-debugging=no"?


On Fri, Aug 3, 2012 at 10:00 PM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

Dear Matt,


My CPU is Intel Xeon E5-2609, GPU is Nvidia GF100 [Quadro 4000]. 
The size of the system is 2522469 x 2522469, and the number non-0 elements is 71773925, about 0.000012 of the total. 
The output of -log_summary is in the attachment. The G_log_summary is the output when using GPU, C_log_summary when using CPU. 


Zeng Xiangze


在 2012-08-03 22:28:07,"Matthew Knepley" <knepley at gmail.com> 写道:

On Fri, Aug 3, 2012 at 9:18 AM, Xiangze Zeng <zengshixiangze at 163.com> wrote:

Dear all,


When I use the CPU solve the equations, it takes 78 minutes, when I change to use GPU, it uses 64 minutes, only 15 minutes faster. I see some paper say when using PETCs with GPU to solve the large sparse matrix equations, it can be several times faster? What's the matter?


For all performance questions, we at least need the output of -log_summary. However, we would also need to know


  - The size and sparsity of your system


  - The CPU and GPU you used (saying anything without knowing this is impossible)


   Matt
 
Thank you!


Sincerely,
Zeng Xiangze








--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener








--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100








--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener



--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100








--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener



--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100





--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100








--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener



--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100








--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener



--
Mailbox 379, School of Physics
Shandong University
27 South Shanda Road, Jinan, Shandong, P.R.China, 250100
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120806/02c373fc/attachment-0001.html>


More information about the petsc-users mailing list