[petsc-dev] PETSc GPU capabilities

Chetan Jhurani chetan.jhurani at gmail.com
Tue Feb 28 01:46:48 CST 2012


John, Paul,

 

I ran the example with the same options and the code aborts at

a different location in cusp.  Although still called by PCSetUp_SACUSP.

The example works fine if txpetscgpu is not used.

 

Valgrind does not show any relevant issues prior to the std::terminate.

 

My best guess based on this and some investigation is that this is happening

because of inconsistent C style casts in the code (which are #ifdefed out when

txpetscgpu is not used). They could be related to different code paths taken

in calling MatCUSPCopyToGPU in sacusp.cu depending on txpetscgpu macro.

 

I'm busy with other stuff, but I'll let you know when this gets fixed.

 

Chetan

 

 

 

From: petsc-dev-bounces at mcs.anl.gov [mailto:petsc-dev-bounces at mcs.anl.gov] On Behalf Of John Fettig
Sent: Monday, February 27, 2012 2:02 PM
To: For users of the development version of PETSc
Subject: Re: [petsc-dev] PETSc GPU capabilities

 

It finally finished running through cuda-gdb.  Here's a backtrace.  new_size=46912574500784 in the call to
thrust::detail::vector_base<double, thrust::device_malloc_allocator<double> >::resize looks suspicious.

#0  0x0000003e1c832885 in raise () from /lib64/libc.so.6
#1  0x0000003e1c834065 in abort () from /lib64/libc.so.6
#2  0x0000003e284bea7d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib64/libstdc++.so.6
#3  0x0000003e284bcc06 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003e284bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003e284bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00002aaaab45ad71 in thrust::detail::backend::cuda::malloc<0u> (n=375300596006272)
    at malloc.inl:50
#7  0x00002aaaab454322 in thrust::detail::backend::dispatch::malloc<0u> (n=375300596006272)
    at malloc.h:56
#8  0x00002aaaab453555 in thrust::device_malloc (n=375300596006272) at device_malloc.inl:32
#9  0x00002aaaab46477d in thrust::device_malloc<double> (n=46912574500784)
    at device_malloc.inl:38
#10 0x00002aaaab461fce in thrust::device_malloc_allocator<double>::allocate (
    this=0x7fffffff9880, cnt=46912574500784) at device_malloc_allocator.h:101
#11 0x00002aaaab45ee91 in thrust::detail::contiguous_storage<double, thrust::device_malloc_allocator<double> >::allocate
(this=0x7fffffff9880, n=46912574500784)
    at contiguous_storage.inl:134
#12 0x00002aaaab46ebba in thrust::detail::contiguous_storage<double, thrust::device_malloc_allocator<double> >::contiguous_storage
(this=0x7fffffff9880, n=46912574500784)
    at contiguous_storage.inl:46
#13 0x00002aaaab46cd1e in thrust::detail::vector_base<double, thrust::device_malloc_allocator<double> >::fill_insert
(this=0x13623990, position=..., n=46912574500784, 
    x=@0x7fffffff9f18) at vector_base.inl:792
#14 0x00002aaaab46b058 in thrust::detail::vector_base<double, thrust::device_malloc_allocator<double> >::insert (this=0x13623990,
position=..., n=46912574500784, x=@0x7fffffff9f18)
    at vector_base.inl:561
#15 0x00002aaaab4692a3 in thrust::detail::vector_base<double, thrust::device_malloc_allocator<double> >::resize (this=0x13623990,
new_size=46912574500784, x=@0x7fffffff9f18)
    at vector_base.inl:222
#16 0x00002aaaac2c3d9b in cusp::precond::smoothed_aggregation<int, double,
thrust::detail::cuda_device_space_tag>::smoothed_aggregation<cusp::csr_matrix<int, double, thrust::detail::cuda_device_space_tag> >
(this=0x136182b0, A=..., theta=0) at smoothed_aggregation.inl:210
#17 0x00002aaaac27cf84 in PCSetUp_SACUSP (pc=0x1360f330) at sacusp.cu:76
#18 0x00002aaaac1f0024 in PCSetUp (pc=0x1360f330) at precon.c:832
#19 0x00002aaaabd02144 in KSPSetUp (ksp=0x135d2a00) at itfunc.c:261
#20 0x00002aaaabd0396e in KSPSolve (ksp=0x135d2a00, b=0x135a0fa0, x=0x135a2b50)
    at itfunc.c:385
#21 0x0000000000403619 in main (argc=17, args=0x7fffffffc538) at ex2.c:217



On Mon, Feb 27, 2012 at 4:48 PM, John Fettig <john.fettig at gmail.com> wrote:

Hi Paul,

This is very interesting.  I tried building the code with --download-txpetscgpu and it doesn't work for me.  It runs out of memory,
no matter how small the problem (this is ex2 from src/ksp/ksp/examples/tutorials):

mpirun -np 1 ./ex2 -n 10 -m 10 -ksp_type cg -pc_type sacusp -mat_type aijcusp -vec_type cusp -cusp_storage_format csr -use_cusparse
0

terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
  what():  std::bad_alloc: out of memory
MPI Application rank 0 killed before MPI_Finalize() with signal 6

This example works fine when I build without your gpu additions (and for much larger problems too).  Am I doing something wrong?

For reference, I'm using CUDA 4.1, CUSP 0.3, and Thrust 1.5.1

John

 

On Fri, Feb 10, 2012 at 5:04 PM, Paul Mullowney <paulm at txcorp.com> wrote:

Hi All,

I've been developing GPU capabilities for PETSc. The development has focused mostly on
(1) An efficient multi-GPU SpMV, i.e. MatMult. This is working well.
(2) Triangular Solve used in ILU preconditioners; i.e. MatSolve. The performance of this ... is what it is :|
This code is in beta mode. Keep that in mind, if you decide to use it. It supports single and double precision, real numbers only!
Complex will be supported at some point in the future, but not any time soon.

To build with these capabilities, add the following to your configure line.
--download-txpetscgpu=yes

The capabilities of the SpMV code are accessed with the following 2 command line flags
-cusp_storage_format csr (other options are coo (coordinate), ell (ellpack), dia (diagonal). hyb (hybrid) is not yet supported)
-use_cusparse (this is a boolean and at the moment is only supported with csr format matrices. In the future, cusparse will work
with ell, coo, and hyb formats).

Regarding the number of GPUs to run on:
Imagine a system with P nodes, N cores per node, and M GPUs per node. Then, to use only the GPUs, I would run with M ranks per node
over P nodes.  As an example, I have a system with 2 nodes. Each node has 8 cores, and 4 GPUs attached to each node (P=2, N=8, M=4).
In a PBS queue script, one would use 2 nodes at 4 processors per node. Each mpi rank (CPU processor) will be attached to a GPU.

You do not need to explicitly manage the GPUs, apart from understanding what type of system you are running on. To learn how many
devices are available per node, use the command line flag:
-cuda_show_devices

-Paul

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120227/8cb74ec7/attachment.html>


More information about the petsc-dev mailing list