[petsc-dev] PETSc GPU capabilities

Mon Feb 27 16:02:11 CST 2012

It finally finished running through cuda-gdb.  Here's a backtrace.
new_size=46912574500784 in the call to thrust::detail::vector_base<double,
thrust::device_malloc_allocator<double> >::resize looks suspicious.

#0  0x0000003e1c832885 in raise () from /lib64/libc.so.6
#1  0x0000003e1c834065 in abort () from /lib64/libc.so.6
#2  0x0000003e284bea7d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib64/libstdc++.so.6
#3  0x0000003e284bcc06 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003e284bcc33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003e284bcd2e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00002aaaab45ad71 in thrust::detail::backend::cuda::malloc<0u>
(n=375300596006272)
    at malloc.inl:50
#7  0x00002aaaab454322 in thrust::detail::backend::dispatch::malloc<0u>
(n=375300596006272)
    at malloc.h:56
#8  0x00002aaaab453555 in thrust::device_malloc (n=375300596006272) at
device_malloc.inl:32
#9  0x00002aaaab46477d in thrust::device_malloc<double> (n=46912574500784)
    at device_malloc.inl:38
#10 0x00002aaaab461fce in thrust::device_malloc_allocator<double>::allocate
(
    this=0x7fffffff9880, cnt=46912574500784) at
device_malloc_allocator.h:101
#11 0x00002aaaab45ee91 in thrust::detail::contiguous_storage<double,
thrust::device_malloc_allocator<double> >::allocate (this=0x7fffffff9880,
n=46912574500784)
    at contiguous_storage.inl:134
#12 0x00002aaaab46ebba in thrust::detail::contiguous_storage<double,
thrust::device_malloc_allocator<double> >::contiguous_storage
(this=0x7fffffff9880, n=46912574500784)
    at contiguous_storage.inl:46
#13 0x00002aaaab46cd1e in thrust::detail::vector_base<double,
thrust::device_malloc_allocator<double> >::fill_insert (this=0x13623990,
position=..., n=46912574500784,
    x=@0x7fffffff9f18) at vector_base.inl:792
#14 0x00002aaaab46b058 in thrust::detail::vector_base<double,
thrust::device_malloc_allocator<double> >::insert (this=0x13623990,
position=..., n=46912574500784, x=@0x7fffffff9f18)
    at vector_base.inl:561
#15 0x00002aaaab4692a3 in thrust::detail::vector_base<double,
thrust::device_malloc_allocator<double> >::resize (this=0x13623990,
new_size=46912574500784, x=@0x7fffffff9f18)
    at vector_base.inl:222
#16 0x00002aaaac2c3d9b in cusp::precond::smoothed_aggregation<int, double,
thrust::detail::cuda_device_space_tag>::smoothed_aggregation<cusp::csr_matrix<int,
double, thrust::detail::cuda_device_space_tag> > (this=0x136182b0, A=...,
theta=0) at smoothed_aggregation.inl:210
#17 0x00002aaaac27cf84 in PCSetUp_SACUSP (pc=0x1360f330) at sacusp.cu:76
#18 0x00002aaaac1f0024 in PCSetUp (pc=0x1360f330) at precon.c:832
#19 0x00002aaaabd02144 in KSPSetUp (ksp=0x135d2a00) at itfunc.c:261
#20 0x00002aaaabd0396e in KSPSolve (ksp=0x135d2a00, b=0x135a0fa0,
x=0x135a2b50)
    at itfunc.c:385
#21 0x0000000000403619 in main (argc=17, args=0x7fffffffc538) at ex2.c:217

On Mon, Feb 27, 2012 at 4:48 PM, John Fettig <john.fettig at gmail.com> wrote:

> Hi Paul,
>
> This is very interesting.  I tried building the code with
> --download-txpetscgpu and it doesn't work for me.  It runs out of memory,
> no matter how small the problem (this is ex2 from
> src/ksp/ksp/examples/tutorials):
>
> mpirun -np 1 ./ex2 -n 10 -m 10 -ksp_type cg -pc_type sacusp -mat_type
> aijcusp -vec_type cusp -cusp_storage_format csr -use_cusparse 0
>
> terminate called after throwing an instance of
> 'thrust::system::detail::bad_alloc'
>   what():  std::bad_alloc: out of memory
> MPI Application rank 0 killed before MPI_Finalize() with signal 6
>
> This example works fine when I build without your gpu additions (and for
> much larger problems too).  Am I doing something wrong?
>
> For reference, I'm using CUDA 4.1, CUSP 0.3, and Thrust 1.5.1
>
> John
>
>
> On Fri, Feb 10, 2012 at 5:04 PM, Paul Mullowney <paulm at txcorp.com> wrote:
>
>> Hi All,
>>
>> I've been developing GPU capabilities for PETSc. The development has
>> focused mostly on
>> (1) An efficient multi-GPU SpMV, i.e. MatMult. This is working well.
>> (2) Triangular Solve used in ILU preconditioners; i.e. MatSolve. The
>> performance of this ... is what it is :|
>> This code is in beta mode. Keep that in mind, if you decide to use it. It
>> supports single and double precision, real numbers only! Complex will be
>> supported at some point in the future, but not any time soon.
>>
>> To build with these capabilities, add the following to your configure
>> line.
>> --download-txpetscgpu=yes
>>
>> The capabilities of the SpMV code are accessed with the following 2
>> command line flags
>> -cusp_storage_format csr (other options are coo (coordinate), ell
>> (ellpack), dia (diagonal). hyb (hybrid) is not yet supported)
>> -use_cusparse (this is a boolean and at the moment is only supported with
>> csr format matrices. In the future, cusparse will work with ell, coo, and
>> hyb formats).
>>
>> Regarding the number of GPUs to run on:
>> Imagine a system with P nodes, N cores per node, and M GPUs per node.
>> Then, to use only the GPUs, I would run with M ranks per node over P nodes.
>>  As an example, I have a system with 2 nodes. Each node has 8 cores, and 4
>> GPUs attached to each node (P=2, N=8, M=4). In a PBS queue script, one
>> would use 2 nodes at 4 processors per node. Each mpi rank (CPU processor)
>> will be attached to a GPU.
>>
>> You do not need to explicitly manage the GPUs, apart from understanding
>> what type of system you are running on. To learn how many devices are
>> available per node, use the command line flag:
>> -cuda_show_devices
>>
>> -Paul
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120227/a02c4d2b/attachment.html>