[petsc-dev] [petsc-maint #72279] PETSc and multigpu
Alexander Grayver
agrayver at gfz-potsdam.de
Fri May 6 04:38:16 CDT 2011
Hi Barry,
Thanks for reply first of all.
1. I put MPI_Barrier() to the end of the example:
...
ierr = VecDestroy(&f);CHKERRQ(ierr);
ierr = SNESDestroy(&snes);CHKERRQ(ierr);
ierr = DMDestroy(&da);CHKERRQ(ierr);
MPI_Barrier(MPI_COMM_WORLD);
PetscFinalize();
return 0;
}
The same error occurs.
2. Then I tried flags -snes_monitor -ksp_monitor:
[agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
-snes_monitor -ksp_monitor
0 SNES Function norm 3.906279802209e-03
0 KSP Residual norm 5.994156809227e+00
1 KSP Residual norm 3.538158441448e-04
2 KSP Residual norm 3.124431921666e-04
3 KSP Residual norm 4.109213410989e-06
1 SNES Function norm 7.201017610490e-04
0 KSP Residual norm 3.317803708316e-02
1 KSP Residual norm 2.447380361169e-06
2 KSP Residual norm 2.164193969957e-06
3 KSP Residual norm 2.124317398679e-08
2 SNES Function norm 1.719678934825e-05
0 KSP Residual norm 1.651586453143e-06
1 KSP Residual norm 2.037037536868e-08
2 KSP Residual norm 1.109736798274e-08
3 KSP Residual norm 1.857218772156e-12
3 SNES Function norm 1.159391068583e-09
0 KSP Residual norm 3.116936044619e-11
1 KSP Residual norm 1.366503312678e-12
2 KSP Residual norm 6.598477672192e-13
3 KSP Residual norm 5.306147277879e-17
4 SNES Function norm 2.202297235559e-10
[agraiver at tesla-cmc new]$ LD_PRELOAD=./libpetsc_mgpu.so mpirun -np 2
./lapexp -da_grid_x 65535 -da_vec_type cusp -snes_monitor -ksp_monitor
Assigned CUDA device 0 to MPI process 0
Assigned CUDA device 1 to MPI process 1
0 SNES Function norm 3.906279802209e-03
0 KSP Residual norm 5.994156809227e+00
1 KSP Residual norm 5.927247846249e-05
1 SNES Function norm 3.906225077938e-03
0 KSP Residual norm 5.993813868985e+00
1 KSP Residual norm 5.927575078206e-05
terminate called after throwing an instance of
'thrust::system::system_error'
what(): invalid device pointer
[tesla-cmc:17671] *** Process received signal ***
[tesla-cmc:17671] Signal: Aborted (6)
[tesla-cmc:17671] Signal code: (-6)
terminate called after throwing an instance of
'thrust::system::system_error'
what(): invalid device pointer
[tesla-cmc:17670] *** Process received signal ***
[tesla-cmc:17670] Signal: Aborted (6)
[tesla-cmc:17670] Signal code: (-6)
[tesla-cmc:17671] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
[tesla-cmc:17671] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3de50330c5]
[tesla-cmc:17671] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
[tesla-cmc:17671] [ 3]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
[0x7f3eceb4595d]
[tesla-cmc:17671] [ 4]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f3eceb43b76]
[tesla-cmc:17671] [ 5]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f3eceb43ba3]
[tesla-cmc:17671] [ 6]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f3eceb43cae]
[tesla-cmc:17671] [ 7]
./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
[0x4263be]
[tesla-cmc:17671] [ 8]
./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
[0x425950]
[tesla-cmc:17671] [ 9]
./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f) [0x425016]
[tesla-cmc:17671] [10]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
[0x7f3ed0324cff]
[tesla-cmc:17671] [11]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
[0x7f3ed0322e78]
[tesla-cmc:17671] [12]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
[0x7f3ed03215f7]
[tesla-cmc:17671] [13]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
[0x7f3ed03205f4]
[tesla-cmc:17671] [14]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
[0x7f3ed031fc2e]
[tesla-cmc:17671] [15]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d) [0x7f3ed0d5845f]
[tesla-cmc:17671] [16]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
[0x7f3ed0d46840]
[tesla-cmc:17671] [17]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
[0x7f3ed0de98af]
[tesla-cmc:17671] [18]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
[0x7f3ed0e23ddf]
[tesla-cmc:17671] [19]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
[0x7f3ed09cbd24]
[tesla-cmc:17671] [20]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546) [0x7f3ed090f4fe]
[tesla-cmc:17671] [21]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1) [0x7f3ed0ca9ac3]
[tesla-cmc:17671] [22]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8) [0x7f3ed0caa210]
[tesla-cmc:17671] [23] ./lapexp(main+0x5ed) [0x4207d5]
[tesla-cmc:17671] [24] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x3de501ee5d]
[tesla-cmc:17671] [25] ./lapexp() [0x41efc9]
[tesla-cmc:17671] *** End of error message ***
[tesla-cmc:17670] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
[tesla-cmc:17670] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3de50330c5]
[tesla-cmc:17670] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
[tesla-cmc:17670] [ 3]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
[0x7f0a37e9295d]
[tesla-cmc:17670] [ 4]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f0a37e90b76]
[tesla-cmc:17670] [ 5]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f0a37e90ba3]
[tesla-cmc:17670] [ 6]
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f0a37e90cae]
[tesla-cmc:17670] [ 7]
./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
[0x4263be]
[tesla-cmc:17670] [ 8]
./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
[0x425950]
[tesla-cmc:17670] [ 9]
./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f) [0x425016]
[tesla-cmc:17670] [10]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
[0x7f0a39671cff]
[tesla-cmc:17670] [11]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
[0x7f0a3966fe78]
[tesla-cmc:17670] [12]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
[0x7f0a3966e5f7]
[tesla-cmc:17670] [13]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
[0x7f0a3966d5f4]
[tesla-cmc:17670] [14]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
[0x7f0a3966cc2e]
[tesla-cmc:17670] [15]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d) [0x7f0a3a0a545f]
[tesla-cmc:17670] [16]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
[0x7f0a3a093840]
[tesla-cmc:17670] [17]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
[0x7f0a3a1368af]
[tesla-cmc:17670] [18]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
[0x7f0a3a170ddf]
[tesla-cmc:17670] [19]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
[0x7f0a39d18d24]
[tesla-cmc:17670] [20]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546) [0x7f0a39c5c4fe]
[tesla-cmc:17670] [21]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1) [0x7f0a39ff6ac3]
[tesla-cmc:17670] [22]
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8) [0x7f0a39ff7210]
[tesla-cmc:17670] [23] ./lapexp(main+0x5ed) [0x4207d5]
[tesla-cmc:17670] [24] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x3de501ee5d]
[tesla-cmc:17670] [25] ./lapexp() [0x41efc9]
[tesla-cmc:17670] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 17670 on node tesla-cmc
exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Then for one process:
[agraiver at tesla-cmc new]$ LD_PRELOAD=./libpetsc_mgpu.so mpirun -np 1
./lapexp -da_grid_x 65535 -da_vec_type cusp -snes_monitor -ksp_monitor
Assigned CUDA device 0 to MPI process 0
0 SNES Function norm 3.906279802209e-03
0 KSP Residual norm 2.600060425819e+01
1 KSP Residual norm 1.711173401491e-09
1 SNES Function norm 2.518839283204e-05
0 KSP Residual norm 1.864270712051e-01
1 KSP Residual norm 1.123567613474e-11
2 SNES Function norm 1.475752536169e-09
0 KSP Residual norm 1.065095925089e-05
1 KSP Residual norm 8.918344224261e-16
3 SNES Function norm 2.186342855894e-10
0 KSP Residual norm 6.313874615230e-11
1 KSP Residual norm 2.338370003621e-21
[agraiver at tesla-cmc new]$ mpirun -np 1 ./lapexp -da_grid_x 65535
-snes_monitor -ksp_monitor
0 SNES Function norm 3.906279802209e-03
0 KSP Residual norm 2.600060425819e+01
1 KSP Residual norm 1.727316216725e-09
1 SNES Function norm 2.518839280713e-05
0 KSP Residual norm 1.864270710157e-01
1 KSP Residual norm 1.518456989028e-11
2 SNES Function norm 1.475794371713e-09
0 KSP Residual norm 1.065102315659e-05
1 KSP Residual norm 1.258453455440e-15
3 SNES Function norm 2.207728411745e-10
0 KSP Residual norm 6.963755704792e-12
1 KSP Residual norm 1.188067869190e-21
4 SNES Function norm 2.199244040060e-10
Then I tried to run example with 2 processes but without assigning GPUs,
so two processes now use the same GPU device:
[agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
-da_vec_type cusp -snes_monitor -ksp_monitor
0 SNES Function norm 3.906279802209e-03
0 KSP Residual norm 5.994156809227e+00
1 KSP Residual norm 5.927247846249e-05
1 SNES Function norm 3.906225077938e-03
0 KSP Residual norm 5.993813868985e+00
1 KSP Residual norm 5.927575078206e-05
terminate called after throwing an instance of
'thrust::system::system_error'
what(): invalid device pointer
[tesla-cmc:17687] *** Process received signal ***
[tesla-cmc:17687] Signal: Aborted (6)
[tesla-cmc:17687] Signal code: (-6)
terminate called after throwing an instance of
'thrust::system::system_error'
what(): invalid device pointer
...
the same error as above
3. At the moment we don't have configuration such as several nodes with
one GPU each. But we will try to get it, although it can take some time.
Regards,
Alexander
On 05.05.2011 21:36, Barry Smith wrote:
> Alexandar,
>
> Could you try putting a MPI_Barrier() just before the PetscFinalize() in your GPU example and see if it still crashes?
>
> I'm wondering if the first process to get to cublasShutdown(), which is in PetscFinalize(), is somehow shutting down all the GPUs for the node and so the other processes that have not finished working with their GPUs crash. Just a wild guess but worth checking.
>
> Thanks
>
> Barry
>
>
> On May 5, 2011, at 1:57 PM, Barry Smith wrote:
>
>> Alexander
>>
>> Thank you for the sample code; it will be very useful.
>>
>> We have run parallel jobs with CUDA where each node has only a single MPI process and uses a single GPU without the crash that you get below. I cannot explain why it would not work in your situation. Do you have access to two nodes each with a GPU so you could try that?
>>
>> It is crashing in a delete of a
>>
>> struct _p_PetscCUSPIndices {
>> CUSPINTARRAYCPU indicesCPU;
>> CUSPINTARRAYGPU indicesGPU;
>> };
>>
>> where cusp::array1d<PetscInt,cusp::device_memory>
>>
>> thus it is crashing after it has completed actually doing the computation. If you run with -snes_monitor -ksp_monitor with and without the -da_vec_type cusp on 2 processes what do you get for output in the two cases? I want to see if it is running correctly on two processes?
>>
>> Could the crash be due to memory corruption sometime doing the computation?
>>
>>
>> Barry
>>
>>
>>
>>
>>
>> On May 5, 2011, at 3:38 AM, Alexander Grayver wrote:
>>
>>> Hello!
>>>
>>> We work with petsc-dev branch and ex47cu.cu example. Our platform is
>>> Intel Quad processor and 8 identical Tesla GPUs. CUDA 3.2 toolkit is
>>> installed.
>>> Ideally we would like to make petsc working in a multi-GPU way within
>>> just one node so that different GPUs could be attached to different
>>> processes.
>>> Since it's not possible within current PETSc implementation we created a
>>> preload library (see LD_PRELOAD for details) for CUBLAS function
>>> cublasInit().
>>> When PETSc calls this function our library gets control and we assign
>>> GPUs according to rank within MPI communicator, then we call original
>>> cublasInit().
>>> This preload library is very simple, see petsc_mgpu.c attached.
>>> This trick makes each process to have its own context and ideally all
>>> computations should be distributed over several GPUs.
>>>
>>> We managed to build petsc and example (see makefile attached) and we
>>> tested it as follows:
>>>
>>> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -info> cpu_1process.out
>>> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 -info>
>>> cpu_2processes.out
>>> [agraiver at tesla-cmc new]$ ./lapexp -da_grid_x 65535 -da_vec_type cusp
>>> -info> gpu_1process.out
>>> [agraiver at tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
>>> -da_vec_type cusp -info> gpu_2processes.out
>>>
>>> Everything except last configuration works well. The last one crashes
>>> with the following exception and callstack:
>>> terminate called after throwing an instance of
>>> 'thrust::system::system_error'
>>> what(): invalid device pointer
>>> [tesla-cmc:15549] *** Process received signal ***
>>> [tesla-cmc:15549] Signal: Aborted (6)
>>> [tesla-cmc:15549] Signal code: (-6)
>>> [tesla-cmc:15549] [ 0] /lib64/libpthread.so.0() [0x3de540eeb0]
>>> [tesla-cmc:15549] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3de50330c5]
>>> [tesla-cmc:15549] [ 2] /lib64/libc.so.6(abort+0x186) [0x3de5034a76]
>>> [tesla-cmc:15549] [ 3]
>>> /opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)
>>> [0x7f0d3530b95d]
>>> [tesla-cmc:15549] [ 4]
>>> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76) [0x7f0d35309b76]
>>> [tesla-cmc:15549] [ 5]
>>> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3) [0x7f0d35309ba3]
>>> [tesla-cmc:15549] [ 6]
>>> /opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae) [0x7f0d35309cae]
>>> [tesla-cmc:15549] [ 7]
>>> ./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)
>>> [0x426320]
>>> [tesla-cmc:15549] [ 8]
>>> ./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)
>>> [0x4258b2]
>>> [tesla-cmc:15549] [ 9]
>>> ./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f) [0x424f78]
>>> [tesla-cmc:15549] [10]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)
>>> [0x7f0d36aeacff]
>>> [tesla-cmc:15549] [11]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)
>>> [0x7f0d36ae8e78]
>>> [tesla-cmc:15549] [12]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)
>>> [0x7f0d36ae75f7]
>>> [tesla-cmc:15549] [13]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)
>>> [0x7f0d36ae65f4]
>>> [tesla-cmc:15549] [14]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)
>>> [0x7f0d36ae5c2e]
>>> [tesla-cmc:15549] [15]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d) [0x7f0d3751e45f]
>>> [tesla-cmc:15549] [16]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)
>>> [0x7f0d3750c840]
>>> [tesla-cmc:15549] [17]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)
>>> [0x7f0d375af8af]
>>> [tesla-cmc:15549] [18]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)
>>> [0x7f0d375e9ddf]
>>> [tesla-cmc:15549] [19]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)
>>> [0x7f0d37191d24]
>>> [tesla-cmc:15549] [20]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546) [0x7f0d370d54fe]
>>> [tesla-cmc:15549] [21]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1) [0x7f0d3746fac3]
>>> [tesla-cmc:15549] [22]
>>> /opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8) [0x7f0d37470210]
>>> [tesla-cmc:15549] [23] ./lapexp(main+0x5ed) [0x420745]
>>>
>>> I've sent all detailed output files for different execution
>>> configuration listed above as well as configure.log and make.log to
>>> petsc-maint at mcs.anl.gov hoping that someone could recognize the problem.
>>> Now we have one node with multi-GPU, but I'm also wondering if someone
>>> really tested usage of GPU functionality over several nodes with one GPU
>>> each?
>>>
>>> Regards,
>>> Alexander
>>>
>>> <petsc_mgpu.c><makefile.txt><configure.log>
More information about the petsc-dev
mailing list