Hi Alexander,<div><br></div><div>Looking through the runs for CPU and GPU with only 1 process, I'm seeing the following oddity which you pointed out:</div><div><br></div><div>CPU 1 process</div><div><div>minden@bb45:~/petsc-dev/src/snes/examples/tutorials$ /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile /home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor -ksp_monitor</div>
<div> 0 SNES Function norm 3.906279802209e-03 </div><div> 0 KSP Residual norm 2.600060425819e+01 </div><div> 1 KSP Residual norm 1.727316216725e-09 </div><div> 1 SNES Function norm 2.518839280713e-05 </div><div> 0 KSP Residual norm 1.864270710157e-01 </div>
<div> 1 KSP Residual norm 1.518456989028e-11 </div><div> 2 SNES Function norm 1.475794371713e-09 </div><div> 0 KSP Residual norm 1.065102315659e-05 </div><div> 1 KSP Residual norm 1.258453455440e-15 </div><div> 3 SNES Function norm 2.207728411745e-10 </div>
<div> 0 KSP Residual norm 6.963755704792e-12 </div><div> 1 KSP Residual norm 1.188067869190e-21 </div><div> 4 SNES Function norm 2.199244040060e-10 </div><div><br></div><div>GPU 1 process</div><div><div>minden@bb45:~/petsc-dev/src/snes/examples/tutorials$ /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile /home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor -ksp_monitor -da_vec_type cusp</div>
<div> 0 SNES Function norm 3.906279802209e-03 </div><div> 0 KSP Residual norm 2.600060425819e+01 </div><div> 1 KSP Residual norm 1.711173401491e-09 </div><div> 1 SNES Function norm 2.518839283204e-05 </div><div> 0 KSP Residual norm 1.864270712051e-01 </div>
<div> 1 KSP Residual norm 1.123567613474e-11 </div><div> 2 SNES Function norm 1.475752536169e-09 </div><div> 0 KSP Residual norm 1.065095925089e-05 </div><div> 1 KSP Residual norm 8.918344224261e-16 </div><div> 3 SNES Function norm 2.186342855894e-10 </div>
<div> 0 KSP Residual norm 6.313874615230e-11 </div><div> 1 KSP Residual norm 2.338370003621e-21 </div></div><div><br></div><div>As you noted, the CPU version terminates on SNES function norm whereas the GPU version stops on a KSP residual norm. Looking through the exact numbers, I found that the small differences in values between the GPU and CPU versions cause the convergence criterion for the SNES to be off by about 2e-23, so it goes for another round of line search before concluding it has found a local minimum and terminating. By using GPU matrix as well,</div>
<div><br></div><div>GPU 1 process with cusp matrix</div><div><div>minden@bb45:~/petsc-dev/src/snes/examples/tutorials$ /home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra -machinefile /home/balay/machinefile -n 1 ./ex47cu -da_grid_x 65535 -snes_monitor -ksp_monitor -da_vec_type cusp -da_mat_type aijcusp</div>
<div> 0 SNES Function norm 3.906279802209e-03 </div><div> 0 KSP Residual norm 2.600060425819e+01 </div><div> 1 KSP Residual norm 8.745056654228e-10 </div><div> 1 SNES Function norm 2.518839297589e-05 </div><div> 0 KSP Residual norm 1.864270723743e-01 </div>
<div> 1 KSP Residual norm 1.265482694189e-11 </div><div> 2 SNES Function norm 1.475659976840e-09 </div><div> 0 KSP Residual norm 1.065091221064e-05 </div><div> 1 KSP Residual norm 8.245135443599e-16 </div><div> 3 SNES Function norm 2.200530918322e-10 </div>
<div> 0 KSP Residual norm 7.730316189302e-11 </div><div> 1 KSP Residual norm 1.115126544733e-21 </div><div> 4 SNES Function norm 2.192093087025e-10 </div></div><div><br></div><div>It changes the values again just enough to push it to the right side of the convergence check.</div>
<div><br></div><div>I am still looking into the problems for 2 processes with GPU, it seems to somewhere be using old data as you can see by the fact that the function norm is the same at the beginning of each SNES iteration</div>
<div><br></div><div>GPU, 2 processes</div><div><meta charset="utf-8"><span class="Apple-style-span" style="border-collapse: collapse; color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 13px; ">[agraiver@tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535 -da_vec_type cusp -snes_monitor -ksp_monitor<div class="im" style="color: rgb(80, 0, 80); ">
<br> 0 SNES Function norm 3.906279802209e-03<-----<br> 0 KSP Residual norm 5.994156809227e+00<br></div><div class="im" style="color: rgb(80, 0, 80); "> 1 KSP Residual norm 5.927247846249e-05<br> 1 SNES Function norm 3.906225077938e-03<------<br>
0 KSP Residual norm 5.993813868985e+00<br> 1 KSP Residual norm 5.927575078206e-05</div></span></div><div><br></div><div>So, it's doing some good calculations and then throwing them away and starting over again. I will continue to look into this.</div>
<div><br></div><div>Cheers,</div><div><br></div><div>Victor</div><div><br></div>---<br>Victor L. Minden<br><br>Tufts University<br>School of Engineering<br>Class of 2012<br>
<br><br><div class="gmail_quote">On Wed, May 11, 2011 at 8:31 AM, Alexander Grayver <span dir="ltr"><<a href="mailto:agrayver@gfz-potsdam.de">agrayver@gfz-potsdam.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
Victor thanks. We've got last version and now it doesn't crash.
However it seems there is still problem.<br>
<br>
Let's look at three different runs:<br>
<br>
[agraiver@tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
-snes_monitor -ksp_monitor<div class="im"><br>
0 SNES Function norm 3.906279802209e-03<br>
0 KSP Residual norm 5.994156809227e+00<br>
1 KSP Residual norm 3.538158441448e-04<br>
2 KSP Residual norm 3.124431921666e-04<br>
3 KSP Residual norm 4.109213410989e-06<br>
1 SNES Function norm 7.201017610490e-04<br>
0 KSP Residual norm 3.317803708316e-02<br>
1 KSP Residual norm 2.447380361169e-06<br>
2 KSP Residual norm 2.164193969957e-06<br>
3 KSP Residual norm 2.124317398679e-08<br>
2 SNES Function norm 1.719678934825e-05<br>
0 KSP Residual norm 1.651586453143e-06<br>
1 KSP Residual norm 2.037037536868e-08<br>
2 KSP Residual norm 1.109736798274e-08<br>
3 KSP Residual norm 1.857218772156e-12<br>
3 SNES Function norm 1.159391068583e-09<br>
0 KSP Residual norm 3.116936044619e-11<br>
1 KSP Residual norm 1.366503312678e-12<br>
2 KSP Residual norm 6.598477672192e-13<br>
3 KSP Residual norm 5.306147277879e-17<br>
4 SNES Function norm 2.202297235559e-10<br></div>
[agraiver@tesla-cmc new]$ mpirun -np 1 ./lapexp -da_grid_x 65535
-da_vec_type cusp -snes_monitor -ksp_monitor<div class="im"><br>
0 SNES Function norm 3.906279802209e-03<br></div><div class="im">
0 KSP Residual norm 2.600060425819e+01<br>
1 KSP Residual norm 1.711173401491e-09<br>
1 SNES Function norm 2.518839283204e-05<br>
0 KSP Residual norm 1.864270712051e-01<br>
1 KSP Residual norm 1.123567613474e-11<br>
2 SNES Function norm 1.475752536169e-09<br>
0 KSP Residual norm 1.065095925089e-05<br>
1 KSP Residual norm 8.918344224261e-16<br>
3 SNES Function norm 2.186342855894e-10<br></div>
0 KSP Residual norm 6.313874615230e-11<br>
1 KSP Residual norm 2.338370003621e-21<br>
[agraiver@tesla-cmc new]$ mpirun -np 2 ./lapexp -da_grid_x 65535
-da_vec_type cusp -snes_monitor -ksp_monitor<div class="im"><br>
0 SNES Function norm 3.906279802209e-03<br>
0 KSP Residual norm 5.994156809227e+00<br></div><div class="im">
1 KSP Residual norm 5.927247846249e-05<br>
1 SNES Function norm 3.906225077938e-03<br>
0 KSP Residual norm 5.993813868985e+00<br>
1 KSP Residual norm 5.927575078206e-05<br></div>
[agraiver@tesla-cmc new]$<br>
<br>
lepexp is the default example, just renamed. The first run used 2
CPUs, the second one used 1 GPU and the third one ran with 2
processes using 1 GPU.<br>
First different is that when use cpu the last string in output is
always:<div class="im"><br>
4 SNES Function norm 2.202297235559e-10<br></div>
whereas for CPU the last string is "N KSP ...something..."<br>
Then is seems that for 2 processes using 1 GPU example doesn't
converge, the norm is quite big. The same situation happens when we
use 2 process and 2 GPUs. Can you explain this?<br>
BTW, we can even give you access to our server with 6 CPUs and 8
GPUs within one node. <br>
<br>
Regards,<br><font color="#888888">
Alexander</font><div><div></div><div class="h5"><br>
<br>
On 11.05.2011 01:07, Victor Minden wrote:
<blockquote type="cite">I pushed my change to petsc-dev, so hopefully a new
pull of the latest mercurial repository should do it, let me know
if not.<br clear="all">
---<br>
Victor L. Minden<br>
<br>
Tufts University<br>
School of Engineering<br>
Class of 2012<br>
<br>
<br>
<div class="gmail_quote">On Tue, May 10, 2011 at 6:59 PM,
Alexander Grayver <span dir="ltr"><<a href="mailto:agrayver@gfz-potsdam.de" target="_blank">agrayver@gfz-potsdam.de</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">
<div bgcolor="#ffffff" text="#000000"> Hi Victor,<br>
<br>
Thanks a lot!<br>
What should we do to get new version?<br>
<br>
Regards,<br>
<font color="#888888"> Alexander</font>
<div>
<div><br>
<br>
On 10.05.2011 02:02, Victor Minden wrote:
<blockquote type="cite">I believe I've resolved this
issue.
<div><br>
</div>
<div>Cheers,</div>
<div><br>
</div>
<div>Victor<br clear="all">
---<br>
Victor L. Minden<br>
<br>
Tufts University<br>
School of Engineering<br>
Class of 2012<br>
<br>
<br>
<div class="gmail_quote">On Sun, May 8, 2011 at 5:26
PM, Victor Minden <span dir="ltr"><<a href="mailto:victorminden@gmail.com" target="_blank">victorminden@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex"> Barry,<br>
<br>
I can verify this on breadboard now,<br>
<br>
with two processes, cuda<br>
<br>
minden@bb45:~/petsc-dev/src/snes/examples/tutorials$<br>
/home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra
-machinefile<br>
/home/balay/machinefile -n 2 ./ex47cu -da_grid_x
65535 -log_summary<br>
-snes_monitor -ksp_monitor -da_vec_type cusp<br>
<div> 0 SNES Function norm 3.906279802209e-03<br>
0 KSP Residual norm 5.994156809227e+00<br>
1 KSP Residual norm 5.927247846249e-05<br>
1 SNES Function norm 3.906225077938e-03<br>
0 KSP Residual norm 5.993813868985e+00<br>
1 KSP Residual norm 5.927575078206e-05<br>
</div>
<div>terminate called after throwing an instance
of 'thrust::system::system_error'<br>
what(): invalid device pointer<br>
terminate called after throwing an instance of
'thrust::system::system_error'<br>
what(): invalid device pointer<br>
</div>
Aborted (signal 6)<br>
<br>
<br>
<br>
Without cuda<br>
<br>
minden@bb45:~/petsc-dev/src/snes/examples/tutorials$<br>
/home/balay/soft/mvapich2-1.5-lucid/bin/mpiexec.hydra
-machinefile<br>
/home/balay/machinefile -n 2 ./ex47cu -da_grid_x
65535 -log_summary<br>
<div>-snes_monitor -ksp_monitor<br>
0 SNES Function norm 3.906279802209e-03<br>
0 KSP Residual norm 5.994156809227e+00<br>
1 KSP Residual norm 3.538158441448e-04<br>
2 KSP Residual norm 3.124431921666e-04<br>
3 KSP Residual norm 4.109213410989e-06<br>
1 SNES Function norm 7.201017610490e-04<br>
0 KSP Residual norm 3.317803708316e-02<br>
1 KSP Residual norm 2.447380361169e-06<br>
2 KSP Residual norm 2.164193969957e-06<br>
3 KSP Residual norm 2.124317398679e-08<br>
2 SNES Function norm <a href="tel:1.7196789348" value="+17196789348" target="_blank">1.7196789348</a>25e-05<br>
0 KSP Residual norm <a href="tel:1.6515864531" value="+16515864531" target="_blank">1.6515864531</a>43e-06<br>
1 KSP Residual norm 2.037037536868e-08<br>
2 KSP Residual norm 1.109736798274e-08<br>
3 KSP Residual norm <a href="tel:1.8572187721" value="+18572187721" target="_blank">1.8572187721</a>56e-12<br>
3 SNES Function norm 1.159391068583e-09<br>
0 KSP Residual norm 3.116936044619e-11<br>
1 KSP Residual norm 1.366503312678e-12<br>
2 KSP Residual norm 6.598477672192e-13<br>
3 KSP Residual norm 5.306147277879e-17<br>
4 SNES Function norm 2.202297235559e-10<br>
<br>
</div>
Note the repeated norms when using cuda. Looks
like I'll have to take<br>
a closer look at this.<br>
<br>
-Victor<br>
<br>
---<br>
Victor L. Minden<br>
<br>
Tufts University<br>
School of Engineering<br>
Class of 2012<br>
<div>
<div><br>
<br>
<br>
On Thu, May 5, 2011 at 2:57 PM, Barry Smith
<<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>>
wrote:<br>
><br>
> Alexander<br>
><br>
> Thank you for the sample code; it
will be very useful.<br>
><br>
> We have run parallel jobs with CUDA
where each node has only a single MPI
process and uses a single GPU without the
crash that you get below. I cannot explain
why it would not work in your situation. Do
you have access to two nodes each with a GPU
so you could try that?<br>
><br>
> It is crashing in a delete of a<br>
><br>
> struct _p_PetscCUSPIndices {<br>
> CUSPINTARRAYCPU indicesCPU;<br>
> CUSPINTARRAYGPU indicesGPU;<br>
> };<br>
><br>
> where
cusp::array1d<PetscInt,cusp::device_memory><br>
><br>
> thus it is crashing after it has
completed actually doing the computation. If
you run with -snes_monitor -ksp_monitor with
and without the -da_vec_type cusp on 2
processes what do you get for output in the
two cases? I want to see if it is running
correctly on two processes?<br>
><br>
> Could the crash be due to memory
corruption sometime doing the computation?<br>
><br>
><br>
> Barry<br>
><br>
><br>
><br>
><br>
><br>
> On May 5, 2011, at 3:38 AM, Alexander
Grayver wrote:<br>
><br>
>> Hello!<br>
>><br>
>> We work with petsc-dev branch and <a href="http://ex47cu.cu" target="_blank">ex47cu.cu</a>
example. Our platform is<br>
>> Intel Quad processor and 8
identical Tesla GPUs. CUDA 3.2 toolkit is<br>
>> installed.<br>
>> Ideally we would like to make petsc
working in a multi-GPU way within<br>
>> just one node so that different
GPUs could be attached to different<br>
>> processes.<br>
>> Since it's not possible within
current PETSc implementation we created a<br>
>> preload library (see LD_PRELOAD for
details) for CUBLAS function<br>
>> cublasInit().<br>
>> When PETSc calls this function our
library gets control and we assign<br>
>> GPUs according to rank within MPI
communicator, then we call original<br>
>> cublasInit().<br>
>> This preload library is very
simple, see petsc_mgpu.c attached.<br>
>> This trick makes each process to
have its own context and ideally all<br>
>> computations should be distributed
over several GPUs.<br>
>><br>
>> We managed to build petsc and
example (see makefile attached) and we<br>
>> tested it as follows:<br>
>><br>
>> [agraiver@tesla-cmc new]$ ./lapexp
-da_grid_x 65535 -info > cpu_1process.out<br>
>> [agraiver@tesla-cmc new]$ mpirun
-np 2 ./lapexp -da_grid_x 65535 -info ><br>
>> cpu_2processes.out<br>
>> [agraiver@tesla-cmc new]$ ./lapexp
-da_grid_x 65535 -da_vec_type cusp<br>
>> -info > gpu_1process.out<br>
>> [agraiver@tesla-cmc new]$ mpirun
-np 2 ./lapexp -da_grid_x 65535<br>
>> -da_vec_type cusp -info >
gpu_2processes.out<br>
>><br>
>> Everything except last
configuration works well. The last one
crashes<br>
>> with the following exception and
callstack:<br>
>> terminate called after throwing an
instance of<br>
>> 'thrust::system::system_error'<br>
>> what(): invalid device pointer<br>
>> [tesla-cmc:15549] *** Process
received signal ***<br>
>> [tesla-cmc:15549] Signal: Aborted
(6)<br>
>> [tesla-cmc:15549] Signal code:
(-6)<br>
>> [tesla-cmc:15549] [ 0]
/lib64/libpthread.so.0() [0x3de540eeb0]<br>
>> [tesla-cmc:15549] [ 1]
/lib64/libc.so.6(gsignal+0x35)
[0x3de50330c5]<br>
>> [tesla-cmc:15549] [ 2]
/lib64/libc.so.6(abort+0x186) [0x3de5034a76]<br>
>> [tesla-cmc:15549] [ 3]<br>
>>
/opt/llvm/dragonegg/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)<br>
>> [0x7f0d3530b95d]<br>
>> [tesla-cmc:15549] [ 4]<br>
>>
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7b76)
[0x7f0d35309b76]<br>
>> [tesla-cmc:15549] [ 5]<br>
>>
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7ba3)
[0x7f0d35309ba3]<br>
>> [tesla-cmc:15549] [ 6]<br>
>>
/opt/llvm/dragonegg/lib64/libstdc++.so.6(+0xb7cae)
[0x7f0d35309cae]<br>
>> [tesla-cmc:15549] [ 7]<br>
>>
./lapexp(_ZN6thrust6detail6device4cuda4freeILj0EEEvNS_10device_ptrIvEE+0x69)<br>
>> [0x426320]<br>
>> [tesla-cmc:15549] [ 8]<br>
>>
./lapexp(_ZN6thrust6detail6device8dispatch4freeILj0EEEvNS_10device_ptrIvEENS0_21cuda_device_space_tagE+0x2b)<br>
>> [0x4258b2]<br>
>> [tesla-cmc:15549] [ 9]<br>
>>
./lapexp(_ZN6thrust11device_freeENS_10device_ptrIvEE+0x2f)
[0x424f78]<br>
>> [tesla-cmc:15549] [10]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust23device_malloc_allocatorIiE10deallocateENS_10device_ptrIiEEm+0x33)<br>
>> [0x7f0d36aeacff]<br>
>> [tesla-cmc:15549] [11]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEE10deallocateEv+0x6e)<br>
>> [0x7f0d36ae8e78]<br>
>> [tesla-cmc:15549] [12]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail18contiguous_storageIiNS_23device_malloc_allocatorIiEEED1Ev+0x19)<br>
>> [0x7f0d36ae75f7]<br>
>> [tesla-cmc:15549] [13]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN6thrust6detail11vector_baseIiNS_23device_malloc_allocatorIiEEED1Ev+0x52)<br>
>> [0x7f0d36ae65f4]<br>
>> [tesla-cmc:15549] [14]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN4cusp7array1dIiN6thrust6detail21cuda_device_space_tagEED1Ev+0x18)<br>
>> [0x7f0d36ae5c2e]<br>
>> [tesla-cmc:15549] [15]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(_ZN19_p_PetscCUSPIndicesD1Ev+0x1d)
[0x7f0d3751e45f]<br>
>> [tesla-cmc:15549] [16]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(PetscCUSPIndicesDestroy+0x20f)<br>
>> [0x7f0d3750c840]<br>
>> [tesla-cmc:15549] [17]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy_PtoP+0x1bc8)<br>
>> [0x7f0d375af8af]<br>
>> [tesla-cmc:15549] [18]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(VecScatterDestroy+0x586)<br>
>> [0x7f0d375e9ddf]<br>
>> [tesla-cmc:15549] [19]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy_MPIAIJ+0x49f)<br>
>> [0x7f0d37191d24]<br>
>> [tesla-cmc:15549] [20]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(MatDestroy+0x546)
[0x7f0d370d54fe]<br>
>> [tesla-cmc:15549] [21]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESReset+0x5d1)
[0x7f0d3746fac3]<br>
>> [tesla-cmc:15549] [22]<br>
>>
/opt/openmpi_gcc-1.4.3/lib/libpetsc.so(SNESDestroy+0x4b8)
[0x7f0d37470210]<br>
>> [tesla-cmc:15549] [23]
./lapexp(main+0x5ed) [0x420745]<br>
>><br>
>> I've sent all detailed output files
for different execution<br>
>> configuration listed above as well
as configure.log and make.log to<br>
>> <a href="mailto:petsc-maint@mcs.anl.gov" target="_blank">petsc-maint@mcs.anl.gov</a>
hoping that someone could recognize the
problem.<br>
>> Now we have one node with
multi-GPU, but I'm also wondering if someone<br>
>> really tested usage of GPU
functionality over several nodes with one
GPU<br>
>> each?<br>
>><br>
>> Regards,<br>
>> Alexander<br>
>><br>
>>
<petsc_mgpu.c><makefile.txt><configure.log><br>
><br>
><br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div>