<div dir="ltr">Hi Karl,<div><br></div><div>Thanks for the explanation. Do you think it will help if I use a GPU which is capable of doing double precision arithmetic?</div><div><br></div><div>I am using NVIDIA Quadra FX 1800 M. It has 1GB of global memory. Unfortunately NVIDIAs' visual profiler does not seem to work with its OpenCL implementation. The code does not crash when I run it on the CPU using Intels OpenCL.</div>
<div><br></div><div>I mean't to say that the code does not crash either with ComputeResidualViennaCL or ComputeResidual with the normal Petsc Vec/Mats but does indeed crash when either of them are used with the ViennaCL vecs. Do you think there is memory allocation at every time step? I thought all the memory would be allocated during initialization. </div>
<div><br></div><div>Cheers,</div><div>Mani</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jan 28, 2014 at 3:09 AM, Karl Rupp <span dir="ltr"><<a href="mailto:rupp@mcs.anl.gov" target="_blank">rupp@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Mani,<div><div class="h5"><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I've been testing further, the code using TS with ViennaCL and there are<br>
a couple of things I wanted to point out<br>
<br>
1) When using the ComputeResidualViennaCL with either the normal Petsc<br>
Vecs/Mats or Vec/MatViennaCL, and using the GPU, the nonlinear<br>
convergence is very different from using an OpenCL CPU backend or just<br>
the regular Petsc code.<br>
<br>
a) Using NVIDIA OpenCL to run on the GPU to compute the residual and<br>
using either normal Petsc Vec/Mat or ViennaCL Vec/Mat:<br>
<br>
0 TS dt 10 time 0<br>
0 SNES Function norm 4.789374470711e-01<br>
1 SNES Function norm 5.491749197245e-02<br>
2 SNES Function norm 6.542412564158e-03<br>
3 SNES Function norm 7.800844032317e-04<br>
4 SNES Function norm 9.349243191537e-05<br>
5 SNES Function norm 1.120692741097e-05<br>
1 TS dt 10 time 10<br>
<br>
b) Using Intel OpenCL to run on the CPU to compute the residual and<br>
using either normal Petsc Vec/Mat or ViennaCL Vec/Mat::<br>
<br>
0 TS dt 10 time 0<br>
0 SNES Function norm 3.916582465172e-02<br>
1 SNES Function norm 4.990998832000e-07<br>
<br>
c) Using ComputeResidual (which runs on the CPU) with the normal Petsc<br>
Vec/Mat<br>
<br>
0 TS dt 10 time 0<br>
0 SNES Function norm 3.916582465172e-02<br>
1 SNES Function norm 4.990998832000e-07<br>
1 TS dt 10 time 10<br>
<br>
You see that b) and c) match perfectly but a) is quite different. Why<br>
could this be?<br>
</blockquote>
<br></div></div>
The reason are different arithmetic units. Your OpenCL kernel contains<br>
dx_dt[INDEX_GLOBAL(i,j,var)] -<br>
(x[INDEX_GLOBAL(i+1,j,var)] -<br>
x[INDEX_GLOBAL(i,j,var)])/DX1 -<br>
(x[INDEX_GLOBAL(i,j+1,var)] -<br>
x[INDEX_GLOBAL(i,j,var)])/DX2<br>
so you are subtracting values of about the same magnitude multiple times. You get consistent results on the CPU because the same arithmetic units get used irrespective of OpenCL-based or 'native' execution. The NVIDIA GPU has different round-off behavior. You are likely to see similar effects with AMD GPUs. There is nothing we can do to change this.<div class="im">
<br>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2) When I try using either ComputeResidual or ComputeResidualViennaCL<br>
with the ViennaCL Vec/Mats, the GPU run crashes at a late time because<br>
of a routine in ViennaCL.<br>
<br>
ViennaCL: FATAL ERROR: Kernel start failed for 'vec_mul'.<br>
ViennaCL: Smaller work sizes could not solve the problem.<br>
[0]PETSC ERROR: --------------------- Error Message<br>
------------------------------<u></u>------<br>
[0]PETSC ERROR: Error in external library!<br>
[0]PETSC ERROR: ViennaCL error: ViennaCL: FATAL ERROR:<br>
CL_MEM_OBJECT_ALLOCATION_<u></u>FAILURE<br>
<br>
I have attached the full crash log. The crash occurs late into the run,<br>
in this case at the 80th time step. I thought all memory allocation<br>
occurs at the beginning of the run, so I don't quite understand why its<br>
failing.<br>
</blockquote>
<br></div>
Okay, this sounds like the GPU ultimately runs out of memory. Which GPU do you use? How much memory does it have? Do you also see an increase in memory consumption with the Intel OpenCL SDK?<div class="im"><br>
<br>
<br>
> Note that the code works if I use ComputeResidualViennaCL with<br>
> the normal Petsc Vec/Mats.<br>
<br></div>
You mean ComputeResidual(), don't you?<br>
<br>
Best regards,<br>
Karli<br>
<br>
</blockquote></div><br></div>