Dear Dave,<br><br>Did you run the codes with double precision?<br><br>Thanks,<br>Yujie<br><br><div class="gmail_quote">On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D <span dir="ltr"><<a href="mailto:wdn@lanl.gov">wdn@lanl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I recently ran a couple of test runs with petsc-dev that I do not understand. I'm running on a test bed<br>
machine that has 4 nodes with two Tesla 2090 gpus per node. Each node is dual socket and populated<br>
with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors. These are 8 core processors and so each<br>
node has 16 cores. On the gpu, I'm running with Paul's latest version of the txpetscgpu package. I'm<br>
running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with m=n=10000. My objective was<br>
to compare the performance running on 4 nodes using all 8 gpus to that of running on the same 4 nodes<br>
with all 64 cores. This problem uses about a third of the memory available on the gpus. I was using cg<br>
with jacobi preconditioning on both the gpu run and the cpu run. What is puzzling to me is that the cpu<br>
case ran 44x times slower than the gpu case and the big difference was in the time spend in functions<br>
like VecTDot, VecNorm and VecAXPY.<br>
<br>
Below is a table that summarizes the performance of the main functions that were using time in the<br>
two runs. Times are in seconds.<br>
<br>
| GPU | CPU | Ratio<br>
-------------------------------------------------------------------------<br>
MatMult | 450.64 | 5484.7 | 12.17<br>
-------------------------------------------------------------------------<br>
VecTDot | 285.35 | 16688.0 | 58.48<br>
-------------------------------------------------------------------------<br>
VecNorm | 19.03 | 9058.8 | 476.03<br>
-------------------------------------------------------------------------<br>
VecAXPY | 106.88 | 5636.3 | 52.73<br>
-------------------------------------------------------------------------<br>
VecAYPX | 53.69 | 85.1 | 1.58<br>
-------------------------------------------------------------------------<br>
KSPSolve | 811.95 | 35930.0 | 44.25<br>
-------------------------------------------------------------------------<br>
<br>
The ratio of MatMult for CPU versus GPU is what I typically see when I am comparing a CPU run on<br>
a single core versus a run on a single GPU. Since both runs are communicating across node via mpi,<br>
I'm puzzled about why the CPU case is so much slower than the GPU case especially since there is<br>
communication for the MatMult as well. Both runs compute the same final error norm using the exact<br>
same number of iterations. Do these results make sense to people who understand the performance<br>
issues of parallel sparse linear solvers much better than I? Or do these results look abnormal. I had<br>
wondered if part of the performance issue was related to my running 8 times as many mpi processes<br>
for the CPU case. However, I ran a smaller problem with m=n=1000 and using 8 mpi processes and<br>
2 cores per node and I see the same extreme differences in the times spent in VecTDot, VecNorm<br>
and VecAXPY.<br>
<br>
Here are the command lines I used for the two runs:<br>
<br>
CPU:<br>
<br>
mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left<br>
<br>
GPU:<br>
<br>
mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left -mat_type aijcusp -vec_type cusp -cusp_storage_format dia<br>
<br>
Thanks,<br>
<br>
Dave<br>
<br>
--<br>
Dave Nystrom<br>
LANL HPC-5<br>
Phone: <a href="tel:505-667-7913" value="+15056677913">505-667-7913</a><br>
Email: <a href="mailto:wdn@lanl.gov">wdn@lanl.gov</a><br>
Smail: Mail Stop B272<br>
Group HPC-5<br>
Los Alamos National Laboratory<br>
Los Alamos, NM 87545<br>
<br>
</blockquote></div><br>