<html><head><base href="x-msg://2656/"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>The difference in performance for VecAXPY and VecAYPX is dramatic (~35x) and these are dead simple methods that are almost identical and are not parallel so they may be a good place to start looking. You might look a simpler example like src/vec/vec/example/tutorial/ex1.c. You could add a loop around the calls to VecAXPY and VecAYPX to get some meaningful timings.</div><div><br></div><div>Also, you might limit the number of iterations to say 100, so it does not take 10 hours to run these tests.</div><div><br></div><div>You could also try scaling the problem up (or down) to see when these problems kick in (eg, when you go off a node ...).</div><div><br></div><div>Mark</div><div><br></div><div><div>On Feb 23, 2012, at 2:49 PM, Nystrom, William D wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div ocsi="0" fpstyle="1"><div style="direction: ltr; font-family: Arial; color: rgb(0, 0, 0); font-size: 14pt; ">Hi Matt,<br><br>Attached are the log files for the two runs.<br><br>Thanks,<br><br>Dave<br><div><div style="font-family: Tahoma; font-size: 13px; "><font size="2"><span style="font-size: 10pt; "></span></font><br></div></div><div style="font-family: 'Times New Roman'; color: rgb(0, 0, 0); font-size: 16px; "><hr tabindex="-1"><div id="divRpF62437" style="direction: ltr; "><font color="#000000" face="Tahoma" size="2"><b>From:</b><span class="Apple-converted-space"> </span><a href="mailto:petsc-dev-bounces@mcs.anl.gov">petsc-dev-bounces@mcs.anl.gov</a><span class="Apple-converted-space"> </span>[petsc-dev-bounces@mcs.anl.gov] on behalf of Matthew Knepley [knepley@gmail.com]<br><b>Sent:</b><span class="Apple-converted-space"> </span>Thursday, February 23, 2012 11:17 AM<br><b>To:</b><span class="Apple-converted-space"> </span>For users of the development version of PETSc<br><b>Subject:</b><span class="Apple-converted-space"> </span>Re: [petsc-dev] Understanding Some Parallel Results with PETSc<br></font><br></div><div></div><div>On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D<span class="Apple-converted-space"> </span><span dir="ltr"><<a href="mailto:wdn@lanl.gov" target="_blank">wdn@lanl.gov</a>></span><span class="Apple-converted-space"> </span>wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex; position: static; z-index: auto; ">I recently ran a couple of test runs with petsc-dev that I do not understand. I'm running on a test bed<br>machine that has 4 nodes with two Tesla 2090 gpus per node. Each node is dual socket and populated<br>with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors. These are 8 core processors and so each<br>node has 16 cores. On the gpu, I'm running with Paul's latest version of the txpetscgpu package. I'm<br>running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with m=n=10000. My objective was<br>to compare the performance running on 4 nodes using all 8 gpus to that of running on the same 4 nodes<br>with all 64 cores. This problem uses about a third of the memory available on the gpus. I was using cg<br>with jacobi preconditioning on both the gpu run and the cpu run. What is puzzling to me is that the cpu<br>case ran 44x times slower than the gpu case and the big difference was in the time spend in functions<br>like VecTDot, VecNorm and VecAXPY.<br><br>Below is a table that summarizes the performance of the main functions that were using time in the<br>two runs. Times are in seconds.<br><br> | GPU | CPU | Ratio<br>-------------------------------------------------------------------------<br>MatMult | 450.64 | 5484.7 | 12.17<br>-------------------------------------------------------------------------<br>VecTDot | 285.35 | 16688.0 | 58.48<br>-------------------------------------------------------------------------<br>VecNorm | 19.03 | 9058.8 | 476.03<br>-------------------------------------------------------------------------<br>VecAXPY | 106.88 | 5636.3 | 52.73<br>-------------------------------------------------------------------------<br>VecAYPX | 53.69 | 85.1 | 1.58<br>-------------------------------------------------------------------------<br>KSPSolve | 811.95 | 35930.0 | 44.25<br>-------------------------------------------------------------------------<br><br>The ratio of MatMult for CPU versus GPU is what I typically see when I am comparing a CPU run on<br>a single core versus a run on a single GPU. Since both runs are communicating across node via mpi,<br>I'm puzzled about why the CPU case is so much slower than the GPU case especially since there is<br>communication for the MatMult as well. Both runs compute the same final error norm using the exact<br>same number of iterations. Do these results make sense to people who understand the performance<br>issues of parallel sparse linear solvers much better than I? Or do these results look abnormal. I had<br>wondered if part of the performance issue was related to my running 8 times as many mpi processes<br>for the CPU case. However, I ran a smaller problem with m=n=1000 and using 8 mpi processes and<br>2 cores per node and I see the same extreme differences in the times spent in VecTDot, VecNorm<br>and VecAXPY.<br><br>Here are the command lines I used for the two runs:<br><br>CPU:<br><br>mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left<br><br>GPU:<br><br>mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left -mat_type aijcusp -vec_type cusp -cusp_storage_format dia<br></blockquote><div><br></div><div>1) Always send -log_summary with performance questions</div><div><br></div><div>2) Comparing two things will not make any sense beyond "one ran faster" without a model for execution time</div><div><br></div><div>3) In order to make sense of my model, I need flop rates for those events</div><div><br></div><div> Matt</div><div> </div><blockquote class="gmail_quote" style="margin-top: 0pt; margin-right: 0pt; margin-bottom: 0pt; margin-left: 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex; ">Thanks,<br><br>Dave<br><br>--<br>Dave Nystrom<br>LANL HPC-5<br>Phone:<span class="Apple-converted-space"> </span><a href="tel:505-667-7913" value="+15056677913" target="_blank">505-667-7913</a><br>Email:<span class="Apple-converted-space"> </span><a href="mailto:wdn@lanl.gov" target="_blank">wdn@lanl.gov</a><br>Smail: Mail Stop B272<br> Group HPC-5<br> Los Alamos National Laboratory<br> Los Alamos, NM 87545<br><br></blockquote></div><br><br clear="all"><div><br></div>--<span class="Apple-converted-space"> </span><br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>-- Norbert Wiener<br></div></div></div><span><ex2_10000_10000_cg_jacobi_mpi_64.log></span><span><ex2_10000_10000_cg_jacobi_cusp_dia_mpi_8.log></span></div></span></blockquote></div><br></body></html>