<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style id="owaParaStyle" type="text/css">

<!--

p

        {margin-top:0;

        margin-bottom:0}

p

        {margin-top:0;

        margin-bottom:0}

-->

P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1" style="word-wrap: break-word;">

<div style="direction: ltr;font-family: Arial;color: #000000;font-size: 14pt;">Rerunning the CPU case with numactl results in a 25x speedup and log_summary<br>

results that look reasonable to me now.  I'm wondering now what the result will<br>

be for running the GPU case with numactl.  Its in the queue waiting to run now.<br>

<div><br>

Dave<br>

<br>

</div>

<div style="font-family: Times New Roman; color: rgb(0, 0, 0); font-size: 16px;">

<hr tabindex="-1">

<div style="direction: ltr;" id="divRpF166841"><font color="#000000" face="Tahoma" size="2"><b>From:</b> Nystrom, William D<br>

<b>Sent:</b> Thursday, February 23, 2012 4:24 PM<br>

<b>To:</b> For users of the development version of PETSc<br>

<b>Cc:</b> Nystrom, William D<br>

<b>Subject:</b> RE: [petsc-dev] Understanding Some Parallel Results with PETSc<br>

</font><br>

</div>

<div></div>

<div>

<div style="direction: ltr; font-family: Arial; color: rgb(0, 0, 0); font-size: 14pt;">

I think I may be starting to understand this now.  I ran a smaller CPU problem<br>

with numactl and compared the results to the same problem run without<br>

numactl.  The problem size was 1000x1000.  The result was stunning.  Using<br>

numactl, the problem ran 580x faster.  The performance of VecAXPY and<br>

VecAYPX were comparable and the performance of VecTDot and VecNorm<br>

were also very good.  So I think I will rerun my 10000x10000 case with numactl<br>

and see what the results look like.<br>

<br>

Thanks<br>

<br>

Dave<br>

<br>

<div></div>

<div style="font-family: Times New Roman; color: rgb(0, 0, 0); font-size: 16px;">

<hr tabindex="-1">

<div id="divRpF524307" style="direction: ltr;"><font color="#000000" face="Tahoma" size="2"><b>From:</b> Nystrom, William D<br>

<b>Sent:</b> Thursday, February 23, 2012 3:04 PM<br>

<b>To:</b> For users of the development version of PETSc<br>

<b>Cc:</b> Nystrom, William D<br>

<b>Subject:</b> RE: [petsc-dev] Understanding Some Parallel Results with PETSc<br>

</font><br>

</div>

<div></div>

<div>

<div style="direction: ltr; font-family: Arial; color: rgb(0, 0, 0); font-size: 14pt;">

Hi Mark,<br>

<br>

Thanks for the suggestions.  Sounds like you would say that there is<br>

something wrong with the performance of the cpu only calculation.<br>

Is that a fair conclusion?  I have been looking at smaller problem<br>

sizes since the original run.  Limiting the iteration count also seems<br>

like a good way to look at the performance of larger problem sizes.<br>

<br>

Thanks again for your suggestions.<br>

<br>

Dave<br>

<div>

<div style="font-family: Tahoma; font-size: 13px;"><font size="2"><span style="font-size: 10pt;"></span></font><br>

</div>

</div>

<div style="font-family: Times New Roman; color: rgb(0, 0, 0); font-size: 16px;">

<hr tabindex="-1">

<div id="divRpF491886" style="direction: ltr;"><font color="#000000" face="Tahoma" size="2"><b>From:</b> petsc-dev-bounces@mcs.anl.gov [petsc-dev-bounces@mcs.anl.gov] on behalf of Mark F. Adams [mark.adams@columbia.edu]<br>

<b>Sent:</b> Thursday, February 23, 2012 2:20 PM<br>

<b>To:</b> For users of the development version of PETSc<br>

<b>Subject:</b> Re: [petsc-dev] Understanding Some Parallel Results with PETSc<br>

</font><br>

</div>

<div></div>

<div>

<div>The difference in performance for VecAXPY and VecAYPX is dramatic (~35x) and these are dead simple methods that are almost identical and are not parallel so they may be a good place to start looking.   You might look a simpler example like src/vec/vec/example/tutorial/ex1.c.

  You could add a loop around the calls to VecAXPY and VecAYPX to get some meaningful timings.</div>

<div><br>

</div>

<div>Also, you might limit the number of iterations to say 100, so it does not take 10 hours to run these tests.</div>

<div><br>

</div>

<div>You could also try scaling the problem up (or down) to see when these problems kick in (eg, when you go off a node ...).</div>

<div><br>

</div>

<div>Mark</div>

<div><br>

</div>

<div>

<div>On Feb 23, 2012, at 2:49 PM, Nystrom, William D wrote:</div>

<br class="Apple-interchange-newline">

<blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;">

<div>

<div style="direction: ltr; font-family: Arial; color: rgb(0, 0, 0); font-size: 14pt;">

Hi Matt,<br>

<br>

Attached are the log files for the two runs.<br>

<br>

Thanks,<br>

<br>

Dave<br>

<div>

<div style="font-family: Tahoma; font-size: 13px;"><font size="2"><span style="font-size: 10pt;"></span></font><br>

</div>

</div>

<div style="font-family: 'Times New Roman'; color: rgb(0, 0, 0); font-size: 16px;">

<hr tabindex="-1">

<div id="divRpF62437" style="direction: ltr;"><font color="#000000" face="Tahoma" size="2"><b>From:</b><span class="Apple-converted-space"> </span><a href="mailto:petsc-dev-bounces@mcs.anl.gov" target="_blank">petsc-dev-bounces@mcs.anl.gov</a><span class="Apple-converted-space"> </span>[petsc-dev-bounces@mcs.anl.gov]

 on behalf of Matthew Knepley [knepley@gmail.com]<br>

<b>Sent:</b><span class="Apple-converted-space"> </span>Thursday, February 23, 2012 11:17 AM<br>

<b>To:</b><span class="Apple-converted-space"> </span>For users of the development version of PETSc<br>

<b>Subject:</b><span class="Apple-converted-space"> </span>Re: [petsc-dev] Understanding Some Parallel Results with PETSc<br>

</font><br>

</div>

<div></div>

<div>On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D<span class="Apple-converted-space"> </span><span dir="ltr"><<a href="mailto:wdn@lanl.gov" target="_blank">wdn@lanl.gov</a>></span><span class="Apple-converted-space"> </span>wrote:<br>

<div class="gmail_quote">

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex; position: static; z-index: auto;">

I recently ran a couple of test runs with petsc-dev that I do not understand.  I'm running on a test bed<br>

machine that has 4 nodes with two Tesla 2090 gpus per node.  Each node is dual socket and populated<br>

with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors.  These are 8 core processors and so each<br>

node has 16 cores.  On the gpu, I'm running with Paul's latest version of the txpetscgpu package.  I'm<br>

running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with m=n=10000.  My objective was<br>

to compare the performance running on 4 nodes using all 8 gpus to that of running on the same 4 nodes<br>

with all 64 cores.  This problem uses about a third of the memory available on the gpus.  I was using cg<br>

with jacobi preconditioning on both the gpu run and the cpu run.  What is puzzling to me is that the cpu<br>

case ran 44x times slower than the gpu case and the big difference was in the time spend in functions<br>

like VecTDot, VecNorm and VecAXPY.<br>

<br>

Below is a table that summarizes the performance of the main functions that were using time in the<br>

two runs.  Times are in seconds.<br>

<br>

                |      GPU      |      CPU     |    Ratio<br>

-------------------------------------------------------------------------<br>

MatMult     |     450.64    |     5484.7    |     12.17<br>

-------------------------------------------------------------------------<br>

VecTDot    |     285.35    |   16688.0    |     58.48<br>

-------------------------------------------------------------------------<br>

VecNorm   |       19.03    |     9058.8    |   476.03<br>

-------------------------------------------------------------------------<br>

VecAXPY  |     106.88    |     5636.3    |     52.73<br>

-------------------------------------------------------------------------<br>

VecAYPX  |       53.69    |        85.1    |       1.58<br>

-------------------------------------------------------------------------<br>

KSPSolve  |     811.95    |   35930.0    |     44.25<br>

-------------------------------------------------------------------------<br>

<br>

The ratio of MatMult for CPU versus GPU is what I typically see when I am comparing a CPU run on<br>

a single core versus a run on a single GPU.  Since both runs are communicating across node via mpi,<br>

I'm puzzled about why the CPU case is so much slower than the GPU case especially since there is<br>

communication for the MatMult as well.  Both runs compute the same final error norm using the exact<br>

same number of iterations.  Do these results make sense to people who understand the performance<br>

issues of parallel sparse linear solvers much better than I?  Or do these results look abnormal.  I had<br>

wondered if part of the performance issue was related to my running 8 times as many mpi processes<br>

for the CPU case.  However, I ran a smaller problem with m=n=1000 and using 8 mpi processes and<br>

2 cores per node and I see the same extreme differences in the times spent in VecTDot, VecNorm<br>

and VecAXPY.<br>

<br>

Here are the command lines I used for the two runs:<br>

<br>

CPU:<br>

<br>

mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left<br>

<br>

GPU:<br>

<br>

mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left -mat_type aijcusp -vec_type cusp -cusp_storage_format dia<br>

</blockquote>

<div><br>

</div>

<div>1) Always send -log_summary with performance questions</div>

<div><br>

</div>

<div>2) Comparing two things will not make any sense beyond "one ran faster" without a model for execution time</div>

<div><br>

</div>

<div>3) In order to make sense of my model, I need flop rates for those events</div>

<div><br>

</div>

<div>   Matt</div>

<div> </div>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Thanks,<br>

<br>

Dave<br>

<br>

--<br>

Dave Nystrom<br>

LANL HPC-5<br>

Phone:<span class="Apple-converted-space"> </span><a href="tel:505-667-7913" value="+15056677913" target="_blank">505-667-7913</a><br>

Email:<span class="Apple-converted-space"> </span><a href="mailto:wdn@lanl.gov" target="_blank">wdn@lanl.gov</a><br>

Smail: Mail Stop B272<br>

      Group HPC-5<br>

      Los Alamos National Laboratory<br>

      Los Alamos, NM 87545<br>

<br>

</blockquote>

</div>

<br>

<br clear="all">

<div><br>

</div>

--<span class="Apple-converted-space"> </span><br>

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>

-- Norbert Wiener<br>

</div>

</div>

</div>

<span><ex2_10000_10000_cg_jacobi_mpi_64.log></span><span><ex2_10000_10000_cg_jacobi_cusp_dia_mpi_8.log></span></div>

</span></blockquote>

</div>

<br>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</body>

</html>