[petsc-dev] Understanding Some Parallel Results with PETSc

Thu Feb 23 21:52:09 CST 2012

  Jed,

    Could you or another qualified person (and that ain't me) add something about numactl to the "What kind of parallel computers or clusters are needed to use PETSc?" FAQ question?

   Thanks

    Barry

On Feb 23, 2012, at 6:53 PM, Nystrom, William D wrote:

> Rerunning the CPU case with numactl results in a 25x speedup and log_summary
> results that look reasonable to me now.  I'm wondering now what the result will
> be for running the GPU case with numactl.  Its in the queue waiting to run now.
> 
> Dave
> 
> From: Nystrom, William D
> Sent: Thursday, February 23, 2012 4:24 PM
> To: For users of the development version of PETSc
> Cc: Nystrom, William D
> Subject: RE: [petsc-dev] Understanding Some Parallel Results with PETSc
> 
> I think I may be starting to understand this now.  I ran a smaller CPU problem
> with numactl and compared the results to the same problem run without
> numactl.  The problem size was 1000x1000.  The result was stunning.  Using
> numactl, the problem ran 580x faster.  The performance of VecAXPY and
> VecAYPX were comparable and the performance of VecTDot and VecNorm
> were also very good.  So I think I will rerun my 10000x10000 case with numactl
> and see what the results look like.
> 
> Thanks
> 
> Dave
> 
> From: Nystrom, William D
> Sent: Thursday, February 23, 2012 3:04 PM
> To: For users of the development version of PETSc
> Cc: Nystrom, William D
> Subject: RE: [petsc-dev] Understanding Some Parallel Results with PETSc
> 
> Hi Mark,
> 
> Thanks for the suggestions.  Sounds like you would say that there is
> something wrong with the performance of the cpu only calculation.
> Is that a fair conclusion?  I have been looking at smaller problem
> sizes since the original run.  Limiting the iteration count also seems
> like a good way to look at the performance of larger problem sizes.
> 
> Thanks again for your suggestions.
> 
> Dave
> 
> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Mark F. Adams [mark.adams at columbia.edu]
> Sent: Thursday, February 23, 2012 2:20 PM
> To: For users of the development version of PETSc
> Subject: Re: [petsc-dev] Understanding Some Parallel Results with PETSc
> 
> The difference in performance for VecAXPY and VecAYPX is dramatic (~35x) and these are dead simple methods that are almost identical and are not parallel so they may be a good place to start looking.   You might look a simpler example like src/vec/vec/example/tutorial/ex1.c.  You could add a loop around the calls to VecAXPY and VecAYPX to get some meaningful timings.
> 
> Also, you might limit the number of iterations to say 100, so it does not take 10 hours to run these tests.
> 
> You could also try scaling the problem up (or down) to see when these problems kick in (eg, when you go off a node ...).
> 
> Mark
> 
> On Feb 23, 2012, at 2:49 PM, Nystrom, William D wrote:
> 
>> Hi Matt,
>> 
>> Attached are the log files for the two runs.
>> 
>> Thanks,
>> 
>> Dave
>> 
>> From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Matthew Knepley [knepley at gmail.com]
>> Sent: Thursday, February 23, 2012 11:17 AM
>> To: For users of the development version of PETSc
>> Subject: Re: [petsc-dev] Understanding Some Parallel Results with PETSc
>> 
>> On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D <wdn at lanl.gov> wrote:
>> I recently ran a couple of test runs with petsc-dev that I do not understand.  I'm running on a test bed
>> machine that has 4 nodes with two Tesla 2090 gpus per node.  Each node is dual socket and populated
>> with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors.  These are 8 core processors and so each
>> node has 16 cores.  On the gpu, I'm running with Paul's latest version of the txpetscgpu package.  I'm
>> running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with m=n=10000.  My objective was
>> to compare the performance running on 4 nodes using all 8 gpus to that of running on the same 4 nodes
>> with all 64 cores.  This problem uses about a third of the memory available on the gpus.  I was using cg
>> with jacobi preconditioning on both the gpu run and the cpu run.  What is puzzling to me is that the cpu
>> case ran 44x times slower than the gpu case and the big difference was in the time spend in functions
>> like VecTDot, VecNorm and VecAXPY.
>> 
>> Below is a table that summarizes the performance of the main functions that were using time in the
>> two runs.  Times are in seconds.
>> 
>>                 |      GPU      |      CPU     |    Ratio
>> -------------------------------------------------------------------------
>> MatMult     |     450.64    |     5484.7    |     12.17
>> -------------------------------------------------------------------------
>> VecTDot    |     285.35    |   16688.0    |     58.48
>> -------------------------------------------------------------------------
>> VecNorm   |       19.03    |     9058.8    |   476.03
>> -------------------------------------------------------------------------
>> VecAXPY  |     106.88    |     5636.3    |     52.73
>> -------------------------------------------------------------------------
>> VecAYPX  |       53.69    |        85.1    |       1.58
>> -------------------------------------------------------------------------
>> KSPSolve  |     811.95    |   35930.0    |     44.25
>> -------------------------------------------------------------------------
>> 
>> The ratio of MatMult for CPU versus GPU is what I typically see when I am comparing a CPU run on
>> a single core versus a run on a single GPU.  Since both runs are communicating across node via mpi,
>> I'm puzzled about why the CPU case is so much slower than the GPU case especially since there is
>> communication for the MatMult as well.  Both runs compute the same final error norm using the exact
>> same number of iterations.  Do these results make sense to people who understand the performance
>> issues of parallel sparse linear solvers much better than I?  Or do these results look abnormal.  I had
>> wondered if part of the performance issue was related to my running 8 times as many mpi processes
>> for the CPU case.  However, I ran a smaller problem with m=n=1000 and using 8 mpi processes and
>> 2 cores per node and I see the same extreme differences in the times spent in VecTDot, VecNorm
>> and VecAXPY.
>> 
>> Here are the command lines I used for the two runs:
>> 
>> CPU:
>> 
>> mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left
>> 
>> GPU:
>> 
>> mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left -mat_type aijcusp -vec_type cusp -cusp_storage_format dia
>> 
>> 1) Always send -log_summary with performance questions
>> 
>> 2) Comparing two things will not make any sense beyond "one ran faster" without a model for execution time
>> 
>> 3) In order to make sense of my model, I need flop rates for those events
>> 
>>    Matt
>>  
>> Thanks,
>> 
>> Dave
>> 
>> --
>> Dave Nystrom
>> LANL HPC-5
>> Phone: 505-667-7913
>> Email: wdn at lanl.gov
>> Smail: Mail Stop B272
>>       Group HPC-5
>>       Los Alamos National Laboratory
>>       Los Alamos, NM 87545
>> 
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>> -- Norbert Wiener
>> <ex2_10000_10000_cg_jacobi_mpi_64.log><ex2_10000_10000_cg_jacobi_cusp_dia_mpi_8.log>
> 
>