[petsc-dev] Understanding Some Parallel Results with PETSc

Fri Feb 24 12:11:45 CST 2012

Hi Jed,

Attached is a gzipped tarball of the stuff I used to run these two test problems with
numactl.  Actually, I hacked them a bit this morning because I was running them in
our test framework for doing acceptance testing of new systems.  But the scripts in
the tarball should give you all the info you need.  There is a top level script called
runPetsc that just invokes mpirun from openmpi and calls the wrapper scripts for using
numactl.  You could actually dispense with the top level script and just invoke the
mpirun commands yourself.  I include it as an easy way to document what I did.
The runPetscProb_1 script runs petsc on the gpus using numactl to control the affinities
of the gpus to the cpu numa nodes.  The runPetscProb_2 script runs petsc on the cpus
using numactl.  Note that both of those wrapper scripts are using openmpi variables.
I'm not sure how one would do the same thing with another flavor of mpi.  But I imagine
it is possible.  Also, I'm not sure if there are other more elegant ways to run with numactl
than using the wrapper script approach.  Perhaps there is but this is what we have been
doing.

I've also included a Perl script called numa-maps that is useful for actually checking the
affinities that you get while running in order to make sure that numactl is doing what
you think it is doing.  I'm not sure where this script comes from.  I find it on some systems
and not on others.

I've also include logs with the output of cpuinfo, nvidia-smi and uname -a to answer any
questions you had about the system I was running on.

Finally, I've included  runPetscProb_1.log and runPetscProb_2.log which contains the
log_summary output for my latest runs on the gpu and cpu respectively.  Using numactl
reduced the runtime for the gpu case as well but not as much as for the cpu case.  So
the final result was that running the same problem while using all of the gpu resources
on a node was about 2.5x times faster than using all of the cpu resources on the same
number of nodes.

Let me know if you need more any more info.  I'm planning to use this stuff to help test
a new gpu cluster that we have just started acceptance testing on.  It has the same
basic hardware as the testbed cluster for these results but has 308 nodes.  That
should be interesting and fun.

Thanks,

Dave

--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
       Group HPC-5
       Los Alamos National Laboratory
       Los Alamos, NM 87545

________________________________
From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on behalf of Jed Brown [jedbrown at mcs.anl.gov]
Sent: Thursday, February 23, 2012 10:43 PM
To: For users of the development version of PETSc
Cc: Dave Nystrom
Subject: Re: [petsc-dev] Understanding Some Parallel Results with PETSc

On Thu, Feb 23, 2012 at 23:41, Dave Nystrom <dnystrom1 at comcast.net<mailto:dnystrom1 at comcast.net>> wrote:
I could also send you my mpi/numactl command lines for gpu and cpu when I am
back in the office.

Yes, please.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120224/12510e8c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_vs_gpu_with_numactl.tar.gz
Type: application/x-gzip
Size: 9833 bytes
Desc: cpu_vs_gpu_with_numactl.tar.gz
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120224/12510e8c/attachment.gz>