[petsc-dev] Status of threadcomm testing

Tue Nov 12 14:45:49 CST 2013

I've been doing a bunch of strong scaling studies using one of the PETSc
example problems, src/ksp/ksp/examples/tutorials/ex2.c, and some of the
newer capabilities in the "next" branch.  One of these newer capabilities is
the threadcomm package which allows trying things like MPI+pthreads and
MPI+openmp.  About 3 months ago, I discovered a performance scaling
issue with the threadcomm package when going from a single node to two
nodes which I reported to petsc-maint.  When performing strong scaling
runs of ex2 for a 6400x6400 grid, the run takes longer for 2 nodes than for
a single node.  However, for runs with 4 to 64 nodes, the runs scale very
well in the strong scaling sense when compared to the 2 node case.  Over
the last 3 months, I've spent some time trying to gather additional testing
data and to debug the problem using HPCToolkit, Totalview and printf.
One of the problems I discovered was that for the MPI + pthread case with
more than one node, the VecNorm function was not calling the threaded
kernel.  So with a single rank per node, VecNorm was being computed
using only that single rank.  I produced a hack that fixes the VecNorm
problem and it is attached as the petsc_vecnorm_hack.tar file.  With that
fix, I then reran some of my scaling studies.  Attached are plots of two of
those results.  The first is a set of runs performed on the Vulcan Blue Gene
Q machine at LLNL.  Those results used 2 threads per core which seems
needed to saturate the memory bandwidth of the Vulcan nodes.  The plots
show the strong scaling results before and after the VecNorm fix.  As can
be seen, the scaling is very good for Vulcan after the VecNorm fix.

The second set of plots show the results from running the same strong
scaling study on one of the LANL Cray XE6 machines with the VecNorm
fix.  In this case, there is still a problem with scaling when going from one
node to two nodes.  But it is not as bad as before the VecNorm fix.  I've
also done scaling studies after the VecNorm fix on another LANL cluster
which uses nodes with dual socket Sandy Bridge cpus which also show
a continued problem with scaling from one node to two nodes - but better
that the case before the VecNorm fix.

My conclusion from the results presented above is that this is a NUMA
issue because the scaling is very good on Vulcan where the nodes do
not have NUMA issues.

All of these runs have been performed using the "-threadcomm_affinities"
option.  Using code like the following:

  /* wdn hack begin */
  #include <pthread.h>
  #include <sched.h>
  cpu_set_t cpuset;
  PetscInt  j;
  CPU_ZERO(&cpuset);
  sched_getaffinity(0,sizeof(cpu_set_t),&cpuset);
  printf( "VecTDot thread_id: %d\n", thread_id );
  for ( j=0; j<16; j++ )
  {
      if ( CPU_ISSET( j, &cpuset ) )
      {
          printf( "VecTDot thread_id: %d   core: %d\n", thread_id, j );
      }
  }
  /* wdn hack end */

added to some of the threaded kernels, I believe I have convinced myself
that the threadcomm affinity support is working properly for the MPI case
with multiple nodes.  So I believe the threadcomm affinity support is properly
setting the thread affinities consistent with user input.

I've also performed a set of strong scaling runs on a NUMA machine using
8 threads but without setting thread affinities.  These runs scale pretty well
but are about a factor of 2 slower initially than the case where thread affinities
are set.  Plots of these runs are shown in the last attached plot.  See the
curves marked "aff_yes" and "aff_no".  On this set of plots, you can also see
that the two node result is about the same with or without affinities set.  Since
it appears from the results of using the diagnostic printf above that the thread
affinities are being properly set and recognized by the OS, it seems that this
final problem is the result of the data being located in a different NUMA domain
from that of the thread for the threads that are mapped to the second socket
cores when there are two or more nodes.

For the single node case, it would seem that the data is properly distributed
in memory so that it resides in the same NUMA domain as the core to which
its thread is bound.  But for the multiple node case, it would seem that the
data for threads bound to cores in the second socket actually resides in
memory attached to the first socket.  That the performance result is different
for a single node and multiple nodes would suggest that a different path
through the source code is taken for multiple nodes than for a single node.

These are my conclusions based on the testing and debugging that I have
done so far.  I've also verified in isolated cases that threadcomm with openmp
has the same scaling issues.  Do these conclusions seem reasonable?  Or
are there other possible scenarios that could reproduce my test data?

It would be nice to get this problem fixed so that the threadcomm package
would be more useful.

Thanks,

Dave

--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
       Group HPC-5
       Los Alamos National Laboratory
       Los Alamos, NM 87545

-------------- next part --------------
A non-text attachment was scrubbed...
Name: KSPSolve_Time_vs_Node_Count_CPU_Pthread_6400_vulcan.png
Type: image/png
Size: 16037 bytes
Desc: KSPSolve_Time_vs_Node_Count_CPU_Pthread_6400_vulcan.png
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20131112/acc856fe/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: KSPSolve_Time_vs_Node_Count_CPU_Pthread_6400_smog.png
Type: image/png
Size: 14650 bytes
Desc: KSPSolve_Time_vs_Node_Count_CPU_Pthread_6400_smog.png
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20131112/acc856fe/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc_vecnorm_hack.tar
Type: application/x-tar
Size: 61440 bytes
Desc: petsc_vecnorm_hack.tar
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20131112/acc856fe/attachment.tar>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: KSPSolve_Time_vs_Node_Count_CPU_PThread_Orig_vs_Hack_1_6400.png
Type: image/png
Size: 18552 bytes
Desc: KSPSolve_Time_vs_Node_Count_CPU_PThread_Orig_vs_Hack_1_6400.png
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20131112/acc856fe/attachment-0002.png>