[petsc-dev] Status of threadcomm testing

Tue Nov 12 15:35:47 CST 2013

Yes, I used one MPI rank per node for all the runs.  Would be nice to try one
MPI rank per NUMA domain but threadcomm does not support that yet.

The scaling issue shows up on all of the plots.  Attached is a gzipped tarball
with all the plots for your viewing pleasure.  The plots are for results on the
Cray XE6.  As you will see, the problem is not limited to reductions.

Dave

________________________________________
From: Karl Rupp [rupp at mcs.anl.gov]
Sent: Tuesday, November 12, 2013 2:19 PM
To: Nystrom, William D; PETSc Dev
Subject: Re: [petsc-dev] Status of threadcomm testing

Hi Dave,

thanks for the concise summary and for getting a handle on the VecNorm
scaling issue.

 > My conclusion from the results presented above is that this is a NUMA
> issue because the scaling is very good on Vulcan where the nodes do
> not have NUMA issues.

This is a reasonable conclusion. Am I correct that you used one MPI rank
per node for all the figures?

> I've also performed a set of strong scaling runs on a NUMA machine using
> 8 threads but without setting thread affinities.  These runs scale pretty well
> but are about a factor of 2 slower initially than the case where thread affinities
> are set.  Plots of these runs are shown in the last attached plot.  See the
> curves marked "aff_yes" and "aff_no".  On this set of plots, you can also see
> that the two node result is about the same with or without affinities set.  Since
> it appears from the results of using the diagnostic printf above that the thread
> affinities are being properly set and recognized by the OS, it seems that this
> final problem is the result of the data being located in a different NUMA domain
> from that of the thread for the threads that are mapped to the second socket
> cores when there are two or more nodes.

The factor of two is a strong indicator for a NUMA hickup, yes.

> For the single node case, it would seem that the data is properly distributed
> in memory so that it resides in the same NUMA domain as the core to which
> its thread is bound.  But for the multiple node case, it would seem that the
> data for threads bound to cores in the second socket actually resides in
> memory attached to the first socket.  That the performance result is different
> for a single node and multiple nodes would suggest that a different path
> through the source code is taken for multiple nodes than for a single node.

Hmm, apparently this requires more debugging then.

> These are my conclusions based on the testing and debugging that I have
> done so far.  I've also verified in isolated cases that threadcomm with openmp
> has the same scaling issues.  Do these conclusions seem reasonable?  Or
> are there other possible scenarios that could reproduce my test data?
>
> It would be nice to get this problem fixed so that the threadcomm package
> would be more useful.

Definitely. Since you probably have an isolated test case and hardware
at hand: Do you happen to know whether the same scaling issue shows up
with VecDot() and/or VecTDot()? They are supposed to run through the
same reductions, so this should give us a hint on whether the problem is
VecNorm-specific or applies to reductions in general.

Thanks and best regards,
Karli

-------------- next part --------------
A non-text attachment was scrubbed...
Name: all_plots.tar.gz
Type: application/x-gzip
Size: 81472 bytes
Desc: all_plots.tar.gz
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20131112/5d2daa7f/attachment.gz>