[petsc-dev] Status of threadcomm testing

Tue Nov 12 15:19:53 CST 2013

Hi Dave,

thanks for the concise summary and for getting a handle on the VecNorm 
scaling issue.

 > My conclusion from the results presented above is that this is a NUMA
> issue because the scaling is very good on Vulcan where the nodes do
> not have NUMA issues.

This is a reasonable conclusion. Am I correct that you used one MPI rank 
per node for all the figures?

> I've also performed a set of strong scaling runs on a NUMA machine using
> 8 threads but without setting thread affinities.  These runs scale pretty well
> but are about a factor of 2 slower initially than the case where thread affinities
> are set.  Plots of these runs are shown in the last attached plot.  See the
> curves marked "aff_yes" and "aff_no".  On this set of plots, you can also see
> that the two node result is about the same with or without affinities set.  Since
> it appears from the results of using the diagnostic printf above that the thread
> affinities are being properly set and recognized by the OS, it seems that this
> final problem is the result of the data being located in a different NUMA domain
> from that of the thread for the threads that are mapped to the second socket
> cores when there are two or more nodes.

The factor of two is a strong indicator for a NUMA hickup, yes.

> For the single node case, it would seem that the data is properly distributed
> in memory so that it resides in the same NUMA domain as the core to which
> its thread is bound.  But for the multiple node case, it would seem that the
> data for threads bound to cores in the second socket actually resides in
> memory attached to the first socket.  That the performance result is different
> for a single node and multiple nodes would suggest that a different path
> through the source code is taken for multiple nodes than for a single node.

Hmm, apparently this requires more debugging then.

> These are my conclusions based on the testing and debugging that I have
> done so far.  I've also verified in isolated cases that threadcomm with openmp
> has the same scaling issues.  Do these conclusions seem reasonable?  Or
> are there other possible scenarios that could reproduce my test data?
>
> It would be nice to get this problem fixed so that the threadcomm package
> would be more useful.

Definitely. Since you probably have an isolated test case and hardware 
at hand: Do you happen to know whether the same scaling issue shows up 
with VecDot() and/or VecTDot()? They are supposed to run through the 
same reductions, so this should give us a hint on whether the problem is 
VecNorm-specific or applies to reductions in general.

Thanks and best regards,
Karli