[petsc-dev] Status of threadcomm testing
Karl Rupp
rupp at mcs.anl.gov
Tue Nov 12 15:19:53 CST 2013
Hi Dave,
thanks for the concise summary and for getting a handle on the VecNorm
scaling issue.
> My conclusion from the results presented above is that this is a NUMA
> issue because the scaling is very good on Vulcan where the nodes do
> not have NUMA issues.
This is a reasonable conclusion. Am I correct that you used one MPI rank
per node for all the figures?
> I've also performed a set of strong scaling runs on a NUMA machine using
> 8 threads but without setting thread affinities. These runs scale pretty well
> but are about a factor of 2 slower initially than the case where thread affinities
> are set. Plots of these runs are shown in the last attached plot. See the
> curves marked "aff_yes" and "aff_no". On this set of plots, you can also see
> that the two node result is about the same with or without affinities set. Since
> it appears from the results of using the diagnostic printf above that the thread
> affinities are being properly set and recognized by the OS, it seems that this
> final problem is the result of the data being located in a different NUMA domain
> from that of the thread for the threads that are mapped to the second socket
> cores when there are two or more nodes.
The factor of two is a strong indicator for a NUMA hickup, yes.
> For the single node case, it would seem that the data is properly distributed
> in memory so that it resides in the same NUMA domain as the core to which
> its thread is bound. But for the multiple node case, it would seem that the
> data for threads bound to cores in the second socket actually resides in
> memory attached to the first socket. That the performance result is different
> for a single node and multiple nodes would suggest that a different path
> through the source code is taken for multiple nodes than for a single node.
Hmm, apparently this requires more debugging then.
> These are my conclusions based on the testing and debugging that I have
> done so far. I've also verified in isolated cases that threadcomm with openmp
> has the same scaling issues. Do these conclusions seem reasonable? Or
> are there other possible scenarios that could reproduce my test data?
>
> It would be nice to get this problem fixed so that the threadcomm package
> would be more useful.
Definitely. Since you probably have an isolated test case and hardware
at hand: Do you happen to know whether the same scaling issue shows up
with VecDot() and/or VecTDot()? They are supposed to run through the
same reductions, so this should give us a hint on whether the problem is
VecNorm-specific or applies to reductions in general.
Thanks and best regards,
Karli
More information about the petsc-dev
mailing list