[petsc-users] Question about ksp ex3.c

Thu Sep 29 09:06:58 CDT 2011

Thanks for your useful replies.

>> The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6
>> GB/s
>> but you should runs streams as Barry says to see what is achievable).

I will try to get the STREAM benchmark running

> Note that your naive 1D partition is getting worse as you add processes. The
> MatMult should scale out somewhat better if you use a 2D decomposition, as
> is done by any of the examples using a DA (DMDA in 3.2) for grid management.

I am able to recompute the problem using a 2D decomposition and I will
look at the log-summary output for those tests also

Best,

Matija

On Thu, Sep 29, 2011 at 2:09 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> On Thu, Sep 29, 2011 at 07:44, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> The way I read these numbers is that there is bandwidth for about 3 cores
>> on
>> this machine, and non-negligible synchronization penalty:
>>                         1 proc   2 proc   4 proc    8 proc
>> VecAXPBYCZ    496        857      1064      1070
>> VecDot               451        724      1089        736
>> MatMult              434        638        701        703
>
> Matt, thanks for pulling out this summary. The synchronization for the dot
> product is clearly expensive here. I'm surprised it's so significant on such
> a small problem, but it is a common problem for scalability.
> I think that MPI is putting one process per socket when you go from 1 to 2
> processes. This gives you pretty good speedup despite the fact that memory
> traffic for both sockets is routed through the "Blackford" chipset. Intel
> fixed this in later generations by throwing out the notion of uniform memory
> access by having independent memory banks for each socket (which AMD had at
> the time these chips came out).
> If not odd hardware issues such as having enough memory streams and
> imbalances, one process per socket can saturate the memory bandwidth, so the
> speed-up you get from 2 procs (1 per socket) to 4 procs (2 per socket) is
> minimal. On this architecture, you typically see the STREAM bandwidth go
> down slightly as you add more than 2 procs per socket, so it's no surprise
> that it doesn't help here.
> Note that your naive 1D partition is getting worse as you add processes. The
> MatMult should scale out somewhat better if you use a 2D decomposition, as
> is done by any of the examples using a DA (DMDA in 3.2) for grid management.
>
>>
>> The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6
>> GB/s
>> but you should runs streams as Barry says to see what is achievable).
>> There is
>> obviously a penalty for VecDot against VecAXPYCZ, which is the sync
>> penalty
>> which also seems to affect MatMult. Maybe Jed can explain that.
>