[petsc-users] Question about ksp ex3.c
Matija Kecman
matijakecman at gmail.com
Thu Sep 29 09:06:58 CDT 2011
Thanks for your useful replies.
>> The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6
>> GB/s
>> but you should runs streams as Barry says to see what is achievable).
I will try to get the STREAM benchmark running
> Note that your naive 1D partition is getting worse as you add processes. The
> MatMult should scale out somewhat better if you use a 2D decomposition, as
> is done by any of the examples using a DA (DMDA in 3.2) for grid management.
I am able to recompute the problem using a 2D decomposition and I will
look at the log-summary output for those tests also
Best,
Matija
On Thu, Sep 29, 2011 at 2:09 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote:
> On Thu, Sep 29, 2011 at 07:44, Matthew Knepley <knepley at gmail.com> wrote:
>>
>> The way I read these numbers is that there is bandwidth for about 3 cores
>> on
>> this machine, and non-negligible synchronization penalty:
>> 1 proc 2 proc 4 proc 8 proc
>> VecAXPBYCZ 496 857 1064 1070
>> VecDot 451 724 1089 736
>> MatMult 434 638 701 703
>
> Matt, thanks for pulling out this summary. The synchronization for the dot
> product is clearly expensive here. I'm surprised it's so significant on such
> a small problem, but it is a common problem for scalability.
> I think that MPI is putting one process per socket when you go from 1 to 2
> processes. This gives you pretty good speedup despite the fact that memory
> traffic for both sockets is routed through the "Blackford" chipset. Intel
> fixed this in later generations by throwing out the notion of uniform memory
> access by having independent memory banks for each socket (which AMD had at
> the time these chips came out).
> If not odd hardware issues such as having enough memory streams and
> imbalances, one process per socket can saturate the memory bandwidth, so the
> speed-up you get from 2 procs (1 per socket) to 4 procs (2 per socket) is
> minimal. On this architecture, you typically see the STREAM bandwidth go
> down slightly as you add more than 2 procs per socket, so it's no surprise
> that it doesn't help here.
> Note that your naive 1D partition is getting worse as you add processes. The
> MatMult should scale out somewhat better if you use a 2D decomposition, as
> is done by any of the examples using a DA (DMDA in 3.2) for grid management.
>
>>
>> The bandwidth tops out between 2 and 4 cores (The 5345 should have 10.6
>> GB/s
>> but you should runs streams as Barry says to see what is achievable).
>> There is
>> obviously a penalty for VecDot against VecAXPYCZ, which is the sync
>> penalty
>> which also seems to affect MatMult. Maybe Jed can explain that.
>
More information about the petsc-users
mailing list