[petsc-users] profiling question

Tue Sep 21 04:36:16 CDT 2010

On Tue, Sep 21, 2010 at 10:41, Leo van Kampenhout
<lvankampenhout at gmail.com> wrote:
> At the cluster I am currently working on, each node is made up by 12 PEs and
> have shared memory. When I would just reserve 1 PE for my job, the other 11
> processors are given to other users, therefore giving dynamic load on the
> memory system resulting into inaccurate timings. The solve-times I get are
> ranging between 1 and 5 minutes. For me, this is not very scientific either.

First, shared memory and especially NUMA architectures are very
difficult to draw meaningful intra-node scalability conclusions on.
If at all possible, try to compare inter-node scalability instead
since it is a far more reliable estimate and less
architecture-dependent (provided the network is decent).  That said,
you should be looking for reproducibility much more than "good"
scaling.  It's well known that intra-node memory contention is a major
issue, the STREAM benchmarks actually show _lower_ total bandwidth
when running on all 6 cores per socket with Istanbul than when using
only 4 (and 2 cores is within a few percent).

> The second idea was to reserve all 12 PEs on the node and just let 1 PE run
> the job. However, in this way the single CPU gets all the memory bandwidth
> and has no waiting time, therefore giving very fast results. When I would
> calculate speedup from here, the algorithm does not scale very well.

I say just do this and live with the poor intra-node scaling numbers.
Some architectures actually scale memory within the node (e.g.
BlueGene), but most don't.  People expect to see the memory bottleneck
in these results, it's nothing to be ashamed of.

> Another idea would be to spawn 12 identical jobs on 12 PEs and take the
> average runtime. Unfortunately, there is only one PETSC_COMM_WORLD, so I
> think this is impossible to do from within one program (MPI_COMM_WORLD).

You could split MPI_COMM_WORLD and run a separate PETSC_COMM_WORLD on
each group, but I think this option will not be reproducible (the
instances will slightly out of sync, so memory and communication
bottlenecks will be loaded in different ways on subsequent runs) and
is a bit disingenuous because this is not a configuration that you
would ever run in practice.

Jed