[petsc-users] profiling question

Tue Sep 21 06:49:42 CDT 2010

Thanks for the helpful response Jed. I was not aware of the possibility to
run seperate PETSC_COMM_WORLDS in the same program,  at least this is not
clear from the documentation (e.g.
http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-dev/docs/manualpages/Sys/PetscInitialize.html)
I'll probably still try this out just out of curiosity.

About presenting scaling results, the most appealing to me seems to show two
graphs, one with intra-node scaling (1-12) and the other going upwards from
there (12, 24, 36, ...)

Leo

2010/9/21 Jed Brown <jed at 59a2.org>

> On Tue, Sep 21, 2010 at 10:41, Leo van Kampenhout
> <lvankampenhout at gmail.com> wrote:
> > At the cluster I am currently working on, each node is made up by 12 PEs
> and
> > have shared memory. When I would just reserve 1 PE for my job, the other
> 11
> > processors are given to other users, therefore giving dynamic load on the
> > memory system resulting into inaccurate timings. The solve-times I get
> are
> > ranging between 1 and 5 minutes. For me, this is not very scientific
> either.
>
> First, shared memory and especially NUMA architectures are very
> difficult to draw meaningful intra-node scalability conclusions on.
> If at all possible, try to compare inter-node scalability instead
> since it is a far more reliable estimate and less
> architecture-dependent (provided the network is decent).  That said,
> you should be looking for reproducibility much more than "good"
> scaling.  It's well known that intra-node memory contention is a major
> issue, the STREAM benchmarks actually show _lower_ total bandwidth
> when running on all 6 cores per socket with Istanbul than when using
> only 4 (and 2 cores is within a few percent).
>
> > The second idea was to reserve all 12 PEs on the node and just let 1 PE
> run
> > the job. However, in this way the single CPU gets all the memory
> bandwidth
> > and has no waiting time, therefore giving very fast results. When I would
> > calculate speedup from here, the algorithm does not scale very well.
>
> I say just do this and live with the poor intra-node scaling numbers.
> Some architectures actually scale memory within the node (e.g.
> BlueGene), but most don't.  People expect to see the memory bottleneck
> in these results, it's nothing to be ashamed of.
>
> > Another idea would be to spawn 12 identical jobs on 12 PEs and take the
> > average runtime. Unfortunately, there is only one PETSC_COMM_WORLD, so I
> > think this is impossible to do from within one program (MPI_COMM_WORLD).
>
> You could split MPI_COMM_WORLD and run a separate PETSC_COMM_WORLD on
> each group, but I think this option will not be reproducible (the
> instances will slightly out of sync, so memory and communication
> bottlenecks will be loaded in different ways on subsequent runs) and
> is a bit disingenuous because this is not a configuration that you
> would ever run in practice.
>
> Jed
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20100921/e67d4f3d/attachment.htm>