[petsc-users] About parallel performance
Matthew Knepley
knepley at gmail.com
Thu May 29 17:45:34 CDT 2014
On Thu, May 29, 2014 at 5:40 PM, Qin Lu <lu_qin_2000 at yahoo.com> wrote:
> Is this determined by how the machine was built (which I can not do
> anything), or by how the MPI/meassge-passing is configured at the cluster
> (which I can ask IT people to modify)? - this machine is actually a node of
> a linux cluster.
>
It is determined by how the machine was built. Your best bet for
scalability is to use one process per node.
Thanks,
Matt
>
> Thanks,
> Qin
>
> *From:* Matthew Knepley <knepley at gmail.com>
> *To:* Qin Lu <lu_qin_2000 at yahoo.com>
> *Cc:* Barry Smith <bsmith at mcs.anl.gov>; petsc-users <
> petsc-users at mcs.anl.gov>
> *Sent:* Thursday, May 29, 2014 5:27 PM
> *Subject:* Re: [petsc-users] About parallel performance
>
> On Thu, May 29, 2014 at 5:15 PM, Qin Lu <lu_qin_2000 at yahoo.com> wrote:
>
> Barry,
>
> How did you read the test results? For a machine good for parallism,
> should the data of np=2 be about half of the those of np=1?
>
>
> Ideally, the numbers should be about twice as big for np = 2.
>
>
>
> The machine has very new Intel chips and is very for serial run. What may
> cause the bad parallism? - the configurations of the machine, or I am using
> a MPI lib (MPICH2) that was not built correctly?
>
>
> The cause is machine architecture. The memory bandwidth is only sufficient
> for one core.
>
> Thanks,
>
> Matt
>
>
>
>
> Many thanks,
> Qin
>
> ----- Original Message -----
> From: Barry Smith <bsmith at mcs.anl.gov>
> To: Qin Lu <lu_qin_2000 at yahoo.com>; petsc-users <petsc-users at mcs.anl.gov>
> Cc:
> Sent: Thursday, May 29, 2014 4:54 PM
> Subject: Re: [petsc-users] About parallel performance
>
>
> In that PETSc version BasicVersion is actually the MPI streams benchmark
> so you ran the right thing. Your machine is totally worthless for sparse
> linear algebra parallelism. The entire memory bandwidth is used by the
> first core so adding the second core to the computation gives you no
> improvement at all in the streams benchmark.
>
> But the single core memory bandwidth is pretty good so for problems that
> don’t need parallelism you should get good performance.
>
> Barry
>
>
>
>
> On May 29, 2014, at 4:37 PM, Qin Lu <lu_qin_2000 at yahoo.com> wrote:
>
> > Barry,
> >
> > I have PETSc-3.4.2 and I didn't see MPIVersion there; do you mean
> BasicVersion? I built and ran it (if you did mean MPIVersion, I will get
> PETSc-3.4 later):
> >
> > =================
> > [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 1 ./BasicVersion
> > Number of MPI processes 1
> > Function Rate (MB/s)
> > Copy: 21682.9932
> > Scale: 21637.5509
> > Add: 21583.0395
> > Triad: 21504.6563
> > [/petsc-3.4.2-64bit/src/benchmarks/streams]$ mpiexec -n 2 ./BasicVersion
> > Number of MPI processes 2
> > Function Rate (MB/s)
> > Copy: 21369.6976
> > Scale: 21632.3203
> > Add: 22203.7107
> > Triad: 22305.1841
> > =======================
> >
> > Thanks a lot,
> > Qin
> >
> > From: Barry Smith <bsmith at mcs.anl.gov>
> > To: Qin Lu <lu_qin_2000 at yahoo.com>
> > Cc: "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>
> > Sent: Thursday, May 29, 2014 4:17 PM
> > Subject: Re: [petsc-users] About parallel performance
> >
> >
> >
> > You need to run the streams benchmarks are one and two processes to
> see how the memory bandwidth changes. If you are using petsc-3.4 you can
> >
> > cd src/benchmarks/streams/
> >
> > make MPIVersion
> >
> > mpiexec -n 1 ./MPIVersion
> >
> > mpiexec -n 2 ./MPIVersion
> >
> > and send all the results
> >
> > Barry
> >
> >
> >
> > On May 29, 2014, at 4:06 PM, Qin Lu <lu_qin_2000 at yahoo.com> wrote:
> >
> >> For now I only care about the CPU of PETSc subroutines. I tried to add
> PetscLogEventBegin/End and the results are consistent with the log_summary
> attached in my first email.
> >>
> >> The CPU of MatSetValues and MatAssemblyBegin/End of both p1 and p2 runs
> are small (< 20 sec). The CPU of PCSetup/PCApply are about the same between
> p1 and p2 (~120 sec). The CPU of KSPSolve of p2 (143 sec) is a little
> faster than p1's (176 sec), but p2 spent more time in MatGetSubMatrice (43
> sec). So the total CPU of PETSc subtroutines are about the same between p1
> and p2 (502 sec vs. 488 sec).
> >>
> >> It seems I need a more efficient parallel preconditioner. Do you have
> any suggestions for that?
> >>
> >> Many thanks,
> >> Qin
> >>
> >> ----- Original Message -----
> >> From: Barry Smith <bsmith at mcs.anl.gov>
> >> To: Qin Lu <lu_qin_2000 at yahoo.com>
> >> Cc: "petsc-users at mcs.anl.gov" <petsc-users at mcs.anl.gov>
> >> Sent: Thursday, May 29, 2014 2:12 PM
> >> Subject: Re: [petsc-users] About parallel performance
> >>
> >>
> >> You need to determine where the other 80% of the time is. My guess
> it is in setting the values into the matrix each time. Use
> PetscLogEventRegister() and put a PetscLogEventBegin/End() around the code
> that computes all the entries in the matrix and calls MatSetValues() and
> MatAssemblyBegin/End().
> >>
> >> Likely the reason the linear solver does not scale better is that
> you have a machine with multiple cores that share the same memory bandwidth
> and the first core is already using well over half the memory bandwidth so
> the second core cannot be fully utilized since both cores have to wait for
> data to arrive from memory. If you are using the development version of
> PETSc you can run make streams NPMAX=2 from the PETSc root directory and
> send this to us to confirm this.
> >>
> >> Barry
> >>
> >>
> >>
> >>
> >>
> >> On May 29, 2014, at 1:23 PM, Qin Lu <lu_qin_2000 at yahoo.com> wrote:
> >>
> >>> Hello,
> >>>
> >>> I implemented PETSc parallel linear solver in a program, the
> implementation is basically the same as
> /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I preallocated the MatMPIAIJ,
> and let PETSc partition the matrix through MatGetOwnershipRange. However, a
> few tests shows the parallel solver is always a little slower the serial
> solver (I have excluded the matrix generation CPU).
> >>>
> >>> For serial run I used PCILU as preconditioner; for parallel run, I
> used ASM with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type
> preonly -ksp_type bcgs -pc_type asm). The number of unknowns are around
> 200,000.
> >>>
> >>> I have used -log_summary to print out the performance summary as
> attached (log_summary_p1 for serial run and log_summary_p2 for the run with
> 2 processes). It seems the KSPSolve counts only for less than 20% of Global
> %T.
> >>> My questions are:
> >>>
> >>> 1. what is the bottle neck of the parallel run according to the
> summary?
> >>> 2. Do you have any suggestions to improve the parallel performance?
> >>>
> >>> Thanks a lot for your suggestions!
> >>>
> >>> Regards,
> >>> Qin <log_summary_p1.txt><log_summary_p2.txt>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20140529/4c68ac6a/attachment-0001.html>
More information about the petsc-users
mailing list