[petsc-users] Is OpenMP still available for PETSc?
Franck Houssen
franck.houssen at inria.fr
Wed Jul 5 08:09:59 CDT 2017
For a given use case, you may want to try all possible t and c such that t*c=n : stick to the best one.
Now, if you modify resources (t/c) and you get different timing/iterations, this seems logical to me: blocks, overlap, ... (and finally convergence) will differ so comparison does no more really make sense as you do something different (unless you fix t, and let c vary: even like that, you may not get what you expect - anyway, seems it's not what you do).
Franck
----- Mail original -----
> De: "Damian Kaliszan" <damian at man.poznan.pl>
> À: "Franck Houssen" <franck.houssen at inria.fr>, "Barry Smith" <bsmith at mcs.anl.gov>
> Cc: petsc-users at mcs.anl.gov
> Envoyé: Mercredi 5 Juillet 2017 10:50:39
> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
>
> Thank you:)
>
> Few notes on what you wrote
> 1. I always try to keep t*c=number of cores, however for 64 core KNL
> which has hyperthreading switched on (cpuinfo shows 256 cores) t*c
> should be 64 or 256 (in other words: is t=64 and c=4 correct?) ?
> 2. I noticed that for the same input data I may get different
> timings in 2 cases
> a) different number of ksp iterations are observed (why they differ?)
> -> please see screenshot Julia_N_10_4_vs_64.JPG for the following
> config (this may be
> related to 64*4 issue + which one is correct from first glance?):
>
> Matrix size=1000x1000
>
> 1/ slurm-23716.out, 511 steps, ~ 28 secs
> #SBATCH --nodes=1
> #SBATCH --ntasks=64
> #SBATCH --ntasks-per-node=64
> #SBATCH --cpus-per-task=4
>
>
> 2/ slurm-23718.out, 94 steps, ~ 4 secs
>
> #SBATCH --nodes=1
> #SBATCH --ntasks=4
> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=4
>
> b) equal number of ksp iterations are observed but different timings
> (this might be
> due to false sharing or oversubscription ?)
> please see Julia_N_10_64_vs_64.JPG screenshot
>
>
>
> Best,
> Damian
>
> W liście datowanym 5 lipca 2017 (10:26:46) napisano:
>
> > The man page of slurm/sbatch is cumbersome.
>
> > But, you may think of :
> > 1. tasks "as MPI processus"
> > 2. cpus "as threads"
>
> > You should always set resources the most precise way when possible,
> > that is (never use --tasks but prefer) to:
> > 1. use --nodes=n.
> > 2. use --tasks-per-node=t.
> > 3. use --cpus-per-tasks=c.
> > 4. for a start, make sure that t*c = number of cores you have per node.
> > 5. use --exclusive unless you may have VERY different timing if you run
> > twice the same job.
> > 6. make sure mpi is configured correctly (run twice [or more] the
> > same mono-thread application: get the same timing ?)
> > 7. if using OpenMP or multithread applications, make sure you have
> > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with
> > intel).
> > 8. make sure you have enough memory (--mem) unless performance may be
> > degraded (swap).
>
> > The rule of thumb 4 may NOT be respected but if so, you need to be
> > aware WHY you want to do that (for KNL, it may [or not] make sense
> > [depending on cache modes]).
>
> > Remember than any multi-threaded (OpenMP or not) application may be
> > victim of false sharing
> > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile
> > (using cache metrics) may help to understand if this is the problem,
> > and track it if so (you may use perf-record for that).
>
> > Understanding HW is not an easy thing: you really need to go step
> > by step unless you have no chance to understand anything in the end.
>
> > Hope this may help !...
>
> > Franck
>
> > Note: activating/deactivating hyper-threading (if available -
> > generally in BIOS when possible) may also change performances.
>
> > ----- Mail original -----
> >> De: "Barry Smith" <bsmith at mcs.anl.gov>
> >> À: "Damian Kaliszan" <damian at man.poznan.pl>
> >> Cc: "PETSc" <petsc-users at mcs.anl.gov>
> >> Envoyé: Mardi 4 Juillet 2017 19:04:36
> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> >>
> >>
> >> You may need to ask a slurm expert. I have no idea what cpus-per-task
> >> means
> >>
> >>
> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <damian at man.poznan.pl>
> >> > wrote:
> >> >
> >> > Hi,
> >> >
> >> > Yes, this is exactly what I meant.
> >> > Please find attached output for 2 input datasets and for 2 various slurm
> >> > configs each:
> >> >
> >> > A/ Matrix size=8000000x8000000
> >> >
> >> > 1/ slurm-14432809.out, 930 ksp steps, ~90 secs
> >> >
> >> >
> >> > #SBATCH --nodes=2
> >> > #SBATCH --ntasks=32
> >> > #SBATCH --ntasks-per-node=16
> >> > #SBATCH --cpus-per-task=4
> >> >
> >> > 2/ slurm-14432810.out , 100.000 ksp steps, ~9700 secs
> >> >
> >> > #SBATCH --nodes=2
> >> > #SBATCH --ntasks=32
> >> > #SBATCH --ntasks-per-node=16
> >> > #SBATCH --cpus-per-task=2
> >> >
> >> >
> >> >
> >> > B/ Matrix size=1000x1000
> >> >
> >> > 1/ slurm-23716.out, 511 ksp steps, ~ 28 secs
> >> > #SBATCH --nodes=1
> >> > #SBATCH --ntasks=64
> >> > #SBATCH --ntasks-per-node=64
> >> > #SBATCH --cpus-per-task=4
> >> >
> >> >
> >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs
> >> >
> >> > #SBATCH --nodes=1
> >> > #SBATCH --ntasks=4
> >> > #SBATCH --ntasks-per-node=4
> >> > #SBATCH --cpus-per-task=4
> >> >
> >> >
> >> > I would really appreciate any help...:)
> >> >
> >> > Best,
> >> > Damian
> >> >
> >> >
> >> >
> >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano:
> >> >
> >> >
> >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <damian at man.poznan.pl>
> >> > wrote:
> >> > Hi,
> >> >
> >> >
> >> > >> 1) You can call Bcast on PETSC_COMM_WORLD
> >> >
> >> > To be honest I can't find Bcast method in petsc4py.PETSc.Comm (I'm
> >> > using petsc4py)
> >> >
> >> > >> 2) If you are using WORLD, the number of iterates will be the same on
> >> > >> each process since iteration is collective.
> >> >
> >> > Yes, this is how it should be. But what I noticed is that for
> >> > different --cpus-per-task numbers in slurm script I get different
> >> > number of solver iterations which is in turn related to timings. The
> >> > imparity is huge. For example for some configurations where
> >> > --cpus-per-task=1 I receive 900
> >> > iterations and for --cpus-per-task=2 I receive valid number of 100.000
> >> > which is set as max
> >> > iter number set when setting solver tolerances.
> >> >
> >> > I am trying to understand what you are saying. You mean that you make 2
> >> > different runs and get a different
> >> > number of iterates with a KSP? In order to answer questions about
> >> > convergence, we need to see the output
> >> > of
> >> >
> >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
> >> >
> >> > for all cases.
> >> >
> >> > Thanks,
> >> >
> >> > Matt
> >> >
> >> > Best,
> >> > Damian
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > What most experimenters take for granted before they begin their
> >> > experiments is infinitely more interesting than any results to which
> >> > their
> >> > experiments lead.
> >> > -- Norbert Wiener
> >> >
> >> > http://www.caam.rice.edu/~mk51/
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > -------------------------------------------------------
> >> > Damian Kaliszan
> >> >
> >> > Poznan Supercomputing and Networking Center
> >> > HPC and Data Centres Technologies
> >> > ul. Jana Pawła II 10
> >> > 61-139 Poznan
> >> > POLAND
> >> >
> >> > phone (+48 61) 858 5109
> >> > e-mail damian at man.poznan.pl
> >> > www - http://www.man.poznan.pl/
> >> > -------------------------------------------------------
> >> > <slum_output.zip>
> >>
> >>
More information about the petsc-users
mailing list