[petsc-users] Is OpenMP still available for PETSc?

Wed Jul 5 08:09:59 CDT 2017

For a given use case, you may want to try all possible t and c such that t*c=n : stick to the best one.

Now, if you modify resources (t/c) and you get different timing/iterations, this seems logical to me: blocks, overlap, ... (and finally convergence) will differ so comparison does no more really make sense as you do something different (unless you fix t, and let c vary: even like that, you may not get what you expect - anyway, seems it's not what you do).

Franck

----- Mail original -----
> De: "Damian Kaliszan" <damian at man.poznan.pl>
> À: "Franck Houssen" <franck.houssen at inria.fr>, "Barry Smith" <bsmith at mcs.anl.gov>
> Cc: petsc-users at mcs.anl.gov
> Envoyé: Mercredi 5 Juillet 2017 10:50:39
> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> 
> Thank you:)
> 
> Few notes on what you wrote
> 1. I  always  try to keep t*c=number of cores, however for 64 core KNL
> which  has  hyperthreading  switched  on (cpuinfo shows 256 cores) t*c
> should be 64 or 256 (in other words: is t=64 and c=4 correct?) ?
> 2.  I  noticed  that  for the same input data I may get different
> timings in 2 cases
> a)  different  number of ksp iterations are observed (why they differ?)
>    ->   please  see  screenshot  Julia_N_10_4_vs_64.JPG for the following
>    config (this may be
>    related to 64*4 issue + which one is correct from first glance?):
> 
> Matrix size=1000x1000
> 
> 1/  slurm-23716.out, 511 steps, ~ 28 secs
> #SBATCH --nodes=1
> #SBATCH --ntasks=64
> #SBATCH --ntasks-per-node=64
> #SBATCH --cpus-per-task=4
> 
> 
> 2/ slurm-23718.out, 94 steps, ~ 4 secs
> 
> #SBATCH --nodes=1
> #SBATCH --ntasks=4
> #SBATCH --ntasks-per-node=4
> #SBATCH --cpus-per-task=4
> 
> b)  equal  number  of ksp iterations are observed but different timings
> (this might be
> due to false sharing or oversubscription ?)
> please see   Julia_N_10_64_vs_64.JPG screenshot
> 
> 
> 
> Best,
> Damian
>  
> W liście datowanym 5 lipca 2017 (10:26:46) napisano:
>  
> > The man page of slurm/sbatch is cumbersome.
> 
> > But, you may think of :
> > 1. tasks "as MPI processus"
> > 2. cpus "as threads"
> 
> > You should always set resources the most precise way when possible,
> > that is (never use --tasks but prefer) to:
> > 1. use --nodes=n.
> > 2. use --tasks-per-node=t.
> > 3. use --cpus-per-tasks=c.
> > 4. for a start, make sure that t*c = number of cores you have per node.
> > 5. use --exclusive unless you may have VERY different timing if you run
> > twice the same job.
> > 6. make sure mpi is configured correctly (run twice [or more] the
> > same mono-thread application: get the same timing ?)
> > 7. if using OpenMP or multithread applications, make sure you have
> > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with
> > intel).
> > 8. make sure you have enough memory (--mem) unless performance may be
> > degraded (swap).
> 
> > The rule of thumb 4 may NOT be respected but if so, you need to be
> > aware WHY you want to do that (for KNL, it may [or not] make sense
> > [depending on cache modes]).
> 
> > Remember than any multi-threaded (OpenMP or not) application may be
> > victim of false sharing
> > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile
> > (using cache metrics) may help to understand if this is the problem,
> > and track it if so (you may use perf-record for that).
> 
> > Understanding HW is not an easy thing: you really need to go step
> > by step unless you have no chance to understand anything in the end.
> 
> > Hope this may help !...
> 
> > Franck
> 
> > Note: activating/deactivating hyper-threading (if available -
> > generally in BIOS when possible) may also change performances.
> 
> > ----- Mail original -----
> >> De: "Barry Smith" <bsmith at mcs.anl.gov>
> >> À: "Damian Kaliszan" <damian at man.poznan.pl>
> >> Cc: "PETSc" <petsc-users at mcs.anl.gov>
> >> Envoyé: Mardi 4 Juillet 2017 19:04:36
> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> >> 
> >> 
> >>    You may need to ask a slurm expert. I have no idea what cpus-per-task
> >>    means
> >> 
> >> 
> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <damian at man.poznan.pl>
> >> > wrote:
> >> > 
> >> > Hi,
> >> > 
> >> > Yes, this is exactly what I meant.
> >> > Please find attached output for 2 input datasets and for 2 various slurm
> >> > configs each:
> >> > 
> >> > A/ Matrix size=8000000x8000000
> >> > 
> >> > 1/ slurm-14432809.out,   930 ksp steps, ~90 secs
> >> > 
> >> > 
> >> > #SBATCH --nodes=2
> >> > #SBATCH --ntasks=32
> >> > #SBATCH --ntasks-per-node=16
> >> > #SBATCH --cpus-per-task=4
> >> > 
> >> > 2/  slurm-14432810.out , 100.000 ksp steps, ~9700 secs
> >> > 
> >> > #SBATCH --nodes=2
> >> > #SBATCH --ntasks=32
> >> > #SBATCH --ntasks-per-node=16
> >> > #SBATCH --cpus-per-task=2
> >> > 
> >> > 
> >> > 
> >> > B/ Matrix size=1000x1000
> >> > 
> >> > 1/  slurm-23716.out, 511 ksp steps, ~ 28 secs
> >> > #SBATCH --nodes=1
> >> > #SBATCH --ntasks=64
> >> > #SBATCH --ntasks-per-node=64
> >> > #SBATCH --cpus-per-task=4
> >> > 
> >> > 
> >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs
> >> > 
> >> > #SBATCH --nodes=1
> >> > #SBATCH --ntasks=4
> >> > #SBATCH --ntasks-per-node=4
> >> > #SBATCH --cpus-per-task=4
> >> > 
> >> > 
> >> > I would really appreciate any help...:)
> >> > 
> >> > Best,
> >> > Damian
> >> > 
> >> > 
> >> > 
> >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano:
> >> > 
> >> > 
> >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <damian at man.poznan.pl>
> >> > wrote:
> >> > Hi,
> >> > 
> >> > 
> >> > >> 1) You can call Bcast on PETSC_COMM_WORLD
> >> > 
> >> > To  be  honest  I  can't find Bcast method in petsc4py.PETSc.Comm (I'm
> >> > using petsc4py)
> >> > 
> >> > >> 2) If you are using WORLD, the number of iterates will be the same on
> >> > >> each process since iteration is collective.
> >> > 
> >> > Yes,  this  is  how  it  should  be.  But  what  I noticed is that for
> >> > different  --cpus-per-task  numbers  in  slurm  script I get different
> >> > number  of  solver iterations which is in turn related to timings. The
> >> > imparity    is    huge.  For  example  for   some configurations where
> >> > --cpus-per-task=1 I receive 900
> >> > iterations and  for --cpus-per-task=2 I receive valid number of 100.000
> >> > which is set as max
> >> > iter number set when setting solver tolerances.
> >> > 
> >> > I am trying to understand what you are saying. You mean that you make 2
> >> > different runs and get a different
> >> > number of iterates with a KSP? In order to answer questions about
> >> > convergence, we need to see the output
> >> > of
> >> > 
> >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
> >> > 
> >> > for all cases.
> >> > 
> >> > Thanks,
> >> > 
> >> >    Matt
> >> > 
> >> > Best,
> >> > Damian
> >> > 
> >> > 
> >> > 
> >> > 
> >> > --
> >> > What most experimenters take for granted before they begin their
> >> > experiments is infinitely more interesting than any results to which
> >> > their
> >> > experiments lead.
> >> > -- Norbert Wiener
> >> > 
> >> > http://www.caam.rice.edu/~mk51/
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > -------------------------------------------------------
> >> > Damian Kaliszan
> >> > 
> >> > Poznan Supercomputing and Networking Center
> >> > HPC and Data Centres Technologies
> >> > ul. Jana Pawła II 10
> >> > 61-139 Poznan
> >> > POLAND
> >> > 
> >> > phone (+48 61) 858 5109
> >> > e-mail damian at man.poznan.pl
> >> > www - http://www.man.poznan.pl/
> >> > -------------------------------------------------------
> >> > <slum_output.zip>
> >> 
> >>