[petsc-users] Is OpenMP still available for PETSc?

Thu Jul 6 07:23:59 CDT 2017

I do NOT have an answer !

What I was trying to say is that, it WOULD be likely that each MPI proc gets some kind of "sub parts" (blocks - possibly overlapping ?) of the initial problem. Then, it WOULD be likely that each MPI proc has several threads to process each "sub part". These assumptions may be "not so true" (depending on pc, cases, methods, options, ...) or "just" wrong : I do NOT know how PETSc has been written !

Suppose the picture I assume is "not so far" from reality, then:
1. increasing t (= slurm tasks = nb of MPI procs) may change the problem (*), so you MAY (?) no more "really" compare (or you need to make SURE that the overlapping and other stuffs still allow comparison - not so obvious).
2. increasing c (= slurm cpus = nb of threads per MPI proc) SHOULD improve timings and possibly iterations (but you also may NOT get any improvement if threads go to a point where they are drowned, concurrent, victim of false sharing or any kind of other reasons...)

To make it short: "to allow comparisons, first, I WOULD have fixed t, and then, I would have changed only c.... And I would NOT have been so surprised if speed-up would show up to be disappointing". 

I may be wrong.

Franck

(*) : changing t changes "sub parts" (blocks, borders of local "sub parts", overlapping, ....). My understanding is that whatever the method you use overlapping of blocks (however you name them according to methods, options, ...) "sizes" the informations that come from "neighbor proc" : my understanding is that this information is mandatory for good local convergence and may impact it (but I may be wrong) 

----- Mail original -----
> De: "Damian Kaliszan" <damian at man.poznan.pl>
> À: "Franck Houssen" <franck.houssen at inria.fr>
> Cc: petsc-users at mcs.anl.gov
> Envoyé: Jeudi 6 Juillet 2017 09:56:58
> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> 
> Dear Franck,
> 
> Thank you for tour comment!
> Did  I  get  you  correctly:  According  to  you may t<->c combination
> influence  the  the  convergence  -> ksp iterations -> timings (result
> array/vector should be identical though)?
> 
> Best,
> Damian
>  
> W liście datowanym 5 lipca 2017 (15:09:59) napisano:
>  
> > For a given use case, you may want to try all possible t and c such
> > that t*c=n : stick to the best one.
> 
> > Now, if you modify resources (t/c) and you get different
> > timing/iterations, this seems logical to me: blocks, overlap, ...
> > (and finally convergence) will differ so comparison does no more
> > really make sense as you do something different (unless you fix t,
> > and let c vary: even like that, you may not get what you expect -
> > anyway, seems it's not what you do).
> 
> > Franck
> 
> > ----- Mail original -----
> >> De: "Damian Kaliszan" <damian at man.poznan.pl>
> >> À: "Franck Houssen" <franck.houssen at inria.fr>, "Barry Smith"
> >> <bsmith at mcs.anl.gov>
> >> Cc: petsc-users at mcs.anl.gov
> >> Envoyé: Mercredi 5 Juillet 2017 10:50:39
> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> >> 
> >> Thank you:)
> >> 
> >> Few notes on what you wrote
> >> 1. I  always  try to keep t*c=number of cores, however for 64 core KNL
> >> which  has  hyperthreading  switched  on (cpuinfo shows 256 cores) t*c
> >> should be 64 or 256 (in other words: is t=64 and c=4 correct?) ?
> >> 2.  I  noticed  that  for the same input data I may get different
> >> timings in 2 cases
> >> a)  different  number of ksp iterations are observed (why they differ?)
> >>    ->   please  see  screenshot  Julia_N_10_4_vs_64.JPG for the following
> >>    config (this may be
> >>    related to 64*4 issue + which one is correct from first glance?):
> >> 
> >> Matrix size=1000x1000
> >> 
> >> 1/  slurm-23716.out, 511 steps, ~ 28 secs
> >> #SBATCH --nodes=1
> >> #SBATCH --ntasks=64
> >> #SBATCH --ntasks-per-node=64
> >> #SBATCH --cpus-per-task=4
> >> 
> >> 
> >> 2/ slurm-23718.out, 94 steps, ~ 4 secs
> >> 
> >> #SBATCH --nodes=1
> >> #SBATCH --ntasks=4
> >> #SBATCH --ntasks-per-node=4
> >> #SBATCH --cpus-per-task=4
> >> 
> >> b)  equal  number  of ksp iterations are observed but different timings
> >> (this might be
> >> due to false sharing or oversubscription ?)
> >> please see   Julia_N_10_64_vs_64.JPG screenshot
> >> 
> >> 
> >> 
> >> Best,
> >> Damian
> >>  
> >> W liście datowanym 5 lipca 2017 (10:26:46) napisano:
> >>  
> >> > The man page of slurm/sbatch is cumbersome.
> >> 
> >> > But, you may think of :
> >> > 1. tasks "as MPI processus"
> >> > 2. cpus "as threads"
> >> 
> >> > You should always set resources the most precise way when possible,
> >> > that is (never use --tasks but prefer) to:
> >> > 1. use --nodes=n.
> >> > 2. use --tasks-per-node=t.
> >> > 3. use --cpus-per-tasks=c.
> >> > 4. for a start, make sure that t*c = number of cores you have per node.
> >> > 5. use --exclusive unless you may have VERY different timing if you run
> >> > twice the same job.
> >> > 6. make sure mpi is configured correctly (run twice [or more] the
> >> > same mono-thread application: get the same timing ?)
> >> > 7. if using OpenMP or multithread applications, make sure you have
> >> > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with
> >> > intel).
> >> > 8. make sure you have enough memory (--mem) unless performance may be
> >> > degraded (swap).
> >> 
> >> > The rule of thumb 4 may NOT be respected but if so, you need to be
> >> > aware WHY you want to do that (for KNL, it may [or not] make sense
> >> > [depending on cache modes]).
> >> 
> >> > Remember than any multi-threaded (OpenMP or not) application may be
> >> > victim of false sharing
> >> > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile
> >> > (using cache metrics) may help to understand if this is the problem,
> >> > and track it if so (you may use perf-record for that).
> >> 
> >> > Understanding HW is not an easy thing: you really need to go step
> >> > by step unless you have no chance to understand anything in the end.
> >> 
> >> > Hope this may help !...
> >> 
> >> > Franck
> >> 
> >> > Note: activating/deactivating hyper-threading (if available -
> >> > generally in BIOS when possible) may also change performances.
> >> 
> >> > ----- Mail original -----
> >> >> De: "Barry Smith" <bsmith at mcs.anl.gov>
> >> >> À: "Damian Kaliszan" <damian at man.poznan.pl>
> >> >> Cc: "PETSc" <petsc-users at mcs.anl.gov>
> >> >> Envoyé: Mardi 4 Juillet 2017 19:04:36
> >> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
> >> >> 
> >> >> 
> >> >>    You may need to ask a slurm expert. I have no idea what
> >> >>    cpus-per-task
> >> >>    means
> >> >> 
> >> >> 
> >> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <damian at man.poznan.pl>
> >> >> > wrote:
> >> >> > 
> >> >> > Hi,
> >> >> > 
> >> >> > Yes, this is exactly what I meant.
> >> >> > Please find attached output for 2 input datasets and for 2 various
> >> >> > slurm
> >> >> > configs each:
> >> >> > 
> >> >> > A/ Matrix size=8000000x8000000
> >> >> > 
> >> >> > 1/ slurm-14432809.out,   930 ksp steps, ~90 secs
> >> >> > 
> >> >> > 
> >> >> > #SBATCH --nodes=2
> >> >> > #SBATCH --ntasks=32
> >> >> > #SBATCH --ntasks-per-node=16
> >> >> > #SBATCH --cpus-per-task=4
> >> >> > 
> >> >> > 2/  slurm-14432810.out , 100.000 ksp steps, ~9700 secs
> >> >> > 
> >> >> > #SBATCH --nodes=2
> >> >> > #SBATCH --ntasks=32
> >> >> > #SBATCH --ntasks-per-node=16
> >> >> > #SBATCH --cpus-per-task=2
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > B/ Matrix size=1000x1000
> >> >> > 
> >> >> > 1/  slurm-23716.out, 511 ksp steps, ~ 28 secs
> >> >> > #SBATCH --nodes=1
> >> >> > #SBATCH --ntasks=64
> >> >> > #SBATCH --ntasks-per-node=64
> >> >> > #SBATCH --cpus-per-task=4
> >> >> > 
> >> >> > 
> >> >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs
> >> >> > 
> >> >> > #SBATCH --nodes=1
> >> >> > #SBATCH --ntasks=4
> >> >> > #SBATCH --ntasks-per-node=4
> >> >> > #SBATCH --cpus-per-task=4
> >> >> > 
> >> >> > 
> >> >> > I would really appreciate any help...:)
> >> >> > 
> >> >> > Best,
> >> >> > Damian
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano:
> >> >> > 
> >> >> > 
> >> >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan
> >> >> > <damian at man.poznan.pl>
> >> >> > wrote:
> >> >> > Hi,
> >> >> > 
> >> >> > 
> >> >> > >> 1) You can call Bcast on PETSC_COMM_WORLD
> >> >> > 
> >> >> > To  be  honest  I  can't find Bcast method in petsc4py.PETSc.Comm
> >> >> > (I'm
> >> >> > using petsc4py)
> >> >> > 
> >> >> > >> 2) If you are using WORLD, the number of iterates will be the same
> >> >> > >> on
> >> >> > >> each process since iteration is collective.
> >> >> > 
> >> >> > Yes,  this  is  how  it  should  be.  But  what  I noticed is that
> >> >> > for
> >> >> > different  --cpus-per-task  numbers  in  slurm  script I get
> >> >> > different
> >> >> > number  of  solver iterations which is in turn related to timings.
> >> >> > The
> >> >> > imparity    is    huge.  For  example  for   some configurations
> >> >> > where
> >> >> > --cpus-per-task=1 I receive 900
> >> >> > iterations and  for --cpus-per-task=2 I receive valid number of
> >> >> > 100.000
> >> >> > which is set as max
> >> >> > iter number set when setting solver tolerances.
> >> >> > 
> >> >> > I am trying to understand what you are saying. You mean that you make
> >> >> > 2
> >> >> > different runs and get a different
> >> >> > number of iterates with a KSP? In order to answer questions about
> >> >> > convergence, we need to see the output
> >> >> > of
> >> >> > 
> >> >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
> >> >> > 
> >> >> > for all cases.
> >> >> > 
> >> >> > Thanks,
> >> >> > 
> >> >> >    Matt
> >> >> > 
> >> >> > Best,
> >> >> > Damian
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > --
> >> >> > What most experimenters take for granted before they begin their
> >> >> > experiments is infinitely more interesting than any results to which
> >> >> > their
> >> >> > experiments lead.
> >> >> > -- Norbert Wiener
> >> >> > 
> >> >> > http://www.caam.rice.edu/~mk51/
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> > -------------------------------------------------------
> >> >> > Damian Kaliszan
> >> >> > 
> >> >> > Poznan Supercomputing and Networking Center
> >> >> > HPC and Data Centres Technologies
> >> >> > ul. Jana Pawła II 10
> >> >> > 61-139 Poznan
> >> >> > POLAND
> >> >> > 
> >> >> > phone (+48 61) 858 5109
> >> >> > e-mail damian at man.poznan.pl
> >> >> > www - http://www.man.poznan.pl/
> >> >> > -------------------------------------------------------
> >> >> > <slum_output.zip>
> >> >> 
> >> >>
> 
> 
>