[petsc-users] Is OpenMP still available for PETSc?

Thu Jul 6 02:56:58 CDT 2017

Dear Franck,

Thank you for tour comment!
Did  I  get  you  correctly:  According  to  you may t<->c combination
influence  the  the  convergence  -> ksp iterations -> timings (result
array/vector should be identical though)?

Best,
Damian

W liście datowanym 5 lipca 2017 (15:09:59) napisano:

> For a given use case, you may want to try all possible t and c such
> that t*c=n : stick to the best one.

> Now, if you modify resources (t/c) and you get different
> timing/iterations, this seems logical to me: blocks, overlap, ...
> (and finally convergence) will differ so comparison does no more
> really make sense as you do something different (unless you fix t,
> and let c vary: even like that, you may not get what you expect -
> anyway, seems it's not what you do).

> Franck

> ----- Mail original -----
>> De: "Damian Kaliszan" <damian at man.poznan.pl>
>> À: "Franck Houssen" <franck.houssen at inria.fr>, "Barry Smith" <bsmith at mcs.anl.gov>
>> Cc: petsc-users at mcs.anl.gov
>> Envoyé: Mercredi 5 Juillet 2017 10:50:39
>> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
>> 
>> Thank you:)
>> 
>> Few notes on what you wrote
>> 1. I  always  try to keep t*c=number of cores, however for 64 core KNL
>> which  has  hyperthreading  switched  on (cpuinfo shows 256 cores) t*c
>> should be 64 or 256 (in other words: is t=64 and c=4 correct?) ?
>> 2.  I  noticed  that  for the same input data I may get different
>> timings in 2 cases
>> a)  different  number of ksp iterations are observed (why they differ?)
>>    ->   please  see  screenshot  Julia_N_10_4_vs_64.JPG for the following
>>    config (this may be
>>    related to 64*4 issue + which one is correct from first glance?):
>> 
>> Matrix size=1000x1000
>> 
>> 1/  slurm-23716.out, 511 steps, ~ 28 secs
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=64
>> #SBATCH --ntasks-per-node=64
>> #SBATCH --cpus-per-task=4
>> 
>> 
>> 2/ slurm-23718.out, 94 steps, ~ 4 secs
>> 
>> #SBATCH --nodes=1
>> #SBATCH --ntasks=4
>> #SBATCH --ntasks-per-node=4
>> #SBATCH --cpus-per-task=4
>> 
>> b)  equal  number  of ksp iterations are observed but different timings
>> (this might be
>> due to false sharing or oversubscription ?)
>> please see   Julia_N_10_64_vs_64.JPG screenshot
>> 
>> 
>> 
>> Best,
>> Damian
>>  
>> W liście datowanym 5 lipca 2017 (10:26:46) napisano:
>>  
>> > The man page of slurm/sbatch is cumbersome.
>> 
>> > But, you may think of :
>> > 1. tasks "as MPI processus"
>> > 2. cpus "as threads"
>> 
>> > You should always set resources the most precise way when possible,
>> > that is (never use --tasks but prefer) to:
>> > 1. use --nodes=n.
>> > 2. use --tasks-per-node=t.
>> > 3. use --cpus-per-tasks=c.
>> > 4. for a start, make sure that t*c = number of cores you have per node.
>> > 5. use --exclusive unless you may have VERY different timing if you run
>> > twice the same job.
>> > 6. make sure mpi is configured correctly (run twice [or more] the
>> > same mono-thread application: get the same timing ?)
>> > 7. if using OpenMP or multithread applications, make sure you have
>> > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with
>> > intel).
>> > 8. make sure you have enough memory (--mem) unless performance may be
>> > degraded (swap).
>> 
>> > The rule of thumb 4 may NOT be respected but if so, you need to be
>> > aware WHY you want to do that (for KNL, it may [or not] make sense
>> > [depending on cache modes]).
>> 
>> > Remember than any multi-threaded (OpenMP or not) application may be
>> > victim of false sharing
>> > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile
>> > (using cache metrics) may help to understand if this is the problem,
>> > and track it if so (you may use perf-record for that).
>> 
>> > Understanding HW is not an easy thing: you really need to go step
>> > by step unless you have no chance to understand anything in the end.
>> 
>> > Hope this may help !...
>> 
>> > Franck
>> 
>> > Note: activating/deactivating hyper-threading (if available -
>> > generally in BIOS when possible) may also change performances.
>> 
>> > ----- Mail original -----
>> >> De: "Barry Smith" <bsmith at mcs.anl.gov>
>> >> À: "Damian Kaliszan" <damian at man.poznan.pl>
>> >> Cc: "PETSc" <petsc-users at mcs.anl.gov>
>> >> Envoyé: Mardi 4 Juillet 2017 19:04:36
>> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
>> >> 
>> >> 
>> >>    You may need to ask a slurm expert. I have no idea what cpus-per-task
>> >>    means
>> >> 
>> >> 
>> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <damian at man.poznan.pl>
>> >> > wrote:
>> >> > 
>> >> > Hi,
>> >> > 
>> >> > Yes, this is exactly what I meant.
>> >> > Please find attached output for 2 input datasets and for 2 various slurm
>> >> > configs each:
>> >> > 
>> >> > A/ Matrix size=8000000x8000000
>> >> > 
>> >> > 1/ slurm-14432809.out,   930 ksp steps, ~90 secs
>> >> > 
>> >> > 
>> >> > #SBATCH --nodes=2
>> >> > #SBATCH --ntasks=32
>> >> > #SBATCH --ntasks-per-node=16
>> >> > #SBATCH --cpus-per-task=4
>> >> > 
>> >> > 2/  slurm-14432810.out , 100.000 ksp steps, ~9700 secs
>> >> > 
>> >> > #SBATCH --nodes=2
>> >> > #SBATCH --ntasks=32
>> >> > #SBATCH --ntasks-per-node=16
>> >> > #SBATCH --cpus-per-task=2
>> >> > 
>> >> > 
>> >> > 
>> >> > B/ Matrix size=1000x1000
>> >> > 
>> >> > 1/  slurm-23716.out, 511 ksp steps, ~ 28 secs
>> >> > #SBATCH --nodes=1
>> >> > #SBATCH --ntasks=64
>> >> > #SBATCH --ntasks-per-node=64
>> >> > #SBATCH --cpus-per-task=4
>> >> > 
>> >> > 
>> >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs
>> >> > 
>> >> > #SBATCH --nodes=1
>> >> > #SBATCH --ntasks=4
>> >> > #SBATCH --ntasks-per-node=4
>> >> > #SBATCH --cpus-per-task=4
>> >> > 
>> >> > 
>> >> > I would really appreciate any help...:)
>> >> > 
>> >> > Best,
>> >> > Damian
>> >> > 
>> >> > 
>> >> > 
>> >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano:
>> >> > 
>> >> > 
>> >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <damian at man.poznan.pl>
>> >> > wrote:
>> >> > Hi,
>> >> > 
>> >> > 
>> >> > >> 1) You can call Bcast on PETSC_COMM_WORLD
>> >> > 
>> >> > To  be  honest  I  can't find Bcast method in petsc4py.PETSc.Comm (I'm
>> >> > using petsc4py)
>> >> > 
>> >> > >> 2) If you are using WORLD, the number of iterates will be the same on
>> >> > >> each process since iteration is collective.
>> >> > 
>> >> > Yes,  this  is  how  it  should  be.  But  what  I noticed is that for
>> >> > different  --cpus-per-task  numbers  in  slurm  script I get different
>> >> > number  of  solver iterations which is in turn related to timings. The
>> >> > imparity    is    huge.  For  example  for   some configurations where
>> >> > --cpus-per-task=1 I receive 900
>> >> > iterations and  for --cpus-per-task=2 I receive valid number of 100.000
>> >> > which is set as max
>> >> > iter number set when setting solver tolerances.
>> >> > 
>> >> > I am trying to understand what you are saying. You mean that you make 2
>> >> > different runs and get a different
>> >> > number of iterates with a KSP? In order to answer questions about
>> >> > convergence, we need to see the output
>> >> > of
>> >> > 
>> >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
>> >> > 
>> >> > for all cases.
>> >> > 
>> >> > Thanks,
>> >> > 
>> >> >    Matt
>> >> > 
>> >> > Best,
>> >> > Damian
>> >> > 
>> >> > 
>> >> > 
>> >> > 
>> >> > --
>> >> > What most experimenters take for granted before they begin their
>> >> > experiments is infinitely more interesting than any results to which
>> >> > their
>> >> > experiments lead.
>> >> > -- Norbert Wiener
>> >> > 
>> >> > http://www.caam.rice.edu/~mk51/
>> >> > 
>> >> > 
>> >> > 
>> >> > 
>> >> > 
>> >> > -------------------------------------------------------
>> >> > Damian Kaliszan
>> >> > 
>> >> > Poznan Supercomputing and Networking Center
>> >> > HPC and Data Centres Technologies
>> >> > ul. Jana Pawła II 10
>> >> > 61-139 Poznan
>> >> > POLAND
>> >> > 
>> >> > phone (+48 61) 858 5109
>> >> > e-mail damian at man.poznan.pl
>> >> > www - http://www.man.poznan.pl/
>> >> > -------------------------------------------------------
>> >> > <slum_output.zip>
>> >> 
>> >>