[petsc-users] Is OpenMP still available for PETSc?

Wed Jul 5 03:49:23 CDT 2017

Thank you:)

Few notes on what you wrote
1. I  always  try to keep t*c=number of cores, however for 64 core KNL
which  has  hyperthreading  switched  on (cpuinfo shows 256 cores) t*c
should be 64 or 256 (in other words: is t=64 and c=4 correct?) ?
2.  I  noticed  that  for the same input data I may get different
timings in 2 cases
a)  different  number of ksp iterations are observed (why they differ?)
   ->   please  see  screenshot  Julia_N_10_4_vs_64.JPG for the following config (this may be
   related to 64*4 issue + which one is correct from first glance?):

Matrix size=1000x1000

1/  slurm-23716.out, 511 steps, ~ 28 secs
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=4


2/ slurm-23718.out, 94 steps, ~ 4 secs

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4

b)  equal  number  of ksp iterations are observed but different timings  (this might be
due to false sharing or oversubscription ?)
please see   Julia_N_10_64_vs_64.JPG screenshot


Best,
Damian
 
W liście datowanym 5 lipca 2017 (10:26:46) napisano:
 
> The man page of slurm/sbatch is cumbersome.

> But, you may think of :
> 1. tasks "as MPI processus"
> 2. cpus "as threads"

> You should always set resources the most precise way when possible,
> that is (never use --tasks but prefer) to:
> 1. use --nodes=n.
> 2. use --tasks-per-node=t.
> 3. use --cpus-per-tasks=c.
> 4. for a start, make sure that t*c = number of cores you have per node.
> 5. use --exclusive unless you may have VERY different timing if you run twice the same job.
> 6. make sure mpi is configured correctly (run twice [or more] the
> same mono-thread application: get the same timing ?)
> 7. if using OpenMP or multithread applications, make sure you have
> set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with intel).
> 8. make sure you have enough memory (--mem) unless performance may be degraded (swap).

> The rule of thumb 4 may NOT be respected but if so, you need to be
> aware WHY you want to do that (for KNL, it may [or not] make sense [depending on cache modes]).

> Remember than any multi-threaded (OpenMP or not) application may be
> victim of false sharing
> (https://en.wikipedia.org/wiki/False_sharing): in this case, profile
> (using cache metrics) may help to understand if this is the problem,
> and track it if so (you may use perf-record for that).

> Understanding HW is not an easy thing: you really need to go step
> by step unless you have no chance to understand anything in the end.

> Hope this may help !...

> Franck

> Note: activating/deactivating hyper-threading (if available -
> generally in BIOS when possible) may also change performances. 

> ----- Mail original -----
>> De: "Barry Smith" <bsmith at mcs.anl.gov>
>> À: "Damian Kaliszan" <damian at man.poznan.pl>
>> Cc: "PETSc" <petsc-users at mcs.anl.gov>
>> Envoyé: Mardi 4 Juillet 2017 19:04:36
>> Objet: Re: [petsc-users] Is OpenMP still available for PETSc?
>> 
>> 
>>    You may need to ask a slurm expert. I have no idea what cpus-per-task
>>    means
>> 
>> 
>> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <damian at man.poznan.pl> wrote:
>> > 
>> > Hi,
>> > 
>> > Yes, this is exactly what I meant.
>> > Please find attached output for 2 input datasets and for 2 various slurm
>> > configs each:
>> > 
>> > A/ Matrix size=8000000x8000000
>> > 
>> > 1/ slurm-14432809.out,   930 ksp steps, ~90 secs
>> > 
>> > 
>> > #SBATCH --nodes=2
>> > #SBATCH --ntasks=32
>> > #SBATCH --ntasks-per-node=16
>> > #SBATCH --cpus-per-task=4
>> > 
>> > 2/  slurm-14432810.out , 100.000 ksp steps, ~9700 secs
>> > 
>> > #SBATCH --nodes=2
>> > #SBATCH --ntasks=32
>> > #SBATCH --ntasks-per-node=16
>> > #SBATCH --cpus-per-task=2
>> > 
>> > 
>> > 
>> > B/ Matrix size=1000x1000
>> > 
>> > 1/  slurm-23716.out, 511 ksp steps, ~ 28 secs
>> > #SBATCH --nodes=1
>> > #SBATCH --ntasks=64
>> > #SBATCH --ntasks-per-node=64
>> > #SBATCH --cpus-per-task=4
>> > 
>> > 
>> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs
>> > 
>> > #SBATCH --nodes=1
>> > #SBATCH --ntasks=4
>> > #SBATCH --ntasks-per-node=4
>> > #SBATCH --cpus-per-task=4
>> > 
>> > 
>> > I would really appreciate any help...:)
>> > 
>> > Best,
>> > Damian
>> > 
>> > 
>> > 
>> > W liście datowanym 3 lipca 2017 (16:29:15) napisano:
>> > 
>> > 
>> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <damian at man.poznan.pl>
>> > wrote:
>> > Hi,
>> > 
>> > 
>> > >> 1) You can call Bcast on PETSC_COMM_WORLD
>> > 
>> > To  be  honest  I  can't find Bcast method in petsc4py.PETSc.Comm (I'm
>> > using petsc4py)
>> > 
>> > >> 2) If you are using WORLD, the number of iterates will be the same on
>> > >> each process since iteration is collective.
>> > 
>> > Yes,  this  is  how  it  should  be.  But  what  I noticed is that for
>> > different  --cpus-per-task  numbers  in  slurm  script I get different
>> > number  of  solver iterations which is in turn related to timings. The
>> > imparity    is    huge.  For  example  for   some configurations where
>> > --cpus-per-task=1 I receive 900
>> > iterations and  for --cpus-per-task=2 I receive valid number of 100.000
>> > which is set as max
>> > iter number set when setting solver tolerances.
>> > 
>> > I am trying to understand what you are saying. You mean that you make 2
>> > different runs and get a different
>> > number of iterates with a KSP? In order to answer questions about
>> > convergence, we need to see the output
>> > of
>> > 
>> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason
>> > 
>> > for all cases.
>> > 
>> > Thanks,
>> > 
>> >    Matt
>> > 
>> > Best,
>> > Damian
>> > 
>> > 
>> > 
>> > 
>> > --
>> > What most experimenters take for granted before they begin their
>> > experiments is infinitely more interesting than any results to which their
>> > experiments lead.
>> > -- Norbert Wiener
>> > 
>> > http://www.caam.rice.edu/~mk51/
>> > 
>> > 
>> > 
>> > 
>> >