[petsc-users] Using OpenMP threads with PETSc

Thu Apr 9 17:07:39 CDT 2015

2015-04-09 18:50 GMT-03:00 Jed Brown <jed at jedbrown.org>:
> Lucas Clemente Vella <lvella at gmail.com> writes:
>
>> For the next attempt, I've added the option "-threadcomm_type openmp",
>> as the page http://www.mcs.anl.gov/petsc/features/threads.html says...
>
> Need to configure --with-openmpclasses with petsc-dev to use
> -threadcomm_type openmp.
>
>> Now the program runs, but the CPU usage is never higher than 133%,
>> which is much lower than the expected (if I solve the same matrix with
>> Hypre + OpenMP, CPU usage peaks 200% most of the time).
>
> Before getting carried away, run with flat MPI (no threads ever) and
> compare the time.  In many ways, this mode has fewer synchronizations
> and thus should perform better.  That's what we see on most machines
> with HPGMG and lots of other people have found similar behavior.
>
> If it's slower, send -log_summary for both versions.  Not all operations
> are currently threaded.
>
>> So, what threading model gives best results in PETSc, OpenMP or
>> pthreads? And what I need to do use OpenMP? All my attempts were
>> single-threaded,
>
> You mean single-process.
>
>> but I will need to work with MPI, is it straightforward to spawn many
>> MPI processes (i.e. just put mpirun before the command?)
>
> Yes.
>
>> Lastly, how to prevent PETSc to mess with CPU affinities, because I
>> implement affinity control myself inside my program.
>
> Affinity fighting is a recurring disaster for library interoperability
> with threads.  I think what is currently done in PETSc is actively
> harmful and should be taken out.  But I think not using threads is a
> much better solution for almost everyone.

I just found that the 133% issue was because of affinity clash with
other batch jobs running in the cluster's node. If I manually change
it to allow any CPU (let the kernel pick), it runs at 200%.
I'll comment out all CPU affinity changing calls from the code, and
rebuild PETSc with --with-openmpclasses.

I suspect the optimal setup is to have one process for each NUMA node,
one thread for logical core, and affinity lock each process (all
threads) to the cores corresponding to its NUMA node. The mentioned
article showed improvements on almost all cases they compared OpenMP
threads with MPI processes.

-- 
Lucas Clemente Vella
lvella at gmail.com