[petsc-users] Strange efficiency in PETSc-dev using OpenMP

Mon Sep 23 14:13:48 CDT 2013

Hi Barry,

Another strange problem:

Currently I have PETSc-3.4.2 MPI version and PETSc-dev OpenMP version on 
my computer, with different environment variable of PETSC_ARCH and 
PETSC_DIR. Before installation of PETSc-dev OpenMP version, the 
PETSc-3.4.2 MPI version works fine. But after installation of PETSc-dev 
OpenMP version, the same problem exist in PETSc-3.4.2 MPI version if run 
with 1 processor, but no problem with 2 or more processors.

Thanks,

Danyang

On 23/09/2013 12:01 PM, Danyang Su wrote:
> Hi Barry,
>
> Sorry I forgot the message in the previous email. It is still slow 
> when run without the "-threadcomm_type openmp -threadcomm_nthreads 1"
>
> Thanks,
>
> Danyang
>
> On 23/09/2013 11:43 AM, Barry Smith wrote:
>>     You did not answer my question from yesterday:
>>
>>   If you run the Openmp compiled version WITHOUT the
>>
>> -threadcomm_nthreads 1
>> -threadcomm_type openmp
>>
>>   command line options is it still slow?
>>
>>
>> On Sep 23, 2013, at 1:33 PM, Danyang Su <danyang.su at gmail.com> wrote:
>>
>>> Hi Shri,
>>>
>>> It seems that the problem does not result from the affinities 
>>> setting for threads. I have tried several settings, the threads are 
>>> set to different cores, but there is no improvement.
>>>
>>> Here is the information of package, core and thread maps
>>>
>>> OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
>>> OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid 
>>> leaf 11 info
>>> OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 
>>> {0,1,2,3,4,5,6,7,8,9,10,11}
>>> OMP: Info #156: KMP_AFFINITY: 12 available OS procs
>>> OMP: Info #157: KMP_AFFINITY: Uniform topology
>>> OMP: Info #179: KMP_AFFINITY: 1 packages x 6 cores/pkg x 2 
>>> threads/core (6 total cores)
>>> OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
>>> OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 
>>> thread 1
>>> OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 
>>> thread 1
>>> OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 
>>> thread 1
>>> OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 
>>> thread 1
>>> OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 4 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 4 
>>> thread 1
>>> OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 5 
>>> thread 0
>>> OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 5 
>>> thread 1
>>> OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost 
>>> levels of machine
>>>
>>>
>>> And here is the internal thread bounding with different kmp_affinity 
>>> settings:
>>>
>>> 1. KMP_AFFINITY=verbose,granularity=thread,compact
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {2}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {3}
>>>
>>> 2. KMP_AFFINITY=verbose,granularity=fine,compact
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {2}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {3}
>>>
>>> 3. KMP_AFFINITY=verbose,granularity=fine,compact,1,0
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {2}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {4}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {6}
>>>
>>> 4. KMP_AFFINITY=verbose,scatter
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0,1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {2,3}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {4,5}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {6,7}
>>>
>>> 5. KMP_AFFINITY=verbose,compact (For this setting, two threads are 
>>> assigned to the same core)
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0,1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {0,1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {2,3}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {2,3}
>>>
>>> 6. KMP_AFFINITY=verbose,granularity=core,compact  (For this setting, 
>>> two threads are assigned to the same core)
>>>
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set 
>>> {0,1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set 
>>> {0,1}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set 
>>> {2,3}
>>> OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set 
>>> {2,3}
>>>
>>> The first 4 settings can assign threads to a distinct core, but the 
>>> problem is not solved.
>>>
>>> Thanks,
>>>
>>> Danyang
>>>
>>>
>>>
>>> On 22/09/2013 8:00 PM, Shri wrote:
>>>> I think this is definitely an issue with setting the affinities for 
>>>> threads, i.e., the assignment of threads to cores. Ideally each 
>>>> thread should be assigned to a distinct core but in your case all 
>>>> the 4 threads are getting pinned to the same core resulting in such 
>>>> a massive slowdown. Unfortunately, the thread affinities for OpenMP 
>>>> are set through environment variables. For Intel's OpenMP one needs 
>>>> to define the thread affinities through the environment variable 
>>>> KMP_AFFINITY. See this document here 
>>>> http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/optaps/common/optaps_openmp_thread_affinity.htm. 
>>>> Try setting the affinities via KMP_AFFINITY and let us know if it 
>>>> works.
>>>>
>>>> Shri
>>>> On Sep 21, 2013, at 11:06 PM, Danyang Su wrote:
>>>>
>>>>> Hi Shri,
>>>>>
>>>>> Thanks for your info. It can work with the option -threadcomm_type 
>>>>> openmp. But another problem arises, as described as follows.
>>>>>
>>>>> The sparse matrix is  53760*53760 with 1067392 non-zero entries. 
>>>>> If the codes is compiled using PETSc-3.4.2, it works fine, the 
>>>>> equations can be solved quickly and I can see the speedup. But if 
>>>>> the code is compiled using PETSc-dev with OpenMP option, it takes 
>>>>> a long time in solving the equations and I cannot see any speedup 
>>>>> when more processors are used.
>>>>>
>>>>> For PETSc-3.4.2,  run by "mpiexec -n 4 ksp_inhm_d -log_summary 
>>>>> log_mpi4_petsc3.4.2.log", the iteration and runtime are:
>>>>> Iterations     6 time_assembly  0.4137E-01 time_ksp 0.9296E-01
>>>>>
>>>>> For PETSc-dev,  run by "mpiexec -n 1 ksp_inhm_d -threadcomm_type 
>>>>> openmp -threadcomm_nthreads 4 -log_summary 
>>>>> log_openmp_petsc_dev.log", the iteration and runtime are:
>>>>> Iterations     6 time_assembly  0.3595E+03 time_ksp 0.2907E+00
>>>>>
>>>>> Most of the time 'time_assembly  0.3595E+03' is spent on the 
>>>>> following codes
>>>>>                  do i = istart, iend - 1
>>>>>                     ii = ia_in(i+1)
>>>>>                     jj = ia_in(i+2)
>>>>>                     call MatSetValues(a, ione, i, jj-ii, 
>>>>> ja_in(ii:jj-1)-1, a_in(ii:jj-1), Insert_Values, ierr)
>>>>>                  end do
>>>>>
>>>>> The log files for both PETSc-3.4.2 and PETSc-dev are attached.
>>>>>
>>>>> Is there anything wrong with my codes or with running option? The 
>>>>> above codes works fine when using MPICH.
>>>>>
>>>>> Thanks and regards,
>>>>>
>>>>> Danyang
>>>>>
>>>>> On 21/09/2013 2:09 PM, Shri wrote:
>>>>>> There are three thread communicator types in PETSc. The default 
>>>>>> is "no thread" which is basically a non-threaded version. The 
>>>>>> other two types are "openmp" and "pthread". If you want to use 
>>>>>> OpenMP then use the option -threadcomm_type openmp.
>>>>>>
>>>>>> Shri
>>>>>>
>>>>>> On Sep 21, 2013, at 3:46 PM, Danyang Su <danyang.su at gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Barry,
>>>>>>>
>>>>>>> Thanks for the quick reply.
>>>>>>>
>>>>>>> After changing
>>>>>>> #if defined(PETSC_HAVE_PTHREADCLASSES) || defined 
>>>>>>> (PETSC_HAVE_OPENMP)
>>>>>>> to
>>>>>>> #if defined(PETSC_HAVE_PTHREADCLASSES)
>>>>>>> and comment out
>>>>>>> #elif defined(PETSC_HAVE_OPENMP)
>>>>>>> PETSC_EXTERN PetscStack *petscstack;
>>>>>>>
>>>>>>> It can be compiled and validated with "make test".
>>>>>>>
>>>>>>> But I still have questions on running the examples. After 
>>>>>>> rebuild the codes (e.g., ksp_ex2f.f), I can run it with "mpiexec 
>>>>>>> -n 1 ksp_ex2f", or "mpiexec -n 4 ksp_ex2f", or "mpiexec -n 1 
>>>>>>> ksp_ex2f -threadcomm_nthreads 1", but if I run it with "mpiexec 
>>>>>>> -n 1 ksp_ex2f -threadcomm_nthreads 4", there will be a lot of 
>>>>>>> error information (attached).
>>>>>>>
>>>>>>> The codes is not modified and there is no OpenMP routines in it. 
>>>>>>> For the current development in my project, I want to keep the 
>>>>>>> OpenMP codes in calculating matrix values, but want to solve it 
>>>>>>> with PETSc (OpenMP). Is it possible?
>>>>>>>
>>>>>>> Thanks and regards,
>>>>>>>
>>>>>>> Danyang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 21/09/2013 7:26 AM, Barry Smith wrote:
>>>>>>>>    Danyang,
>>>>>>>>
>>>>>>>>       I don't think the  || defined (PETSC_HAVE_OPENMP)   
>>>>>>>> belongs in the code below.
>>>>>>>>
>>>>>>>> /*  Linux functions CPU_SET and others don't work if sched.h is 
>>>>>>>> not included before
>>>>>>>>      including pthread.h. Also, these functions are active only 
>>>>>>>> if either _GNU_SOURCE
>>>>>>>>      or __USE_GNU is not set (see /usr/include/sched.h and 
>>>>>>>> /usr/include/features.h), hence
>>>>>>>>      set these first.
>>>>>>>> */
>>>>>>>> #if defined(PETSC_HAVE_PTHREADCLASSES) || defined 
>>>>>>>> (PETSC_HAVE_OPENMP)
>>>>>>>>
>>>>>>>> Edit include/petscerror.h and locate these lines and remove 
>>>>>>>> that part and then rerun make all.  Let us know if it works or 
>>>>>>>> not.
>>>>>>>>
>>>>>>>>     Barry
>>>>>>>>
>>>>>>>> i.e. replace
>>>>>>>>
>>>>>>>> #if defined(PETSC_HAVE_PTHREADCLASSES) || defined 
>>>>>>>> (PETSC_HAVE_OPENMP)
>>>>>>>>
>>>>>>>> with
>>>>>>>>
>>>>>>>> #if defined(PETSC_HAVE_PTHREADCLASSES)
>>>>>>>>
>>>>>>>> On Sep 21, 2013, at 6:53 AM, Matthew Knepley
>>>>>>>> <petsc-maint at mcs.anl.gov>
>>>>>>>>   wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Sat, Sep 21, 2013 at 12:18 AM, Danyang Su 
>>>>>>>>> <danyang.su at gmail.com>
>>>>>>>>>   wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I got error information in compiling petsc-dev with openmp in 
>>>>>>>>> cygwin. Before, I have successfully compiled petsc-3.4.2 and 
>>>>>>>>> it works fine.
>>>>>>>>> The log files have been attached.
>>>>>>>>>
>>>>>>>>> The OpenMP configure test is wrong. It clearly fails to find 
>>>>>>>>> pthread.h, but the test passes. Then in petscerror.h
>>>>>>>>> we guard pthread.h using PETSC_HAVE_OPENMP. Can someone who 
>>>>>>>>> knows OpenMP fix this?
>>>>>>>>>
>>>>>>>>>      Matt
>>>>>>>>>   Thanks,
>>>>>>>>>
>>>>>>>>> Danyang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> What most experimenters take for granted before they begin 
>>>>>>>>> their experiments is infinitely more interesting than any 
>>>>>>>>> results to which their experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>>
>>>>>>> <error.txt>
>>>>> <log_mpi4_petsc3.4.2.log><log_openmp_petsc_dev.log>
>