[petsc-users] petsc4py help with parallel execution

Fri Nov 16 12:18:40 CST 2018

On Fri, 16 Nov 2018 at 19:02, Ivan Voznyuk <ivan.voznyuk.work at gmail.com>
wrote:

> Hi Satish,
> Thanks for your reply.
>
> Bad news... I tested 2 solutions that you proposed, none has worked.
>

You don't still have
OMP_NUM_THREADS=1
set in your environment do you?

Can you print the value of this env variable from within your python code
and confirm it's not 1

>
> 1. --with-blaslapack-dir=/opt/intel/mkl
> --with-mkl_pardiso-dir=/opt/intel/mkl installed well, without any problems.
> However, the code is still turning in sequential way.
> 2. When I changed -lmkl_sequential to -lmkl_intel_thread -liomp, he at
> first did not find the liomp, so I had to create a symbolic link of libiomp5.so
> to /lib.
> At the launching of the .py code I had to go with:
> export
> LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_core.so:/opt/intel/mkl/lib/intel64/libmkl_sequential.so
> and
> export LD_LIBRARY_PATH=/opt/petsc/petsc1/arch-linux2-c-debug/lib/
>
> But still it does not solve the given problem and code is still running
> sequentially...
>
> May be you have some other ideas?
>
> Thanks,
> Ivan
>
>
>
>
> On Fri, Nov 16, 2018 at 6:11 PM Balay, Satish <balay at mcs.anl.gov> wrote:
>
>> Yes PETSc prefers sequential MKL - as MPI handles parallelism.
>>
>> One way to trick petsc configure to use threaded MKL is to enable
>> pardiso. i.e:
>>
>> --with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl
>>
>>
>> http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log
>>
>> BLAS/LAPACK: -Wl,-rpath,/soft/com/packages/intel/16/u3/mkl/lib/intel64
>> -L/soft/com/packages/intel/16/u3/mkl/lib/intel64 -lmkl_intel_lp64
>> -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread
>>
>> Or you can manually specify the correct MKL library list [with
>> threading] via --with-blaslapack-lib option.
>>
>> Satish
>>
>> On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote:
>>
>> > Hi,
>> > You were totally right: no miracle, parallelization does come from
>> > multithreading. We checked Option 1/: played with OMP_NUM_THREADS=1 it
>> > changed computational time.
>> >
>> > So, I reinstalled everything (starting with Ubuntu ending with petsc)
>> and
>> > configured the following things:
>> >
>> > - installed system's ompenmpi
>> > - installed Intel MKL Blas / Lapack
>> > - configured PETSC as ./configure --with-cc=mpicc --with-fc=mpif90
>> > --with-cxx=mpicxx --with-blas-lapack-dir=/opt/intel/mkl/lib/intel64
>> > --download-scalapack --download-mumps --with-hwloc --with-shared
>> > --with-openmp=1 --with-pthread=1 --with-scalar-type=complex
>> > hoping that it would take into account blas multithreading
>> > - installed petsc4py
>> >
>> > However, I do not get any parallelization...
>> > What I tried to do so far unsuccessfully :
>> > - play with OMP_NUM_THREADS
>> > - reinstall the system
>> > - ldd PETSc.cpython-35m-x86_64-linux-gnu.so yields lld_result.txt (here
>> > attached)
>> > I noted that libmkl_sequential.so library there. Do you think this is
>> > normal?
>> > - I found a similar problem reported here:
>> > https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html
>> To
>> > solve this problem, developers recommended to replace -lmkl_sequential
>> to
>> > -lmkl_intel_thread options in PETSC_ARCH/lib/conf/petscvariables.
>> However,
>> > I did not find something that would be named like this (it might be a
>> > change of version)
>> > - Anyway, I replaced lmkl_sequential to lmkl_intel_thread in every file
>> of
>> > PETSC, but it changed nothing.
>> >
>> > As a result, in the new make.log (here attached ) I have a parameter
>> > #define PETSC_HAVE_LIBMKL_SEQUENTIAL 1 and option -lmkl_sequential
>> >
>> > Do you have any idea of what I should change in the initial options in
>> > order to obtain the blas multithreding parallelization?
>> >
>> > Thanks a lot for your help!
>> >
>> > Ivan
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Nov 16, 2018 at 1:25 AM Dave May <dave.mayhem23 at gmail.com>
>> wrote:
>> >
>> > >
>> > >
>> > > On Thu, 15 Nov 2018 at 17:44, Ivan via petsc-users <
>> > > petsc-users at mcs.anl.gov> wrote:
>> > >
>> > >> Hi Stefano,
>> > >>
>> > >> In fact, yes, we look at the htop output (and the resulting
>> computational
>> > >> time ofc).
>> > >>
>> > >> In our code we use MUMPS, which indeed depends on blas / lapack. So I
>> > >> think this might be it!
>> > >>
>> > >> I will definetely check it (I mean the difference between our MUMPS,
>> > >> blas, lapack).
>> > >>
>> > >> If you have an idea of how we can verify on his PC that the source
>> of his
>> > >> parallelization does come from BLAS, please do not hesitate to tell
>> me!
>> > >>
>> > >
>> > > Option 1/
>> > > * Set this environment variable
>> > >   export OMP_NUM_THREADS=1
>> > > * Re-run your "parallel" test.
>> > > * If the performance differs (job runs slower) compared with your
>> previous
>> > > run where you inferred parallelism was being employed, you can safely
>> > > assume that the parallelism observed comes from threads
>> > >
>> > > Option 2/
>> > > * Re-configure PETSc to use a known BLAS implementation which does not
>> > > support threads
>> > > * Re-compile PETSc
>> > > * Re-run your parallel test
>> > > * If the performance differs (job runs slower) compared with your
>> previous
>> > > run where you inferred parallelism was being employed, you can safely
>> > > assume that the parallelism observed comes from threads
>> > >
>> > > Option 3/
>> > > * Use a PC which does not depend on BLAS at all,
>> > > e.g. -pc_type jacobi -pc_type bjacobi
>> > > * If the performance differs (job runs slower) compared with your
>> previous
>> > > run where you inferred parallelism was being employed, you can safely
>> > > assume that the parallelism observed comes from BLAS + threads
>> > >
>> > >
>> > >
>> > >> Thanks!
>> > >>
>> > >> Ivan
>> > >> On 15/11/2018 18:24, Stefano Zampini wrote:
>> > >>
>> > >> If you say your program is parallel by just looking at the output
>> from
>> > >> the top command, you are probably linking against a multithreaded
>> blas
>> > >> library
>> > >>
>> > >> Il giorno Gio 15 Nov 2018, 20:09 Matthew Knepley via petsc-users <
>> > >> petsc-users at mcs.anl.gov> ha scritto:
>> > >>
>> > >>> On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk <
>> > >>> ivan.voznyuk.work at gmail.com> wrote:
>> > >>>
>> > >>>> Hi Matthew,
>> > >>>>
>> > >>>> Does it mean that by using just command python3 simple_code.py
>> (without
>> > >>>> mpiexec) you *cannot* obtain a parallel execution?
>> > >>>>
>> > >>>
>> > >>> As I wrote before, its not impossible. You could be directly calling
>> > >>> PMI, but I do not think you are doing that.
>> > >>>
>> > >>>
>> > >>>> It s been 5 days we are trying to understand with my colleague how
>> he
>> > >>>> managed to do so.
>> > >>>> It means that by using simply python3 simple_code.py he gets 8
>> > >>>> processors workiing.
>> > >>>> By the way, we wrote in his code few lines:
>> > >>>> rank = PETSc.COMM_WORLD.Get_rank()
>> > >>>> size = PETSc.COMM_WORLD.Get_size()
>> > >>>> and we got rank = 0, size = 1
>> > >>>>
>> > >>>
>> > >>> This is MPI telling you that you are only running on 1 processes.
>> > >>>
>> > >>>
>> > >>>> However, we compilator arrives to KSP.solve(), somehow it turns on
>> 8
>> > >>>> processors.
>> > >>>>
>> > >>>
>> > >>> Why do you think its running on 8 processes?
>> > >>>
>> > >>>
>> > >>>> This problem is solved on his PC in 5-8 sec (in parallel, using
>> *python3
>> > >>>> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but
>> with
>> > >>>> the same command *python3 simple_code.py*)
>> > >>>>
>> > >>>
>> > >>> I think its much more likely that there are differences in the
>> solver
>> > >>> (use -ksp_view to see exactly what solver was used), then
>> > >>> to think it is parallelism. Moreover, you would never ever ever see
>> that
>> > >>> much speedup on a laptop since all these computations
>> > >>> are bandwidth limited.
>> > >>>
>> > >>>   Thanks,
>> > >>>
>> > >>>      Matt
>> > >>>
>> > >>>
>> > >>>> So, conclusion is that on his computer this code works in the same
>> way
>> > >>>> as scipy: all the code is executed in sequantial mode, but when it
>> comes to
>> > >>>> solution of system of linear equations, it runs on all available
>> > >>>> processors. All this with just running python3 my_code.py (without
>> any
>> > >>>> mpi-smth)
>> > >>>>
>> > >>>> Is it an exception / abnormal behavior? I mean, is it something
>> > >>>> irregular that you, developers, have never seen?
>> > >>>>
>> > >>>> Thanks and have a good evening!
>> > >>>> Ivan
>> > >>>>
>> > >>>> P.S. I don't think I know the answer regarding Scipy...
>> > >>>>
>> > >>>>
>> > >>>> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley <knepley at gmail.com
>> >
>> > >>>> wrote:
>> > >>>>
>> > >>>>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk <
>> > >>>>> ivan.voznyuk.work at gmail.com> wrote:
>> > >>>>>
>> > >>>>>> Hi Matthew,
>> > >>>>>> Thanks for your reply!
>> > >>>>>>
>> > >>>>>> Let me precise what I mean by defining few questions:
>> > >>>>>>
>> > >>>>>> 1. In order to obtain a parallel execution of simple_code.py, do
>> I
>> > >>>>>> need to go with mpiexec python3 simple_code.py, or I can just
>> launch
>> > >>>>>> python3 simple_code.py?
>> > >>>>>>
>> > >>>>>
>> > >>>>> mpiexec -n 2 python3 simple_code.py
>> > >>>>>
>> > >>>>>
>> > >>>>>> 2. This simple_code.py consists of 2 parts: a) preparation of
>> matrix
>> > >>>>>> b) solving the system of linear equations with PETSc. If I
>> launch mpirun
>> > >>>>>> (or mpiexec) -np 8 python3 simple_code.py, I suppose that I will
>> basically
>> > >>>>>> obtain 8 matrices and 8 systems to solve. However, I need to
>> prepare only
>> > >>>>>> one matrix, but launch this code in parallel on 8 processors.
>> > >>>>>>
>> > >>>>>
>> > >>>>> When you create the Mat object, you give it a communicator (here
>> > >>>>> PETSC_COMM_WORLD). That allows us to distribute the data. This is
>> all
>> > >>>>> covered extensively in the manual and the online tutorials, as
>> well as the
>> > >>>>> example code.
>> > >>>>>
>> > >>>>>
>> > >>>>>> In fact, here attached you will find a similar code
>> (scipy_code.py)
>> > >>>>>> with only one difference: the system of linear equations is
>> solved with
>> > >>>>>> scipy. So when I solve it, I can clearly see that the solution
>> is obtained
>> > >>>>>> in a parallel way. However, I do not use the command mpirun (or
>> mpiexec). I
>> > >>>>>> just go with python3 scipy_code.py.
>> > >>>>>>
>> > >>>>>
>> > >>>>> Why do you think its running in parallel?
>> > >>>>>
>> > >>>>>   Thanks,
>> > >>>>>
>> > >>>>>      Matt
>> > >>>>>
>> > >>>>>
>> > >>>>>> In this case, the first part (creation of the sparse matrix) is
>> not
>> > >>>>>> parallel, whereas the solution of system is found in a parallel
>> way.
>> > >>>>>> So my question is, Do you think that it s possible to have the
>> same
>> > >>>>>> behavior with PETSC? And what do I need for this?
>> > >>>>>>
>> > >>>>>> I am asking this because for my colleague it worked! It means
>> that he
>> > >>>>>> launches the simple_code.py on his computer using the command
>> python3
>> > >>>>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he
>> obtains a
>> > >>>>>> parallel execution of the same code.
>> > >>>>>>
>> > >>>>>> Thanks for your help!
>> > >>>>>> Ivan
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley <
>> knepley at gmail.com>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users <
>> > >>>>>>> petsc-users at mcs.anl.gov> wrote:
>> > >>>>>>>
>> > >>>>>>>> Dear PETSC community,
>> > >>>>>>>>
>> > >>>>>>>> I have a question regarding the parallel execution of petsc4py.
>> > >>>>>>>>
>> > >>>>>>>> I have a simple code (here attached simple_code.py) which
>> solves a
>> > >>>>>>>> system of linear equations Ax=b using petsc4py. To execute it,
>> I use the
>> > >>>>>>>> command python3 simple_code.py which yields a sequential
>> performance. With
>> > >>>>>>>> a colleague of my, we launched this code on his computer, and
>> this time the
>> > >>>>>>>> execution was in parallel. Although, he used the same command
>> python3
>> > >>>>>>>> simple_code.py (without mpirun, neither mpiexec).
>> > >>>>>>>>
>> > >>>>>>> I am not sure what you mean. To run MPI programs in parallel,
>> you
>> > >>>>>>> need a launcher like mpiexec or mpirun. There are Python
>> programs (like
>> > >>>>>>> nemesis) that use the launcher API directly (called PMI), but
>> that is not
>> > >>>>>>> part of petsc4py.
>> > >>>>>>>
>> > >>>>>>>   Thanks,
>> > >>>>>>>
>> > >>>>>>>      Matt
>> > >>>>>>>
>> > >>>>>>>> My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7,
>> PETSc
>> > >>>>>>>> 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in
>> virtualenv
>> > >>>>>>>>
>> > >>>>>>>> In order to parallelize it, I have already tried:
>> > >>>>>>>> - use 2 different PCs
>> > >>>>>>>> - use Ubuntu 16.04, 18.04
>> > >>>>>>>> - use different architectures (arch-linux2-c-debug,
>> > >>>>>>>> linux-gnu-c-debug, etc)
>> > >>>>>>>> - ofc use different configurations (my present config can be
>> found
>> > >>>>>>>> in make.log that I attached here)
>> > >>>>>>>> - mpi from mpich, openmpi
>> > >>>>>>>>
>> > >>>>>>>> Nothing worked.
>> > >>>>>>>>
>> > >>>>>>>> Do you have any ideas?
>> > >>>>>>>>
>> > >>>>>>>> Thanks and have a good day,
>> > >>>>>>>> Ivan
>> > >>>>>>>>
>> > >>>>>>>> --
>> > >>>>>>>> Ivan VOZNYUK
>> > >>>>>>>> PhD in Computational Electromagnetics
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> --
>> > >>>>>>> What most experimenters take for granted before they begin their
>> > >>>>>>> experiments is infinitely more interesting than any results to
>> which their
>> > >>>>>>> experiments lead.
>> > >>>>>>> -- Norbert Wiener
>> > >>>>>>>
>> > >>>>>>> https://www.cse.buffalo.edu/~knepley/
>> > >>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> Ivan VOZNYUK
>> > >>>>>> PhD in Computational Electromagnetics
>> > >>>>>> +33 (0)6.95.87.04.55
>> > >>>>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage>
>> > >>>>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106>
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> --
>> > >>>>> What most experimenters take for granted before they begin their
>> > >>>>> experiments is infinitely more interesting than any results to
>> which their
>> > >>>>> experiments lead.
>> > >>>>> -- Norbert Wiener
>> > >>>>>
>> > >>>>> https://www.cse.buffalo.edu/~knepley/
>> > >>>>> <http://www.cse.buffalo.edu/~knepley/>
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Ivan VOZNYUK
>> > >>>> PhD in Computational Electromagnetics
>> > >>>> +33 (0)6.95.87.04.55
>> > >>>> My webpage <https://ivanvoznyukwork.wixsite.com/webpage>
>> > >>>> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106>
>> > >>>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> What most experimenters take for granted before they begin their
>> > >>> experiments is infinitely more interesting than any results to
>> which their
>> > >>> experiments lead.
>> > >>> -- Norbert Wiener
>> > >>>
>> > >>> https://www.cse.buffalo.edu/~knepley/
>> > >>> <http://www.cse.buffalo.edu/~knepley/>
>> > >>>
>> > >>
>> >
>> >
>>
>>
>
> --
> Ivan VOZNYUK
> PhD in Computational Electromagnetics
> +33 (0)6.95.87.04.55
> My webpage <https://ivanvoznyukwork.wixsite.com/webpage>
> My LinkedIn <http://linkedin.com/in/ivan-voznyuk-b869b8106>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20181116/fd635fd5/attachment-0001.html>