<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hi Satish,</div><div>Thanks for your reply.</div><div><br></div><div>Bad news... I tested 2 solutions that you proposed, none has worked. <br></div><div><br></div><div>1. --with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl installed well, without any problems. However, the code is still turning in sequential way.</div><div>2. When I changed -lmkl_sequential to -lmkl_intel_thread -liomp, he at first did not find the liomp, so I had to create a symbolic link of <code>libiomp5.so </code>to /lib.</div><div>At the launching of the .py code I had to go with:</div><div>export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_core.so:/opt/intel/mkl/lib/intel64/libmkl_sequential.so</div><div>and</div><div>export LD_LIBRARY_PATH=/opt/petsc/petsc1/arch-linux2-c-debug/lib/</div><div><br></div><div>But still it does not solve the given problem and code is still running sequentially...</div><div><br></div><div>May be you have some other ideas?</div><div><br></div><div>Thanks,</div><div>Ivan<br></div><div><br></div><div><br></div><div><br> </div></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Nov 16, 2018 at 6:11 PM Balay, Satish <<a href="mailto:balay@mcs.anl.gov">balay@mcs.anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yes PETSc prefers sequential MKL - as MPI handles parallelism.<br>

<br>

One way to trick petsc configure to use threaded MKL is to enable pardiso. i.e:<br>

<br>

--with-blaslapack-dir=/opt/intel/mkl --with-mkl_pardiso-dir=/opt/intel/mkl<br>

<br>

<a href="http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log" rel="noreferrer" target="_blank">http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2018/11/15/configure_master_arch-pardiso_grind.log</a><br>

<br>

BLAS/LAPACK: -Wl,-rpath,/soft/com/packages/intel/16/u3/mkl/lib/intel64 -L/soft/com/packages/intel/16/u3/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread<br>

<br>

Or you can manually specify the correct MKL library list [with<br>

threading] via --with-blaslapack-lib option.<br>

<br>

Satish<br>

<br>

On Fri, 16 Nov 2018, Ivan Voznyuk via petsc-users wrote:<br>

<br>

> Hi,<br>

> You were totally right: no miracle, parallelization does come from<br>

> multithreading. We checked Option 1/: played with OMP_NUM_THREADS=1 it<br>

> changed computational time.<br>

> <br>

> So, I reinstalled everything (starting with Ubuntu ending with petsc) and<br>

> configured the following things:<br>

> <br>

> - installed system's ompenmpi<br>

> - installed Intel MKL Blas / Lapack<br>

> - configured PETSC as ./configure --with-cc=mpicc --with-fc=mpif90<br>

> --with-cxx=mpicxx --with-blas-lapack-dir=/opt/intel/mkl/lib/intel64<br>

> --download-scalapack --download-mumps --with-hwloc --with-shared<br>

> --with-openmp=1 --with-pthread=1 --with-scalar-type=complex<br>

> hoping that it would take into account blas multithreading<br>

> - installed petsc4py<br>

> <br>

> However, I do not get any parallelization...<br>

> What I tried to do so far unsuccessfully :<br>

> - play with OMP_NUM_THREADS<br>

> - reinstall the system<br>

> - ldd <a href="http://PETSc.cpython-35m-x86_64-linux-gnu.so" rel="noreferrer" target="_blank">PETSc.cpython-35m-x86_64-linux-gnu.so</a> yields lld_result.txt (here<br>

> attached)<br>

> I noted that libmkl_sequential.so library there. Do you think this is<br>

> normal?<br>

> - I found a similar problem reported here:<br>

> <a href="https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html" rel="noreferrer" target="_blank">https://lists.mcs.anl.gov/pipermail/petsc-users/2016-March/028803.html</a> To<br>

> solve this problem, developers recommended to replace -lmkl_sequential to<br>

> -lmkl_intel_thread options in PETSC_ARCH/lib/conf/petscvariables. However,<br>

> I did not find something that would be named like this (it might be a<br>

> change of version)<br>

> - Anyway, I replaced lmkl_sequential to lmkl_intel_thread in every file of<br>

> PETSC, but it changed nothing.<br>

> <br>

> As a result, in the new make.log (here attached ) I have a parameter<br>

> #define PETSC_HAVE_LIBMKL_SEQUENTIAL 1 and option -lmkl_sequential<br>

> <br>

> Do you have any idea of what I should change in the initial options in<br>

> order to obtain the blas multithreding parallelization?<br>

> <br>

> Thanks a lot for your help!<br>

> <br>

> Ivan<br>

> <br>

> <br>

> <br>

> <br>

> <br>

> <br>

> On Fri, Nov 16, 2018 at 1:25 AM Dave May <<a href="mailto:dave.mayhem23@gmail.com" target="_blank">dave.mayhem23@gmail.com</a>> wrote:<br>

> <br>

> ><br>

> ><br>

> > On Thu, 15 Nov 2018 at 17:44, Ivan via petsc-users <<br>

> > <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>

> ><br>

> >> Hi Stefano,<br>

> >><br>

> >> In fact, yes, we look at the htop output (and the resulting computational<br>

> >> time ofc).<br>

> >><br>

> >> In our code we use MUMPS, which indeed depends on blas / lapack. So I<br>

> >> think this might be it!<br>

> >><br>

> >> I will definetely check it (I mean the difference between our MUMPS,<br>

> >> blas, lapack).<br>

> >><br>

> >> If you have an idea of how we can verify on his PC that the source of his<br>

> >> parallelization does come from BLAS, please do not hesitate to tell me!<br>

> >><br>

> ><br>

> > Option 1/<br>

> > * Set this environment variable<br>

> >   export OMP_NUM_THREADS=1<br>

> > * Re-run your "parallel" test.<br>

> > * If the performance differs (job runs slower) compared with your previous<br>

> > run where you inferred parallelism was being employed, you can safely<br>

> > assume that the parallelism observed comes from threads<br>

> ><br>

> > Option 2/<br>

> > * Re-configure PETSc to use a known BLAS implementation which does not<br>

> > support threads<br>

> > * Re-compile PETSc<br>

> > * Re-run your parallel test<br>

> > * If the performance differs (job runs slower) compared with your previous<br>

> > run where you inferred parallelism was being employed, you can safely<br>

> > assume that the parallelism observed comes from threads<br>

> ><br>

> > Option 3/<br>

> > * Use a PC which does not depend on BLAS at all,<br>

> > e.g. -pc_type jacobi -pc_type bjacobi<br>

> > * If the performance differs (job runs slower) compared with your previous<br>

> > run where you inferred parallelism was being employed, you can safely<br>

> > assume that the parallelism observed comes from BLAS + threads<br>

> ><br>

> ><br>

> ><br>

> >> Thanks!<br>

> >><br>

> >> Ivan<br>

> >> On 15/11/2018 18:24, Stefano Zampini wrote:<br>

> >><br>

> >> If you say your program is parallel by just looking at the output from<br>

> >> the top command, you are probably linking against a multithreaded blas<br>

> >> library<br>

> >><br>

> >> Il giorno Gio 15 Nov 2018, 20:09 Matthew Knepley via petsc-users <<br>

> >> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> ha scritto:<br>

> >><br>

> >>> On Thu, Nov 15, 2018 at 11:59 AM Ivan Voznyuk <<br>

> >>> <a href="mailto:ivan.voznyuk.work@gmail.com" target="_blank">ivan.voznyuk.work@gmail.com</a>> wrote:<br>

> >>><br>

> >>>> Hi Matthew,<br>

> >>>><br>

> >>>> Does it mean that by using just command python3 simple_code.py (without<br>

> >>>> mpiexec) you *cannot* obtain a parallel execution?<br>

> >>>><br>

> >>><br>

> >>> As I wrote before, its not impossible. You could be directly calling<br>

> >>> PMI, but I do not think you are doing that.<br>

> >>><br>

> >>><br>

> >>>> It s been 5 days we are trying to understand with my colleague how he<br>

> >>>> managed to do so.<br>

> >>>> It means that by using simply python3 simple_code.py he gets 8<br>

> >>>> processors workiing.<br>

> >>>> By the way, we wrote in his code few lines:<br>

> >>>> rank = PETSc.COMM_WORLD.Get_rank()<br>

> >>>> size = PETSc.COMM_WORLD.Get_size()<br>

> >>>> and we got rank = 0, size = 1<br>

> >>>><br>

> >>><br>

> >>> This is MPI telling you that you are only running on 1 processes.<br>

> >>><br>

> >>><br>

> >>>> However, we compilator arrives to KSP.solve(), somehow it turns on 8<br>

> >>>> processors.<br>

> >>>><br>

> >>><br>

> >>> Why do you think its running on 8 processes?<br>

> >>><br>

> >>><br>

> >>>> This problem is solved on his PC in 5-8 sec (in parallel, using *python3<br>

> >>>> simple_code.py*), on mine it takes 70-90 secs (in sequantial, but with<br>

> >>>> the same command *python3 simple_code.py*)<br>

> >>>><br>

> >>><br>

> >>> I think its much more likely that there are differences in the solver<br>

> >>> (use -ksp_view to see exactly what solver was used), then<br>

> >>> to think it is parallelism. Moreover, you would never ever ever see that<br>

> >>> much speedup on a laptop since all these computations<br>

> >>> are bandwidth limited.<br>

> >>><br>

> >>>   Thanks,<br>

> >>><br>

> >>>      Matt<br>

> >>><br>

> >>><br>

> >>>> So, conclusion is that on his computer this code works in the same way<br>

> >>>> as scipy: all the code is executed in sequantial mode, but when it comes to<br>

> >>>> solution of system of linear equations, it runs on all available<br>

> >>>> processors. All this with just running python3 my_code.py (without any<br>

> >>>> mpi-smth)<br>

> >>>><br>

> >>>> Is it an exception / abnormal behavior? I mean, is it something<br>

> >>>> irregular that you, developers, have never seen?<br>

> >>>><br>

> >>>> Thanks and have a good evening!<br>

> >>>> Ivan<br>

> >>>><br>

> >>>> P.S. I don't think I know the answer regarding Scipy...<br>

> >>>><br>

> >>>><br>

> >>>> On Thu, Nov 15, 2018 at 2:39 PM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>><br>

> >>>> wrote:<br>

> >>>><br>

> >>>>> On Thu, Nov 15, 2018 at 8:07 AM Ivan Voznyuk <<br>

> >>>>> <a href="mailto:ivan.voznyuk.work@gmail.com" target="_blank">ivan.voznyuk.work@gmail.com</a>> wrote:<br>

> >>>>><br>

> >>>>>> Hi Matthew,<br>

> >>>>>> Thanks for your reply!<br>

> >>>>>><br>

> >>>>>> Let me precise what I mean by defining few questions:<br>

> >>>>>><br>

> >>>>>> 1. In order to obtain a parallel execution of simple_code.py, do I<br>

> >>>>>> need to go with mpiexec python3 simple_code.py, or I can just launch<br>

> >>>>>> python3 simple_code.py?<br>

> >>>>>><br>

> >>>>><br>

> >>>>> mpiexec -n 2 python3 simple_code.py<br>

> >>>>><br>

> >>>>><br>

> >>>>>> 2. This simple_code.py consists of 2 parts: a) preparation of matrix<br>

> >>>>>> b) solving the system of linear equations with PETSc. If I launch mpirun<br>

> >>>>>> (or mpiexec) -np 8 python3 simple_code.py, I suppose that I will basically<br>

> >>>>>> obtain 8 matrices and 8 systems to solve. However, I need to prepare only<br>

> >>>>>> one matrix, but launch this code in parallel on 8 processors.<br>

> >>>>>><br>

> >>>>><br>

> >>>>> When you create the Mat object, you give it a communicator (here<br>

> >>>>> PETSC_COMM_WORLD). That allows us to distribute the data. This is all<br>

> >>>>> covered extensively in the manual and the online tutorials, as well as the<br>

> >>>>> example code.<br>

> >>>>><br>

> >>>>><br>

> >>>>>> In fact, here attached you will find a similar code (scipy_code.py)<br>

> >>>>>> with only one difference: the system of linear equations is solved with<br>

> >>>>>> scipy. So when I solve it, I can clearly see that the solution is obtained<br>

> >>>>>> in a parallel way. However, I do not use the command mpirun (or mpiexec). I<br>

> >>>>>> just go with python3 scipy_code.py.<br>

> >>>>>><br>

> >>>>><br>

> >>>>> Why do you think its running in parallel?<br>

> >>>>><br>

> >>>>>   Thanks,<br>

> >>>>><br>

> >>>>>      Matt<br>

> >>>>><br>

> >>>>><br>

> >>>>>> In this case, the first part (creation of the sparse matrix) is not<br>

> >>>>>> parallel, whereas the solution of system is found in a parallel way.<br>

> >>>>>> So my question is, Do you think that it s possible to have the same<br>

> >>>>>> behavior with PETSC? And what do I need for this?<br>

> >>>>>><br>

> >>>>>> I am asking this because for my colleague it worked! It means that he<br>

> >>>>>> launches the simple_code.py on his computer using the command python3<br>

> >>>>>> simple_code.py (and not mpi-smth python3 simple_code.py) and he obtains a<br>

> >>>>>> parallel execution of the same code.<br>

> >>>>>><br>

> >>>>>> Thanks for your help!<br>

> >>>>>> Ivan<br>

> >>>>>><br>

> >>>>>><br>

> >>>>>> On Thu, Nov 15, 2018 at 11:54 AM Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>><br>

> >>>>>> wrote:<br>

> >>>>>><br>

> >>>>>>> On Thu, Nov 15, 2018 at 4:53 AM Ivan Voznyuk via petsc-users <<br>

> >>>>>>> <a href="mailto:petsc-users@mcs.anl.gov" target="_blank">petsc-users@mcs.anl.gov</a>> wrote:<br>

> >>>>>>><br>

> >>>>>>>> Dear PETSC community,<br>

> >>>>>>>><br>

> >>>>>>>> I have a question regarding the parallel execution of petsc4py.<br>

> >>>>>>>><br>

> >>>>>>>> I have a simple code (here attached simple_code.py) which solves a<br>

> >>>>>>>> system of linear equations Ax=b using petsc4py. To execute it, I use the<br>

> >>>>>>>> command python3 simple_code.py which yields a sequential performance. With<br>

> >>>>>>>> a colleague of my, we launched this code on his computer, and this time the<br>

> >>>>>>>> execution was in parallel. Although, he used the same command python3<br>

> >>>>>>>> simple_code.py (without mpirun, neither mpiexec).<br>

> >>>>>>>><br>

> >>>>>>> I am not sure what you mean. To run MPI programs in parallel, you<br>

> >>>>>>> need a launcher like mpiexec or mpirun. There are Python programs (like<br>

> >>>>>>> nemesis) that use the launcher API directly (called PMI), but that is not<br>

> >>>>>>> part of petsc4py.<br>

> >>>>>>><br>

> >>>>>>>   Thanks,<br>

> >>>>>>><br>

> >>>>>>>      Matt<br>

> >>>>>>><br>

> >>>>>>>> My configuration: Ubuntu x86_64 Ubuntu 16.04, Intel Core i7, PETSc<br>

> >>>>>>>> 3.10.2, PETSC_ARCH=arch-linux2-c-debug, petsc4py 3.10.0 in virtualenv<br>

> >>>>>>>><br>

> >>>>>>>> In order to parallelize it, I have already tried:<br>

> >>>>>>>> - use 2 different PCs<br>

> >>>>>>>> - use Ubuntu 16.04, 18.04<br>

> >>>>>>>> - use different architectures (arch-linux2-c-debug,<br>

> >>>>>>>> linux-gnu-c-debug, etc)<br>

> >>>>>>>> - ofc use different configurations (my present config can be found<br>

> >>>>>>>> in make.log that I attached here)<br>

> >>>>>>>> - mpi from mpich, openmpi<br>

> >>>>>>>><br>

> >>>>>>>> Nothing worked.<br>

> >>>>>>>><br>

> >>>>>>>> Do you have any ideas?<br>

> >>>>>>>><br>

> >>>>>>>> Thanks and have a good day,<br>

> >>>>>>>> Ivan<br>

> >>>>>>>><br>

> >>>>>>>> --<br>

> >>>>>>>> Ivan VOZNYUK<br>

> >>>>>>>> PhD in Computational Electromagnetics<br>

> >>>>>>>><br>

> >>>>>>><br>

> >>>>>>><br>

> >>>>>>> --<br>

> >>>>>>> What most experimenters take for granted before they begin their<br>

> >>>>>>> experiments is infinitely more interesting than any results to which their<br>

> >>>>>>> experiments lead.<br>

> >>>>>>> -- Norbert Wiener<br>

> >>>>>>><br>

> >>>>>>> <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>

> >>>>>>> <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>

> >>>>>>><br>

> >>>>>><br>

> >>>>>><br>

> >>>>>> --<br>

> >>>>>> Ivan VOZNYUK<br>

> >>>>>> PhD in Computational Electromagnetics<br>

> >>>>>> +33 (0)6.95.87.04.55<br>

> >>>>>> My webpage <<a href="https://ivanvoznyukwork.wixsite.com/webpage" rel="noreferrer" target="_blank">https://ivanvoznyukwork.wixsite.com/webpage</a>><br>

> >>>>>> My LinkedIn <<a href="http://linkedin.com/in/ivan-voznyuk-b869b8106" rel="noreferrer" target="_blank">http://linkedin.com/in/ivan-voznyuk-b869b8106</a>><br>

> >>>>>><br>

> >>>>><br>

> >>>>><br>

> >>>>> --<br>

> >>>>> What most experimenters take for granted before they begin their<br>

> >>>>> experiments is infinitely more interesting than any results to which their<br>

> >>>>> experiments lead.<br>

> >>>>> -- Norbert Wiener<br>

> >>>>><br>

> >>>>> <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>

> >>>>> <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>

> >>>>><br>

> >>>><br>

> >>>><br>

> >>>> --<br>

> >>>> Ivan VOZNYUK<br>

> >>>> PhD in Computational Electromagnetics<br>

> >>>> +33 (0)6.95.87.04.55<br>

> >>>> My webpage <<a href="https://ivanvoznyukwork.wixsite.com/webpage" rel="noreferrer" target="_blank">https://ivanvoznyukwork.wixsite.com/webpage</a>><br>

> >>>> My LinkedIn <<a href="http://linkedin.com/in/ivan-voznyuk-b869b8106" rel="noreferrer" target="_blank">http://linkedin.com/in/ivan-voznyuk-b869b8106</a>><br>

> >>>><br>

> >>><br>

> >>><br>

> >>> --<br>

> >>> What most experimenters take for granted before they begin their<br>

> >>> experiments is infinitely more interesting than any results to which their<br>

> >>> experiments lead.<br>

> >>> -- Norbert Wiener<br>

> >>><br>

> >>> <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>

> >>> <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>

> >>><br>

> >><br>

> <br>

> <br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">Ivan VOZNYUK<div>PhD in Computational Electromagnetics</div><div>+33 (0)6.95.87.04.55</div><div><a href="https://ivanvoznyukwork.wixsite.com/webpage" target="_blank">My webpage</a><br></div><div><a href="http://linkedin.com/in/ivan-voznyuk-b869b8106" target="_blank">My LinkedIn</a></div></div></div></div></div>