[petsc-dev] development version for MATAIJMKL type mat

Tue Oct 10 16:07:25 CDT 2017

On Tue, Oct 10, 2017 at 1:03 PM, Mark Adams <mfadams at lbl.gov> wrote:

> putting this back on the list.
>
> On Tue, Oct 10, 2017 at 3:21 PM, Bakytzhan Kallemov <bkallemov at lbl.gov>
> wrote:
>
>>
>>
>>
>> -------- Forwarded Message --------
>> Subject: Re: [petsc-dev] development version for MATAIJMKL type mat
>> Date: Tue, 10 Oct 2017 12:18:08 -0700
>> From: Bakytzhan Kallemov <bkallemov at lbl.gov> <bkallemov at lbl.gov>
>> To: Barry Smith <bsmith at mcs.anl.gov> <bsmith at mcs.anl.gov>
>>
>> Hi Barry,
>>
>> Yes I am using OMP_NUM_THREADS to control the number of threads.
>>
>> I talked to Mark and he suggested to use this developer version of the
>> code to use openmp threading when running petsc.
>>
>> May be it's better to explain what I am trying to get.
>>
>> Attached please see the plot for average time for a time step
>> advance(which reflects mostly PCApply, I believe, but I can generate the
>> separate plot for that later) running on a single KNL node for different
>> combination mpi+openmp.
>>
>> My goal is to get an advantage by using openmp threads in hybrid
>> running, so I am trying to get the plot more flat.
>>
>>
> This first thing to do is verify that everything is hooked up correctly.
> The threaded version of hypre does look a little slower. Is this data
> repeatable? This could just be experimental noise.
>
> I would run these in the same script so that you know you get the same
> node and "environment". We really just need one data point, say 32 threads
> per 2 MPI processes per socket, or whatever your goal is.
>
> You would need to talk with hypre about (performance) debugging/optimizing
> hypre. We know more about gamg (I wrote it) and it uses native PETSc
> primitives so we get useful data with -log_view. If you do this with gamg I
> can look at the output and check that everything is hooked up correctly.
>
> Oh, actually, AIJMKL is not quite working yet for AMG because the
> matrix-matrix product methods are not hooked up. This should be happening
> in the next few months. So I'd talk with hypre. They have methods to get
> performance data that they can analyses.
>

Mark is right: Try Hypre first for now. However, I'm hoping to get AIJMKL
using the MKL sparse matrix-matrix multiply working this week. (So,
probably sometime next week, in actual practice. =) ). I'll let everyone
know when I have this working.

Note that my guess is that, when running with a single thread, the MKL
sparse matrix-matrix multiply will be somewhat slower than the PETSc
version (when compiling with the Intel compiler, which does a pretty good
job). Using MKL will allow you to use MKL's threading, though, so if this
is important to your application, this may be worth giving a try. Also, any
performance problems we see we should report to the MKL team, which may
have some motivation to improve them if we are making it easy for PETSc
applications to use them.

--Richard

> Mark
>
>
>> Is this something that I can do with new development such as aifmkl mat
>> type?
>>
>> Thanks,
>>
>> Baky
>>
>>
>> On 10/10/2017 11:50 AM, Barry Smith wrote:
>> >> On Oct 10, 2017, at 10:52 AM, Bakytzhan Kallemov <bkallemov at lbl.gov> <bkallemov at lbl.gov> wrote:
>> >>
>> >> Hi,
>> >>
>> >> My name is Baky Kallemov.
>> >>
>> >> Currently, I am working on improving a scalibility of  the Chombo-Petsc interface on cori machine at nersc system.
>> >>
>> >> I successfully build the libs from master branch with --with-openmp and hypre.
>> >>
>> >> However, I have not noticed any difference running my test problem on single node KNL node using new MATAIJMKL
>> >    hyre uses its own matrix operations so it won't get faster when using running PETSc with MATAIJMKL or any other specific matrix type.
>> >>
>> >> type for different hybrid mpi+openmp runs  compared to regular released version.
>> >     What are you comparing? Are you using say 32 MPI processes and 2 threads or 16 MPI processes and 4 threads? How are you controlling the number of OpenMP threads, OpenMP environmental variable? What parts of the time in the code are you comparing? You should just -log_view and compare the times for PCApply and PCSetUp() between say 64 MPI process/1 thread and 32 MPI processes/2 threads and send us the output for those two cases.
>> >
>> >> It seems that it made no difference, so perhaps I am doing something wrong or my build is not configured right.
>> >>
>> >> Do you have any example that makes use of threads when running hybrid and show an advantage?
>> >     There is not reason to think that using threads on KNL is faster than just using MPI processes. Despite what the NERSc/LBL web pages may say, just because a website says something doesn't make it true.
>> >
>> >
>> >> I'd like to test it and make sure that my libs are configured correctly, before start to investigate it further.
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Baky
>> >>
>> >>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20171010/8e176dc5/attachment-0001.html>