[petsc-users] Strange Partition in PETSc 3.11 version on some computers

Danyang Su danyang.su at gmail.com
Tue Sep 17 12:15:04 CDT 2019


On 2019-09-17 10:07 a.m., Mark Adams wrote:
> Matt that sound like it.
>
> danyang, just in case its not clear, you need to delete your 
> architecture directory and reconfigure from scratch. You should be 
> able to just delete the arch-dir/externalpackages/git.parmetis[metis] 
> directories but I'd simply delete the whole arch-dir.

Many thanks to you all for the suggestions. I will try this first and 
keep you updated.

Danyang

>
> On Tue, Sep 17, 2019 at 1:03 PM Matthew Knepley <knepley at gmail.com 
> <mailto:knepley at gmail.com>> wrote:
>
>     On Tue, Sep 17, 2019 at 12:53 PM Danyang Su <danyang.su at gmail.com
>     <mailto:danyang.su at gmail.com>> wrote:
>
>         Hi Mark,
>
>         Thanks for your follow-up.
>
>         The unstructured grid code has been verified and there is no
>         problem in the results. The convergence rate is also good. The
>         3D mesh is not good, it is based on the original stratum which
>         I haven't refined, but good for initial test as it is relative
>         small and the results obtained from this mesh still makes sense.
>
>         The 2D meshes are just for testing purpose as I want to
>         reproduce the partition problem on a cluster using PETSc3.11.3
>         and Intel2019. Unfortunately, I didn't find problem using this
>         example.
>
>         The code has no problem in using different PETSc versions
>         (PETSc V3.4 to V3.11) and MPI distribution (MPICH, OpenMPI,
>         IntelMPI), except for one simulation case (the mesh I
>         attached) on a cluster with PETSc3.11.3 and Intel2019u4 due to
>         the very different partition compared to PETSc3.9.3. Yet the
>         simulation results are the same except for the efficiency
>         problem because the strange partition results into much more
>         communication (ghost nodes).
>
>         I am still trying different compiler and mpi with PETSc3.11.3
>         on that cluster to trace the problem. Will get back to you
>         guys when there is update.
>
>     You had --download-parmetis in your configure command, but I
>     wonder if it is possible that it actually was not downloaded and
>     already present. The type of the ParMetis weights can be changed,
>     and if the type that PETSc thinks it is does not match the
>     actual library type, then the weights could all be crazy numbers.
>     I seem to recall someone changing the weight type in a release,
>     which might mean that the built ParMetis was fine with one version
>     and not the other.
>
>       Thanks,
>
>         Matt
>
>         Thanks,
>
>         danyang
>
>         On 2019-09-17 9:02 a.m., Mark Adams wrote:
>>         Danyang,
>>
>>         Excuse me if I missed something in this thread but just a few
>>         ideas.
>>
>>         First, I trust that you have verified that you are getting a
>>         good solution with these bad meshes. Ideally you would check
>>         that the solver convergence rates are similar.
>>
>>         You might verify that your mesh is inside of DMPLex
>>         correctly. You can visualize a Plex mesh very easily. (let us
>>         know if you need instructions).
>>
>>         This striping on the 2D meshes look something like what you
>>         are getting with your 3D PRISM mesh. DMPLex just calls
>>         Parmetis with a flat graph. It is odd to me that your
>>         rectangular grids have so much structure and are
>>         non-isotropic. I assume that these rectangular meshes are
>>         isotropic (eg, squares).
>>
>>         Anyway, just some thoughts,
>>         Mark
>>
>>         On Tue, Sep 17, 2019 at 12:43 AM Danyang Su via petsc-users
>>         <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>
>>
>>             On 2019-09-16 12:02 p.m., Matthew Knepley wrote:
>>>             On Mon, Sep 16, 2019 at 1:46 PM Smith, Barry F.
>>>             <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>>>
>>>
>>>                   Very different stuff going on in the two cases,
>>>                 different objects being created, different number of
>>>                 different types of operations. Clearly a major
>>>                 refactorization of the code was done. Presumably a
>>>                 regression was introduced that changed the behavior
>>>                 dramatically, possible by mistake.
>>>
>>>                    You can attempt to use git bisect to determine
>>>                 what changed caused the dramatic change in behavior.
>>>                 Then it can be decided if the changed that triggered
>>>                 the change in the results was a bug or a planned
>>>                 feature.
>>>
>>>
>>>             Danyang,
>>>
>>>             Can you send me the smallest mesh you care about, and I
>>>             will look at the partitioning? We can at least get
>>>             quality metrics
>>>             between these two releases.
>>>
>>>               Thanks,
>>>
>>>                  Matt
>>
>>             Hi Matt,
>>
>>             This is the smallest mesh for the regional scale
>>             simulation that has strange partition problem. It can be
>>             download via the link below.
>>
>>             https://www.dropbox.com/s/tu34jgqqhkz8pwj/basin-3d.vtk?dl=0
>>
>>             I am trying to reproduce the similar problem using
>>             smaller 2D mesh, however, there is no such problem in 2D,
>>             even though the partitions using PETSc 3.9.3 and 3.11.3
>>             are a bit different, they both look reasonable. As shown
>>             below, both rectangular mesh and triangular mesh use DMPlex.
>>
>>             2D rectangular and triangle mesh
>>
>>             I will keep on testing using PETSc3.11.3 but with
>>             different compiler and MPI to check if I can reproduce
>>             the problem.
>>
>>             Thanks,
>>
>>             Danyang
>>
>>>
>>>                  Barry
>>>
>>>
>>>                 > On Sep 16, 2019, at 11:50 AM, Danyang Su
>>>                 <danyang.su at gmail.com <mailto:danyang.su at gmail.com>>
>>>                 wrote:
>>>                 >
>>>                 > Hi Barry and Matt,
>>>                 >
>>>                 > Attached is the output of both runs with -dm_view
>>>                 -log_view included.
>>>                 >
>>>                 > I am now coordinating with staff to install PETSc
>>>                 3.9.3 version using intel2019u4 to narrow down the
>>>                 problem. Will get back to you later after the test.
>>>                 >
>>>                 > Thanks,
>>>                 >
>>>                 > Danyang
>>>                 >
>>>                 > On 2019-09-15 4:43 p.m., Smith, Barry F. wrote:
>>>                 >>   Send the configure.log and make.log for the two
>>>                 system configurations that produce very different
>>>                 results as well as the output running with -dm_view
>>>                 -info for both runs. The cause is likely not subtle,
>>>                 one is likely using metis and the other is likely
>>>                 just not using any partitioner.
>>>                 >>
>>>                 >>
>>>                 >>
>>>                 >>> On Sep 15, 2019, at 6:07 PM, Matthew Knepley via
>>>                 petsc-users <petsc-users at mcs.anl.gov
>>>                 <mailto:petsc-users at mcs.anl.gov>> wrote:
>>>                 >>>
>>>                 >>> On Sun, Sep 15, 2019 at 6:59 PM Danyang Su
>>>                 <danyang.su at gmail.com <mailto:danyang.su at gmail.com>>
>>>                 wrote:
>>>                 >>> Hi Matt,
>>>                 >>>
>>>                 >>> Thanks for the quick reply. I have no change in
>>>                 the adjacency. The source code and the simulation
>>>                 input files are all the same. I also tried to use
>>>                 GNU compiler and mpich with petsc 3.11.3 and it
>>>                 works fine.
>>>                 >>>
>>>                 >>> It looks like the problem is caused by the
>>>                 difference in configuration. However, the
>>>                 configuration is pretty the same as petsc 3.9.3
>>>                 except the compiler and mpi used. I will contact
>>>                 scinet staff to check if they have any idea on this.
>>>                 >>>
>>>                 >>> Very very strange since the partition is handled
>>>                 completely by Metis, and does not use MPI.
>>>                 >>>
>>>                 >>>   Thanks,
>>>                 >>>
>>>                 >>>     Matt
>>>                 >>>  Thanks,
>>>                 >>>
>>>                 >>> Danyang
>>>                 >>>
>>>                 >>> On September 15, 2019 3:20:18 p.m. PDT, Matthew
>>>                 Knepley <knepley at gmail.com
>>>                 <mailto:knepley at gmail.com>> wrote:
>>>                 >>> On Sun, Sep 15, 2019 at 5:19 PM Danyang Su via
>>>                 petsc-users <petsc-users at mcs.anl.gov
>>>                 <mailto:petsc-users at mcs.anl.gov>> wrote:
>>>                 >>> Dear All,
>>>                 >>>
>>>                 >>> I have a question regarding strange partition
>>>                 problem in PETSc 3.11 version. The problem does not
>>>                 exist on my local workstation. However, on a cluster
>>>                 with different PETSc versions, the partition seems
>>>                 quite different, as you can find in the figure
>>>                 below, which is tested with 160 processors. The
>>>                 color means the processor owns that subdomain. In
>>>                 this layered prism mesh, there are 40 layers from
>>>                 bottom to top and each layer has around 20k nodes.
>>>                 The natural order of nodes is also layered from
>>>                 bottom to top.
>>>                 >>>
>>>                 >>> The left partition (PETSc 3.10 and earlier)
>>>                 looks good with minimum number of ghost nodes while
>>>                 the right one (PETSc 3.11) looks weired with huge
>>>                 number of ghost nodes. Looks like the right one uses
>>>                 partition layer by layer. This problem exists on a a
>>>                 cluster but not on my local workstation for the same
>>>                 PETSc version (with different compiler and MPI).
>>>                 Other than the difference in partition and
>>>                 efficiency, the simulation results are the same.
>>>                 >>>
>>>                 >>>
>>>                 >>>
>>>                 >>>
>>>                 >>> Below is PETSc configuration on three machine:
>>>                 >>>
>>>                 >>> Local workstation (works fine):  ./configure
>>>                 --with-cc=gcc --with-cxx=g++ --with-fc=gfortran
>>>                 --download-mpich --download-scalapack
>>>                 --download-parmetis --download-metis
>>>                 --download-ptscotch --download-fblaslapack
>>>                 --download-hypre --download-superlu_dist
>>>                 --download-hdf5=yes --download-ctetgen
>>>                 --with-debugging=0 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>>                 FOPTFLAGS=-O3 --with-cxx-dialect=C++11
>>>                 >>>
>>>                 >>> Cluster with PETSc 3.9.3 (works fine):
>>>                 --prefix=/scinet/niagara/software/2018a/opt/intel-2018.2-intelmpi-2018.2/petsc/3.9.3
>>>                 CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>>>                 COPTFLAGS="-march=native -O2"
>>>                 CXXOPTFLAGS="-march=native -O2"
>>>                 FOPTFLAGS="-march=native -O2" --download-chaco=1
>>>                 --download-hypre=1 --download-metis=1
>>>                 --download-ml=1 --download-mumps=1
>>>                 --download-parmetis=1 --download-plapack=1
>>>                 --download-prometheus=1 --download-ptscotch=1
>>>                 --download-scotch=1 --download-sprng=1
>>>                 --download-superlu=1 --download-superlu_dist=1
>>>                 --download-triangle=1 --with-avx512-kernels=1
>>>                 --with-blaslapack-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>>>                 --with-debugging=0 --with-hdf5=1
>>>                 --with-mkl_pardiso-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>>>                 --with-scalapack=1
>>>                 --with-scalapack-lib="[/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>>>                 --with-x=0
>>>                 >>>
>>>                 >>> Cluster with PETSc 3.11.3 (looks weired):
>>>                 --prefix=/scinet/niagara/software/2019b/opt/intel-2019u4-intelmpi-2019u4/petsc/3.11.3
>>>                 CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>>>                 COPTFLAGS="-march=native -O2"
>>>                 CXXOPTFLAGS="-march=native -O2"
>>>                 FOPTFLAGS="-march=native -O2" --download-chaco=1
>>>                 --download-hdf5=1 --download-hypre=1
>>>                 --download-metis=1 --download-ml=1
>>>                 --download-mumps=1 --download-parmetis=1
>>>                 --download-plapack=1 --download-prometheus=1
>>>                 --download-ptscotch=1 --download-scotch=1
>>>                 --download-sprng=1 --download-superlu=1
>>>                 --download-superlu_dist=1 --download-triangle=1
>>>                 --with-avx512-kernels=1
>>>                 --with-blaslapack-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>>>                 --with-cxx-dialect=C++11 --with-debugging=0
>>>                 --with-mkl_pardiso-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>>>                 --with-scalapack=1
>>>                 --with-scalapack-lib="[/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>>>                 --with-x=0
>>>                 >>>
>>>                 >>> And the partition is used by default dmplex
>>>                 distribution.
>>>                 >>>
>>>                 >>>       !c distribute mesh over processes
>>>                 >>>       call
>>>                 DMPlexDistribute(dmda_flow%da,stencil_width,        
>>>                       &
>>>                 >>>    PETSC_NULL_SF,        &
>>>                 >>>    PETSC_NULL_OBJECT,        &
>>>                 >>>    distributedMesh,ierr)
>>>                 >>>       CHKERRQ(ierr)
>>>                 >>>
>>>                 >>> Any idea on this strange problem?
>>>                 >>>
>>>                 >>>
>>>                 >>> I just looked at the code. Your mesh should be
>>>                 partitioned by k-way partitioning using Metis since
>>>                 its on 1 proc for partitioning. This code
>>>                 >>> is the same for 3.9 and 3.11, and you get the
>>>                 same result on your machine. I cannot understand
>>>                 what might be happening on your cluster
>>>                 >>> (MPI plays no role). Is it possible that you
>>>                 changed the adjacency specification in that version?
>>>                 >>>
>>>                 >>>   Thanks,
>>>                 >>>
>>>                 >>>      Matt
>>>                 >>> Thanks,
>>>                 >>>
>>>                 >>> Danyang
>>>                 >>>
>>>                 >>>
>>>                 >>>
>>>                 >>> --
>>>                 >>> What most experimenters take for granted before
>>>                 they begin their experiments is infinitely more
>>>                 interesting than any results to which their
>>>                 experiments lead.
>>>                 >>> -- Norbert Wiener
>>>                 >>>
>>>                 >>> https://www.cse.buffalo.edu/~knepley/
>>>                 >>>
>>>                 >>> --
>>>                 >>> Sent from my Android device with K-9 Mail.
>>>                 Please excuse my brevity.
>>>                 >>>
>>>                 >>>
>>>                 >>> --
>>>                 >>> What most experimenters take for granted before
>>>                 they begin their experiments is infinitely more
>>>                 interesting than any results to which their
>>>                 experiments lead.
>>>                 >>> -- Norbert Wiener
>>>                 >>>
>>>                 >>> https://www.cse.buffalo.edu/~knepley/
>>>                 > <basin-petsc-3.9.3.log><basin-petsc-3.11.3.log>
>>>
>>>
>>>
>>>             -- 
>>>             What most experimenters take for granted before they
>>>             begin their experiments is infinitely more interesting
>>>             than any results to which their experiments lead.
>>>             -- Norbert Wiener
>>>
>>>             https://www.cse.buffalo.edu/~knepley/
>>>             <http://www.cse.buffalo.edu/~knepley/>
>>
>
>
>     -- 
>     What most experimenters take for granted before they begin their
>     experiments is infinitely more interesting than any results to
>     which their experiments lead.
>     -- Norbert Wiener
>
>     https://www.cse.buffalo.edu/~knepley/
>     <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/70919e95/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc-partition-compare.png
Type: image/png
Size: 69346 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/70919e95/attachment-0001.png>


More information about the petsc-users mailing list