[petsc-users] Strange Partition in PETSc 3.11 version on some computers
Danyang Su
danyang.su at gmail.com
Tue Sep 17 11:53:37 CDT 2019
Hi Mark,
Thanks for your follow-up.
The unstructured grid code has been verified and there is no problem in
the results. The convergence rate is also good. The 3D mesh is not good,
it is based on the original stratum which I haven't refined, but good
for initial test as it is relative small and the results obtained from
this mesh still makes sense.
The 2D meshes are just for testing purpose as I want to reproduce the
partition problem on a cluster using PETSc3.11.3 and Intel2019.
Unfortunately, I didn't find problem using this example.
The code has no problem in using different PETSc versions (PETSc V3.4 to
V3.11) and MPI distribution (MPICH, OpenMPI, IntelMPI), except for one
simulation case (the mesh I attached) on a cluster with PETSc3.11.3 and
Intel2019u4 due to the very different partition compared to PETSc3.9.3.
Yet the simulation results are the same except for the efficiency
problem because the strange partition results into much more
communication (ghost nodes).
I am still trying different compiler and mpi with PETSc3.11.3 on that
cluster to trace the problem. Will get back to you guys when there is
update.
Thanks,
danyang
On 2019-09-17 9:02 a.m., Mark Adams wrote:
> Danyang,
>
> Excuse me if I missed something in this thread but just a few ideas.
>
> First, I trust that you have verified that you are getting a good
> solution with these bad meshes. Ideally you would check that the
> solver convergence rates are similar.
>
> You might verify that your mesh is inside of DMPLex correctly. You can
> visualize a Plex mesh very easily. (let us know if you need instructions).
>
> This striping on the 2D meshes look something like what you are
> getting with your 3D PRISM mesh. DMPLex just calls Parmetis with a
> flat graph. It is odd to me that your rectangular grids have so much
> structure and are non-isotropic. I assume that these
> rectangular meshes are isotropic (eg, squares).
>
> Anyway, just some thoughts,
> Mark
>
> On Tue, Sep 17, 2019 at 12:43 AM Danyang Su via petsc-users
> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>
>
> On 2019-09-16 12:02 p.m., Matthew Knepley wrote:
>> On Mon, Sep 16, 2019 at 1:46 PM Smith, Barry F.
>> <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>>
>>
>> Very different stuff going on in the two cases, different
>> objects being created, different number of different types of
>> operations. Clearly a major refactorization of the code was
>> done. Presumably a regression was introduced that changed the
>> behavior dramatically, possible by mistake.
>>
>> You can attempt to use git bisect to determine what
>> changed caused the dramatic change in behavior. Then it can
>> be decided if the changed that triggered the change in the
>> results was a bug or a planned feature.
>>
>>
>> Danyang,
>>
>> Can you send me the smallest mesh you care about, and I will look
>> at the partitioning? We can at least get quality metrics
>> between these two releases.
>>
>> Thanks,
>>
>> Matt
>
> Hi Matt,
>
> This is the smallest mesh for the regional scale simulation that
> has strange partition problem. It can be download via the link below.
>
> https://www.dropbox.com/s/tu34jgqqhkz8pwj/basin-3d.vtk?dl=0
>
> I am trying to reproduce the similar problem using smaller 2D
> mesh, however, there is no such problem in 2D, even though the
> partitions using PETSc 3.9.3 and 3.11.3 are a bit different, they
> both look reasonable. As shown below, both rectangular mesh and
> triangular mesh use DMPlex.
>
> 2D rectangular and triangle mesh
>
> I will keep on testing using PETSc3.11.3 but with different
> compiler and MPI to check if I can reproduce the problem.
>
> Thanks,
>
> Danyang
>
>>
>> Barry
>>
>>
>> > On Sep 16, 2019, at 11:50 AM, Danyang Su
>> <danyang.su at gmail.com <mailto:danyang.su at gmail.com>> wrote:
>> >
>> > Hi Barry and Matt,
>> >
>> > Attached is the output of both runs with -dm_view -log_view
>> included.
>> >
>> > I am now coordinating with staff to install PETSc 3.9.3
>> version using intel2019u4 to narrow down the problem. Will
>> get back to you later after the test.
>> >
>> > Thanks,
>> >
>> > Danyang
>> >
>> > On 2019-09-15 4:43 p.m., Smith, Barry F. wrote:
>> >> Send the configure.log and make.log for the two system
>> configurations that produce very different results as well as
>> the output running with -dm_view -info for both runs. The
>> cause is likely not subtle, one is likely using metis and the
>> other is likely just not using any partitioner.
>> >>
>> >>
>> >>
>> >>> On Sep 15, 2019, at 6:07 PM, Matthew Knepley via
>> petsc-users <petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>> wrote:
>> >>>
>> >>> On Sun, Sep 15, 2019 at 6:59 PM Danyang Su
>> <danyang.su at gmail.com <mailto:danyang.su at gmail.com>> wrote:
>> >>> Hi Matt,
>> >>>
>> >>> Thanks for the quick reply. I have no change in the
>> adjacency. The source code and the simulation input files are
>> all the same. I also tried to use GNU compiler and mpich with
>> petsc 3.11.3 and it works fine.
>> >>>
>> >>> It looks like the problem is caused by the difference in
>> configuration. However, the configuration is pretty the same
>> as petsc 3.9.3 except the compiler and mpi used. I will
>> contact scinet staff to check if they have any idea on this.
>> >>>
>> >>> Very very strange since the partition is handled
>> completely by Metis, and does not use MPI.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Matt
>> >>> Thanks,
>> >>>
>> >>> Danyang
>> >>>
>> >>> On September 15, 2019 3:20:18 p.m. PDT, Matthew Knepley
>> <knepley at gmail.com <mailto:knepley at gmail.com>> wrote:
>> >>> On Sun, Sep 15, 2019 at 5:19 PM Danyang Su via
>> petsc-users <petsc-users at mcs.anl.gov
>> <mailto:petsc-users at mcs.anl.gov>> wrote:
>> >>> Dear All,
>> >>>
>> >>> I have a question regarding strange partition problem in
>> PETSc 3.11 version. The problem does not exist on my local
>> workstation. However, on a cluster with different PETSc
>> versions, the partition seems quite different, as you can
>> find in the figure below, which is tested with 160
>> processors. The color means the processor owns that
>> subdomain. In this layered prism mesh, there are 40 layers
>> from bottom to top and each layer has around 20k nodes. The
>> natural order of nodes is also layered from bottom to top.
>> >>>
>> >>> The left partition (PETSc 3.10 and earlier) looks good
>> with minimum number of ghost nodes while the right one (PETSc
>> 3.11) looks weired with huge number of ghost nodes. Looks
>> like the right one uses partition layer by layer. This
>> problem exists on a a cluster but not on my local workstation
>> for the same PETSc version (with different compiler and MPI).
>> Other than the difference in partition and efficiency, the
>> simulation results are the same.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Below is PETSc configuration on three machine:
>> >>>
>> >>> Local workstation (works fine): ./configure --with-cc=gcc
>> --with-cxx=g++ --with-fc=gfortran --download-mpich
>> --download-scalapack --download-parmetis --download-metis
>> --download-ptscotch --download-fblaslapack --download-hypre
>> --download-superlu_dist --download-hdf5=yes
>> --download-ctetgen --with-debugging=0 COPTFLAGS=-O3
>> CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --with-cxx-dialect=C++11
>> >>>
>> >>> Cluster with PETSc 3.9.3 (works fine):
>> --prefix=/scinet/niagara/software/2018a/opt/intel-2018.2-intelmpi-2018.2/petsc/3.9.3
>> CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>> COPTFLAGS="-march=native -O2" CXXOPTFLAGS="-march=native -O2"
>> FOPTFLAGS="-march=native -O2" --download-chaco=1
>> --download-hypre=1 --download-metis=1 --download-ml=1
>> --download-mumps=1 --download-parmetis=1 --download-plapack=1
>> --download-prometheus=1 --download-ptscotch=1
>> --download-scotch=1 --download-sprng=1 --download-superlu=1
>> --download-superlu_dist=1 --download-triangle=1
>> --with-avx512-kernels=1
>> --with-blaslapack-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>> --with-debugging=0 --with-hdf5=1
>> --with-mkl_pardiso-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>> --with-scalapack=1
>> --with-scalapack-lib="[/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>> --with-x=0
>> >>>
>> >>> Cluster with PETSc 3.11.3 (looks weired):
>> --prefix=/scinet/niagara/software/2019b/opt/intel-2019u4-intelmpi-2019u4/petsc/3.11.3
>> CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>> COPTFLAGS="-march=native -O2" CXXOPTFLAGS="-march=native -O2"
>> FOPTFLAGS="-march=native -O2" --download-chaco=1
>> --download-hdf5=1 --download-hypre=1 --download-metis=1
>> --download-ml=1 --download-mumps=1 --download-parmetis=1
>> --download-plapack=1 --download-prometheus=1
>> --download-ptscotch=1 --download-scotch=1 --download-sprng=1
>> --download-superlu=1 --download-superlu_dist=1
>> --download-triangle=1 --with-avx512-kernels=1
>> --with-blaslapack-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>> --with-cxx-dialect=C++11 --with-debugging=0
>> --with-mkl_pardiso-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>> --with-scalapack=1
>> --with-scalapack-lib="[/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>> --with-x=0
>> >>>
>> >>> And the partition is used by default dmplex distribution.
>> >>>
>> >>> !c distribute mesh over processes
>> >>> call DMPlexDistribute(dmda_flow%da,stencil_width,
>> &
>> >>> PETSC_NULL_SF, &
>> >>> PETSC_NULL_OBJECT, &
>> >>> distributedMesh,ierr)
>> >>> CHKERRQ(ierr)
>> >>>
>> >>> Any idea on this strange problem?
>> >>>
>> >>>
>> >>> I just looked at the code. Your mesh should be
>> partitioned by k-way partitioning using Metis since its on 1
>> proc for partitioning. This code
>> >>> is the same for 3.9 and 3.11, and you get the same result
>> on your machine. I cannot understand what might be happening
>> on your cluster
>> >>> (MPI plays no role). Is it possible that you changed the
>> adjacency specification in that version?
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Matt
>> >>> Thanks,
>> >>>
>> >>> Danyang
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> What most experimenters take for granted before they
>> begin their experiments is infinitely more interesting than
>> any results to which their experiments lead.
>> >>> -- Norbert Wiener
>> >>>
>> >>> https://www.cse.buffalo.edu/~knepley/
>> >>>
>> >>> --
>> >>> Sent from my Android device with K-9 Mail. Please excuse
>> my brevity.
>> >>>
>> >>>
>> >>> --
>> >>> What most experimenters take for granted before they
>> begin their experiments is infinitely more interesting than
>> any results to which their experiments lead.
>> >>> -- Norbert Wiener
>> >>>
>> >>> https://www.cse.buffalo.edu/~knepley/
>> > <basin-petsc-3.9.3.log><basin-petsc-3.11.3.log>
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to
>> which their experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/66850e18/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc-partition-compare.png
Type: image/png
Size: 69346 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/66850e18/attachment-0001.png>
More information about the petsc-users
mailing list