[petsc-users] Strange Partition in PETSc 3.11 version on some computers
Danyang Su
danyang.su at gmail.com
Tue Sep 17 12:15:04 CDT 2019
On 2019-09-17 10:07 a.m., Mark Adams wrote:
> Matt that sound like it.
>
> danyang, just in case its not clear, you need to delete your
> architecture directory and reconfigure from scratch. You should be
> able to just delete the arch-dir/externalpackages/git.parmetis[metis]
> directories but I'd simply delete the whole arch-dir.
Many thanks to you all for the suggestions. I will try this first and
keep you updated.
Danyang
>
> On Tue, Sep 17, 2019 at 1:03 PM Matthew Knepley <knepley at gmail.com
> <mailto:knepley at gmail.com>> wrote:
>
> On Tue, Sep 17, 2019 at 12:53 PM Danyang Su <danyang.su at gmail.com
> <mailto:danyang.su at gmail.com>> wrote:
>
> Hi Mark,
>
> Thanks for your follow-up.
>
> The unstructured grid code has been verified and there is no
> problem in the results. The convergence rate is also good. The
> 3D mesh is not good, it is based on the original stratum which
> I haven't refined, but good for initial test as it is relative
> small and the results obtained from this mesh still makes sense.
>
> The 2D meshes are just for testing purpose as I want to
> reproduce the partition problem on a cluster using PETSc3.11.3
> and Intel2019. Unfortunately, I didn't find problem using this
> example.
>
> The code has no problem in using different PETSc versions
> (PETSc V3.4 to V3.11) and MPI distribution (MPICH, OpenMPI,
> IntelMPI), except for one simulation case (the mesh I
> attached) on a cluster with PETSc3.11.3 and Intel2019u4 due to
> the very different partition compared to PETSc3.9.3. Yet the
> simulation results are the same except for the efficiency
> problem because the strange partition results into much more
> communication (ghost nodes).
>
> I am still trying different compiler and mpi with PETSc3.11.3
> on that cluster to trace the problem. Will get back to you
> guys when there is update.
>
> You had --download-parmetis in your configure command, but I
> wonder if it is possible that it actually was not downloaded and
> already present. The type of the ParMetis weights can be changed,
> and if the type that PETSc thinks it is does not match the
> actual library type, then the weights could all be crazy numbers.
> I seem to recall someone changing the weight type in a release,
> which might mean that the built ParMetis was fine with one version
> and not the other.
>
> Thanks,
>
> Matt
>
> Thanks,
>
> danyang
>
> On 2019-09-17 9:02 a.m., Mark Adams wrote:
>> Danyang,
>>
>> Excuse me if I missed something in this thread but just a few
>> ideas.
>>
>> First, I trust that you have verified that you are getting a
>> good solution with these bad meshes. Ideally you would check
>> that the solver convergence rates are similar.
>>
>> You might verify that your mesh is inside of DMPLex
>> correctly. You can visualize a Plex mesh very easily. (let us
>> know if you need instructions).
>>
>> This striping on the 2D meshes look something like what you
>> are getting with your 3D PRISM mesh. DMPLex just calls
>> Parmetis with a flat graph. It is odd to me that your
>> rectangular grids have so much structure and are
>> non-isotropic. I assume that these rectangular meshes are
>> isotropic (eg, squares).
>>
>> Anyway, just some thoughts,
>> Mark
>>
>> On Tue, Sep 17, 2019 at 12:43 AM Danyang Su via petsc-users
>> <petsc-users at mcs.anl.gov <mailto:petsc-users at mcs.anl.gov>> wrote:
>>
>>
>> On 2019-09-16 12:02 p.m., Matthew Knepley wrote:
>>> On Mon, Sep 16, 2019 at 1:46 PM Smith, Barry F.
>>> <bsmith at mcs.anl.gov <mailto:bsmith at mcs.anl.gov>> wrote:
>>>
>>>
>>> Very different stuff going on in the two cases,
>>> different objects being created, different number of
>>> different types of operations. Clearly a major
>>> refactorization of the code was done. Presumably a
>>> regression was introduced that changed the behavior
>>> dramatically, possible by mistake.
>>>
>>> You can attempt to use git bisect to determine
>>> what changed caused the dramatic change in behavior.
>>> Then it can be decided if the changed that triggered
>>> the change in the results was a bug or a planned
>>> feature.
>>>
>>>
>>> Danyang,
>>>
>>> Can you send me the smallest mesh you care about, and I
>>> will look at the partitioning? We can at least get
>>> quality metrics
>>> between these two releases.
>>>
>>> Thanks,
>>>
>>> Matt
>>
>> Hi Matt,
>>
>> This is the smallest mesh for the regional scale
>> simulation that has strange partition problem. It can be
>> download via the link below.
>>
>> https://www.dropbox.com/s/tu34jgqqhkz8pwj/basin-3d.vtk?dl=0
>>
>> I am trying to reproduce the similar problem using
>> smaller 2D mesh, however, there is no such problem in 2D,
>> even though the partitions using PETSc 3.9.3 and 3.11.3
>> are a bit different, they both look reasonable. As shown
>> below, both rectangular mesh and triangular mesh use DMPlex.
>>
>> 2D rectangular and triangle mesh
>>
>> I will keep on testing using PETSc3.11.3 but with
>> different compiler and MPI to check if I can reproduce
>> the problem.
>>
>> Thanks,
>>
>> Danyang
>>
>>>
>>> Barry
>>>
>>>
>>> > On Sep 16, 2019, at 11:50 AM, Danyang Su
>>> <danyang.su at gmail.com <mailto:danyang.su at gmail.com>>
>>> wrote:
>>> >
>>> > Hi Barry and Matt,
>>> >
>>> > Attached is the output of both runs with -dm_view
>>> -log_view included.
>>> >
>>> > I am now coordinating with staff to install PETSc
>>> 3.9.3 version using intel2019u4 to narrow down the
>>> problem. Will get back to you later after the test.
>>> >
>>> > Thanks,
>>> >
>>> > Danyang
>>> >
>>> > On 2019-09-15 4:43 p.m., Smith, Barry F. wrote:
>>> >> Send the configure.log and make.log for the two
>>> system configurations that produce very different
>>> results as well as the output running with -dm_view
>>> -info for both runs. The cause is likely not subtle,
>>> one is likely using metis and the other is likely
>>> just not using any partitioner.
>>> >>
>>> >>
>>> >>
>>> >>> On Sep 15, 2019, at 6:07 PM, Matthew Knepley via
>>> petsc-users <petsc-users at mcs.anl.gov
>>> <mailto:petsc-users at mcs.anl.gov>> wrote:
>>> >>>
>>> >>> On Sun, Sep 15, 2019 at 6:59 PM Danyang Su
>>> <danyang.su at gmail.com <mailto:danyang.su at gmail.com>>
>>> wrote:
>>> >>> Hi Matt,
>>> >>>
>>> >>> Thanks for the quick reply. I have no change in
>>> the adjacency. The source code and the simulation
>>> input files are all the same. I also tried to use
>>> GNU compiler and mpich with petsc 3.11.3 and it
>>> works fine.
>>> >>>
>>> >>> It looks like the problem is caused by the
>>> difference in configuration. However, the
>>> configuration is pretty the same as petsc 3.9.3
>>> except the compiler and mpi used. I will contact
>>> scinet staff to check if they have any idea on this.
>>> >>>
>>> >>> Very very strange since the partition is handled
>>> completely by Metis, and does not use MPI.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Matt
>>> >>> Thanks,
>>> >>>
>>> >>> Danyang
>>> >>>
>>> >>> On September 15, 2019 3:20:18 p.m. PDT, Matthew
>>> Knepley <knepley at gmail.com
>>> <mailto:knepley at gmail.com>> wrote:
>>> >>> On Sun, Sep 15, 2019 at 5:19 PM Danyang Su via
>>> petsc-users <petsc-users at mcs.anl.gov
>>> <mailto:petsc-users at mcs.anl.gov>> wrote:
>>> >>> Dear All,
>>> >>>
>>> >>> I have a question regarding strange partition
>>> problem in PETSc 3.11 version. The problem does not
>>> exist on my local workstation. However, on a cluster
>>> with different PETSc versions, the partition seems
>>> quite different, as you can find in the figure
>>> below, which is tested with 160 processors. The
>>> color means the processor owns that subdomain. In
>>> this layered prism mesh, there are 40 layers from
>>> bottom to top and each layer has around 20k nodes.
>>> The natural order of nodes is also layered from
>>> bottom to top.
>>> >>>
>>> >>> The left partition (PETSc 3.10 and earlier)
>>> looks good with minimum number of ghost nodes while
>>> the right one (PETSc 3.11) looks weired with huge
>>> number of ghost nodes. Looks like the right one uses
>>> partition layer by layer. This problem exists on a a
>>> cluster but not on my local workstation for the same
>>> PETSc version (with different compiler and MPI).
>>> Other than the difference in partition and
>>> efficiency, the simulation results are the same.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Below is PETSc configuration on three machine:
>>> >>>
>>> >>> Local workstation (works fine): ./configure
>>> --with-cc=gcc --with-cxx=g++ --with-fc=gfortran
>>> --download-mpich --download-scalapack
>>> --download-parmetis --download-metis
>>> --download-ptscotch --download-fblaslapack
>>> --download-hypre --download-superlu_dist
>>> --download-hdf5=yes --download-ctetgen
>>> --with-debugging=0 COPTFLAGS=-O3 CXXOPTFLAGS=-O3
>>> FOPTFLAGS=-O3 --with-cxx-dialect=C++11
>>> >>>
>>> >>> Cluster with PETSc 3.9.3 (works fine):
>>> --prefix=/scinet/niagara/software/2018a/opt/intel-2018.2-intelmpi-2018.2/petsc/3.9.3
>>> CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>>> COPTFLAGS="-march=native -O2"
>>> CXXOPTFLAGS="-march=native -O2"
>>> FOPTFLAGS="-march=native -O2" --download-chaco=1
>>> --download-hypre=1 --download-metis=1
>>> --download-ml=1 --download-mumps=1
>>> --download-parmetis=1 --download-plapack=1
>>> --download-prometheus=1 --download-ptscotch=1
>>> --download-scotch=1 --download-sprng=1
>>> --download-superlu=1 --download-superlu_dist=1
>>> --download-triangle=1 --with-avx512-kernels=1
>>> --with-blaslapack-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>>> --with-debugging=0 --with-hdf5=1
>>> --with-mkl_pardiso-dir=/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl
>>> --with-scalapack=1
>>> --with-scalapack-lib="[/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/niagara/intel/2018.2/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>>> --with-x=0
>>> >>>
>>> >>> Cluster with PETSc 3.11.3 (looks weired):
>>> --prefix=/scinet/niagara/software/2019b/opt/intel-2019u4-intelmpi-2019u4/petsc/3.11.3
>>> CC=mpicc CXX=mpicxx F77=mpif77 F90=mpif90 FC=mpifc
>>> COPTFLAGS="-march=native -O2"
>>> CXXOPTFLAGS="-march=native -O2"
>>> FOPTFLAGS="-march=native -O2" --download-chaco=1
>>> --download-hdf5=1 --download-hypre=1
>>> --download-metis=1 --download-ml=1
>>> --download-mumps=1 --download-parmetis=1
>>> --download-plapack=1 --download-prometheus=1
>>> --download-ptscotch=1 --download-scotch=1
>>> --download-sprng=1 --download-superlu=1
>>> --download-superlu_dist=1 --download-triangle=1
>>> --with-avx512-kernels=1
>>> --with-blaslapack-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>>> --with-cxx-dialect=C++11 --with-debugging=0
>>> --with-mkl_pardiso-dir=/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl
>>> --with-scalapack=1
>>> --with-scalapack-lib="[/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_scalapack_lp64.so,/scinet/intel/2019u4/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so]"
>>> --with-x=0
>>> >>>
>>> >>> And the partition is used by default dmplex
>>> distribution.
>>> >>>
>>> >>> !c distribute mesh over processes
>>> >>> call
>>> DMPlexDistribute(dmda_flow%da,stencil_width,
>>> &
>>> >>> PETSC_NULL_SF, &
>>> >>> PETSC_NULL_OBJECT, &
>>> >>> distributedMesh,ierr)
>>> >>> CHKERRQ(ierr)
>>> >>>
>>> >>> Any idea on this strange problem?
>>> >>>
>>> >>>
>>> >>> I just looked at the code. Your mesh should be
>>> partitioned by k-way partitioning using Metis since
>>> its on 1 proc for partitioning. This code
>>> >>> is the same for 3.9 and 3.11, and you get the
>>> same result on your machine. I cannot understand
>>> what might be happening on your cluster
>>> >>> (MPI plays no role). Is it possible that you
>>> changed the adjacency specification in that version?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Matt
>>> >>> Thanks,
>>> >>>
>>> >>> Danyang
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> What most experimenters take for granted before
>>> they begin their experiments is infinitely more
>>> interesting than any results to which their
>>> experiments lead.
>>> >>> -- Norbert Wiener
>>> >>>
>>> >>> https://www.cse.buffalo.edu/~knepley/
>>> >>>
>>> >>> --
>>> >>> Sent from my Android device with K-9 Mail.
>>> Please excuse my brevity.
>>> >>>
>>> >>>
>>> >>> --
>>> >>> What most experimenters take for granted before
>>> they begin their experiments is infinitely more
>>> interesting than any results to which their
>>> experiments lead.
>>> >>> -- Norbert Wiener
>>> >>>
>>> >>> https://www.cse.buffalo.edu/~knepley/
>>> > <basin-petsc-3.9.3.log><basin-petsc-3.11.3.log>
>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they
>>> begin their experiments is infinitely more interesting
>>> than any results to which their experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to
> which their experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/70919e95/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: petsc-partition-compare.png
Type: image/png
Size: 69346 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190917/70919e95/attachment-0001.png>
More information about the petsc-users
mailing list