[petsc-users] Problems exceeding 192 million unknowns in FEM code
Satish Balay
balay at mcs.anl.gov
Wed Feb 27 11:07:54 CST 2013
There could some inefficiency if the stash allocated for the off-proc
messages is insufficient [-info will show how many mallocs were
needed, and how much stash was needed.]
One can use the option -matstash_initial_size to specify a sufficient
size [that can avoid these mallocs]
Also you can try reducing the message sizes [assuming MPI behaves
better with smaller messages]. This can be done by duing multiple
flush assemblies before the finaly assembly [each flush assembly will
pack and send of data destined for off processors]
i.e:
MatSetVaules()
...
MatAssemblyBegin/End(MAT_FLUSH_ASSEMBLY)
MatSetVaules()
...
MatAssemblyBegin/End(MAT_FLUSH_ASSEMBLY)
MatSetVaules()
...
MatAssemblyBegin/End(MAT_FINAL_ASSEMBLY)
[in this case a smaller stash is used anyway]
Satish
On Wed, 27 Feb 2013, Barry Smith wrote:
>
> This sounds like an OpenMPI issue with the matrix element communication and assembly process. In this process PETSc "stashes" any values set on the "wrong" process, then at MatAssemblyBegin() each process computes
> how much data it is sending to each other process, tells each other process how much data to expect (the receivers then post receives), then actually sends the data. We have used this code over many years on many systems so it is likely to be relatively bug free. My feeling is that the OpenMPI hardware/software combination is getting overwhelmed with all the messages that need to be sent around.
>
> Since you are handling this communication completely differently with Trilinos it doesn't have this problem.
>
> Since we can't change the hardware the first thing to check is how much (what percentage) of matrix entries need to be stashed and moved between processes. Run the 100 by 100 by 100 on 8 cores with the -info option and send the resulting output (it may be a lot) to petsc-maint at mcs.anl.gov (not petsc-users) and we'll show you how to determine how much data is being moved around. Perhaps you could also send the output from the successful 800 core run (with -info).
>
> Barry
>
>
>
> On Feb 27, 2013, at 10:02 AM, "Ramsey, James J CIV (US)" <james.j.ramsey14.civ at mail.mil> wrote:
>
> > I apologize in advance for any naivete. I'm new to running largeish-scale FEM problems.
> >
> > I have written a code intended to find the elastic properties of the unit cell of a microstructure rendered as a finite element mesh. [FWIW, basically I'm trying to solve equation (19) from Composites: Part A 32 (2001) 1291-1301.] The method of partitioning is kind of crude. The format of the file that the code reads in to determine element connectivity is the same as the format used by mpmetis. I run mpmetis on the element connectivity file, which produces two files indicating the partitioning of the nodes and elements. Each process then incrementally reads in the files used to define the mesh, using the output files from mpmetis to determine which nodes and elements to store in memory. (Yes, I've become aware that there are problems with multiple processes reading the same file, especially when the file is large.) For meshes with 50 x 50 x 50 or 100 x 100 x100 elements, the code seems to work reasonably well. The 100x100x100 mesh has run on a single 8-core node with 20
GB of RAM, for about 125,000 elements per core. If I try a mesh with 400 x 400 x 400 elements, I start running into problems.
> >
> > On one cluster (the same cluster on which I ran the 100x100x100 mesh), the 400x400x400 mesh wouldn't finish its run on 512 cores, which seems odd to me since (1) the number of elements per core is about the same as the case where the 100x100x100 mesh ran on 8 nodes and (2) an earlier version of the code using Epetra from Trilinos did work on that many cores. This might just be an issue with me preallocating too many non-zeros and running out of memory, though I'm not sure why that wouldn't have been a problem for the 100x100x100 run. On 512 cores, the code dies as it loops over the local elements to assemble its part of the global stiffness matrix. On 608 cores, the code dies differently. It finishes looping over the elements, but dies with "RETRY EXCEEDED" errors from OpenMPI. For 800 and 1024 cores, the code appears to work again. FYI, this cluster has 4x DDR Infiniband interconnects. (I don't know what "4x DDR" means, but maybe someone else does.)
> >
> > On a different--and newer--cluster, I get no joy with the 400x400x400 mesh at all. This cluster has 16 cores and 64 GB of RAM per node, and FDR-10 Infiniband interconnects. For 512 and 608 cores, the code seems to die as it loops over the elements, while for 704, 800, and 912 cores, the code finishes its calls to MatAssemblyBegin(), but during the calls to MatAssemblyEnd(), I get thousands of warning messages from OpenMPI, saying "rdma_recv.c:578 MXM WARN Many RX drops detected. Application performance may be affected". On this cluster, the Trilinos version of the code worked. even at 512 cores.
> >
> > For a 500x500x500 mesh, I have no luck on either cluster with PETSc, and only one cluster (the one with 16 cores per node) seems to work with the Trilinos version of the code. (It was actually the failures with the 500x500x500 mesh that led me to rewrite the relevant parts of the code using PETSc. For the cluster with 8 cores per node, running a 500x500x500 mesh on 1024 cores, the code usually dies during the calls to MatAssemblyEnd(), spewing out "RETRY EXCEEDED" errors from OpenMPI. I have done one trial with 1504 cores on the 8-core/node cluster, but it seems to have died before the code even starts. On the other cluster, I've tried cases with 1024 and 1504 cores, and during the calls to MatAssemblyEnd(), I get thousands of warning messages from OpenMPI, saying "rdma_recv.c:578 MXM WARN Many RX drops detected. Application performance may be affected." (Interestingly enough, in the output from the Trilinos version of my code, running on 1024 cores, I get one warning f
rom OpenMPI about the RX drops, but the code appears to have finished successfully and gotten reasonably results.)
> >
> > I'm trying to make sense of what barriers I'm hitting here. If the main problem is, say, that the MatAssemblyXXX() calls entail a lot of communication, then why would increasing the number of cores solve the problem for the 400x400x400 case? And why does increasing the number of cores not seem to help for the other cluster? Am I doing something naive here? (Or maybe the question should be what naive things am I doing here.)
>
>
More information about the petsc-users
mailing list