[petsc-users] Problems exceeding 192 million unknowns in FEM code

Wed Feb 27 10:02:28 CST 2013

I apologize in advance for any naivete. I'm new to running largeish-scale FEM problems.

I have written a code intended to find the elastic properties of the unit cell of a microstructure rendered as a finite element mesh. [FWIW, basically I'm trying to solve equation (19) from Composites: Part A 32 (2001) 1291-1301.] The method of partitioning is kind of crude. The format of the file that the code reads in to determine element connectivity is the same as the format used by mpmetis. I run mpmetis on the element connectivity file, which produces two files indicating the partitioning of the nodes and elements. Each process then incrementally reads in the files used to define the mesh, using the output files from mpmetis to determine which nodes and elements to store in memory. (Yes, I've become aware that there are problems with multiple processes reading the same file, especially when the file is large.) For meshes with 50 x 50 x 50 or 100 x 100 x100 elements, the code seems to work reasonably well. The 100x100x100 mesh has run on a single 8-core node with 20 GB of RAM, for about 125,000 elements per core. If I try a mesh with 400 x 400 x 400 elements, I start running into problems.

On one cluster (the same cluster on which I ran the 100x100x100 mesh), the 400x400x400 mesh wouldn't finish its run on 512 cores, which seems odd to me since (1) the number of elements per core is about the same as the case where the 100x100x100 mesh ran on 8 nodes and (2) an earlier version of the code using Epetra from Trilinos did work on that many cores. This might just be an issue with me preallocating too many non-zeros and running out of memory, though I'm not sure why that wouldn't have been a problem for the 100x100x100 run. On 512 cores, the code dies as it loops over the local elements to assemble its part of the global stiffness matrix. On 608 cores, the code dies differently. It finishes looping over the elements, but dies with "RETRY EXCEEDED" errors from OpenMPI. For 800 and 1024 cores, the code appears to work again. FYI, this cluster has 4x DDR Infiniband interconnects. (I don't know what "4x DDR" means, but maybe someone else does.)

On a different--and newer--cluster, I get no joy with the 400x400x400 mesh at all. This cluster has 16 cores and 64 GB of RAM per node, and FDR-10 Infiniband interconnects. For 512 and 608 cores, the code seems to die as it loops over the elements, while for 704, 800, and 912 cores, the code finishes its calls to MatAssemblyBegin(), but during the calls to MatAssemblyEnd(), I get thousands of warning messages from OpenMPI, saying "rdma_recv.c:578  MXM WARN  Many RX drops detected. Application performance may be affected". On this cluster, the Trilinos version of the code worked. even at 512 cores.

For a 500x500x500 mesh, I have no luck on either cluster with PETSc, and only one cluster (the one with 16 cores per node) seems to work with the Trilinos version of the code. (It was actually the failures with the 500x500x500 mesh that led me to rewrite the relevant parts of the code using PETSc. For the cluster with 8 cores per node, running a 500x500x500 mesh on 1024 cores, the code usually dies during the calls to MatAssemblyEnd(), spewing out "RETRY EXCEEDED" errors from OpenMPI. I have done one trial with 1504 cores on the 8-core/node cluster, but it seems to have died before the code even starts. On the other cluster, I've tried cases with 1024 and 1504 cores, and during the calls to MatAssemblyEnd(), I get thousands of warning messages from OpenMPI, saying "rdma_recv.c:578  MXM WARN  Many RX drops detected. Application performance may be affected." (Interestingly enough, in the output from the Trilinos version of my code, running on 1024 cores, I get one warning from OpenMPI about the RX drops, but the code appears to have finished successfully and gotten reasonably results.)

I'm trying to make sense of what barriers I'm hitting here. If the main problem is, say, that the MatAssemblyXXX() calls entail a lot of communication, then why would increasing the number of cores solve the problem for the 400x400x400 case? And why does increasing the number of cores not seem to help for the other cluster? Am I doing something naive here? (Or maybe the question should be what naive things am I doing here.)