[petsc-users] Problems exceeding 192 million unknowns in FEM code

Ramsey, James J CIV (US) james.j.ramsey14.civ at mail.mil
Wed Feb 27 12:44:21 CST 2013


________________________________________
From: petsc-users-bounces at mcs.anl.gov [petsc-users-bounces at mcs.anl.gov] on behalf of Barry Smith [bsmith at mcs.anl.gov]
Sent: Wednesday, February 27, 2013 11:53 AM
To: PETSc users list
Subject: Re: [petsc-users] Problems exceeding 192 million unknowns in FEM       code

   This sounds like an OpenMPI issue with the matrix element communication and assembly process. In this process PETSc "stashes" any values set on the "wrong" process, then at MatAssemblyBegin() each process computes
how much data it is sending to each other process,  tells each other process how much data to expect (the receivers then post receives), then actually sends the data. We have used this code over many years on many systems so it is likely to be relatively bug free. My feeling is that the OpenMPI  hardware/software combination is getting overwhelmed with all the messages that need to be sent around.

   Since you are handling this communication completely differently with Trilinos it doesn't have this problem.
________________________________________

When a job using the Trilinos code goes south, if often dies at about the same place as the code with PETSc often does, at its equivalent of the MatAssemblyBegin()/MatAssemblyEnd() call (i.e. GlobalAssemble()). I suspect that I may have a similar underlying problem in both the PETSc and Trilinos versions of my code, though the details in how the problem is expressed may differ.

________________________________________

   Since we can't change the hardware the first thing to check is how much (what percentage) of matrix entries need to be stashed and moved between processes. Run the 100 by 100 by 100 on 8 cores with the -info option and send the resulting output (it may be a lot) to petsc-maint at mcs.anl.gov (not petsc-users) and we'll show you how to determine how much data is being moved around. Perhaps you could also send the output from the successful 800 core run (with -info).
________________________________________

Done.


More information about the petsc-users mailing list