[petsc-users] problem of running jobs on cluster

Mon Oct 24 15:52:43 CDT 2011

On Oct 24, 2011, at 3:37 PM, Wen Jiang wrote:

> Hi guys,
> 
> I reported this problem a few days ago but I still cannot get it fixed. Right now I am learning how to debug the parallel code. And I just want to get some suggestions before I figure out how the debugger works.
> 
> This is just a big run of my own fem code, which has almost the same structure as the ex3 in ksp examples. This code ( the largest dof I used is around 65,000 ) is running totally fine on one compute node with any number of processes.  And the code with smaller dof ( less than 5000) is also working fine on more than one compute node. However, I am encountering a problem when I tries to run a large job ( for example, dof = 10,000 ) on two compute nodes. 
> 
> The problem is that my code will get stuck at the MatAssemblyEnd() stage. I use the option -info to print information about the code and find that only some of the processes gives the MatAssemblyEnd_SeqAIJ() information and thus the code gets stuck there.
> 
> I have several questions here,
> 
> 1. In ex3, the comments said that the matrix is intentionally laid out across processors differently from the way it is assembled. As far as I understand, this means that the MatSetValues() will insert the values to different processors.( am I correct?). Since generating the entries on the 'wrong' process is expensive, I am just wondering whether there is a better way to do it especially for the assembly the global stiffness matrix in FEM. ( In my code, the MatSetValues will add a 64 by 64 element stiffness matrix every time )
> 
> 2. Since my code (dofs around 10,000 ) is working fine on single node but get stuck on two nodes, I am guessing that might be due to the large chuck of data which needs to be communicated between different nodes in the stage of MatAssembly ? Will the data communication be slower between different nodes than within single node?  

   Absolutely. You want to generate most of the matrix entries on the process where they will be stored. You also need to make sure you've done the correct matrix preallocation: http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#efficient-assembly 
Working for small problem and taking "forever" for larger problem is a sign of bad preallocation or too much data computed on the wrong process.

   Also run the smaller problem with valgrind http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind to make sure there are no memory corruption problems that are slipping by on the small mesh but causing problems on the large.

    Also run the small problem and check for correct memory preallocation; if it is wrong for the small problem it will be wrong for the large.

  Barry

> 
> I appreciate any of your suggestion and I will also keep working on the debugging. 
> 
> Thanks,
> Wen