[petsc-users] nondeterministic behavior of MUMPS when filtering out zero rows and columns
Smith, Barry F.
bsmith at mcs.anl.gov
Fri Nov 8 00:05:10 CST 2019
Make sure you have the latest PETSc and MUMPS installed; they have fixed bugs in MUMPs over time.
Hanging locations are best found with a debugger; there is really no other way. If you have a parallel debugger like DDT use it. If you don't you can use the PETSc option -start_in_debugger to have PETSc start a line debugger in an xterm for each process. Type cont in each window and when it "hangs" do control C in the windows and type bt It will show the traceback where it is hanging on each process. Send us the output.
Barry
Another approach that avoids the debugger is to send to one of the MPI processes a signal, term would be a good one to use. If you are luck that process with catch the signal and print a traceback of where it is when the signal occurred. If you are super lucky you can send the signal to several processes and get several tracebacks.
> On Nov 7, 2019, at 5:44 AM, s.a.hack--- via petsc-users <petsc-users at mcs.anl.gov> wrote:
>
> Hi,
>
> I am doing calculations with version 3.12.0 of PETSc.
> Using the finite-element method, I solve the Maxwell equations on the interior of a 3D domain, coupled with boundary condition auxiliary equations on the boundary of the domain. The auxiliary equations employ auxiliary variables g.
>
> For ease of implementation of element matrix assembly, the auxiliary variables g are defined on the entire domain. However, only the basis functions for g with nonzero value at the boundary give nonzero entries in the system matrix.
>
> The element matrices hence have the structure
> [ A B; C D]
> at the boundary.
>
> In the interior the element matrices have the structure
> [A 0; 0 0].
>
> The degrees of freedom in the system matrix can be ordered by element [u_e1 g_e1 u_e2 g_e2 …] or by parallel process [u_p1 g_p1 u_p2 g_p2 …].
>
> To solve the system matrix, I need to filter out zero rows and columns:
> error = MatFindNonzeroRows(stiffnessMatrix, &nonzeroRows);
> CHKERRABORT(PETSC_COMM_WORLD, error);
> error = MatCreateSubMatrix(stiffnessMatrix, nonzeroRows, nonzeroRows, MAT_INITIAL_MATRIX, &stiffnessMatrixSubMatrix);
> CHKERRABORT(PETSC_COMM_WORLD, error);
>
> I solve the system matrix in parallel on multiple nodes connected with InfiniBand.
> The problem is that the MUMPS solver frequently (nondeterministically) hangs during KSPSolve() (after KSPSetUp() is completed).
> Running with the options -ksp_view and -info the last printed statement is:
> [0] VecScatterCreate_SF(): Using StarForest for vector scatter
> In the calculations where the program does not hang, the calculated solution is correct.
>
> The problem doesn’t occur for calculations on a single node, or for calculations with the SuperLU solver (but SuperLU will not allow calculations that are as large).
> The problem also doesn’t seem to occur for small problems.
> The problem doesn’t occur either when I put ones on the diagonal, but this is computationally expensive:
> error = MatFindZeroRows(stiffnessMatrix, &zeroRows);
> CHKERRABORT(PETSC_COMM_WORLD, error);
> error = MatZeroRowsColumnsIS(stiffnessMatrix, zeroRows, diagEntry, PETSC_IGNORE, PETSC_IGNORE);
> CHKERRABORT(PETSC_COMM_WORLD, error);
>
> Would you have any ideas on what I could check?
>
> Best regards,
> Sjoerd
More information about the petsc-users
mailing list