[petsc-users] Replicating a hang with MUMPS

David Knezevic david.knezevic at akselos.com
Wed Jun 27 13:36:41 CDT 2018


I ran into a case where using MUMPS (called via "-ksp_type preonly -pc_type
lu -pc_factor_mat_solver_package mumps") for a particular solve hangs
indefinitely with 24 MPI processes (but it works fine with other numbers of
processes). The stack trace when killing the job is below, in case that
gives any clue as to what is wrong.

I'm trying to replicate this with a simple test case. I wrote out the
matrix and right-hand side to disk using MatView and VecView, and then I
modified ksp ex10 to read in these files and solve with 24 cores. However,
that did not replicate the error, so I think I also need to make sure that
I use the same number of rows per process in the test case as in the case
that hung. As a result I'm wondering if there is a way to modify the
parallel layout of the matrix and vector after I read them in?

Also, if there are any other suggestions about reproducing or debugging
this issue, please let me know!

Best,
David

--------------------------------

#0  0x00007fb12bf0e74d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fb126262e58 in ?? () from /usr/lib/libopen-pal.so.13
#2  0x00007fb1262596fb in opal_libevent2021_event_base_loop () from
/usr/lib/libopen-pal.so.13
#3  0x00007fb126223238 in opal_progress () from /usr/lib/libopen-pal.so.13
#4  0x00007fb12cef53db in ompi_request_default_test () from
/usr/lib/libmpi.so.12
#5  0x00007fb12cf21d61 in PMPI_Test () from /usr/lib/libmpi.so.12
#6  0x00007fb127a5b939 in pmpi_test__ () from /usr/lib/libmpi_mpifh.so.12
#7  0x00007fb132888d87 in dmumps_try_recvtreat (comm_load=8, ass_irecv=40,
blocking=.FALSE., set_irecv=.TRUE., message_received=.FALSE., msgsou=-1,
msgtag=-1, status=..., bufr=..., lbufr=401408, lbufr_bytes=1605629,
procnode_steps=..., posfac=410095, iwpos=3151, iwposcb=30557,
    iptrlu=1536548, lrlu=1126454, lrlus=2864100, n=30675, iw=...,
liw=39935, a=..., la=3367108, ptrist=..., ptlust=..., ptrfac=...,
ptrast=..., step=..., pimaster=..., pamaster=..., nstk_s=..., comp=0,
iflag=0, ierror=0, comm=7, nbprocfils=..., ipool=..., lpool=48, leaf=2,
    nbfin=90, myid=33, slavef=90, root=..., opassw=353031,
opeliw=700399235, itloc=..., rhs_mumps=..., fils=..., ptrarw=...,
ptraiw=..., intarr=..., dblarr=..., icntl=..., keep=..., keep8=...,
dkeep=..., nd=..., frere=..., lptrar=30675, nelt=1, frtptr=..., frtelt=...,
    istep_to_iniv2=..., tab_pos_in_pere=..., stack_right_authorized=.TRUE.,
lrgroups=...) at dfac_process_message.F:646
#8  0x00007fb1328cfcd1 in dmumps_fac_par_m::dmumps_fac_par (n=30675,
iw=..., liw=39935, a=..., la=3367108, nstk_steps=..., nbprocfils=...,
nd=..., fils=..., step=..., frere=..., dad=..., cand=...,
istep_to_iniv2=..., tab_pos_in_pere=..., maxfrt=0, ntotpv=0, nmaxnpiv=150,
    ptrist=..., ptrast=..., pimaster=..., pamaster=..., ptrarw=...,
ptraiw=..., itloc=..., rhs_mumps=..., ipool=..., lpool=48, rinfo=...,
posfac=410095, iwpos=3151, lrlu=1126454, iptrlu=1536548, lrlus=2864100,
leaf=2, nbroot=1, nbrtot=90, uu=0.01, icntl=..., ptlust=..., ptrfac=...,
    nsteps=1, info=..., keep=..., keep8=..., procnode_steps=..., slavef=90,
myid=33, comm_nodes=7, myid_nodes=33, bufr=..., lbufr=401408,
lbufr_bytes=1605629, intarr=..., dblarr=..., root=..., perm=..., nelt=1,
frtptr=..., frtelt=..., lptrar=30675, comm_load=8, ass_irecv=40,
    seuil=0, seuil_ldlt_niv2=0, mem_distrib=..., ne=..., dkeep=...,
pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_par_m.F:207
#9  0x00007fb13287f875 in dmumps_fac_b (n=30675, nsteps=1, a=...,
la=3367108, iw=..., liw=39935, sym_perm=..., na=..., lna=47, ne_steps=...,
nfsiz=..., fils=..., step=..., frere=..., dad=..., cand=...,
istep_to_iniv2=..., tab_pos_in_pere=..., ptrar=..., ldptrar=30675,
ptrist=...,
    ptlust_s=..., ptrfac=..., iw1=..., iw2=..., itloc=..., rhs_mumps=...,
pool=..., lpool=48, cntl1=0.01, icntl=..., info=..., rinfo=..., keep=...,
keep8=..., procnode_steps=..., slavef=90, comm_nodes=7, myid=33,
myid_nodes=33, bufr=..., lbufr=401408, lbufr_bytes=1605629,
    intarr=..., dblarr=..., root=..., nelt=1, frtptr=..., frtelt=...,
comm_load=8, ass_irecv=40, seuil=0, seuil_ldlt_niv2=0, mem_distrib=...,
dkeep=..., pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_b.F:167
#10 0x00007fb1328419ed in dmumps_fac_driver (id=<error reading variable:
value requires 600640 bytes, which is more than max-value-size>) at
dfac_driver.F:2291
#11 0x00007fb1327ff6dc in dmumps (id=<error reading variable: value
requires 600640 bytes, which is more than max-value-size>) at
dmumps_driver.F:1686
#12 0x00007fb1327faf0a in dmumps_f77 (job=2, sym=0, par=1, comm_f77=5,
n=30675, icntl=..., cntl=..., keep=..., dkeep=..., keep8=..., nz=0, nnz=0,
irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., ahere=0, nz_loc=622296,
nnz_loc=0, irn_loc=..., irn_lochere=1, jcn_loc=...,
    jcn_lochere=1, a_loc=..., a_lochere=1, nelt=0, eltptr=...,
eltptrhere=0, eltvar=..., eltvarhere=0, a_elt=..., a_elthere=0,
perm_in=..., perm_inhere=0, rhs=..., rhshere=0, redrhs=..., redrhshere=0,
info=..., rinfo=..., infog=..., rinfog=..., deficiency=0, lwk_user=0,
    size_schur=0, listvar_schur=..., listvar_schurhere=0, schur=...,
schurhere=0, wk_user=..., wk_userhere=0, colsca=..., colscahere=0,
rowsca=..., rowscahere=0, instance_number=1, nrhs=1, lrhs=0, lredrhs=0,
rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
    irhs_sparse=..., irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0,
isol_loc=..., isol_lochere=0, nz_rhs=0, lsol_loc=0, schur_mloc=0,
schur_nloc=0, schur_lld=0, mblock=0, nblock=0, nprow=0, npcol=0,
ooc_tmpdir=..., ooc_prefix=..., write_problem=..., tmpdirlen=20,
prefixlen=20,
    write_problemlen=20) at dmumps_f77.F:267
#13 0x00007fb1327f9cfa in dmumps_c (mumps_par=mumps_par at entry=0x12bd9660)
at mumps_c.c:417
#14 0x00007fb1321a23fc in MatFactorNumeric_MUMPS (F=0x12bd8b60,
A=0x26bd890, info=<optimized out>) at
/home/buildslave/software/petsc-src/src/mat/impls/aij/mpi/mumps/mumps.c:1073
#15 0x00007fb131ec6ea7 in MatLUFactorNumeric (fact=0x12bd8b60,
mat=0x26bd890, info=info at entry=0xc2a66f8) at
/home/buildslave/software/petsc-src/src/mat/interface/matrix.c:3025
#16 0x00007fb1325040d6 in PCSetUp_LU (pc=0xc2a6380) at
/home/buildslave/software/petsc-src/src/ksp/pc/impls/factor/lu/lu.c:131
#17 0x00007fb13259903e in PCSetUp (pc=0xc2a6380) at
/home/buildslave/software/petsc-src/src/ksp/pc/interface/precon.c:923
#18 0x00007fb13263e53f in KSPSetUp (ksp=ksp at entry=0x12b28c70) at
/home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:381
#19 0x00007fb13263ed36 in KSPSolve (ksp=0x12b28c70, b=0xad77d50,
x=0xad801c0) at
/home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:612
#20 0x00007fb12db5dfc2 in
libMesh::PetscLinearSolver<double>::solve(libMesh::SparseMatrix<double>&,
libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
libMesh::NumericVector<double>&, double, unsigned int) ()
   from
/mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../../third_party/opt_real/libmesh_opt.so.0
#21 0x00007fb1338d0c06 in
libMesh::PetscLinearSolver<double>::solve(libMesh::SparseMatrix<double>&,
libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double,
unsigned int) () from
/mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so
#22 0x00007fb1335e8abd in std::pair<unsigned int, double>
SolveHelper::try_linear_solve<libMesh::LinearSolver<double>
>(libMesh::LinearSolver<double>&, libMesh::SolverConfiguration&,
libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
libMesh::NumericVector<double>&) ()
   from
/mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so
#23 0x00007fb133a70206 in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180627/28c6e6ed/attachment.html>


More information about the petsc-users mailing list