[petsc-users] Replicating a hang with MUMPS

Stefano Zampini stefano.zampini at gmail.com
Wed Jun 27 13:47:23 CDT 2018


Forwarding to the list

Il Mer 27 Giu 2018, 21:46 Stefano Zampini <stefano.zampini at gmail.com> ha
scritto:

> You can call MatSetSizes before MatLoad
>
> Il Mer 27 Giu 2018, 21:36 David Knezevic <david.knezevic at akselos.com> ha
> scritto:
>
>> I ran into a case where using MUMPS (called via "-ksp_type preonly
>> -pc_type lu -pc_factor_mat_solver_package mumps") for a particular solve
>> hangs indefinitely with 24 MPI processes (but it works fine with other
>> numbers of processes). The stack trace when killing the job is below, in
>> case that gives any clue as to what is wrong.
>>
>> I'm trying to replicate this with a simple test case. I wrote out the
>> matrix and right-hand side to disk using MatView and VecView, and then I
>> modified ksp ex10 to read in these files and solve with 24 cores. However,
>> that did not replicate the error, so I think I also need to make sure that
>> I use the same number of rows per process in the test case as in the case
>> that hung. As a result I'm wondering if there is a way to modify the
>> parallel layout of the matrix and vector after I read them in?
>>
>> Also, if there are any other suggestions about reproducing or debugging
>> this issue, please let me know!
>>
>> Best,
>> David
>>
>> --------------------------------
>>
>> #0  0x00007fb12bf0e74d in poll () at ../sysdeps/unix/syscall-template.S:84
>> #1  0x00007fb126262e58 in ?? () from /usr/lib/libopen-pal.so.13
>> #2  0x00007fb1262596fb in opal_libevent2021_event_base_loop () from
>> /usr/lib/libopen-pal.so.13
>> #3  0x00007fb126223238 in opal_progress () from /usr/lib/libopen-pal.so.13
>> #4  0x00007fb12cef53db in ompi_request_default_test () from
>> /usr/lib/libmpi.so.12
>> #5  0x00007fb12cf21d61 in PMPI_Test () from /usr/lib/libmpi.so.12
>> #6  0x00007fb127a5b939 in pmpi_test__ () from /usr/lib/libmpi_mpifh.so.12
>> #7  0x00007fb132888d87 in dmumps_try_recvtreat (comm_load=8,
>> ass_irecv=40, blocking=.FALSE., set_irecv=.TRUE., message_received=.FALSE.,
>> msgsou=-1, msgtag=-1, status=..., bufr=..., lbufr=401408,
>> lbufr_bytes=1605629, procnode_steps=..., posfac=410095, iwpos=3151,
>> iwposcb=30557,
>>     iptrlu=1536548, lrlu=1126454, lrlus=2864100, n=30675, iw=...,
>> liw=39935, a=..., la=3367108, ptrist=..., ptlust=..., ptrfac=...,
>> ptrast=..., step=..., pimaster=..., pamaster=..., nstk_s=..., comp=0,
>> iflag=0, ierror=0, comm=7, nbprocfils=..., ipool=..., lpool=48, leaf=2,
>>     nbfin=90, myid=33, slavef=90, root=..., opassw=353031,
>> opeliw=700399235, itloc=..., rhs_mumps=..., fils=..., ptrarw=...,
>> ptraiw=..., intarr=..., dblarr=..., icntl=..., keep=..., keep8=...,
>> dkeep=..., nd=..., frere=..., lptrar=30675, nelt=1, frtptr=..., frtelt=...,
>>     istep_to_iniv2=..., tab_pos_in_pere=...,
>> stack_right_authorized=.TRUE., lrgroups=...) at dfac_process_message.F:646
>> #8  0x00007fb1328cfcd1 in dmumps_fac_par_m::dmumps_fac_par (n=30675,
>> iw=..., liw=39935, a=..., la=3367108, nstk_steps=..., nbprocfils=...,
>> nd=..., fils=..., step=..., frere=..., dad=..., cand=...,
>> istep_to_iniv2=..., tab_pos_in_pere=..., maxfrt=0, ntotpv=0, nmaxnpiv=150,
>>     ptrist=..., ptrast=..., pimaster=..., pamaster=..., ptrarw=...,
>> ptraiw=..., itloc=..., rhs_mumps=..., ipool=..., lpool=48, rinfo=...,
>> posfac=410095, iwpos=3151, lrlu=1126454, iptrlu=1536548, lrlus=2864100,
>> leaf=2, nbroot=1, nbrtot=90, uu=0.01, icntl=..., ptlust=..., ptrfac=...,
>>     nsteps=1, info=..., keep=..., keep8=..., procnode_steps=...,
>> slavef=90, myid=33, comm_nodes=7, myid_nodes=33, bufr=..., lbufr=401408,
>> lbufr_bytes=1605629, intarr=..., dblarr=..., root=..., perm=..., nelt=1,
>> frtptr=..., frtelt=..., lptrar=30675, comm_load=8, ass_irecv=40,
>>     seuil=0, seuil_ldlt_niv2=0, mem_distrib=..., ne=..., dkeep=...,
>> pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_par_m.F:207
>> #9  0x00007fb13287f875 in dmumps_fac_b (n=30675, nsteps=1, a=...,
>> la=3367108, iw=..., liw=39935, sym_perm=..., na=..., lna=47, ne_steps=...,
>> nfsiz=..., fils=..., step=..., frere=..., dad=..., cand=...,
>> istep_to_iniv2=..., tab_pos_in_pere=..., ptrar=..., ldptrar=30675,
>> ptrist=...,
>>     ptlust_s=..., ptrfac=..., iw1=..., iw2=..., itloc=..., rhs_mumps=...,
>> pool=..., lpool=48, cntl1=0.01, icntl=..., info=..., rinfo=..., keep=...,
>> keep8=..., procnode_steps=..., slavef=90, comm_nodes=7, myid=33,
>> myid_nodes=33, bufr=..., lbufr=401408, lbufr_bytes=1605629,
>>     intarr=..., dblarr=..., root=..., nelt=1, frtptr=..., frtelt=...,
>> comm_load=8, ass_irecv=40, seuil=0, seuil_ldlt_niv2=0, mem_distrib=...,
>> dkeep=..., pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_b.F:167
>> #10 0x00007fb1328419ed in dmumps_fac_driver (id=<error reading variable:
>> value requires 600640 bytes, which is more than max-value-size>) at
>> dfac_driver.F:2291
>> #11 0x00007fb1327ff6dc in dmumps (id=<error reading variable: value
>> requires 600640 bytes, which is more than max-value-size>) at
>> dmumps_driver.F:1686
>> #12 0x00007fb1327faf0a in dmumps_f77 (job=2, sym=0, par=1, comm_f77=5,
>> n=30675, icntl=..., cntl=..., keep=..., dkeep=..., keep8=..., nz=0, nnz=0,
>> irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., ahere=0, nz_loc=622296,
>> nnz_loc=0, irn_loc=..., irn_lochere=1, jcn_loc=...,
>>     jcn_lochere=1, a_loc=..., a_lochere=1, nelt=0, eltptr=...,
>> eltptrhere=0, eltvar=..., eltvarhere=0, a_elt=..., a_elthere=0,
>> perm_in=..., perm_inhere=0, rhs=..., rhshere=0, redrhs=..., redrhshere=0,
>> info=..., rinfo=..., infog=..., rinfog=..., deficiency=0, lwk_user=0,
>>     size_schur=0, listvar_schur=..., listvar_schurhere=0, schur=...,
>> schurhere=0, wk_user=..., wk_userhere=0, colsca=..., colscahere=0,
>> rowsca=..., rowscahere=0, instance_number=1, nrhs=1, lrhs=0, lredrhs=0,
>> rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
>>     irhs_sparse=..., irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0,
>> isol_loc=..., isol_lochere=0, nz_rhs=0, lsol_loc=0, schur_mloc=0,
>> schur_nloc=0, schur_lld=0, mblock=0, nblock=0, nprow=0, npcol=0,
>> ooc_tmpdir=..., ooc_prefix=..., write_problem=..., tmpdirlen=20,
>> prefixlen=20,
>>     write_problemlen=20) at dmumps_f77.F:267
>> #13 0x00007fb1327f9cfa in dmumps_c (mumps_par=mumps_par at entry=0x12bd9660)
>> at mumps_c.c:417
>> #14 0x00007fb1321a23fc in MatFactorNumeric_MUMPS (F=0x12bd8b60,
>> A=0x26bd890, info=<optimized out>) at
>> /home/buildslave/software/petsc-src/src/mat/impls/aij/mpi/mumps/mumps.c:1073
>> #15 0x00007fb131ec6ea7 in MatLUFactorNumeric (fact=0x12bd8b60,
>> mat=0x26bd890, info=info at entry=0xc2a66f8) at
>> /home/buildslave/software/petsc-src/src/mat/interface/matrix.c:3025
>> #16 0x00007fb1325040d6 in PCSetUp_LU (pc=0xc2a6380) at
>> /home/buildslave/software/petsc-src/src/ksp/pc/impls/factor/lu/lu.c:131
>> #17 0x00007fb13259903e in PCSetUp (pc=0xc2a6380) at
>> /home/buildslave/software/petsc-src/src/ksp/pc/interface/precon.c:923
>> #18 0x00007fb13263e53f in KSPSetUp (ksp=ksp at entry=0x12b28c70) at
>> /home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:381
>> #19 0x00007fb13263ed36 in KSPSolve (ksp=0x12b28c70, b=0xad77d50,
>> x=0xad801c0) at
>> /home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:612
>> #20 0x00007fb12db5dfc2 in
>> libMesh::PetscLinearSolver<double>::solve(libMesh::SparseMatrix<double>&,
>> libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
>> libMesh::NumericVector<double>&, double, unsigned int) ()
>>    from
>> /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../../third_party/opt_real/libmesh_opt.so.0
>> #21 0x00007fb1338d0c06 in
>> libMesh::PetscLinearSolver<double>::solve(libMesh::SparseMatrix<double>&,
>> libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double,
>> unsigned int) () from
>> /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so
>> #22 0x00007fb1335e8abd in std::pair<unsigned int, double>
>> SolveHelper::try_linear_solve<libMesh::LinearSolver<double>
>> >(libMesh::LinearSolver<double>&, libMesh::SolverConfiguration&,
>> libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
>> libMesh::NumericVector<double>&) ()
>>    from
>> /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so
>> #23 0x00007fb133a70206 in
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180627/226b28ab/attachment-0001.html>


More information about the petsc-users mailing list