[petsc-users] Replicating a hang with MUMPS
David Knezevic
david.knezevic at akselos.com
Wed Jun 27 14:16:54 CDT 2018
On Wed, Jun 27, 2018 at 3:12 PM, Smith, Barry F. <bsmith at mcs.anl.gov> wrote:
>
> David,
>
> This is ugly but should work. BEFORE reading in the matrix and right
> hand side set the LOCAL sizes for the matrix and vector. This way you can
> control exactly which rows go on which process. Note you will have to have
> your own mechanism to know what the local sizes should be (for example have
> the original program print out the sizes and just cut and paste them into
> your copy of ex10.c) PETSc doesn't provide an automatic way to do this (nor
> should it).
>
> Barry
>
Thanks Barry and Stefano, the approach you both suggested (calling
MatSetSizes and VecSetSizes before reading in) was what I was looking for.
Best,
David
> On Jun 27, 2018, at 1:36 PM, David Knezevic <david.knezevic at akselos.com>
> wrote:
> >
> > I ran into a case where using MUMPS (called via "-ksp_type preonly
> -pc_type lu -pc_factor_mat_solver_package mumps") for a particular solve
> hangs indefinitely with 24 MPI processes (but it works fine with other
> numbers of processes). The stack trace when killing the job is below, in
> case that gives any clue as to what is wrong.
> >
> > I'm trying to replicate this with a simple test case. I wrote out the
> matrix and right-hand side to disk using MatView and VecView, and then I
> modified ksp ex10 to read in these files and solve with 24 cores. However,
> that did not replicate the error, so I think I also need to make sure that
> I use the same number of rows per process in the test case as in the case
> that hung. As a result I'm wondering if there is a way to modify the
> parallel layout of the matrix and vector after I read them in?
> >
> > Also, if there are any other suggestions about reproducing or debugging
> this issue, please let me know!
> >
> > Best,
> > David
> >
> > --------------------------------
> >
> > #0 0x00007fb12bf0e74d in poll () at ../sysdeps/unix/syscall-templa
> te.S:84
> > #1 0x00007fb126262e58 in ?? () from /usr/lib/libopen-pal.so.13
> > #2 0x00007fb1262596fb in opal_libevent2021_event_base_loop () from
> /usr/lib/libopen-pal.so.13
> > #3 0x00007fb126223238 in opal_progress () from
> /usr/lib/libopen-pal.so.13
> > #4 0x00007fb12cef53db in ompi_request_default_test () from
> /usr/lib/libmpi.so.12
> > #5 0x00007fb12cf21d61 in PMPI_Test () from /usr/lib/libmpi.so.12
> > #6 0x00007fb127a5b939 in pmpi_test__ () from /usr/lib/libmpi_mpifh.so.12
> > #7 0x00007fb132888d87 in dmumps_try_recvtreat (comm_load=8,
> ass_irecv=40, blocking=.FALSE., set_irecv=.TRUE., message_received=.FALSE.,
> msgsou=-1, msgtag=-1, status=..., bufr=..., lbufr=401408,
> lbufr_bytes=1605629, procnode_steps=..., posfac=410095, iwpos=3151,
> iwposcb=30557,
> > iptrlu=1536548, lrlu=1126454, lrlus=2864100, n=30675, iw=...,
> liw=39935, a=..., la=3367108, ptrist=..., ptlust=..., ptrfac=...,
> ptrast=..., step=..., pimaster=..., pamaster=..., nstk_s=..., comp=0,
> iflag=0, ierror=0, comm=7, nbprocfils=..., ipool=..., lpool=48, leaf=2,
> > nbfin=90, myid=33, slavef=90, root=..., opassw=353031,
> opeliw=700399235, itloc=..., rhs_mumps=..., fils=..., ptrarw=...,
> ptraiw=..., intarr=..., dblarr=..., icntl=..., keep=..., keep8=...,
> dkeep=..., nd=..., frere=..., lptrar=30675, nelt=1, frtptr=..., frtelt=...,
> > istep_to_iniv2=..., tab_pos_in_pere=...,
> stack_right_authorized=.TRUE., lrgroups=...) at dfac_process_message.F:646
> > #8 0x00007fb1328cfcd1 in dmumps_fac_par_m::dmumps_fac_par (n=30675,
> iw=..., liw=39935, a=..., la=3367108, nstk_steps=..., nbprocfils=...,
> nd=..., fils=..., step=..., frere=..., dad=..., cand=...,
> istep_to_iniv2=..., tab_pos_in_pere=..., maxfrt=0, ntotpv=0, nmaxnpiv=150,
> > ptrist=..., ptrast=..., pimaster=..., pamaster=..., ptrarw=...,
> ptraiw=..., itloc=..., rhs_mumps=..., ipool=..., lpool=48, rinfo=...,
> posfac=410095, iwpos=3151, lrlu=1126454, iptrlu=1536548, lrlus=2864100,
> leaf=2, nbroot=1, nbrtot=90, uu=0.01, icntl=..., ptlust=..., ptrfac=...,
> > nsteps=1, info=..., keep=..., keep8=..., procnode_steps=...,
> slavef=90, myid=33, comm_nodes=7, myid_nodes=33, bufr=..., lbufr=401408,
> lbufr_bytes=1605629, intarr=..., dblarr=..., root=..., perm=..., nelt=1,
> frtptr=..., frtelt=..., lptrar=30675, comm_load=8, ass_irecv=40,
> > seuil=0, seuil_ldlt_niv2=0, mem_distrib=..., ne=..., dkeep=...,
> pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_par_m.F:207
> > #9 0x00007fb13287f875 in dmumps_fac_b (n=30675, nsteps=1, a=...,
> la=3367108, iw=..., liw=39935, sym_perm=..., na=..., lna=47, ne_steps=...,
> nfsiz=..., fils=..., step=..., frere=..., dad=..., cand=...,
> istep_to_iniv2=..., tab_pos_in_pere=..., ptrar=..., ldptrar=30675,
> ptrist=...,
> > ptlust_s=..., ptrfac=..., iw1=..., iw2=..., itloc=...,
> rhs_mumps=..., pool=..., lpool=48, cntl1=0.01, icntl=..., info=...,
> rinfo=..., keep=..., keep8=..., procnode_steps=..., slavef=90,
> comm_nodes=7, myid=33, myid_nodes=33, bufr=..., lbufr=401408,
> lbufr_bytes=1605629,
> > intarr=..., dblarr=..., root=..., nelt=1, frtptr=..., frtelt=...,
> comm_load=8, ass_irecv=40, seuil=0, seuil_ldlt_niv2=0, mem_distrib=...,
> dkeep=..., pivnul_list=..., lpn_list=1, lrgroups=...) at dfac_b.F:167
> > #10 0x00007fb1328419ed in dmumps_fac_driver (id=<error reading variable:
> value requires 600640 bytes, which is more than max-value-size>) at
> dfac_driver.F:2291
> > #11 0x00007fb1327ff6dc in dmumps (id=<error reading variable: value
> requires 600640 bytes, which is more than max-value-size>) at
> dmumps_driver.F:1686
> > #12 0x00007fb1327faf0a in dmumps_f77 (job=2, sym=0, par=1, comm_f77=5,
> n=30675, icntl=..., cntl=..., keep=..., dkeep=..., keep8=..., nz=0, nnz=0,
> irn=..., irnhere=0, jcn=..., jcnhere=0, a=..., ahere=0, nz_loc=622296,
> nnz_loc=0, irn_loc=..., irn_lochere=1, jcn_loc=...,
> > jcn_lochere=1, a_loc=..., a_lochere=1, nelt=0, eltptr=...,
> eltptrhere=0, eltvar=..., eltvarhere=0, a_elt=..., a_elthere=0,
> perm_in=..., perm_inhere=0, rhs=..., rhshere=0, redrhs=..., redrhshere=0,
> info=..., rinfo=..., infog=..., rinfog=..., deficiency=0, lwk_user=0,
> > size_schur=0, listvar_schur=..., listvar_schurhere=0, schur=...,
> schurhere=0, wk_user=..., wk_userhere=0, colsca=..., colscahere=0,
> rowsca=..., rowscahere=0, instance_number=1, nrhs=1, lrhs=0, lredrhs=0,
> rhs_sparse=..., rhs_sparsehere=0, sol_loc=..., sol_lochere=0,
> > irhs_sparse=..., irhs_sparsehere=0, irhs_ptr=..., irhs_ptrhere=0,
> isol_loc=..., isol_lochere=0, nz_rhs=0, lsol_loc=0, schur_mloc=0,
> schur_nloc=0, schur_lld=0, mblock=0, nblock=0, nprow=0, npcol=0,
> ooc_tmpdir=..., ooc_prefix=..., write_problem=..., tmpdirlen=20,
> prefixlen=20,
> > write_problemlen=20) at dmumps_f77.F:267
> > #13 0x00007fb1327f9cfa in dmumps_c (mumps_par=mumps_par at entry=0x12bd9660)
> at mumps_c.c:417
> > #14 0x00007fb1321a23fc in MatFactorNumeric_MUMPS (F=0x12bd8b60,
> A=0x26bd890, info=<optimized out>) at /home/buildslave/software/pets
> c-src/src/mat/impls/aij/mpi/mumps/mumps.c:1073
> > #15 0x00007fb131ec6ea7 in MatLUFactorNumeric (fact=0x12bd8b60,
> mat=0x26bd890, info=info at entry=0xc2a66f8) at
> /home/buildslave/software/petsc-src/src/mat/interface/matrix.c:3025
> > #16 0x00007fb1325040d6 in PCSetUp_LU (pc=0xc2a6380) at
> /home/buildslave/software/petsc-src/src/ksp/pc/impls/factor/lu/lu.c:131
> > #17 0x00007fb13259903e in PCSetUp (pc=0xc2a6380) at
> /home/buildslave/software/petsc-src/src/ksp/pc/interface/precon.c:923
> > #18 0x00007fb13263e53f in KSPSetUp (ksp=ksp at entry=0x12b28c70) at
> /home/buildslave/software/petsc-src/src/ksp/ksp/interface/itfunc.c:381
> > #19 0x00007fb13263ed36 in KSPSolve (ksp=0x12b28c70, b=0xad77d50,
> x=0xad801c0) at /home/buildslave/software/pets
> c-src/src/ksp/ksp/interface/itfunc.c:612
> > #20 0x00007fb12db5dfc2 in libMesh::PetscLinearSolver<dou
> ble>::solve(libMesh::SparseMatrix<double>&, libMesh::SparseMatrix<double>&,
> libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double,
> unsigned int) ()
> > from /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../../third_pa
> rty/opt_real/libmesh_opt.so.0
> > #21 0x00007fb1338d0c06 in libMesh::PetscLinearSolver<dou
> ble>::solve(libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
> libMesh::NumericVector<double>&, double, unsigned int) () from
> /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrbe-opt_real.so
> > #22 0x00007fb1335e8abd in std::pair<unsigned int, double>
> SolveHelper::try_linear_solve<libMesh::LinearSolver<double>
> >(libMesh::LinearSolver<double>&, libMesh::SolverConfiguration&,
> libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&,
> libMesh::NumericVector<double>&) ()
> > from /mnt/fileserver/akselos-4.2.x/scrbe/build/bin/../lib/libscrb
> e-opt_real.so
> > #23 0x00007fb133a70206 in
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180627/68dadfe1/attachment.html>
More information about the petsc-users
mailing list