<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">On Tue, 23 May 2023 at 20:09, Yann Jobic <<a href="mailto:yann.jobic@univ-amu.fr">yann.jobic@univ-amu.fr</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If i may, you can use the command line option "-mat_mumps_icntl_4 2"<br>
MUMPS then gives infomations about the factorization step, such as the <br>
estimated needed memory.<br>
<br></blockquote><div>Thank you for your suggestion!</div><div> </div>Best wishes,<div>Zongze</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Best regards,<br>
<br>
Yann<br>
<br>
Le 5/23/2023 à 11:59 AM, Matthew Knepley a écrit :<br>
> On Mon, May 22, 2023 at 10:42 PM Zongze Yang <<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a> <br>
> <mailto:<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a>>> wrote:<br>
> <br>
>     On Tue, 23 May 2023 at 05:31, Stefano Zampini<br>
>     <<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a> <mailto:<a href="mailto:stefano.zampini@gmail.com" target="_blank">stefano.zampini@gmail.com</a>>> wrote:<br>
> <br>
>         If I may add to the discussion, it may be that you are going OOM<br>
>         since you are trying to factorize a 3 million dofs problem, this<br>
>         problem goes undetected and then fails at a later stage<br>
> <br>
>     Thank you for your comment. I ran the problem with 90 processes<br>
>     distributed across three nodes, each equipped with 500G of memory.<br>
>     If this amount of memory is sufficient for solving the matrix with<br>
>     approximately 3 million degrees of freedom?<br>
> <br>
> <br>
> It really depends on the fill. Suppose that you get 1% fill, then<br>
> <br>
>    (3e6)^2 * 0.01 * 8 = 1e12 B<br>
> <br>
> and you have 1.5e12 B, so I could easily see running out of memory.<br>
> <br>
>    Thanks,<br>
> <br>
>       Matt<br>
> <br>
>     Thanks!<br>
>     Zongze<br>
> <br>
>         Il giorno lun 22 mag 2023 alle ore 20:03 Zongze Yang<br>
>         <<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a> <mailto:<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a>>> ha scritto:<br>
> <br>
>             Thanks!<br>
> <br>
>             Zongze<br>
> <br>
>             Matthew Knepley <<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a><br>
>             <mailto:<a href="mailto:knepley@gmail.com" target="_blank">knepley@gmail.com</a>>>于2023年5月23日 周二00:09写道:<br>
> <br>
>                 On Mon, May 22, 2023 at 11:07 AM Zongze Yang<br>
>                 <<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a> <mailto:<a href="mailto:yangzongze@gmail.com" target="_blank">yangzongze@gmail.com</a>>> wrote:<br>
> <br>
>                     Hi,<br>
> <br>
>                     I hope this letter finds you well. I am writing to<br>
>                     seek guidance regarding an error I encountered while<br>
>                     solving a matrix using MUMPS on multiple nodes:<br>
> <br>
> <br>
>                 Iprobe is buggy on several MPI implementations. PETSc<br>
>                 has an option for shutting it off for this reason.<br>
>                 I do not know how to shut it off inside MUMPS however. I<br>
>                 would mail their mailing list to see.<br>
> <br>
>                    Thanks,<br>
> <br>
>                       Matt<br>
> <br>
>                     ```bash<br>
>                     Abort(1681039) on node 60 (rank 60 in comm 240):<br>
>                     Fatal error in PMPI_Iprobe: Other MPI error, error<br>
>                     stack:<br>
>                     PMPI_Iprobe(124)..............:<br>
>                     MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,<br>
>                     comm=0xc4000026, flag=0x7ffc130f9c4c,<br>
>                     status=0x7ffc130f9e80) failed<br>
>                     MPID_Iprobe(240)..............:<br>
>                     MPIDI_iprobe_safe(108)........:<br>
>                     MPIDI_iprobe_unsafe(35).......:<br>
>                     MPIDI_OFI_do_iprobe(69).......:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Assertion failed in file<br>
>                     src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0<br>
>                     ```<br>
> <br>
>                     The matrix in question has a degree of freedom (dof)<br>
>                     of 3.86e+06. Interestingly, when solving<br>
>                     smaller-scale problems, everything functions<br>
>                     perfectly without any issues. However, when<br>
>                     attempting to solve the larger matrix on multiple<br>
>                     nodes, I encounter the aforementioned error.<br>
> <br>
>                     The complete error message I received is as follows:<br>
>                     ```bash<br>
>                     Abort(1681039) on node 60 (rank 60 in comm 240):<br>
>                     Fatal error in PMPI_Iprobe: Other MPI error, error<br>
>                     stack:<br>
>                     PMPI_Iprobe(124)..............:<br>
>                     MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,<br>
>                     comm=0xc4000026, flag=0x7ffc130f9c4c,<br>
>                     status=0x7ffc130f9e80) failed<br>
>                     MPID_Iprobe(240)..............:<br>
>                     MPIDI_iprobe_safe(108)........:<br>
>                     MPIDI_iprobe_unsafe(35).......:<br>
>                     MPIDI_OFI_do_iprobe(69).......:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Assertion failed in file<br>
>                     src/mpid/ch4/netmod/ofi/ofi_events.c at line 125: 0<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPL_backtrace_show+0x26) [0x7f6076063f2c]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x41dc24) [0x7f6075fc5c24]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49cc51) [0x7f6076044c51]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49f799) [0x7f6076047799]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x451e18) [0x7f6075ff9e18]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x452272) [0x7f6075ffa272]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce836) [0x7f6075e76836]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce90d) [0x7f6075e7690d]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x48137b) [0x7f607602937b]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x44d471) [0x7f6075ff5471]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x407acd) [0x7f6075fafacd]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPIR_Err_return_comm+0x10a) [0x7f6075fafbea]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPI_Iprobe+0x312) [0x7f6075ddd542]<br>
>                     /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpifort.so.12(pmpi_iprobe+0x2f) [0x7f606e08f19f]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_load_MOD_zmumps_load_recv_msgs+0x142) [0x7f60737b194d]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_try_recvtreat_+0x34) [0x7f60738ab735]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_fac_par_m_MOD_zmumps_fac_par+0x991) [0x7f607378bcc8]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_par_i_+0x240) [0x7f6073881d36]<br>
>                     Abort(805938831) on node 51 (rank 51 in comm 240):<br>
>                     Fatal error in PMPI_Iprobe: Other MPI error, error<br>
>                     stack:<br>
>                     PMPI_Iprobe(124)..............:<br>
>                     MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,<br>
>                     comm=0xc4000017, flag=0x7ffe20e1402c,<br>
>                     status=0x7ffe20e14260) failed<br>
>                     MPID_Iprobe(244)..............:<br>
>                     progress_test(100)............:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_b_+0x1463) [0x7f60738831a1]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_driver_+0x6969) [0x7f60738446c9]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_+0x2d83) [0x7f60738bf9cf]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_f77_+0x178c) [0x7f60738c33bc]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_c+0x8f8) [0x7f60738baacb]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x894560) [0x7f6077297560]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(MatLUFactorNumeric+0x32e) [0x7f60773bb1e6]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0xf51665) [0x7f6077954665]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(PCSetUp+0x64b) [0x7f60779c77e0]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSetUp+0xfb6) [0x7f6077ac2d53]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x10c1c28) [0x7f6077ac4c28]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSolve+0x13) [0x7f6077ac8070]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x11249df) [0x7f6077b279df]<br>
>                     /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(SNESSolve+0x10df) [0x7f6077b676c6]<br>
>                     Abort(1) on node 60: Internal error<br>
>                     Abort(1007265423) on node 65 (rank 65 in comm 240):<br>
>                     Fatal error in PMPI_Iprobe: Other MPI error, error<br>
>                     stack:<br>
>                     PMPI_Iprobe(124)..............:<br>
>                     MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,<br>
>                     comm=0xc4000017, flag=0x7fff4d82827c,<br>
>                     status=0x7fff4d8284b0) failed<br>
>                     MPID_Iprobe(244)..............:<br>
>                     progress_test(100)............:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Abort(941205135) on node 32 (rank 32 in comm 240):<br>
>                     Fatal error in PMPI_Iprobe: Other MPI error, error<br>
>                     stack:<br>
>                     PMPI_Iprobe(124)..............:<br>
>                     MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,<br>
>                     comm=0xc4000017, flag=0x7fff715ba3fc,<br>
>                     status=0x7fff715ba630) failed<br>
>                     MPID_Iprobe(240)..............:<br>
>                     MPIDI_iprobe_safe(108)........:<br>
>                     MPIDI_iprobe_unsafe(35).......:<br>
>                     MPIDI_OFI_do_iprobe(69).......:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Abort(470941839) on node 75 (rank 75 in comm 0):<br>
>                     Fatal error in PMPI_Test: Other MPI error, error stack:<br>
>                     PMPI_Test(188)................:<br>
>                     MPI_Test(request=0x7efe31e03014,<br>
>                     flag=0x7ffea65d673c, status=0x7ffea65d6760) failed<br>
>                     MPIR_Test(73).................:<br>
>                     MPIR_Test_state(33)...........:<br>
>                     progress_test(100)............:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Abort(805946511) on node 31 (rank 31 in comm 256):<br>
>                     Fatal error in PMPI_Probe: Other MPI error, error stack:<br>
>                     PMPI_Probe(118)...............:<br>
>                     MPI_Probe(src=MPI_ANY_SOURCE, tag=7,<br>
>                     comm=0xc4000015, status=0x7fff9538b7a0) failed<br>
>                     MPID_Probe(159)...............:<br>
>                     progress_test(100)............:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     Abort(1179791) on node 73 (rank 73 in comm 0): Fatal<br>
>                     error in PMPI_Test: Other MPI error, error stack:<br>
>                     PMPI_Test(188)................:<br>
>                     MPI_Test(request=0x5b638d4, flag=0x7ffd755119cc,<br>
>                     status=0x7ffd755121b0) failed<br>
>                     MPIR_Test(73).................:<br>
>                     MPIR_Test_state(33)...........:<br>
>                     progress_test(100)............:<br>
>                     MPIDI_OFI_handle_cq_error(949): OFI poll failed<br>
>                     (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)<br>
>                     ```<br>
> <br>
>                     Thank you very much for your time and consideration.<br>
> <br>
>                     Best wishes,<br>
>                     Zongze<br>
> <br>
> <br>
> <br>
>                 -- <br>
>                 What most experimenters take for granted before they<br>
>                 begin their experiments is infinitely more interesting<br>
>                 than any results to which their experiments lead.<br>
>                 -- Norbert Wiener<br>
> <br>
>                 <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a><br>
>                 <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>
> <br>
>             -- <br>
>             Best wishes,<br>
>             Zongze<br>
> <br>
> <br>
> <br>
>         -- <br>
>         Stefano<br>
> <br>
> <br>
> <br>
> -- <br>
> What most experimenters take for granted before they begin their <br>
> experiments is infinitely more interesting than any results to which <br>
> their experiments lead.<br>
> -- Norbert Wiener<br>
> <br>
> <a href="https://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">https://www.cse.buffalo.edu/~knepley/</a> <<a href="http://www.cse.buffalo.edu/~knepley/" rel="noreferrer" target="_blank">http://www.cse.buffalo.edu/~knepley/</a>><br>
</blockquote></div></div></div></div>