[petsc-users] MPI_Iprobe Error with MUMPS Solver on Multi-Nodes

Zongze Yang yangzongze at gmail.com
Tue May 23 06:59:40 CDT 2023


On Tue, 23 May 2023 at 19:51, Zongze Yang <yangzongze at gmail.com> wrote:

> Thank you for your suggestion. I solved the problem with SuperLU_DIST, and
> it works well.
>

 This is solved with four nodes, each equipped with 500G of memory.

Best wishes,
Zongze

Best wishes,
> Zongze
>
>
> On Tue, 23 May 2023 at 18:00, Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Mon, May 22, 2023 at 10:46 PM Zongze Yang <yangzongze at gmail.com>
>> wrote:
>>
>>> I have an additional question to ask: Is it possible for the
>>> SuperLU_DIST library to encounter the same MPI problem (PMPI_Iprobe failed)
>>> as MUMPS?
>>>
>>
>> I do not know if they use that function. But it is easy to try it out, so
>> I would.
>>
>>   Thanks,
>>
>>     Matt
>>
>>
>>> Best wishes,
>>> Zongze
>>>
>>>
>>> On Tue, 23 May 2023 at 10:41, Zongze Yang <yangzongze at gmail.com> wrote:
>>>
>>>> On Tue, 23 May 2023 at 05:31, Stefano Zampini <
>>>> stefano.zampini at gmail.com> wrote:
>>>>
>>>>> If I may add to the discussion, it may be that you are going OOM since
>>>>> you are trying to factorize a 3 million dofs problem, this problem goes
>>>>> undetected and then fails at a later stage
>>>>>
>>>>
>>>> Thank you for your comment. I ran the problem with 90 processes
>>>> distributed across three nodes, each equipped with 500G of memory. If this
>>>> amount of memory is sufficient for solving the matrix with approximately 3
>>>> million degrees of freedom?
>>>>
>>>> Thanks!
>>>> Zongze
>>>>
>>>> Il giorno lun 22 mag 2023 alle ore 20:03 Zongze Yang <
>>>>> yangzongze at gmail.com> ha scritto:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Zongze
>>>>>>
>>>>>> Matthew Knepley <knepley at gmail.com>于2023年5月23日 周二00:09写道:
>>>>>>
>>>>>>> On Mon, May 22, 2023 at 11:07 AM Zongze Yang <yangzongze at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I hope this letter finds you well. I am writing to seek guidance
>>>>>>>> regarding an error I encountered while solving a matrix using MUMPS on
>>>>>>>> multiple nodes:
>>>>>>>>
>>>>>>>
>>>>>>> Iprobe is buggy on several MPI implementations. PETSc has an option
>>>>>>> for shutting it off for this reason.
>>>>>>> I do not know how to shut it off inside MUMPS however. I would mail
>>>>>>> their mailing list to see.
>>>>>>>
>>>>>>>   Thanks,
>>>>>>>
>>>>>>>      Matt
>>>>>>>
>>>>>>>
>>>>>>>> ```bash
>>>>>>>> Abort(1681039) on node 60 (rank 60 in comm 240): Fatal error in
>>>>>>>> PMPI_Iprobe: Other MPI error, error stack:
>>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000026, flag=0x7ffc130f9c4c,
>>>>>>>> status=0x7ffc130f9e80) failed
>>>>>>>> MPID_Iprobe(240)..............:
>>>>>>>> MPIDI_iprobe_safe(108)........:
>>>>>>>> MPIDI_iprobe_unsafe(35).......:
>>>>>>>> MPIDI_OFI_do_iprobe(69).......:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Assertion failed in file src/mpid/ch4/netmod/ofi/ofi_events.c at
>>>>>>>> line 125: 0
>>>>>>>> ```
>>>>>>>>
>>>>>>>> The matrix in question has a degree of freedom (dof) of 3.86e+06.
>>>>>>>> Interestingly, when solving smaller-scale problems, everything functions
>>>>>>>> perfectly without any issues. However, when attempting to solve the larger
>>>>>>>> matrix on multiple nodes, I encounter the aforementioned error.
>>>>>>>>
>>>>>>>> The complete error message I received is as follows:
>>>>>>>> ```bash
>>>>>>>> Abort(1681039) on node 60 (rank 60 in comm 240): Fatal error in
>>>>>>>> PMPI_Iprobe: Other MPI error, error stack:
>>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000026, flag=0x7ffc130f9c4c,
>>>>>>>> status=0x7ffc130f9e80) failed
>>>>>>>> MPID_Iprobe(240)..............:
>>>>>>>> MPIDI_iprobe_safe(108)........:
>>>>>>>> MPIDI_iprobe_unsafe(35).......:
>>>>>>>> MPIDI_OFI_do_iprobe(69).......:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Assertion failed in file src/mpid/ch4/netmod/ofi/ofi_events.c at
>>>>>>>> line 125: 0
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPL_backtrace_show+0x26)
>>>>>>>> [0x7f6076063f2c]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x41dc24)
>>>>>>>> [0x7f6075fc5c24]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49cc51)
>>>>>>>> [0x7f6076044c51]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x49f799)
>>>>>>>> [0x7f6076047799]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x451e18)
>>>>>>>> [0x7f6075ff9e18]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x452272)
>>>>>>>> [0x7f6075ffa272]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce836)
>>>>>>>> [0x7f6075e76836]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x2ce90d)
>>>>>>>> [0x7f6075e7690d]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x48137b)
>>>>>>>> [0x7f607602937b]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x44d471)
>>>>>>>> [0x7f6075ff5471]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(+0x407acd)
>>>>>>>> [0x7f6075fafacd]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPIR_Err_return_comm+0x10a)
>>>>>>>> [0x7f6075fafbea]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpi.so.12(MPI_Iprobe+0x312)
>>>>>>>> [0x7f6075ddd542]
>>>>>>>> /nfs/opt/cascadelake/linux-centos7-cascadelake/gcc-9.4.0/mpich-3.4.2-qgtz76gekvjzuacy7wq5a26rqlewoxfc/lib/libmpifort.so.12(pmpi_iprobe+0x2f)
>>>>>>>> [0x7f606e08f19f]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_load_MOD_zmumps_load_recv_msgs+0x142)
>>>>>>>> [0x7f60737b194d]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_try_recvtreat_+0x34)
>>>>>>>> [0x7f60738ab735]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(__zmumps_fac_par_m_MOD_zmumps_fac_par+0x991)
>>>>>>>> [0x7f607378bcc8]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_par_i_+0x240)
>>>>>>>> [0x7f6073881d36]
>>>>>>>> Abort(805938831) on node 51 (rank 51 in comm 240): Fatal error in
>>>>>>>> PMPI_Iprobe: Other MPI error, error stack:
>>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7ffe20e1402c,
>>>>>>>> status=0x7ffe20e14260) failed
>>>>>>>> MPID_Iprobe(244)..............:
>>>>>>>> progress_test(100)............:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_b_+0x1463)
>>>>>>>> [0x7f60738831a1]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_fac_driver_+0x6969)
>>>>>>>> [0x7f60738446c9]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_+0x2d83)
>>>>>>>> [0x7f60738bf9cf]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_f77_+0x178c)
>>>>>>>> [0x7f60738c33bc]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/mumps-5.5.1-gb7wlwxwbalf5rw5vkp6gtkhfkdqpntz/lib/libzmumps.so(zmumps_c+0x8f8)
>>>>>>>> [0x7f60738baacb]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x894560)
>>>>>>>> [0x7f6077297560]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(MatLUFactorNumeric+0x32e)
>>>>>>>> [0x7f60773bb1e6]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0xf51665)
>>>>>>>> [0x7f6077954665]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(PCSetUp+0x64b)
>>>>>>>> [0x7f60779c77e0]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSetUp+0xfb6)
>>>>>>>> [0x7f6077ac2d53]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x10c1c28)
>>>>>>>> [0x7f6077ac4c28]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(KSPSolve+0x13)
>>>>>>>> [0x7f6077ac8070]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(+0x11249df)
>>>>>>>> [0x7f6077b279df]
>>>>>>>> /nfs/home/zzyang/opt/software/linux-centos7-cascadelake/gcc-9.4.0/petsc-develop-5wrc3y6lyelr3iyrlm3sr2jlh2wxif3k/lib/libpetsc.so.3.019(SNESSolve+0x10df)
>>>>>>>> [0x7f6077b676c6]
>>>>>>>> Abort(1) on node 60: Internal error
>>>>>>>> Abort(1007265423) on node 65 (rank 65 in comm 240): Fatal error in
>>>>>>>> PMPI_Iprobe: Other MPI error, error stack:
>>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7fff4d82827c,
>>>>>>>> status=0x7fff4d8284b0) failed
>>>>>>>> MPID_Iprobe(244)..............:
>>>>>>>> progress_test(100)............:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Abort(941205135) on node 32 (rank 32 in comm 240): Fatal error in
>>>>>>>> PMPI_Iprobe: Other MPI error, error stack:
>>>>>>>> PMPI_Iprobe(124)..............: MPI_Iprobe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=MPI_ANY_TAG, comm=0xc4000017, flag=0x7fff715ba3fc,
>>>>>>>> status=0x7fff715ba630) failed
>>>>>>>> MPID_Iprobe(240)..............:
>>>>>>>> MPIDI_iprobe_safe(108)........:
>>>>>>>> MPIDI_iprobe_unsafe(35).......:
>>>>>>>> MPIDI_OFI_do_iprobe(69).......:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Abort(470941839) on node 75 (rank 75 in comm 0): Fatal error in
>>>>>>>> PMPI_Test: Other MPI error, error stack:
>>>>>>>> PMPI_Test(188)................: MPI_Test(request=0x7efe31e03014,
>>>>>>>> flag=0x7ffea65d673c, status=0x7ffea65d6760) failed
>>>>>>>> MPIR_Test(73).................:
>>>>>>>> MPIR_Test_state(33)...........:
>>>>>>>> progress_test(100)............:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Abort(805946511) on node 31 (rank 31 in comm 256): Fatal error in
>>>>>>>> PMPI_Probe: Other MPI error, error stack:
>>>>>>>> PMPI_Probe(118)...............: MPI_Probe(src=MPI_ANY_SOURCE,
>>>>>>>> tag=7, comm=0xc4000015, status=0x7fff9538b7a0) failed
>>>>>>>> MPID_Probe(159)...............:
>>>>>>>> progress_test(100)............:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> Abort(1179791) on node 73 (rank 73 in comm 0): Fatal error in
>>>>>>>> PMPI_Test: Other MPI error, error stack:
>>>>>>>> PMPI_Test(188)................: MPI_Test(request=0x5b638d4,
>>>>>>>> flag=0x7ffd755119cc, status=0x7ffd755121b0) failed
>>>>>>>> MPIR_Test(73).................:
>>>>>>>> MPIR_Test_state(33)...........:
>>>>>>>> progress_test(100)............:
>>>>>>>> MPIDI_OFI_handle_cq_error(949): OFI poll failed
>>>>>>>> (ofi_events.c:951:MPIDI_OFI_handle_cq_error:Input/output error)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Thank you very much for your time and consideration.
>>>>>>>>
>>>>>>>> Best wishes,
>>>>>>>> Zongze
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> What most experimenters take for granted before they begin their
>>>>>>> experiments is infinitely more interesting than any results to which their
>>>>>>> experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>
>>>>>>> https://www.cse.buffalo.edu/~knepley/
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>>
>>>>>> --
>>>>>> Best wishes,
>>>>>> Zongze
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefano
>>>>>
>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20230523/5724a6cd/attachment-0001.html>


More information about the petsc-users mailing list