[petsc-users] Weird memory leakage

Sun Aug 25 17:48:40 CDT 2013

On Sun, Aug 25, 2013 at 3:30 PM, Frank <fangxingjun0319 at gmail.com> wrote:

> Hi,
> I have very weird problem here.
> I am using FORTRAN to call PETSc to solve Poisson equation.
> When I run my code with 8 cores, it works fine, and the consumed memory
> does not increase. However, when it is run with 64 cores, first of all it
> gives lots of error like this:
>
> [n310:18951] [[62652,0],2] -> [[62652,0],10] (node: n219) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] -> [[62652,0],18] (node: n128) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] -> [[62652,0],34] (node: n089) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP -
> ABORTING
> [n310:18951] *** Process received signal ***
> [n310:18951] Signal: Aborted (6)
> [n310:18951] Signal code: (-6)
> [n310:18951] [ 0] /lib64/libpthread.so.0() [0x35b120f500]
> [n310:18951] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x35b0e328a5]
> [n310:18951] [ 2] /lib64/libc.so.6(abort+0x175) [0x35b0e34085]
> [n310:18951] [ 3]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_daemon_cmd_processor+**0x243)
> [0x2ae5e02f0813]
> [n310:18951] [ 4]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [ 5]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [ 6]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_progress+0x5c)
> [0x2ae5e031845c]
> [n310:18951] [ 7]
> /global/software/openmpi-1.6.**1-intel1/lib/openmpi/mca_**
> grpcomm_bad.so(+0x1bd7)
> [0x2ae5e28debd7]
> [n310:18951] [ 8]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_ess_base_orted_finalize+**0x1e)
> [0x2ae5e02f431e]
> [n310:18951] [ 9]
> /global/software/openmpi-1.6.**1-intel1/lib/openmpi/mca_ess_**
> tm.so(+0x1294)
> [0x2ae5e1ab1294]
> [n310:18951] [10]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_finalize+0x4e)
> [0x2ae5e02d0fbe]
> [n310:18951] [11]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**+0x4840b)
> [0x2ae5e02f040b]
> [n310:18951] [12]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [13]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [14]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_progress+0x5c)
> [0x2ae5e031845c]
> [n310:18951] [15]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_trigger_event+0x50)
> [0x2ae5e02dc930]
> [n310:18951] [16]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**+0x4916f)
> [0x2ae5e02f116f]
> [n310:18951] [17]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_daemon_cmd_processor+**0x149)
> [0x2ae5e02f0719]
> [n310:18951] [18]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [19]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [20]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> opal_event_dispatch+0x8)
> [0x2ae5e032f228]
> [n310:18951] [21]
> /global/software/openmpi-1.6.**1-intel1/lib/libopen-rte.so.4(**
> orte_daemon+0x9f0)
> [0x2ae5e02ef8a0]
> [n310:18951] [22] orted(main+0x88) [0x4024d8]
> [n310:18951] [23] /lib64/libc.so.6(__libc_start_**main+0xfd)
> [0x35b0e1ecdd]
> [n310:18951] [24] orted() [0x402389]
> [n310:18951] *** End of error message ***
>
> but the program still gives the right result for a short period. After
> that, it suddenly stopped because memory exceeds some limit. I don't
> understand this. If there is memory leakage in my code, how come it can
> work with 8 cores? Please help me.Thank you so much!
>

All of the errors are OpenMPI errors. The first thing to do is track down
why they are happening.
I think your only option here is to get the system administrator on your
machine to help.

Since you have MPI errors, any number of weird things could be happening,
like your job launching
on many fewer than 64 nodes (as the error says some could not be
contacted), accounting for
memory running out.

  Thanks,

      Matt

> Sincerely
> Xingjun
>
>
>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20130825/bf04abb8/attachment.html>