[petsc-users] Weird memory leakage

Mark F. Adams mfadams at lbl.gov
Sun Aug 25 16:46:47 CDT 2013


On Aug 25, 2013, at 4:30 PM, Frank <fangxingjun0319 at gmail.com> wrote:

> Hi,
> I have very weird problem here.
> I am using FORTRAN to call PETSc to solve Poisson equation.
> When I run my code with 8 cores, it works fine, and the consumed memory does not increase. However, when it is run with 64 cores, first of all it gives lots of error like this:
> 
> [n310:18951] [[62652,0],2] -> [[62652,0],10] (node: n219) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] -> [[62652,0],18] (node: n128) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] -> [[62652,0],34] (node: n089) oob-tcp:
> Number of attempts to create TCP connection has been exceeded. Can not
> communicate with peer
> [n310:18951] [[62652,0],2] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP -
> ABORTING

I don't know where you are getting "memory" errors but this looks like a pretty fatal error.  Unless someone recognizes something else I'd look at this in a debugger and see where this is happening.  See if its deterministic or not.  And if it is see what code is killing it.

Mark

> [n310:18951] *** Process received signal ***
> [n310:18951] Signal: Aborted (6)
> [n310:18951] Signal code: (-6)
> [n310:18951] [ 0] /lib64/libpthread.so.0() [0x35b120f500]
> [n310:18951] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x35b0e328a5]
> [n310:18951] [ 2] /lib64/libc.so.6(abort+0x175) [0x35b0e34085]
> [n310:18951] [ 3]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x243)
> [0x2ae5e02f0813]
> [n310:18951] [ 4]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [ 5]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [ 6]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c)
> [0x2ae5e031845c]
> [n310:18951] [ 7]
> /global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_grpcomm_bad.so(+0x1bd7)
> [0x2ae5e28debd7]
> [n310:18951] [ 8]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_ess_base_orted_finalize+0x1e)
> [0x2ae5e02f431e]
> [n310:18951] [ 9]
> /global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_ess_tm.so(+0x1294)
> [0x2ae5e1ab1294]
> [n310:18951] [10]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_finalize+0x4e)
> [0x2ae5e02d0fbe]
> [n310:18951] [11]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4840b)
> [0x2ae5e02f040b]
> [n310:18951] [12]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [13]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [14]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c)
> [0x2ae5e031845c]
> [n310:18951] [15]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_trigger_event+0x50)
> [0x2ae5e02dc930]
> [n310:18951] [16]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4916f)
> [0x2ae5e02f116f]
> [n310:18951] [17]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x149)
> [0x2ae5e02f0719]
> [n310:18951] [18]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
> [0x2ae5e032f56a]
> [n310:18951] [19]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
> [0x2ae5e032f242]
> [n310:18951] [20]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_dispatch+0x8)
> [0x2ae5e032f228]
> [n310:18951] [21]
> /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon+0x9f0)
> [0x2ae5e02ef8a0]
> [n310:18951] [22] orted(main+0x88) [0x4024d8]
> [n310:18951] [23] /lib64/libc.so.6(__libc_start_main+0xfd) [0x35b0e1ecdd]
> [n310:18951] [24] orted() [0x402389]
> [n310:18951] *** End of error message ***
> 
> but the program still gives the right result for a short period. After that, it suddenly stopped because memory exceeds some limit. I don't understand this. If there is memory leakage in my code, how come it can work with 8 cores? Please help me.Thank you so much!
> 
> Sincerely
> Xingjun
> 
> 



More information about the petsc-users mailing list