[petsc-users] Weird memory leakage

Frank fangxingjun0319 at gmail.com
Sun Aug 25 15:30:16 CDT 2013


Hi,
I have very weird problem here.
I am using FORTRAN to call PETSc to solve Poisson equation.
When I run my code with 8 cores, it works fine, and the consumed memory 
does not increase. However, when it is run with 64 cores, first of all 
it gives lots of error like this:

[n310:18951] [[62652,0],2] -> [[62652,0],10] (node: n219) oob-tcp:
Number of attempts to create TCP connection has been exceeded. Can not
communicate with peer
[n310:18951] [[62652,0],2] -> [[62652,0],18] (node: n128) oob-tcp:
Number of attempts to create TCP connection has been exceeded. Can not
communicate with peer
[n310:18951] [[62652,0],2] -> [[62652,0],34] (node: n089) oob-tcp:
Number of attempts to create TCP connection has been exceeded. Can not
communicate with peer
[n310:18951] [[62652,0],2] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP -
ABORTING
[n310:18951] *** Process received signal ***
[n310:18951] Signal: Aborted (6)
[n310:18951] Signal code: (-6)
[n310:18951] [ 0] /lib64/libpthread.so.0() [0x35b120f500]
[n310:18951] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x35b0e328a5]
[n310:18951] [ 2] /lib64/libc.so.6(abort+0x175) [0x35b0e34085]
[n310:18951] [ 3]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x243)
[0x2ae5e02f0813]
[n310:18951] [ 4]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
[0x2ae5e032f56a]
[n310:18951] [ 5]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
[0x2ae5e032f242]
[n310:18951] [ 6]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c)
[0x2ae5e031845c]
[n310:18951] [ 7]
/global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_grpcomm_bad.so(+0x1bd7)
[0x2ae5e28debd7]
[n310:18951] [ 8]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_ess_base_orted_finalize+0x1e)
[0x2ae5e02f431e]
[n310:18951] [ 9]
/global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_ess_tm.so(+0x1294)
[0x2ae5e1ab1294]
[n310:18951] [10]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_finalize+0x4e)
[0x2ae5e02d0fbe]
[n310:18951] [11]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4840b)
[0x2ae5e02f040b]
[n310:18951] [12]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
[0x2ae5e032f56a]
[n310:18951] [13]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
[0x2ae5e032f242]
[n310:18951] [14]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c)
[0x2ae5e031845c]
[n310:18951] [15]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_trigger_event+0x50)
[0x2ae5e02dc930]
[n310:18951] [16]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4916f)
[0x2ae5e02f116f]
[n310:18951] [17]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x149)
[0x2ae5e02f0719]
[n310:18951] [18]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a)
[0x2ae5e032f56a]
[n310:18951] [19]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12)
[0x2ae5e032f242]
[n310:18951] [20]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_dispatch+0x8)
[0x2ae5e032f228]
[n310:18951] [21]
/global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon+0x9f0)
[0x2ae5e02ef8a0]
[n310:18951] [22] orted(main+0x88) [0x4024d8]
[n310:18951] [23] /lib64/libc.so.6(__libc_start_main+0xfd) [0x35b0e1ecdd]
[n310:18951] [24] orted() [0x402389]
[n310:18951] *** End of error message ***

but the program still gives the right result for a short period. After that, it suddenly stopped because memory exceeds some limit. I don't understand this. If there is memory leakage in my code, how come it can work with 8 cores? Please help me.Thank you so much!

Sincerely
Xingjun




More information about the petsc-users mailing list