<div dir="ltr">On Sun, Aug 25, 2013 at 3:30 PM, Frank <span dir="ltr"><<a href="mailto:fangxingjun0319@gmail.com" target="_blank">fangxingjun0319@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
I have very weird problem here.<br>
I am using FORTRAN to call PETSc to solve Poisson equation.<br>
When I run my code with 8 cores, it works fine, and the consumed memory does not increase. However, when it is run with 64 cores, first of all it gives lots of error like this:<br>
<br>
[n310:18951] [[62652,0],2] -> [[62652,0],10] (node: n219) oob-tcp:<br>
Number of attempts to create TCP connection has been exceeded. Can not<br>
communicate with peer<br>
[n310:18951] [[62652,0],2] -> [[62652,0],18] (node: n128) oob-tcp:<br>
Number of attempts to create TCP connection has been exceeded. Can not<br>
communicate with peer<br>
[n310:18951] [[62652,0],2] -> [[62652,0],34] (node: n089) oob-tcp:<br>
Number of attempts to create TCP connection has been exceeded. Can not<br>
communicate with peer<br>
[n310:18951] [[62652,0],2] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP -<br>
ABORTING<br>
[n310:18951] *** Process received signal ***<br>
[n310:18951] Signal: Aborted (6)<br>
[n310:18951] Signal code: (-6)<br>
[n310:18951] [ 0] /lib64/libpthread.so.0() [0x35b120f500]<br>
[n310:18951] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x35b0e328a5]<br>
[n310:18951] [ 2] /lib64/libc.so.6(abort+0x175) [0x35b0e34085]<br>
[n310:18951] [ 3]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_daemon_cmd_processor+<u></u>0x243)<br>
[0x2ae5e02f0813]<br>
[n310:18951] [ 4]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_base_loop+0x31a)<br>
[0x2ae5e032f56a]<br>
[n310:18951] [ 5]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_loop+0x12)<br>
[0x2ae5e032f242]<br>
[n310:18951] [ 6]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_progress+0x5c)<br>
[0x2ae5e031845c]<br>
[n310:18951] [ 7]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/openmpi/mca_<u></u>grpcomm_bad.so(+0x1bd7)<br>
[0x2ae5e28debd7]<br>
[n310:18951] [ 8]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_ess_base_orted_finalize+<u></u>0x1e)<br>
[0x2ae5e02f431e]<br>
[n310:18951] [ 9]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/openmpi/mca_ess_<u></u>tm.so(+0x1294)<br>
[0x2ae5e1ab1294]<br>
[n310:18951] [10]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_finalize+0x4e)<br>
[0x2ae5e02d0fbe]<br>
[n310:18951] [11]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>+0x4840b)<br>
[0x2ae5e02f040b]<br>
[n310:18951] [12]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_base_loop+0x31a)<br>
[0x2ae5e032f56a]<br>
[n310:18951] [13]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_loop+0x12)<br>
[0x2ae5e032f242]<br>
[n310:18951] [14]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_progress+0x5c)<br>
[0x2ae5e031845c]<br>
[n310:18951] [15]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_trigger_event+0x50)<br>
[0x2ae5e02dc930]<br>
[n310:18951] [16]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>+0x4916f)<br>
[0x2ae5e02f116f]<br>
[n310:18951] [17]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_daemon_cmd_processor+<u></u>0x149)<br>
[0x2ae5e02f0719]<br>
[n310:18951] [18]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_base_loop+0x31a)<br>
[0x2ae5e032f56a]<br>
[n310:18951] [19]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_loop+0x12)<br>
[0x2ae5e032f242]<br>
[n310:18951] [20]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>opal_event_dispatch+0x8)<br>
[0x2ae5e032f228]<br>
[n310:18951] [21]<br>
/global/software/openmpi-1.6.<u></u>1-intel1/lib/libopen-rte.so.4(<u></u>orte_daemon+0x9f0)<br>
[0x2ae5e02ef8a0]<br>
[n310:18951] [22] orted(main+0x88) [0x4024d8]<br>
[n310:18951] [23] /lib64/libc.so.6(__libc_start_<u></u>main+0xfd) [0x35b0e1ecdd]<br>
[n310:18951] [24] orted() [0x402389]<br>
[n310:18951] *** End of error message ***<br>
<br>
but the program still gives the right result for a short period. After that, it suddenly stopped because memory exceeds some limit. I don't understand this. If there is memory leakage in my code, how come it can work with 8 cores? Please help me.Thank you so much!<br>
</blockquote><div><br></div><div>All of the errors are OpenMPI errors. The first thing to do is track down why they are happening.</div><div>I think your only option here is to get the system administrator on your machine to help.</div>
<div><br></div><div>Since you have MPI errors, any number of weird things could be happening, like your job launching</div><div>on many fewer than 64 nodes (as the error says some could not be contacted), accounting for</div>
<div>memory running out.</div><div><br></div><div>  Thanks,</div><div><br></div><div>      Matt</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Sincerely<br>
Xingjun<br>
<br>
<br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br>What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.<br>
-- Norbert Wiener
</div></div>