<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello Howard,<div><br></div><div>Thanks very much for your kind reply.</div><div><br></div><div>i will take your advice and use more cores/nodes for the problem. In the meanwhile, could I know that if changing some MPI environment parameters oh Hopper would help?</div><div><br></div><div><a href="https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1">https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1</a></div><div><br></div><div>Best,</div><div><br></div><div><div>
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Xuefei (Rebecca) Yuan<br>Postdoctoral Fellow<br>Lawrence Berkeley National Laboratory<br>Tel: 1-510-486-7031<br><br></div></span></span>
</div>
<br><div><div>On Jul 30, 2011, at 3:43 PM, Howard Pritchard wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Hello Rebecca,<br><br>I believe one or more nodes your job is running on are almost<br>out of memory. &nbsp;It seems that in a number of places mpich2<br>translates internal out of memory errors (not enough memory)<br>with MPI_ERR_INTERN. &nbsp;On the receive side the out-of-memory<br>situation is handled such that you get the "failed to allocate memory"<br>error. On the send side, it looks like the allocation of memory<br>for the mpich2 internal send request fails without giving a very<br>useful error traceback.<br><br>I would suggest several things:<br><br>1) see if you can reduce the memory requirements/node for the job<br>you are trying to run, maybe by running on more nodes.<br>2) contact nersc help desk. &nbsp;Send them the error message<br>output as they can correlate<br>it with syslog output on the smw to see if there were out-of-memory<br>conditions on the nodes you were using today.<br><br>I think if you want the job to run, you definitely should first see<br>about reducing the memory/node requirements of the job. &nbsp;I don't<br>think this is an issue with flooding of the unexpected queue since<br>the mpich2 says there are 0 unexpected messages queued.<br><br>Howard<br><br><br>Rebecca Yuan wrote:<br><blockquote type="cite"><br></blockquote><blockquote type="cite">Hello,<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Could u pls give me some suggestions to resolve the MPI problem on Hopper?<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Thanks very much!<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Rebecca<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Begin forwarded message:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">*From:* Jeff Hammond &lt;<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">&lt;<a href="mailto:jeff.science@gmail.com">mailto:jeff.science@gmail.com</a>&gt;&gt;<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*Date:* July 30, 2011 8:17:09 AM PDT<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*To:* Main MPI Forum mailing list &lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mpi-forum@lists.mpi-forum.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">&lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mailto:mpi-forum@lists.mpi-forum.org</a>&gt;&gt;<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*Subject:* *Re: [Mpi-forum] The MPI Internal error running on Hopper*<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">*Reply-To:* Main MPI Forum mailing list &lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mpi-forum@lists.mpi-forum.org</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">&lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mailto:mpi-forum@lists.mpi-forum.org</a>&gt;&gt;<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Report to NERSC support. This is not the appropriate email list for<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">support of MPI implementations.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">CrayMPI is an MPICH2-based implementation so you can also try<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a> &lt;<a href="mailto:mpich-discuss@mcs.anl.gov">mailto:mpich-discuss@mcs.anl.gov</a>&gt; but it is<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">still preferred to contact NERSC<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">first since they are the ones who own the Cray support contract for<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Hopper.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Jeff<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Sent from my iPhone<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">On Jul 30, 2011, at 9:54 AM, "Xuefei (Rebecca) Yuan" &lt;<a href="mailto:xyuan@lbl.gov">xyuan@lbl.gov</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">&lt;<a href="mailto:xyuan@lbl.gov">mailto:xyuan@lbl.gov</a>&gt;&gt; wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Hello, all,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I got some MPI internal error while running on a Cray XE6 machine<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">(Hopper), the error message reads:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rank 9 [Sat Jul 30 07:39:14 2011] [c5-2c2s1n3] Fatal error in<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Wait: Other MPI error, error stack:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Wait(179).....................: MPI_Wait(request=0x7fffffff7438,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">status0x7fffffff7460) failed<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPIR_Wait_impl(69).................:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPIDI_CH3I_Progress(370)...........:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">unexpected message. 0 unexpected messages queued.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rank 63 [Sat Jul 30 07:39:14 2011] [c0-2c2s3n0] Fatal error in<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPI_Irecv: Other MPI error, error stack:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPI_Irecv(147): MPI_Irecv(buf=0x4a81890, count=52, MPI_DOUBLE,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000007,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">request=0x7fffffff7438) failed<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPID_Irecv(53): failure occurred while allocating memory for a<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">request object<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rank 54 [Sat Jul 30 07:39:14 2011] [c1-2c2s3n2] Fatal error in<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend: Internal MPI error!, error stack:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend(148): MPI_Isend(buf=0x3d12a350, count=52, MPI_DOUBLE,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">dest=30, tag=21, comm=0xc4000003, request=0x3c9c12f0) failed<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">(unknown)(): Internal MPI error!<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rank 45 [Sat Jul 30 07:39:14 2011] [c1-2c2s2n3] Fatal error in<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend: Internal MPI error!, error stack:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend(148): MPI_Isend(buf=0x3c638de0, count=34, MPI_DOUBLE,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">dest=61, tag=21, comm=0x84000007, request=0x3c03be90) failed<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">(unknown)(): Internal MPI error!<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rank 36 [Sat Jul 30 07:39:14 2011] [c3-2c2s2n1] Fatal error in<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend: Internal MPI error!, error stack:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">PMPI_Isend(148): MPI_Isend(buf=0x3caaf170, count=52, MPI_DOUBLE,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">dest=28, tag=21, comm=0xc4000003, request=0x3c2e561c) failed<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">(unknown)(): Internal MPI error!<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_pmii_daemon(SIGCHLD): [NID 00102] [c0-2c2s3n0] [Sat Jul 30 07:39:14<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2011] PE 63 exit signal Aborted<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_pmii_daemon(SIGCHLD): [NID 06043] [c3-2c2s2n1] [Sat Jul 30 07:39:14<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2011] PE 36 exit signal Aborted<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_pmii_daemon(SIGCHLD): [NID 06328] [c1-2c2s3n2] [Sat Jul 30 07:39:14<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2011] PE 54 exit signal Aborted<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_pmii_daemon(SIGCHLD): [NID 05565] [c5-2c2s1n3] [Sat Jul 30 07:39:14<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2011] PE 9 exit signal Aborted<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_pmii_daemon(SIGCHLD): [NID 06331] [c1-2c2s2n3] [Sat Jul 30 07:39:14<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">2011] PE 45 exit signal Aborted<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">[NID 00102] 2011-07-30 07:39:38 Apid 2986821: initiated application<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">termination<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">So I checked up the environment parameters on hopper at<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1">https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I tried to increase MPI_GNI_MAX_EAGER_MSG_SIZE from 8192 to 131070,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">but it did not help.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Any suggestions that how could resolve this error for MPI_Irecv() and<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">MPI_Isend()?<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Thanks very much!<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Xuefei (Rebecca) Yuan<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Postdoctoral Fellow<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Lawrence Berkeley National Laboratory<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Tel: 1-510-486-7031<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_______________________________________________<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">mpi-forum mailing list<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="mailto:mpi-forum@lists.mpi-forum.org">mpi-forum@lists.mpi-forum.org</a> &lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mailto:mpi-forum@lists.mpi-forum.org</a>&gt;<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">_______________________________________________<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">mpi-forum mailing list<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="mailto:mpi-forum@lists.mpi-forum.org">mpi-forum@lists.mpi-forum.org</a> &lt;<a href="mailto:mpi-forum@lists.mpi-forum.org">mailto:mpi-forum@lists.mpi-forum.org</a>&gt;<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum</a><br></blockquote></blockquote><br><br>-- <br>Howard Pritchard<br>Software Engineering<br>Cray, Inc.<br>_______________________________________________<br>mpich-discuss mailing list<br><a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss<br></div></blockquote></div><br></div></body></html>