[mpich-discuss] Fwd: [Mpi-forum] The MPI Internal error running on Hopper
Xuefei (Rebecca) Yuan
xyuan at lbl.gov
Sat Jul 30 18:28:34 CDT 2011
Hello Howard,
Thanks very much for your kind reply.
i will take your advice and use more cores/nodes for the problem. In the meanwhile, could I know that if changing some MPI environment parameters oh Hopper would help?
https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1
Best,
Xuefei (Rebecca) Yuan
Postdoctoral Fellow
Lawrence Berkeley National Laboratory
Tel: 1-510-486-7031
On Jul 30, 2011, at 3:43 PM, Howard Pritchard wrote:
> Hello Rebecca,
>
> I believe one or more nodes your job is running on are almost
> out of memory. It seems that in a number of places mpich2
> translates internal out of memory errors (not enough memory)
> with MPI_ERR_INTERN. On the receive side the out-of-memory
> situation is handled such that you get the "failed to allocate memory"
> error. On the send side, it looks like the allocation of memory
> for the mpich2 internal send request fails without giving a very
> useful error traceback.
>
> I would suggest several things:
>
> 1) see if you can reduce the memory requirements/node for the job
> you are trying to run, maybe by running on more nodes.
> 2) contact nersc help desk. Send them the error message
> output as they can correlate
> it with syslog output on the smw to see if there were out-of-memory
> conditions on the nodes you were using today.
>
> I think if you want the job to run, you definitely should first see
> about reducing the memory/node requirements of the job. I don't
> think this is an issue with flooding of the unexpected queue since
> the mpich2 says there are 0 unexpected messages queued.
>
> Howard
>
>
> Rebecca Yuan wrote:
>>
>> Hello,
>>
>> Could u pls give me some suggestions to resolve the MPI problem on Hopper?
>>
>> Thanks very much!
>>
>> Rebecca
>>
>> Begin forwarded message:
>>
>>> *From:* Jeff Hammond <jeff.science at gmail.com
>>> <mailto:jeff.science at gmail.com>>
>>> *Date:* July 30, 2011 8:17:09 AM PDT
>>> *To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org
>>> <mailto:mpi-forum at lists.mpi-forum.org>>
>>> *Subject:* *Re: [Mpi-forum] The MPI Internal error running on Hopper*
>>> *Reply-To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org
>>> <mailto:mpi-forum at lists.mpi-forum.org>>
>>>
>>> Report to NERSC support. This is not the appropriate email list for
>>> support of MPI implementations.
>>>
>>> CrayMPI is an MPICH2-based implementation so you can also try
>>> mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov> but it is
>>> still preferred to contact NERSC
>>> first since they are the ones who own the Cray support contract for
>>> Hopper.
>>>
>>> Jeff
>>>
>>> Sent from my iPhone
>>>
>>> On Jul 30, 2011, at 9:54 AM, "Xuefei (Rebecca) Yuan" <xyuan at lbl.gov
>>> <mailto:xyuan at lbl.gov>> wrote:
>>>
>>>> Hello, all,
>>>>
>>>> I got some MPI internal error while running on a Cray XE6 machine
>>>> (Hopper), the error message reads:
>>>>
>>>>
>>>> Rank 9 [Sat Jul 30 07:39:14 2011] [c5-2c2s1n3] Fatal error in
>>>> PMPI_Wait: Other MPI error, error stack:
>>>> PMPI_Wait(179).....................: MPI_Wait(request=0x7fffffff7438,
>>>> status0x7fffffff7460) failed
>>>> MPIR_Wait_impl(69).................:
>>>> MPIDI_CH3I_Progress(370)...........:
>>>> MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
>>>> unexpected message. 0 unexpected messages queued.
>>>> Rank 63 [Sat Jul 30 07:39:14 2011] [c0-2c2s3n0] Fatal error in
>>>> MPI_Irecv: Other MPI error, error stack:
>>>> MPI_Irecv(147): MPI_Irecv(buf=0x4a81890, count=52, MPI_DOUBLE,
>>>> src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000007,
>>>> request=0x7fffffff7438) failed
>>>> MPID_Irecv(53): failure occurred while allocating memory for a
>>>> request object
>>>> Rank 54 [Sat Jul 30 07:39:14 2011] [c1-2c2s3n2] Fatal error in
>>>> PMPI_Isend: Internal MPI error!, error stack:
>>>> PMPI_Isend(148): MPI_Isend(buf=0x3d12a350, count=52, MPI_DOUBLE,
>>>> dest=30, tag=21, comm=0xc4000003, request=0x3c9c12f0) failed
>>>> (unknown)(): Internal MPI error!
>>>> Rank 45 [Sat Jul 30 07:39:14 2011] [c1-2c2s2n3] Fatal error in
>>>> PMPI_Isend: Internal MPI error!, error stack:
>>>> PMPI_Isend(148): MPI_Isend(buf=0x3c638de0, count=34, MPI_DOUBLE,
>>>> dest=61, tag=21, comm=0x84000007, request=0x3c03be90) failed
>>>> (unknown)(): Internal MPI error!
>>>> Rank 36 [Sat Jul 30 07:39:14 2011] [c3-2c2s2n1] Fatal error in
>>>> PMPI_Isend: Internal MPI error!, error stack:
>>>> PMPI_Isend(148): MPI_Isend(buf=0x3caaf170, count=52, MPI_DOUBLE,
>>>> dest=28, tag=21, comm=0xc4000003, request=0x3c2e561c) failed
>>>> (unknown)(): Internal MPI error!
>>>> _pmii_daemon(SIGCHLD): [NID 00102] [c0-2c2s3n0] [Sat Jul 30 07:39:14
>>>> 2011] PE 63 exit signal Aborted
>>>> _pmii_daemon(SIGCHLD): [NID 06043] [c3-2c2s2n1] [Sat Jul 30 07:39:14
>>>> 2011] PE 36 exit signal Aborted
>>>> _pmii_daemon(SIGCHLD): [NID 06328] [c1-2c2s3n2] [Sat Jul 30 07:39:14
>>>> 2011] PE 54 exit signal Aborted
>>>> _pmii_daemon(SIGCHLD): [NID 05565] [c5-2c2s1n3] [Sat Jul 30 07:39:14
>>>> 2011] PE 9 exit signal Aborted
>>>> _pmii_daemon(SIGCHLD): [NID 06331] [c1-2c2s2n3] [Sat Jul 30 07:39:14
>>>> 2011] PE 45 exit signal Aborted
>>>> [NID 00102] 2011-07-30 07:39:38 Apid 2986821: initiated application
>>>> termination
>>>>
>>>> So I checked up the environment parameters on hopper at
>>>>
>>>> https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1
>>>>
>>>> I tried to increase MPI_GNI_MAX_EAGER_MSG_SIZE from 8192 to 131070,
>>>> but it did not help.
>>>>
>>>> Any suggestions that how could resolve this error for MPI_Irecv() and
>>>> MPI_Isend()?
>>>>
>>>> Thanks very much!
>>>>
>>>>
>>>> Xuefei (Rebecca) Yuan
>>>> Postdoctoral Fellow
>>>> Lawrence Berkeley National Laboratory
>>>> Tel: 1-510-486-7031
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi-forum mailing list
>>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>>> _______________________________________________
>>> mpi-forum mailing list
>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>
>
> --
> Howard Pritchard
> Software Engineering
> Cray, Inc.
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110730/56aa3e5b/attachment-0001.htm>
More information about the mpich-discuss
mailing list