[mpich-discuss] Fwd: [Mpi-forum] The MPI Internal error running on Hopper

Sat Jul 30 17:43:33 CDT 2011

Hello Rebecca,

I believe one or more nodes your job is running on are almost
out of memory.  It seems that in a number of places mpich2
translates internal out of memory errors (not enough memory)
with MPI_ERR_INTERN.  On the receive side the out-of-memory
situation is handled such that you get the "failed to allocate memory"
error. On the send side, it looks like the allocation of memory
for the mpich2 internal send request fails without giving a very
useful error traceback.

I would suggest several things:

1) see if you can reduce the memory requirements/node for the job
you are trying to run, maybe by running on more nodes.
2) contact nersc help desk.  Send them the error message
output as they can correlate
it with syslog output on the smw to see if there were out-of-memory
conditions on the nodes you were using today.

I think if you want the job to run, you definitely should first see
about reducing the memory/node requirements of the job.  I don't
think this is an issue with flooding of the unexpected queue since
the mpich2 says there are 0 unexpected messages queued.

Howard

Rebecca Yuan wrote:
> 
> Hello,
> 
> Could u pls give me some suggestions to resolve the MPI problem on Hopper?
> 
> Thanks very much!
> 
> Rebecca
> 
> Begin forwarded message:
> 
>> *From:* Jeff Hammond <jeff.science at gmail.com
>> <mailto:jeff.science at gmail.com>>
>> *Date:* July 30, 2011 8:17:09 AM PDT
>> *To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org
>> <mailto:mpi-forum at lists.mpi-forum.org>>
>> *Subject:* *Re: [Mpi-forum] The MPI Internal error running on Hopper*
>> *Reply-To:* Main MPI Forum mailing list <mpi-forum at lists.mpi-forum.org
>> <mailto:mpi-forum at lists.mpi-forum.org>>
>>
>> Report to NERSC support. This is not the appropriate email list for
>> support of MPI implementations.
>>
>> CrayMPI is an MPICH2-based implementation so you can also try
>> mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov> but it is
>> still preferred to contact NERSC
>> first since they are the ones who own the Cray support contract for
>> Hopper.
>>
>> Jeff
>>
>> Sent from my iPhone
>>
>> On Jul 30, 2011, at 9:54 AM, "Xuefei (Rebecca) Yuan" <xyuan at lbl.gov
>> <mailto:xyuan at lbl.gov>> wrote:
>>
>>> Hello, all,
>>>
>>> I got some MPI internal error while running on a Cray XE6 machine
>>> (Hopper), the error message reads:
>>>
>>>
>>> Rank 9 [Sat Jul 30 07:39:14 2011] [c5-2c2s1n3] Fatal error in
>>> PMPI_Wait: Other MPI error, error stack:
>>> PMPI_Wait(179).....................: MPI_Wait(request=0x7fffffff7438,
>>> status0x7fffffff7460) failed
>>> MPIR_Wait_impl(69).................:
>>> MPIDI_CH3I_Progress(370)...........:
>>> MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
>>> unexpected message. 0 unexpected messages queued.
>>> Rank 63 [Sat Jul 30 07:39:14 2011] [c0-2c2s3n0] Fatal error in
>>> MPI_Irecv: Other MPI error, error stack:
>>> MPI_Irecv(147): MPI_Irecv(buf=0x4a81890, count=52, MPI_DOUBLE,
>>> src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000007,
>>> request=0x7fffffff7438) failed
>>> MPID_Irecv(53): failure occurred while allocating memory for a
>>> request object
>>> Rank 54 [Sat Jul 30 07:39:14 2011] [c1-2c2s3n2] Fatal error in
>>> PMPI_Isend: Internal MPI error!, error stack:
>>> PMPI_Isend(148): MPI_Isend(buf=0x3d12a350, count=52, MPI_DOUBLE,
>>> dest=30, tag=21, comm=0xc4000003, request=0x3c9c12f0) failed
>>> (unknown)(): Internal MPI error!
>>> Rank 45 [Sat Jul 30 07:39:14 2011] [c1-2c2s2n3] Fatal error in
>>> PMPI_Isend: Internal MPI error!, error stack:
>>> PMPI_Isend(148): MPI_Isend(buf=0x3c638de0, count=34, MPI_DOUBLE,
>>> dest=61, tag=21, comm=0x84000007, request=0x3c03be90) failed
>>> (unknown)(): Internal MPI error!
>>> Rank 36 [Sat Jul 30 07:39:14 2011] [c3-2c2s2n1] Fatal error in
>>> PMPI_Isend: Internal MPI error!, error stack:
>>> PMPI_Isend(148): MPI_Isend(buf=0x3caaf170, count=52, MPI_DOUBLE,
>>> dest=28, tag=21, comm=0xc4000003, request=0x3c2e561c) failed
>>> (unknown)(): Internal MPI error!
>>> _pmii_daemon(SIGCHLD): [NID 00102] [c0-2c2s3n0] [Sat Jul 30 07:39:14
>>> 2011] PE 63 exit signal Aborted
>>> _pmii_daemon(SIGCHLD): [NID 06043] [c3-2c2s2n1] [Sat Jul 30 07:39:14
>>> 2011] PE 36 exit signal Aborted
>>> _pmii_daemon(SIGCHLD): [NID 06328] [c1-2c2s3n2] [Sat Jul 30 07:39:14
>>> 2011] PE 54 exit signal Aborted
>>> _pmii_daemon(SIGCHLD): [NID 05565] [c5-2c2s1n3] [Sat Jul 30 07:39:14
>>> 2011] PE 9 exit signal Aborted
>>> _pmii_daemon(SIGCHLD): [NID 06331] [c1-2c2s2n3] [Sat Jul 30 07:39:14
>>> 2011] PE 45 exit signal Aborted
>>> [NID 00102] 2011-07-30 07:39:38 Apid 2986821: initiated application
>>> termination
>>>
>>> So I checked up the environment parameters on hopper at
>>>
>>> https://www.nersc.gov/users/computational-systems/hopper/running-jobs/runtime-tuning-options/#toc-anchor-1
>>>
>>> I tried to increase MPI_GNI_MAX_EAGER_MSG_SIZE from 8192 to 131070,
>>> but it did not help.
>>>
>>> Any suggestions that how could resolve this error for MPI_Irecv() and
>>> MPI_Isend()?
>>>
>>> Thanks very much!
>>>
>>>
>>> Xuefei (Rebecca) Yuan
>>> Postdoctoral Fellow
>>> Lawrence Berkeley National Laboratory
>>> Tel: 1-510-486-7031
>>>
>>>
>>>
>>> _______________________________________________
>>> mpi-forum mailing list
>>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Howard Pritchard
Software Engineering
Cray, Inc.