[mpich-discuss] MPI error on MPI_alltoallv.

jt.meng at siat.ac.cn jt.meng at siat.ac.cn
Thu Aug 2 21:20:22 CDT 2012


Dear Pavan,

      I have fixed this problem. Thanks very much for your help.

      The file limitation on all computing nodes need to be updated by using the commond "echo "ulimit -n 32768" >> /etc/profile", and only the root will be affected by this modification.

      Thanks,

Jintao


> -----原始邮件-----
> 发件人: "Pavan Balaji" <balaji at mcs.anl.gov>
> 发送时间: 2012年8月3日 星期五
> 收件人: mpich-discuss at mcs.anl.gov
> 抄送: jt.meng at siat.ac.cn
> 主题: Re: [mpich-discuss]  MPI error on MPI_alltoallv.
> 
> 
> Please see this:
> 
> https://lists.mcs.anl.gov/mailman/htdig/mpich-discuss/2012-June/012590.html
> 
>   -- Pavan
> 
> On 08/02/2012 03:15 AM, jt.meng at siat.ac.cn wrote:
> > Hi,
> >      My programs run well on 960 cores, howerver if it was running on
> > 1024cores, I get the following error.
> >      I guess that this may cause by the OS limitations.
> >      Can anyone help me resolve this problem?
> >
> > ulimit output start here:
> > ---------------------------------------------------------
> > # ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 136192
> > max locked memory       (kbytes, -l) unlimited
> > max memory size         (kbytes, -m) unlimited
> > open files       &nbs p;              (-n) 819200
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > stack size              (kbytes, -s) unlimited
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) 136192
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> >
> > Error logs start here:
> > ------------------------------------------------------------------------------------
> > Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
> > PMPI_Alltoallv(549)...........: MPI_Alltoallv(sbuf=0x2b08c2bd7010 ,
> > scnts=0x64ac20, sdispls=0x659b40, MPI_LONG_LONG_INT,
> > rbuf=0x2b08c5bde010, rcnts=0x658b30, rdispls=0x65ab50,
> > MPI_LONG_LONG_INT, MPI_COMM_WORLD) failed
> > MPIR_Alltoallv_impl(389)......:
> > MPIR_Alltoallv(355)...........:
> > MPIR_Alltoallv_intra(199).....:
> > MPIC_Waitall_ft(852)..........:
> > MPIR_Waitall_impl(121)........:
> > MPIDI_CH3I_Progress(402)......:
> > MPID_nem_mpich2_test_recv(747):
> > MPID_nem_tcp_connpoll(1838)...:
> > state_listening_handler(1908).: accept of socket fd failed - Too many
> > open files
> > Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
> > PMPI_Alltoallv(549)...........: MPI_Alltoallv(sbuf=0x2b974c333010,
> > scnts=0x64ac20, sdispls=0x659b40, MPI_LONG_LONG_INT,
> > rbuf=0x2b974f335010, rcnts=0x658b30, rdispls=0x65ab50,
> > MPI_LONG_LONG_INT, MPI_COMM_WORLD) failed
> > MPIR_Alltoallv_impl(389)......:
> > MPIR_Alltoallv(355)...........:
> > MPIR_Alltoallv_int ra(199).....:
> > MPIC_Waitall_ft(852)..........:
> > MPIR_Waitall_impl(121)........:
> > MPIDI_CH3I_Progress(402)......:
> > MPID_nem_mpich2_test_recv(747):
> > MPID_nem_tcp_connpoll(1838)...:
> > state_listening_handler(1908).: accept of socket fd failed - Too many
> > open files
> > [proxy:0:9 at node15] handle_pmi_response (./pm/pmiserv/pmip_cb.c:406):
> > assert (!closed) failed
> > [proxy:0:9 at node15] HYD_pmcd_pmip_control_cmd_cb
> > (./pm/pmiserv/pmip_cb.c:952): unable to handle PMI response
> > [proxy:0:9 at node15] HYDT_dmxu_poll_wait_for_event
> > (./tools/demux/demux_poll.c:77): callback returned error status
> > [proxy:0:9 at node15] main (./pm/pmiserv/pmip.c:226): demux engine error
> > waiting for event
> > [mpiexec at node73] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert
> > (!closed) failed
> > [mpiexec at node73] HYDT_dmxu_poll_wait_for_event
> > (./tools/demux/demux_poll.c:77): callback returned error status
> > [mpiex ec at node73] HYD_pmci_wait_for_completion
> > (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> > [mpiexec at node73] main (./ui/mpich/mpiexec.c:405): process manager error
> > waiting for completion
> >
> > Jintao
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji



--
- - - - - - - - - - - - - - - - -  

孟金涛 工程师
数字所 高性能中心
中国科学院深圳先进技术研究院

地址:深圳市南山区西丽大学城学苑大道1068号
电话:0755-86392368,13510470517
邮编:518055
2011-06-01



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120803/bf43aaf3/attachment.html>


More information about the mpich-discuss mailing list