[MPICH] max number of isend/irecv allowed?

Wei-keng Liao wkliao at ece.northwestern.edu
Sun Feb 17 13:48:57 CST 2008


OK. I ran the test codes using mpich 2-1.0.7rc1 on three Linux cluster 
machines below.

I got the same error messages on Ewok @ ORNL. "uname -a" shows:
Linux ewok001.ccs.ornl.gov 2.6.9-42.0.10.EL_lustre-1.4.10.1smp #1 SMP Wed 
Apr 25 12:52:57 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux

On Tungsten @ NCSA, the same error messages were returned. "uname -a" 
shows: Linux tunc 2.4.21-32.0.1.ELsmp-perfctr-lustre #3 SMP Thu Jun 16 
06:08:07 CDT 2005 i686 i686 i386 GNU/Linux

On Mercury @ NCSA, the program hung till the allocated time expired (10 
minutes). No error message was given. "uname -a" shows: Linux 
tg-login4 2.4.21-309.tg1 #1 SMP Thu Jun 1 17:07:28 CDT 2006 ia64 unknown


Wei-keng


On Fri, 15 Feb 2008, Rajeev Thakur wrote:

> The error message shows a connection failure. Can you try with the new 1.0.7
> rc1?
> 
> Rajeev
> 
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
> > Sent: Friday, February 15, 2008 3:41 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [MPICH] max number of isend/irecv allowed?
> > 
> > 
> > Is there a max number of MPI isend/irecv calls allowed per 
> > process before a MPI_Wait_all is called?
> > 
> > I am seeing an error message below when a large number of 
> > isend/irecv are used (eg. 512 processes):
> > 
> >   [cli_53]: aborting job:
> >   Fatal error in MPI_Waitall: Other MPI error, error stack:
> >   MPI_Waitall(258)............................: 
> > MPI_Waitall(count=1024,
> >   req_array=0x5f7730, status_array=0x8176c0) failed
> >   MPIDI_CH3i_Progress_wait(215)...............: an error 
> > occurred while
> >   handling an event returned by MPIDU_Sock_Wait()
> >   MPIDI_CH3I_Progress_handle_sock_event(779)..:
> >   MPIDI_CH3_Sockconn_handle_connect_event(608): [ch3:sock] failed to
> >   connnect to remote process
> >   MPIDU_Socki_handle_connect(791).............: connection failure
> >   (set=0,sock=18,errno=110:(strerror() not found))
> > 
> >   INTERNAL ERROR: Invalid error class (66) encountered while 
> > returning from
> >   MPI_Waitall.  Please file a bug report.  No error stack is 
> > available.
> >   [cli_29]: aborting job:
> > 
> > The program attached reporduces the error. The error occurs 
> > only when running more than 512 processes. (I tested 8 
> > processes per node, each node has 2 CPUs). This program is 
> > extracted from ADIOI_Calc_others_req(). I found the 
> > collective I/O crashed is due to this error. I think this may 
> > also relate to the hanging problem I posted earlier but not 
> > yet solved.
> > 
> > I am using mpich2-1.0.6p1. 
> > 
> > Wei-keng
> > 
> > 
> 




More information about the mpich-discuss mailing list