[MPICH] max number of isend/irecv allowed?
Wei-keng Liao
wkliao at ece.northwestern.edu
Sun Feb 17 13:48:57 CST 2008
OK. I ran the test codes using mpich 2-1.0.7rc1 on three Linux cluster
machines below.
I got the same error messages on Ewok @ ORNL. "uname -a" shows:
Linux ewok001.ccs.ornl.gov 2.6.9-42.0.10.EL_lustre-1.4.10.1smp #1 SMP Wed
Apr 25 12:52:57 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
On Tungsten @ NCSA, the same error messages were returned. "uname -a"
shows: Linux tunc 2.4.21-32.0.1.ELsmp-perfctr-lustre #3 SMP Thu Jun 16
06:08:07 CDT 2005 i686 i686 i386 GNU/Linux
On Mercury @ NCSA, the program hung till the allocated time expired (10
minutes). No error message was given. "uname -a" shows: Linux
tg-login4 2.4.21-309.tg1 #1 SMP Thu Jun 1 17:07:28 CDT 2006 ia64 unknown
Wei-keng
On Fri, 15 Feb 2008, Rajeev Thakur wrote:
> The error message shows a connection failure. Can you try with the new 1.0.7
> rc1?
>
> Rajeev
>
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Wei-keng Liao
> > Sent: Friday, February 15, 2008 3:41 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [MPICH] max number of isend/irecv allowed?
> >
> >
> > Is there a max number of MPI isend/irecv calls allowed per
> > process before a MPI_Wait_all is called?
> >
> > I am seeing an error message below when a large number of
> > isend/irecv are used (eg. 512 processes):
> >
> > [cli_53]: aborting job:
> > Fatal error in MPI_Waitall: Other MPI error, error stack:
> > MPI_Waitall(258)............................:
> > MPI_Waitall(count=1024,
> > req_array=0x5f7730, status_array=0x8176c0) failed
> > MPIDI_CH3i_Progress_wait(215)...............: an error
> > occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(779)..:
> > MPIDI_CH3_Sockconn_handle_connect_event(608): [ch3:sock] failed to
> > connnect to remote process
> > MPIDU_Socki_handle_connect(791).............: connection failure
> > (set=0,sock=18,errno=110:(strerror() not found))
> >
> > INTERNAL ERROR: Invalid error class (66) encountered while
> > returning from
> > MPI_Waitall. Please file a bug report. No error stack is
> > available.
> > [cli_29]: aborting job:
> >
> > The program attached reporduces the error. The error occurs
> > only when running more than 512 processes. (I tested 8
> > processes per node, each node has 2 CPUs). This program is
> > extracted from ADIOI_Calc_others_req(). I found the
> > collective I/O crashed is due to this error. I think this may
> > also relate to the hanging problem I posted earlier but not
> > yet solved.
> >
> > I am using mpich2-1.0.6p1.
> >
> > Wei-keng
> >
> >
>
More information about the mpich-discuss
mailing list