[mpich-discuss] Finalize error

Rajeev Thakur thakur at mcs.anl.gov
Fri Apr 18 11:24:07 CDT 2008


Can you send us a small test program that reproduces this error?

Rajeev 

> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov 
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Quentin Bossard
> Sent: Thursday, April 17, 2008 9:29 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Finalize error
> 
> Hi Dave,
> 
> Thank you for your answer. This has been very interesting to 
> me. I am "glad" you ran into this problem as well.
> 
> When I say I dispatch tasks, I do not use the functions you described.
> It was more to illustrate the fact that I use MPI in order to 
> parallelize similar jobs and with an (almost) equal repartition.
> Basically
> 1. Every process (including master) works on k jobs (k ~= 
> n_total_jobs / n_process) 2. Every slave uses mpi to transmit 
> the results to the master who writes them to files iteratively
> 
> The program runs fairly fast (not the computation part but 
> the "reduce" one).
> 
> Quentin
> 
> On Mon, Apr 14, 2008 at 5:45 PM, Dave Goodell 
> <goodell at mcs.anl.gov> wrote:
> > We have run into failures that look like this in our own 
> testing.  If 
> > this really is the same bug then we know what is causing 
> it, but it's 
> > a nasty one to fix.  The bug is most likely still present in 1.0.7, 
> > since, to the best of my knowledge, we have not attempted 
> the fix yet.
> >
> >  When you say that you dispatch tasks, do you mean that you 
> are using 
> > MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_accept, or any other 
> > dynamic process mechanism?  Also, how quickly does your 
> program run?  
> > We can most often reproduce the problem when the programs are 
> > extremely short and very synchronized (perform the same 
> steps at very close to the same times).
> >
> >  -Dave
> >
> >
> >
> >  On Apr 14, 2008, at 2:33 AM, Quentin Bossard wrote:
> >
> >
> > > Hi,
> > > Thank you for your answer. I am currently using the 1.0.6 
> version. I
> > currently cannot try with the 1.0.7 version. Is this a 
> known bug from 
> > 1.0.6 which disappeared in 1.0.7 ?
> > > Best regards,
> > >
> > > Quentin
> > >
> > > On Fri, Apr 11, 2008 at 7:45 PM, Rajeev Thakur 
> <thakur at mcs.anl.gov> wrote:
> > > Which version of MPICH2 are you using? Can you try with the latest
> > version, 1.0.7?
> > >
> > > Rajeev
> > >
> > >
> > > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of 
> Quentin Bossard
> > > Sent: Friday, April 11, 2008 2:33 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] Finalize error
> > >
> > > Hi everyone,I am trying to run a program I wrote myself 
> using mpi. 
> > > The
> > basic idea is to dispatch tasks in the program on serveral 
> cores/computers.
> > It works fine (i.e. the results of the tasks are correct and well 
> > collected). However I have an error after the finalize 
> (during...?). 
> > Anyway the "Exiting program" is after the instruction finalize (and 
> > only done by the master).
> > > I have not been able to find what was causing this error. The 
> > > message is
> > below. Note that the error is not deterministic (i.e it does not 
> > happen all the time...). If someone has any begining of 
> idea I would 
> > be grateful to hear it.
> > >
> > > Another question : is there a friendly gpl (or at least free) mpi 
> > > debugger
> > ?
> > >
> > > Thanks in advance for your help
> > >
> > > Quentin
> > >
> > >
> > > 0 : Exiting program
> > > Assertion failed in file ch3u_connect_sock.c at line 805: 
> vcch->conn 
> > > ==
> > conn
> > > [cli_5]: aborting job:
> > > internal ABORT - process 5
> > > [cli_4]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error 
> occurred while 
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error 
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=4,errno=54:(strerror() not found))
> > > Assertion failed in file ch3u_connect_sock.c at line 805: 
> vcch->conn 
> > > ==
> > conn
> > > [cli_6]: aborting job:
> > > internal ABORT - process 6
> > > [cli_2]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error 
> occurred while 
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error 
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=4,errno=54:(strerror() not found))
> > > [cli_3]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error 
> occurred while 
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error 
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=2,errno=54:(strerror() not found))
> > > rank 5 in job 1741  hercules.arbitragis_64602   caused 
> collective abort of
> > all ranks
> > >  exit status of rank 5: killed by signal 9
> > > rank 4 in job 1741  hercules.arbitragis_64602   caused 
> collective abort of
> > all ranks
> > >  exit status of rank 4: killed by signal 9
> > > rank 3 in job 1741  hercules.arbitragis_64602   caused 
> collective abort of
> > all ranks
> > >  exit status of rank 3: killed by signal 9
> > > rank 2 in job 1741  hercules.arbitragis_64602   caused 
> collective abort of
> > all ranks
> > >  exit status of rank 2: killed by signal 9 Exit 137
> > >
> > >
> > >
> >
> >
> 
> 




More information about the mpich-discuss mailing list