[mpich-discuss] Finalize error
Rajeev Thakur
thakur at mcs.anl.gov
Fri Apr 18 11:24:07 CDT 2008
Can you send us a small test program that reproduces this error?
Rajeev
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Quentin Bossard
> Sent: Thursday, April 17, 2008 9:29 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Finalize error
>
> Hi Dave,
>
> Thank you for your answer. This has been very interesting to
> me. I am "glad" you ran into this problem as well.
>
> When I say I dispatch tasks, I do not use the functions you described.
> It was more to illustrate the fact that I use MPI in order to
> parallelize similar jobs and with an (almost) equal repartition.
> Basically
> 1. Every process (including master) works on k jobs (k ~=
> n_total_jobs / n_process) 2. Every slave uses mpi to transmit
> the results to the master who writes them to files iteratively
>
> The program runs fairly fast (not the computation part but
> the "reduce" one).
>
> Quentin
>
> On Mon, Apr 14, 2008 at 5:45 PM, Dave Goodell
> <goodell at mcs.anl.gov> wrote:
> > We have run into failures that look like this in our own
> testing. If
> > this really is the same bug then we know what is causing
> it, but it's
> > a nasty one to fix. The bug is most likely still present in 1.0.7,
> > since, to the best of my knowledge, we have not attempted
> the fix yet.
> >
> > When you say that you dispatch tasks, do you mean that you
> are using
> > MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_accept, or any other
> > dynamic process mechanism? Also, how quickly does your
> program run?
> > We can most often reproduce the problem when the programs are
> > extremely short and very synchronized (perform the same
> steps at very close to the same times).
> >
> > -Dave
> >
> >
> >
> > On Apr 14, 2008, at 2:33 AM, Quentin Bossard wrote:
> >
> >
> > > Hi,
> > > Thank you for your answer. I am currently using the 1.0.6
> version. I
> > currently cannot try with the 1.0.7 version. Is this a
> known bug from
> > 1.0.6 which disappeared in 1.0.7 ?
> > > Best regards,
> > >
> > > Quentin
> > >
> > > On Fri, Apr 11, 2008 at 7:45 PM, Rajeev Thakur
> <thakur at mcs.anl.gov> wrote:
> > > Which version of MPICH2 are you using? Can you try with the latest
> > version, 1.0.7?
> > >
> > > Rajeev
> > >
> > >
> > > From: owner-mpich-discuss at mcs.anl.gov
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of
> Quentin Bossard
> > > Sent: Friday, April 11, 2008 2:33 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] Finalize error
> > >
> > > Hi everyone,I am trying to run a program I wrote myself
> using mpi.
> > > The
> > basic idea is to dispatch tasks in the program on serveral
> cores/computers.
> > It works fine (i.e. the results of the tasks are correct and well
> > collected). However I have an error after the finalize
> (during...?).
> > Anyway the "Exiting program" is after the instruction finalize (and
> > only done by the master).
> > > I have not been able to find what was causing this error. The
> > > message is
> > below. Note that the error is not deterministic (i.e it does not
> > happen all the time...). If someone has any begining of
> idea I would
> > be grateful to hear it.
> > >
> > > Another question : is there a friendly gpl (or at least free) mpi
> > > debugger
> > ?
> > >
> > > Thanks in advance for your help
> > >
> > > Quentin
> > >
> > >
> > > 0 : Exiting program
> > > Assertion failed in file ch3u_connect_sock.c at line 805:
> vcch->conn
> > > ==
> > conn
> > > [cli_5]: aborting job:
> > > internal ABORT - process 5
> > > [cli_4]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error
> occurred while
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=4,errno=54:(strerror() not found))
> > > Assertion failed in file ch3u_connect_sock.c at line 805:
> vcch->conn
> > > ==
> > conn
> > > [cli_6]: aborting job:
> > > internal ABORT - process 6
> > > [cli_2]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error
> occurred while
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=4,errno=54:(strerror() not found))
> > > [cli_3]: aborting job:
> > > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > > MPI_Finalize(255).........................: MPI_Finalize failed
> > > MPI_Finalize(154).........................:
> > > MPID_Finalize(129)........................:
> > > MPIDI_CH3U_VC_WaitForClose(339)...........: an error
> occurred while
> > > the
> > device was waiting for all open connections to close
> > > MPIDI_CH3i_Progress_wait(215).............: an error
> occurred while
> > handling an event returned by MPIDU_Sock_Wait()
> > > MPIDI_CH3I_Progress_handle_sock_event(420):
> > > MPIDU_Socki_handle_read(633)..............: connection failure
> > (set=0,sock=2,errno=54:(strerror() not found))
> > > rank 5 in job 1741 hercules.arbitragis_64602 caused
> collective abort of
> > all ranks
> > > exit status of rank 5: killed by signal 9
> > > rank 4 in job 1741 hercules.arbitragis_64602 caused
> collective abort of
> > all ranks
> > > exit status of rank 4: killed by signal 9
> > > rank 3 in job 1741 hercules.arbitragis_64602 caused
> collective abort of
> > all ranks
> > > exit status of rank 3: killed by signal 9
> > > rank 2 in job 1741 hercules.arbitragis_64602 caused
> collective abort of
> > all ranks
> > > exit status of rank 2: killed by signal 9 Exit 137
> > >
> > >
> > >
> >
> >
>
>
More information about the mpich-discuss
mailing list