[mpich-discuss] Finalize error

Quentin Bossard quentin.bossard at gmail.com
Thu Apr 17 09:29:00 CDT 2008


Hi Dave,

Thank you for your answer. This has been very interesting to me. I am
"glad" you ran into this problem as well.

When I say I dispatch tasks, I do not use the functions you described.
It was more to illustrate the fact that I use MPI in order to
parallelize similar jobs and with an (almost) equal repartition.
Basically
1. Every process (including master) works on k jobs (k ~= n_total_jobs
/ n_process)
2. Every slave uses mpi to transmit the results to the master who
writes them to files iteratively

The program runs fairly fast (not the computation part but the "reduce" one).

Quentin

On Mon, Apr 14, 2008 at 5:45 PM, Dave Goodell <goodell at mcs.anl.gov> wrote:
> We have run into failures that look like this in our own testing.  If this
> really is the same bug then we know what is causing it, but it's a nasty one
> to fix.  The bug is most likely still present in 1.0.7, since, to the best
> of my knowledge, we have not attempted the fix yet.
>
>  When you say that you dispatch tasks, do you mean that you are using
> MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_accept, or any other dynamic
> process mechanism?  Also, how quickly does your program run?  We can most
> often reproduce the problem when the programs are extremely short and very
> synchronized (perform the same steps at very close to the same times).
>
>  -Dave
>
>
>
>  On Apr 14, 2008, at 2:33 AM, Quentin Bossard wrote:
>
>
> > Hi,
> > Thank you for your answer. I am currently using the 1.0.6 version. I
> currently cannot try with the 1.0.7 version. Is this a known bug from 1.0.6
> which disappeared in 1.0.7 ?
> > Best regards,
> >
> > Quentin
> >
> > On Fri, Apr 11, 2008 at 7:45 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > Which version of MPICH2 are you using? Can you try with the latest
> version, 1.0.7?
> >
> > Rajeev
> >
> >
> > From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Quentin Bossard
> > Sent: Friday, April 11, 2008 2:33 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] Finalize error
> >
> > Hi everyone,I am trying to run a program I wrote myself using mpi. The
> basic idea is to dispatch tasks in the program on serveral cores/computers.
> It works fine (i.e. the results of the tasks are correct and well
> collected). However I have an error after the finalize (during...?). Anyway
> the "Exiting program" is after the instruction finalize (and only done by
> the master).
> > I have not been able to find what was causing this error. The message is
> below. Note that the error is not deterministic (i.e it does not happen all
> the time...). If someone has any begining of idea I would be grateful to
> hear it.
> >
> > Another question : is there a friendly gpl (or at least free) mpi debugger
> ?
> >
> > Thanks in advance for your help
> >
> > Quentin
> >
> >
> > 0 : Exiting program
> > Assertion failed in file ch3u_connect_sock.c at line 805: vcch->conn ==
> conn
> > [cli_5]: aborting job:
> > internal ABORT - process 5
> > [cli_4]: aborting job:
> > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > MPI_Finalize(255).........................: MPI_Finalize failed
> > MPI_Finalize(154).........................:
> > MPID_Finalize(129)........................:
> > MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while the
> device was waiting for all open connections to close
> > MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(420):
> > MPIDU_Socki_handle_read(633)..............: connection failure
> (set=0,sock=4,errno=54:(strerror() not found))
> > Assertion failed in file ch3u_connect_sock.c at line 805: vcch->conn ==
> conn
> > [cli_6]: aborting job:
> > internal ABORT - process 6
> > [cli_2]: aborting job:
> > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > MPI_Finalize(255).........................: MPI_Finalize failed
> > MPI_Finalize(154).........................:
> > MPID_Finalize(129)........................:
> > MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while the
> device was waiting for all open connections to close
> > MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(420):
> > MPIDU_Socki_handle_read(633)..............: connection failure
> (set=0,sock=4,errno=54:(strerror() not found))
> > [cli_3]: aborting job:
> > Fatal error in MPI_Finalize: Other MPI error, error stack:
> > MPI_Finalize(255).........................: MPI_Finalize failed
> > MPI_Finalize(154).........................:
> > MPID_Finalize(129)........................:
> > MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while the
> device was waiting for all open connections to close
> > MPIDI_CH3i_Progress_wait(215).............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> > MPIDI_CH3I_Progress_handle_sock_event(420):
> > MPIDU_Socki_handle_read(633)..............: connection failure
> (set=0,sock=2,errno=54:(strerror() not found))
> > rank 5 in job 1741  hercules.arbitragis_64602   caused collective abort of
> all ranks
> >  exit status of rank 5: killed by signal 9
> > rank 4 in job 1741  hercules.arbitragis_64602   caused collective abort of
> all ranks
> >  exit status of rank 4: killed by signal 9
> > rank 3 in job 1741  hercules.arbitragis_64602   caused collective abort of
> all ranks
> >  exit status of rank 3: killed by signal 9
> > rank 2 in job 1741  hercules.arbitragis_64602   caused collective abort of
> all ranks
> >  exit status of rank 2: killed by signal 9
> > Exit 137
> >
> >
> >
>
>




More information about the mpich-discuss mailing list