[mpich-discuss] Finalize error

Dave Goodell goodell at mcs.anl.gov
Mon Apr 14 10:45:17 CDT 2008


We have run into failures that look like this in our own testing.  If  
this really is the same bug then we know what is causing it, but it's  
a nasty one to fix.  The bug is most likely still present in 1.0.7,  
since, to the best of my knowledge, we have not attempted the fix yet.

When you say that you dispatch tasks, do you mean that you are using  
MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_accept, or any other  
dynamic process mechanism?  Also, how quickly does your program run?   
We can most often reproduce the problem when the programs are  
extremely short and very synchronized (perform the same steps at very  
close to the same times).

-Dave

On Apr 14, 2008, at 2:33 AM, Quentin Bossard wrote:

> Hi,
> Thank you for your answer. I am currently using the 1.0.6 version.  
> I currently cannot try with the 1.0.7 version. Is this a known bug  
> from 1.0.6 which disappeared in 1.0.7 ?
> Best regards,
>
> Quentin
>
> On Fri, Apr 11, 2008 at 7:45 PM, Rajeev Thakur <thakur at mcs.anl.gov>  
> wrote:
> Which version of MPICH2 are you using? Can you try with the latest  
> version, 1.0.7?
>
> Rajeev
>
>
> From: owner-mpich-discuss at mcs.anl.gov [mailto:owner-mpich- 
> discuss at mcs.anl.gov] On Behalf Of Quentin Bossard
> Sent: Friday, April 11, 2008 2:33 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] Finalize error
>
> Hi everyone,I am trying to run a program I wrote myself using mpi.  
> The basic idea is to dispatch tasks in the program on serveral  
> cores/computers. It works fine (i.e. the results of the tasks are  
> correct and well collected). However I have an error after the  
> finalize (during...?). Anyway the "Exiting program" is after the  
> instruction finalize (and only done by the master).
> I have not been able to find what was causing this error. The  
> message is below. Note that the error is not deterministic (i.e it  
> does not happen all the time...). If someone has any begining of  
> idea I would be grateful to hear it.
>
> Another question : is there a friendly gpl (or at least free) mpi  
> debugger ?
>
> Thanks in advance for your help
>
> Quentin
>
>
> 0 : Exiting program
> Assertion failed in file ch3u_connect_sock.c at line 805: vcch- 
> >conn == conn
> [cli_5]: aborting job:
> internal ABORT - process 5
> [cli_4]: aborting job:
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255).........................: MPI_Finalize failed
> MPI_Finalize(154).........................:
> MPID_Finalize(129)........................:
> MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while  
> the device was waiting for all open connections to close
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while  
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(420):
> MPIDU_Socki_handle_read(633)..............: connection failure  
> (set=0,sock=4,errno=54:(strerror() not found))
> Assertion failed in file ch3u_connect_sock.c at line 805: vcch- 
> >conn == conn
> [cli_6]: aborting job:
> internal ABORT - process 6
> [cli_2]: aborting job:
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255).........................: MPI_Finalize failed
> MPI_Finalize(154).........................:
> MPID_Finalize(129)........................:
> MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while  
> the device was waiting for all open connections to close
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while  
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(420):
> MPIDU_Socki_handle_read(633)..............: connection failure  
> (set=0,sock=4,errno=54:(strerror() not found))
> [cli_3]: aborting job:
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255).........................: MPI_Finalize failed
> MPI_Finalize(154).........................:
> MPID_Finalize(129)........................:
> MPIDI_CH3U_VC_WaitForClose(339)...........: an error occurred while  
> the device was waiting for all open connections to close
> MPIDI_CH3i_Progress_wait(215).............: an error occurred while  
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(420):
> MPIDU_Socki_handle_read(633)..............: connection failure  
> (set=0,sock=2,errno=54:(strerror() not found))
> rank 5 in job 1741  hercules.arbitragis_64602   caused collective  
> abort of all ranks
>   exit status of rank 5: killed by signal 9
> rank 4 in job 1741  hercules.arbitragis_64602   caused collective  
> abort of all ranks
>   exit status of rank 4: killed by signal 9
> rank 3 in job 1741  hercules.arbitragis_64602   caused collective  
> abort of all ranks
>   exit status of rank 3: killed by signal 9
> rank 2 in job 1741  hercules.arbitragis_64602   caused collective  
> abort of all ranks
>   exit status of rank 2: killed by signal 9
> Exit 137
>
>




More information about the mpich-discuss mailing list