[mpich-discuss] Socket closed

Wed Nov 4 08:53:13 CST 2009

When using TCP, the "socket closed" reported on a process A is usually  
a sign that there was actually a failure in some other process B.  An  
example would be B segfaulting for some reason, anywhere in the code  
(including your user code) and then crashing.  The OS tends to report  
the broken TCP connection before the MPICH2 process management system  
realizes that the process has died, killing one or more of B's peers  
(like A).  Then the process management system receives an explicit  
MPI_Abort from the MPI_ERRORS_ARE_FATAL error handler, still before it  
has noticed that B is already dead, and reports the failure from  
process A instead.

It is unlikely that you are experiencing the same underlying problem  
as ticket #838, despite the similar symptoms.  Are there any messages  
from the process manager about exit codes for your processes?   
Sometimes you'll see a bunch of "signal 9" (SIGKILL) and then one or  
more "signal 11" (SIGSEGV) or "signal 10" (SIGBUS).  The SIGKILLs are  
the process manager killing the rest of the job, but the SIGSEGV/BUS  
is a memory bug either in the application or in MPICH2.

Do you have core dumps enabled?  If your "process B" segfaulted (maybe  
ran out of memory?) then you can check the core file to see what  
happened.

-Dave

On Nov 4, 2009, at 7:58 AM, Tim Kroeger wrote:

> Dear all,
>
> There has been a mistake in my mail; that is, I am using mpich2-1.2  
> (not mpich2-1.1.1p1).  After googleing more about my error message,  
> I found the page https://trac.mcs.anl.gov/projects/mpich2/ticket/ 
> 838, which seems to deal with similar problems, but as far as I  
> understand, the fix mentioned there is already included in  
> mpich2-1.2, isn't it?
>
> Any help is greatly appreciated.
>
> Best Regards,
>
> Tim
>
> On Wed, 4 Nov 2009, Tim Kroeger wrote:
>
>> Dear all,
>>
>> In my application, I get the following error message:
>>
>> ================================================================
>>
>> Fatal error in MPI_Allgatherv: Other MPI error, error stack:
>> MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x70cdbbf0,  
>> scount=898392, MPI_DOUBLE, rbuf=0x75cc6cf0, rcounts=0x5c247280,  
>> displs=0x5c2471c0, MPI_DOUBLE, comm=0xc4000021) failed
>> MPIR_Allgatherv(789)..............:
>> MPIC_Sendrecv(161)................:
>> MPIC_Wait(513)....................:
>> MPIDI_CH3I_Progress(150)..........:
>> MPID_nem_mpich2_blocking_recv(948):
>> MPID_nem_tcp_connpoll(1670).......:
>> state_commrdy_handler(1520).......:
>> MPID_nem_tcp_recv_handler(1412)...: socket closed
>>
>> ================================================================
>>
>> This is using mpich2-1.1.1p1.  The problem is reproducible, but it  
>> appears inside a complex application, and the program keeps running  
>> successfully for over 2 hours before the crash occurs.
>>
>> Can anybody tell me what exactly this message means and what  
>> possible causes there are and how I can track that down efficiently?
>>
>> Best Regards,
>>
>> Tim
>>
>> -- 
>> Dr. Tim Kroeger
>> tim.kroeger at mevis.fraunhofer.de            Phone +49-421-218-7710
>> tim.kroeger at cevis.uni-bremen.de            Fax   +49-421-218-4236
>>
>> Fraunhofer MEVIS, Institute for Medical Image Computing
>> Universitaetsallee 29, 28359 Bremen, Germany
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> -- 
> Dr. Tim Kroeger
> tim.kroeger at mevis.fraunhofer.de            Phone +49-421-218-7710
> tim.kroeger at cevis.uni-bremen.de            Fax   +49-421-218-4236
>
> Fraunhofer MEVIS, Institute for Medical Image Computing
> Universitaetsallee 29, 28359 Bremen, Germany
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss