[mpich-discuss] Socket closed
Dave Goodell
goodell at mcs.anl.gov
Wed Nov 4 08:53:13 CST 2009
When using TCP, the "socket closed" reported on a process A is usually
a sign that there was actually a failure in some other process B. An
example would be B segfaulting for some reason, anywhere in the code
(including your user code) and then crashing. The OS tends to report
the broken TCP connection before the MPICH2 process management system
realizes that the process has died, killing one or more of B's peers
(like A). Then the process management system receives an explicit
MPI_Abort from the MPI_ERRORS_ARE_FATAL error handler, still before it
has noticed that B is already dead, and reports the failure from
process A instead.
It is unlikely that you are experiencing the same underlying problem
as ticket #838, despite the similar symptoms. Are there any messages
from the process manager about exit codes for your processes?
Sometimes you'll see a bunch of "signal 9" (SIGKILL) and then one or
more "signal 11" (SIGSEGV) or "signal 10" (SIGBUS). The SIGKILLs are
the process manager killing the rest of the job, but the SIGSEGV/BUS
is a memory bug either in the application or in MPICH2.
Do you have core dumps enabled? If your "process B" segfaulted (maybe
ran out of memory?) then you can check the core file to see what
happened.
-Dave
On Nov 4, 2009, at 7:58 AM, Tim Kroeger wrote:
> Dear all,
>
> There has been a mistake in my mail; that is, I am using mpich2-1.2
> (not mpich2-1.1.1p1). After googleing more about my error message,
> I found the page https://trac.mcs.anl.gov/projects/mpich2/ticket/
> 838, which seems to deal with similar problems, but as far as I
> understand, the fix mentioned there is already included in
> mpich2-1.2, isn't it?
>
> Any help is greatly appreciated.
>
> Best Regards,
>
> Tim
>
> On Wed, 4 Nov 2009, Tim Kroeger wrote:
>
>> Dear all,
>>
>> In my application, I get the following error message:
>>
>> ================================================================
>>
>> Fatal error in MPI_Allgatherv: Other MPI error, error stack:
>> MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x70cdbbf0,
>> scount=898392, MPI_DOUBLE, rbuf=0x75cc6cf0, rcounts=0x5c247280,
>> displs=0x5c2471c0, MPI_DOUBLE, comm=0xc4000021) failed
>> MPIR_Allgatherv(789)..............:
>> MPIC_Sendrecv(161)................:
>> MPIC_Wait(513)....................:
>> MPIDI_CH3I_Progress(150)..........:
>> MPID_nem_mpich2_blocking_recv(948):
>> MPID_nem_tcp_connpoll(1670).......:
>> state_commrdy_handler(1520).......:
>> MPID_nem_tcp_recv_handler(1412)...: socket closed
>>
>> ================================================================
>>
>> This is using mpich2-1.1.1p1. The problem is reproducible, but it
>> appears inside a complex application, and the program keeps running
>> successfully for over 2 hours before the crash occurs.
>>
>> Can anybody tell me what exactly this message means and what
>> possible causes there are and how I can track that down efficiently?
>>
>> Best Regards,
>>
>> Tim
>>
>> --
>> Dr. Tim Kroeger
>> tim.kroeger at mevis.fraunhofer.de Phone +49-421-218-7710
>> tim.kroeger at cevis.uni-bremen.de Fax +49-421-218-4236
>>
>> Fraunhofer MEVIS, Institute for Medical Image Computing
>> Universitaetsallee 29, 28359 Bremen, Germany
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> --
> Dr. Tim Kroeger
> tim.kroeger at mevis.fraunhofer.de Phone +49-421-218-7710
> tim.kroeger at cevis.uni-bremen.de Fax +49-421-218-4236
>
> Fraunhofer MEVIS, Institute for Medical Image Computing
> Universitaetsallee 29, 28359 Bremen, Germany
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list