[mpich-discuss] Socket closed

Darius Buntinas buntinas at mcs.anl.gov
Wed Nov 4 10:17:06 CST 2009


It looks like maybe the application exited with a return value of 1
(indicating an error), without calling MPI_Finalize (or MPI_Abort).

-d

On 11/04/2009 10:09 AM, Tim Kroeger wrote:
> Dear Dave,
> 
> On Wed, 4 Nov 2009, Dave Goodell wrote:
> 
>> When using TCP, the "socket closed" reported on a process A is usually
>> a sign that there was actually a failure in some other process B.  An
>> example would be B segfaulting for some reason, anywhere in the code
>> (including your user code) and then crashing.  The OS tends to report
>> the broken TCP connection before the MPICH2 process management system
>> realizes that the process has died, killing one or more of B's peers
>> (like A).  Then the process management system receives an explicit
>> MPI_Abort from the MPI_ERRORS_ARE_FATAL error handler, still before it
>> has noticed that B is already dead, and reports the failure from
>> process A instead.
> 
> Okay, I understand.
> 
>> It is unlikely that you are experiencing the same underlying problem
>> as ticket #838, despite the similar symptoms.  Are there any messages
>> from the process manager about exit codes for your processes?
> 
> I have now attached all messages I got.  Note that I am running with 24
> processors.  The stderr output contains 22 times the "socket close"
> thing (although it's not always exactly equal), whereas the stdout
> output coutains 4 times the "signal 9" thing and 9 times the "return
> code 1" thing.  I find that somehow confusing, because it's not of the
> type "23*x+1*y" that it should be if once process crashed initiatively.
> 
>> Do you have core dumps enabled?
> 
> No; how can I enable them?
> 
> Anyway, to examine whether your idea that one of the processes ran out
> of memory is correct, I'll meanwhile run the application with less
> processes per node (that is more nodes with the same number of total
> processes).
> 
> Best Regards,
> 
> Tim
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list