[mpich-discuss] Socket closed

Dave Goodell goodell at mcs.anl.gov
Wed Nov 4 10:30:10 CST 2009


Check on Darius' suggestion, that could definitely cause this problem.

As for enabling core dumps, it will depend on your platform.  The most  
common thing that needs to be done is to put "ulimit -c unlimited" in  
your .bashrc/.zshrc file, then log out and log back in again.  You  
could run that command locally, but I don't know that mpiexec-launched  
processes will get that setting in all cases, so it's better off in a  
login script.

-Dave

On Nov 4, 2009, at 10:17 AM, Darius Buntinas wrote:

> It looks like maybe the application exited with a return value of 1
> (indicating an error), without calling MPI_Finalize (or MPI_Abort).
>
> -d
>
> On 11/04/2009 10:09 AM, Tim Kroeger wrote:
>> Dear Dave,
>>
>> On Wed, 4 Nov 2009, Dave Goodell wrote:
>>
>>> When using TCP, the "socket closed" reported on a process A is  
>>> usually
>>> a sign that there was actually a failure in some other process B.   
>>> An
>>> example would be B segfaulting for some reason, anywhere in the code
>>> (including your user code) and then crashing.  The OS tends to  
>>> report
>>> the broken TCP connection before the MPICH2 process management  
>>> system
>>> realizes that the process has died, killing one or more of B's peers
>>> (like A).  Then the process management system receives an explicit
>>> MPI_Abort from the MPI_ERRORS_ARE_FATAL error handler, still  
>>> before it
>>> has noticed that B is already dead, and reports the failure from
>>> process A instead.
>>
>> Okay, I understand.
>>
>>> It is unlikely that you are experiencing the same underlying problem
>>> as ticket #838, despite the similar symptoms.  Are there any  
>>> messages
>>> from the process manager about exit codes for your processes?
>>
>> I have now attached all messages I got.  Note that I am running  
>> with 24
>> processors.  The stderr output contains 22 times the "socket close"
>> thing (although it's not always exactly equal), whereas the stdout
>> output coutains 4 times the "signal 9" thing and 9 times the "return
>> code 1" thing.  I find that somehow confusing, because it's not of  
>> the
>> type "23*x+1*y" that it should be if once process crashed  
>> initiatively.
>>
>>> Do you have core dumps enabled?
>>
>> No; how can I enable them?
>>
>> Anyway, to examine whether your idea that one of the processes ran  
>> out
>> of memory is correct, I'll meanwhile run the application with less
>> processes per node (that is more nodes with the same number of total
>> processes).
>>
>> Best Regards,
>>
>> Tim
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list