[mpich-discuss] Socket closed

Tim Kroeger tim.kroeger at cevis.uni-bremen.de
Wed Nov 4 10:09:04 CST 2009


Dear Dave,

On Wed, 4 Nov 2009, Dave Goodell wrote:

> When using TCP, the "socket closed" reported on a process A is usually a sign 
> that there was actually a failure in some other process B.  An example would 
> be B segfaulting for some reason, anywhere in the code (including your user 
> code) and then crashing.  The OS tends to report the broken TCP connection 
> before the MPICH2 process management system realizes that the process has 
> died, killing one or more of B's peers (like A).  Then the process management 
> system receives an explicit MPI_Abort from the MPI_ERRORS_ARE_FATAL error 
> handler, still before it has noticed that B is already dead, and reports the 
> failure from process A instead.

Okay, I understand.

> It is unlikely that you are experiencing the same underlying problem as 
> ticket #838, despite the similar symptoms.  Are there any messages from the 
> process manager about exit codes for your processes?

I have now attached all messages I got.  Note that I am running with 
24 processors.  The stderr output contains 22 times the "socket close" 
thing (although it's not always exactly equal), whereas the stdout 
output coutains 4 times the "signal 9" thing and 9 times the "return 
code 1" thing.  I find that somehow confusing, because it's not of the 
type "23*x+1*y" that it should be if once process crashed 
initiatively.

> Do you have core dumps enabled?

No; how can I enable them?

Anyway, to examine whether your idea that one of the processes ran out 
of memory is correct, I'll meanwhile run the application with less 
processes per node (that is more nodes with the same number of total 
processes).

Best Regards,

Tim

-- 
Dr. Tim Kroeger
tim.kroeger at mevis.fraunhofer.de            Phone +49-421-218-7710
tim.kroeger at cevis.uni-bremen.de            Fax   +49-421-218-4236

Fraunhofer MEVIS, Institute for Medical Image Computing
Universitaetsallee 29, 28359 Bremen, Germany
-------------- next part --------------
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x70cdbbf0, scount=898392, MPI_DOUBLE, rbuf=0x75cc6cf0, rcounts=0x5c247280, displs=0x5c2471c0, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)...........: MPI_Allgatherv(sbuf=0x71289370, scount=858692, MPI_DOUBLE, rbuf=0x76226bd0, rcounts=0x6c25d20, displs=0x6c25c60, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)...........: 
MPIC_Sendrecv(161).............: 
MPIC_Wait(513).................: 
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1670)....: 
state_commrdy_handler(1520)....: 
MPID_nem_tcp_recv_handler(1412): socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6fe537f0, scount=917616, MPI_DOUBLE, rbuf=0x74e641b0, rcounts=0x5ae4b9f0, displs=0x5ae4b930, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6ed18220, scount=844480, MPI_DOUBLE, rbuf=0x73c99e60, rcounts=0x3925a2f0, displs=0x3925a230, MPI_DOUBLE, comm=0xc4000036) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6f5c5450, scount=925192, MPI_DOUBLE, rbuf=0x745e4ad0, rcounts=0x5ad66140, displs=0x5ad66080, MPI_DOUBLE, comm=0xc4000036) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6d2624e0, scount=736608, MPI_DOUBLE, rbuf=0x72111620, rcounts=0x218a8800, displs=0x218a8740, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)...........: MPI_Allgatherv(sbuf=0x6fb8c960, scount=840996, MPI_DOUBLE, rbuf=0x74b078c0, rcounts=0x4d8e98d0, displs=0x4d8e9810, MPI_DOUBLE, comm=0xc4000036) failed
MPIR_Allgatherv(789)...........: 
MPIC_Sendrecv(161).............: 
MPIC_Wait(513).................: 
MPIDI_CH3I_Progress(150).......: 
MPID_nem_mpich2_test_recv(800).: 
MPID_nem_tcp_connpoll(1670)....: 
state_commrdy_handler(1520)....: 
MPID_nem_tcp_recv_handler(1412): socket closed
: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x70a19720, scount=913080, MPI_DOUBLE, rbuf=0x75a21320, rcounts=0x5b32b180, displs=0x5b32b0c0, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6d3d9970, scount=754800, MPI_DOUBLE, rbuf=0x722ac330, rcounts=0x58d103e0, displs=0x58d10320, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_ProgresFatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6f0ec460, scount=822164, MPI_DOUBLE, rbuf=0x74042740, rcounts=0x256b12b0, displs=0x256b11f0, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6fe10610, scount=836132, MPI_DOUBLE, rbuf=0x74d81d70, rcounts=0x848b760, displs=0x848b6a0, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: s(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6dc95690, scount=821432, MPI_DOUBLE, rbuf=0x72bea290, rcounts=0x4c9d29d0, displs=0x4c9d2910, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6e8500a0, scount=829316, MPI_DOUBLE, rbuf=0x737b4300, rcounts=0x7322720, displs=0x7322660, MPI_DOUBLE, comm=0xc4000029) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed

state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6fb87a00, scount=848896, MPI_DOUBLE, rbuf=0x74b12040, rcounts=0x3913b800, displs=0x3913b740, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6f145290, scount=720672, MPI_DOUBLE, rbuf=0x73fd51d0, rcounts=0x6c2f4a0, displs=0x6c2f3e0, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6b780d30, scount=658416, MPI_DOUBLE, rbuf=0x705972f0, rcounts=0x21ecacd0, displs=0x21ecac10, MPI_DOUBLE, comm=0xc4000029) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6c44e0a0, scount=730680, MPI_DOUBLE, rbuf=0x712f18a0, rcounts=0x21953bb0, displs=0x21953af0, MPI_DOUBLE, comm=0xc4000025) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6ecda900, scount=803288, MPI_DOUBLE, rbuf=0x73c0be00, rcounts=0x16d30e60, displs=0x16d30da0, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(161)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6bf480d0, scount=621184, MPI_DOUBLE, rbuf=0x70be5a60, rcounts=0x37201760, displs=0x372016a0, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6bd7c210, scount=537776, MPI_DOUBLE, rbuf=0x70976d20, rcounts=0x4b82a3f0, displs=0x4b82a330, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6cca4360, scount=608668, MPI_DOUBLE, rbuf=0x719295d0, rcounts=0x21f81540, displs=0x21f81480, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
Fatal error in MPI_Allgatherv: Other MPI error, error stack:
MPI_Allgatherv(1143)..............: MPI_Allgatherv(sbuf=0x6b94aac0, scount=551472, MPI_DOUBLE, rbuf=0x705601d0, rcounts=0x5ee03a50, displs=0x5ee03990, MPI_DOUBLE, comm=0xc4000021) failed
MPIR_Allgatherv(789)..............: 
MPIC_Sendrecv(164)................: 
MPIC_Wait(513)....................: 
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1670).......: 
state_commrdy_handler(1520).......: 
MPID_nem_tcp_recv_handler(1412)...: socket closed
-------------- next part --------------
rank 23 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 23: return code 1 
rank 22 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 22: killed by signal 9 
rank 21 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 21: return code 1 
rank 18 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 18: killed by signal 9 
rank 17 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 17: return code 1 
rank 14 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 14: return code 1 
rank 13 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 13: return code 1 
rank 12 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 12: return code 1 
rank 11 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 11: return code 1 
rank 10 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 10: return code 1 
rank 9 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 9: return code 1 
rank 7 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 7: killed by signal 9 
rank 0 in job 1  node092_49800   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9 


More information about the mpich-discuss mailing list