[mpich-discuss] read from socket failed (errno 10055) on 1.3.2p1

Kuraisa, Roy J (BOSTON) roy_kuraisa at merck.com
Wed Apr 20 07:24:49 CDT 2011


Sure; here it is (I changed the total number of processors but received the same error- i.e.,
mpiexec -hosts 2 usctap3825 14 usctap3488 1 ...):

Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(398)................................: MPI_Gatherv failed(sbuf=00000
0003A470040, scount=104314446, MPI_FLOAT, rbuf=0000000180040040, rcnts=000000000
D0915E0, displs=000000000D091630, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Gatherv_impl(210)...........................:
MPIR_Gatherv(118)................................:
MPIC_Waitall_ft(852).............................:
MPIR_Waitall_impl(121)...........................:
MPIDI_CH3I_Progress(353).........................:
MPID_nem_mpich2_blocking_recv(905)...............:
MPID_nem_newtcp_module_poll(37)..................:
MPID_nem_newtcp_module_connpoll(2669)............:
MPID_nem_newtcp_module_recv_success_handler(2364):
MPID_nem_newtcp_module_post_readv_ex(330)........:
MPIU_SOCKW_Readv_ex(392).........................: read from socket failed, An o
peration on a socket could not be performed because the system lacked sufficient
 buffer space or because a queue was full.
 (errno 10055)
Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(398)....................: MPI_Gatherv failed(sbuf=00000000280E0040,
 scount=104314446, MPI_FLOAT, rbuf=0000000000000000, rcnts=000000000097DE50, dis
pls=000000000097DEA0, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Gatherv_impl(210)...............:
MPIR_Gatherv(166)....................:
MPIC_Send(66)........................:
MPIC_Wait(540).......................:
MPIDI_CH3I_Progress(353).............:
MPID_nem_mpich2_blocking_recv(905)...:
MPID_nem_newtcp_module_poll(37)......:
MPID_nem_newtcp_module_connpoll(2655):
gen_read_fail_handler(1145)..........: read from socket failed - The specified n
etwork name is no longer available.

Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(398)....................: MPI_Gatherv failed(sbuf=00000000280E0040,
 scount=104314446, MPI_FLOAT, rbuf=0000000000000000, rcnts=000000000097DE50, dis
pls=000000000097DEA0, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Gatherv_impl(210)...............:
MPIR_Gatherv(166)....................:
MPIC_Send(66)........................:
MPIC_Wait(540).......................:
MPIDI_CH3I_Progress(353).............:
MPID_nem_mpich2_blocking_recv(905)...:
MPID_nem_newtcp_module_poll(37)......:
MPID_nem_newtcp_module_connpoll(2655):
gen_read_fail_handler(1145)..........: read from socket failed - The specified n
etwork name is no longer available.

Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(398)....................: MPI_Gatherv failed(sbuf=0000000023DF0040,
 scount=104433120, MPI_FLOAT, rbuf=0000000000000000, rcnts=000000000097EC40, dis
pls=000000000097EC90, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Gatherv_impl(210)...............:
MPIR_Gatherv(166)....................:
MPIC_Send(66)........................:
MPIC_Wait(540).......................:
MPIDI_CH3I_Progress(353).............:
MPID_nem_mpich2_blocking_recv(905)...:
MPID_nem_newtcp_module_poll(37)......:
MPID_nem_newtcp_module_connpoll(2655):
gen_write_fail_handler(1194).........: write to socket failed - The specified ne
twork name is no longer available.


job aborted:
rank: node: exit code[: error message]
0: usctap3825: 1: process 0 exited without calling finalize
1: usctap3825: 123
2: usctap3825: 123
3: usctap3825: 123
4: usctap3825: 123
5: usctap3825: 123
6: usctap3825: 123
7: usctap3825: 123
8: usctap3825: 123
9: usctap3825: 123
10: usctap3825: 123
11: usctap3825: 123
12: usctap3825: 1: process 12 exited without calling finalize
13: usctap3825: 1: process 13 exited without calling finalize
14: usctap3488: 1: process 14 exited without calling finalize


cheers, roy

-----Original Message-----
From: Jayesh Krishna [mailto:jayesh at mcs.anl.gov] 
Sent: Tuesday, April 19, 2011 4:18 PM
To: mpich-discuss at mcs.anl.gov
Cc: Kuraisa, Roy J (BOSTON)
Subject: Re: [mpich-discuss] read from socket failed (errno 10055) on 1.3.2p1

Hi,
 Can you send us the complete error message ?

Regards,
Jayesh

----- Original Message -----
From: "Roy J Kuraisa (BOSTON)" <roy_kuraisa at merck.com>
To: mpich-discuss at mcs.anl.gov
Sent: Tuesday, April 19, 2011 2:32:24 PM
Subject: [mpich-discuss] read from socket failed (errno 10055) on 1.3.2p1



Hi, 
Summary: 
--------------- 
On Windows when I execute the following command (working on a fairly large dataset): 
mpiexec -hosts 2 usctap3825 15 usctap3488 1 \\fs1\correlatempi.exe cfg.xml in.h5 out.h5 debug 
I encounter an MPI gather error (read from socket failed (errno 10055). See error stack at end of this message. If I run on only one computer (with 16 cores): 
mpiexec -hosts 1 usctap3825 15 \\fs1\correlatempi.exe cfg.xml in.h5 out.h5 debug 
the program runs successfully. 

Additionally, both of the above commands run successfully on mpich2 v1.2.1 (although I had to build on mpich2 1.2.1 and used different servers that are configured exactly like the origian servers noted above (e.g., usctap3825, 16-core, 64GB memory, etc). 

I noticed that a similar error was fixed in mpich2-1.2 ( http://trac.mcs.anl.gov/projects/mpich2/ticket/895 ). Could this have regressed? tia. 

System Configuration: 
-------------------------------- 
Server1 (usctap3825) 
------- 
a. Windows Server 2003, 64-bit, SP2 
b. 16 cores/processors 
c. 64GB memory 
d. Physical computer 
Server2 (usctap3488) 
------- 
a. Windows Server 2003, 64-bit, SP2 
b. 2 cores/processors 
c. 8GB memory 
d. Virtual Machine 

cheers, roy 


error stack: 
---------------- 
Fatal error in PMPI_Gatherv: Other MPI error, error stack: 
PMPI_Gatherv(398)................................: MPI_Gatherv failed(sbuf=00000 
0003AA30040, scount=97787376, MPI_FLOAT, rbuf=0000000180040040, rcnts=000000000D 
6515E0, displs=000000000D651630, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed 
MPIR_Gatherv_impl(210)...........................: 
MPIR_Gatherv(118)................................: 
MPIC_Waitall_ft(852).............................: 
MPIR_Waitall_impl(121)...........................: 
MPIDI_CH3I_Progress(353).........................: 
MPID_nem_mpich2_blocking_recv(905)...............: 
MPID_nem_newtcp_module_poll(37)..................: 
MPID_nem_newtcp_module_connpoll(2669)............: 
MPID_nem_newtcp_module_recv_success_handler(2364): 
MPID_nem_newtcp_module_post_readv_ex(330)........: 
MPIU_SOCKW_Readv_ex(392).........................: read from socket failed, An o 
peration on a socket could not be performed because the system lacked sufficient 
buffer space or because a queue was full. 
(errno 10055) 
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system. 
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.



More information about the mpich-discuss mailing list