[MPICH2-dev] RE: problem with multithreaded version of sock mpich2-1.06 under Windows

Ryzhykh, Alexey alexey.ryzhykh at intel.com
Mon Sep 24 08:51:18 CDT 2007


I would like to add that we fixed one problem related to thread
synchronization. (added missed  call of MPID_Thread_mutex_lock) 

See attached diff file for mpich2-1.0.6/src/mpid/common/sock/iocp/sock.c

But it does not solve all problems.

Regards,

Alexey

 

________________________________

From: Ryzhykh, Alexey 
Sent: Monday, September 24, 2007 5:42 PM
To: 'mpich2-dev at mcs.anl.gov'; 'Rajeev Thakur'
Cc: Voronov, German; Supalov, Alexander; Truschin, Vladimir; Yulov,
Dmitry
Subject: problem with multithreaded version of sock mpich2-1.06 under
Windows

 

Hi everybody,

We faced the problems with using multithreaded version of mpich2-1.06
built with ch3: sock channel under Windows IA32.

May be the same problems exists on other Intel platforms under Windows -
Intel 64 and IA64. I was able to build working Win mpich2-1.06 only for
IA32.

The simple MT tests like mpich2/threaded_sr works fine but under stress
testing we see the problems.  

I used the special version of IMB that support running several threads
for our stress testing.

And I got intermittent failures running 8 processes on 4 nodes with 8
threads.

Sometimes the test finishes successfully, sometimes it hangs and
sometimes the following error appears:

 

 

job aborted:

rank: node: exit code[: error message]

0: svsmpiw03: 1

1: svsmpiw03: 1: Fatal error in MPI_Recv: Other MPI error, error stack:

MPI_Recv(186).............................: MPI_Recv(buf=03BC0040,
count=262144

 MPI_BYTE, src=2, tag=MPI_ANY_TAG, comm=0x84000007, status=01B4FE78)
failed

MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an

event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(420):

MPIDU_Sock_wait(2602).....................: The specified network name
is no lo

ger available. (errno 64)

2: svsmpiw04: 1: Fatal error in MPI_Waitall: Other MPI error, error
stack:

MPI_Waitall(258)..........................: MPI_Waitall(count=2,
req_array=01B4

E68, status_array=01B4FE78) failed

MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an

event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(420):

MPIDU_Sock_wait(2464).....................: Unable to re-post an aborted
readv

peration

MPIDU_Sock_post_readv(1655)...............: An existing connection was
forcibly

closed by the remote host. (errno 10054)

3: svsmpiw04: 1: Fatal error in MPI_Recv: Other MPI error, error stack:

MPI_Recv(186).............................: MPI_Recv(buf=03BC0040,
count=262144

 MPI_BYTE, src=2, tag=MPI_ANY_TAG, comm=0x84000005, status=01B4FE78)
failed

MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an

event returned by MPIDU_Sock_Wait()

MPIDI_CH3I_Progress_handle_sock_event(420):

MPIDU_Sock_wait(2602).....................: The specified network name
is no lo

ger available. (errno 64)

 

 

Could you please help to solve the problem?

I can provide you with this MT IMB in separate email.  I can't reproduce
the problem on small test cases.

 

With best regards,

Alexey Ryzhykh,

---

Intel, Sarov

 

 

--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: 125252, Moscow, Russian Federation, 
Chapayevsky Per, 14.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20070924/698fb738/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sock.c.diff
Type: application/octet-stream
Size: 1093 bytes
Desc: sock.c.diff
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20070924/698fb738/attachment.obj>


More information about the mpich2-dev mailing list