[MPICH] MPICH2 1.05 MPI_Send & MPI_Recv dropping packages randomly

Rajeev Thakur thakur at mcs.anl.gov
Thu Jan 18 13:38:14 CST 2007


Cool. Why don't you use that for now until ssm is fixed. Also, we would like
to know how it performs for you compared with ch3:sock and ch3:ssm.
 
Rajeev
 


  _____  

From: chong tan [mailto:chong_guan_tan at yahoo.com] 
Sent: Thursday, January 18, 2007 1:28 PM
To: Rajeev Thakur
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [MPICH] MPICH2 1.05 MPI_Send & MPI_Recv dropping packages
randomly



the test also work with ch3:nemesis.
 
tan

 
----- Original Message ----
From: Rajeev Thakur <thakur at mcs.anl.gov>
To: chong tan <chong_guan_tan at yahoo.com>
Cc: mpich-discuss at mcs.anl.gov
Sent: Thursday, January 18, 2007 9:43:37 AM
Subject: RE: [MPICH] MPICH2 1.05 MPI_Send & MPI_Recv dropping packages
randomly


Can you try using the Nemesis channel? Configure with
--with-device=ch3:nemesis. That will use shared memory within a node and TCP
across nodes and should actually perform better than ssm.
 
Rajeev
 


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of chong tan
Sent: Thursday, January 18, 2007 11:11 AM
To: William Gropp
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [MPICH] MPICH2 1.05 MPI_Send & MPI_Recv dropping packages
randomly


all the messages are short messages, the shortest being 3 integer (32 bits),
the longest 9 integers.
 
I can't send you the code per company policy.  There are about 3 million
lines of C, C++ and Tcl.  MPI is used in an isolated part of the code.
 
I will try sock, sock runs almost 11X slower on this prtiticular machine.
On 2.1.04p1, overhead by ssm was 50 sec, and sock's overhead was 520 sec on
the failed test.
 
tan


 
----- Original Message ----
From: William Gropp <gropp at mcs.anl.gov>
To: chong tan <chong_guan_tan at yahoo.com>
Cc: mpich-discuss at mcs.anl.gov
Sent: Wednesday, January 17, 2007 6:37:19 PM
Subject: Re: [MPICH] MPICH2 1.05 MPI_Send & MPI_Recv dropping packages
randomly

Can you send us the test case?  Does it fail with the ch3:sock device?  Are
the messages short or long?   

Bill

On Jan 17, 2007, at 7:06 PM, chong tan wrote:



OS : RedHat Enterprise 4, 2.6.9-42.ELsmp
CPU   4 dual core Intel
 
the package was built with :
setenv CFLAGS "-m32 -O2"
setenv CC         gcc
./configure -prefix=/u/cgtan/my_release_dir --with-device=ch3:ssm
--enable-fast |& tee configure.log

-----
the test programs run 5 processes, one master and 4 slaves.  Master always
recv from slaves and them send to all of them.  Randomly, an MPI_Send
performed in the master will complete, but the corresponidng MPI_Recv in the
targeted slave would not complete, and the who thing hangs. 
 
I have a debugging mechanism that attachs a sequence id to all packages
sent.  The packages are dumped before and after sent, and recv.  a message
is also dumped on the the pending recv.  The sequence id traced OK all the
way to the lost package.
 
The same code work fine with 2.1.04p1.  it has been tested on test cases
longer than 100 million send/recv sequences.  any suggestions ?
 
tan
 

  _____  

 <http://us.rd.yahoo.com/evt=49935/*http://games.yahoo.com> Bored stiff?
Loosen up...
 <http://us.rd.yahoo.com/evt=49935/*http://games.yahoo.com> Download and
play hundreds of games for free on Yahoo! Games.




  _____  

Now that's room service! Choose from over 150,000 hotels
<http://travel.yahoo.com/hotelsearchpage;_ylc=X3oDMTFtaTIzNXVjBF9TAzk3NDA3NT
g5BF9zAzI3MTk0ODEEcG9zAzIEc2VjA21haWx0YWdsaW5lBHNsawNxMS0wNw--> 
in 45,000 destinations on Yahoo! Travel to find your fit.



  _____  

Be a PS3 game guru.
Get your game face on with the
<http://us.rd.yahoo.com/evt=49936/*http://videogames.yahoo.com> latest PS3
news and previews at Yahoo! Games.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070118/6eda7b68/attachment.htm>


More information about the mpich-discuss mailing list