[mpich-discuss] MPICH2 (or MPI_Init) limitation | scalability

Fri Jan 20 04:18:17 CST 2012

Hi,

Le 19 janv. 2012 à 21:54, Darius Buntinas a écrit :

> 
> You were right about the bogus characters after "description#".  Try applying this patch to the MPICH2 source, then do a "make clean" followed by "make" and "make install", then recompile your app and see if it helps.

Goods news, it works ! thank you very very much
Now I can run a test between 2 machines with high number of tasks like this request :

>mpiexec -iface eth2 -f /tmp/machines -n 255 bin/advance_test
bchambon at ccwpge0062's password: 
I am there 
Running MPI version 2, subversion 2 
ref_message is ready 
I am the master task 0 sur ccwpge0061, for 254 slaves tasks, we will exchange a buffer of 10 MB

slave number 1, iteration = 1
slave number 2, iteration = 1
slave number 3, iteration = 1
slave number 4, iteration = 1
slave number 5, iteration = 1
slave number 6, iteration = 1
slave number 7, iteration = 1
…

dstat on the second machine (eth2 : 10Gb/s ~= 1GB/s)

> dstat -n -N eth0,eth2
--net/eth0- --net/eth2-
 recv  send: recv  send
 262B  134B:1049M 2427k
 402B  402B:1049M 2427k
 198B  884B:1046M 2422k
 436B  134B:1047M 2420k
 134B  134B:1041M 2406k

Next step, test jobs thru GridEngine + hydra.
 It seems to work but sometimes I got timeout (*) like 
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"

Is there such a timeout related to hydra I could increase ?
(I mean SMPD_SHORT_TIMEOUT (=60)  in src/pm/smpd/smpd.h) or not ?

Thank you again for taking times for my troubles
Regards

(*) probably due to test on worker nodes running other jobs (production farm)
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120120/e92a0574/attachment.htm>