[mpich-discuss] MPICH2 (or MPI_Init) limitation | scalability
Bernard Chambon
bernard.chambon at cc.in2p3.fr
Fri Jan 20 04:18:17 CST 2012
Hi,
Le 19 janv. 2012 à 21:54, Darius Buntinas a écrit :
>
> You were right about the bogus characters after "description#". Try applying this patch to the MPICH2 source, then do a "make clean" followed by "make" and "make install", then recompile your app and see if it helps.
Goods news, it works ! thank you very very much
Now I can run a test between 2 machines with high number of tasks like this request :
>mpiexec -iface eth2 -f /tmp/machines -n 255 bin/advance_test
bchambon at ccwpge0062's password:
I am there
Running MPI version 2, subversion 2
ref_message is ready
I am the master task 0 sur ccwpge0061, for 254 slaves tasks, we will exchange a buffer of 10 MB
slave number 1, iteration = 1
slave number 2, iteration = 1
slave number 3, iteration = 1
slave number 4, iteration = 1
slave number 5, iteration = 1
slave number 6, iteration = 1
slave number 7, iteration = 1
…
dstat on the second machine (eth2 : 10Gb/s ~= 1GB/s)
> dstat -n -N eth0,eth2
--net/eth0- --net/eth2-
recv send: recv send
262B 134B:1049M 2427k
402B 402B:1049M 2427k
198B 884B:1046M 2422k
436B 134B:1047M 2420k
134B 134B:1041M 2406k
Next step, test jobs thru GridEngine + hydra.
It seems to work but sometimes I got timeout (*) like
error: got no connection within 60 seconds. "Timeout occured while waiting for connection"
Is there such a timeout related to hydra I could increase ?
(I mean SMPD_SHORT_TIMEOUT (=60) in src/pm/smpd/smpd.h) or not ?
Thank you again for taking times for my troubles
Regards
(*) probably due to test on worker nodes running other jobs (production farm)
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120120/e92a0574/attachment.htm>
More information about the mpich-discuss
mailing list