[mpich-discuss] Configuration considerations for 1500+ nodes
Hiatt, Dave M
dave.m.hiatt at citi.com
Mon Oct 19 15:51:12 CDT 2009
I'm just getting started trying to scale up a working app. It runs on 400 and 800 nodes, but scaling to 1500 and above I get "Connection refused" errors as the true computation starts. I'm running RH 5.3 and I'm wondering if this might be some kind of OS resource limitation that MPICH2 is running into trying to open sockets for each of the nodes.
The profile of operation is that each node sends a very small (44 bytes) message on start up as the app "takes roll" as it were. And when I get large enough numbers of nodes, I start seeing this kind of behavior. Up to now I've been using the defaults in my build of MPICH2 (I'm running 1.07). There are a number of possible configuration parameters I could change in MPI. I'm using CH3 and was thinking of changing that to Nemesis, but I'm more suspicious of the OS on this particular error because the compute nodes are being refused connection, not having it fail, so it really sounds like an OS resource issue.
My guess is that I need to configure my master to cope with large numbers of connections, but was wonder if anyone had suggestions on where to start.
If you lived here you'd be home by now
Manager, Market Risk Systems Integration
1000 Technology Dr.
Third Floor East, M.S. 55
O'Fallon, MO 63368-2240
Email: Dave.M.Hiatt at citigroup.com
More information about the mpich-discuss