<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello-<div><br></div><div>We have just installed mpich2-1.3 on a cluster of 18 nodes. The nodes are all running fedora 13</div><div>and consist of 64-bit HP machines of various vintages and numbers of cores (from 2 to 12 cores per node).</div><div><br></div><div>I have created a hostfile (named mpi.machinefile) with the following entries:</div><div><br></div><div><font class="Apple-style-span" face="'Courier New'">% cat mpi.machinefile</font><br><b>aki18:4 <br>aki17:4 <br>aki16:4 <br>aki15:4 <br>aki14:1 <br>aki13:1 <br>aki12:1 <br>aki11:1 <br>aki10:1 <br>aki09:1 <br>aki08:1 <br>aki07:1 <br>aki06:1 <br>aki05:1 <br>aki04:1 <br>aki03:1 <br>aki02:1 <br>aki01:1 </b></div><div><br></div><div>where my nodes are named aki01 ... aki18 (also resolved as <a href="http://aki01.urscorp.com">aki01.urscorp.com</a> ... <a href="http://aki18.urscorp.com">aki18.urscorp.com</a>).</div><div><br></div><div>Executing the following appears to work correctly:</div><div><br></div><div><font class="Apple-style-span" face="'Courier New'">% mpiexec -f mpi.machinefile -n 12 /opt/mpich2-1.3/examples/cpi</font></div><div><br></div><div>and gives the output:</div><div><br></div><div><b>Process 9 of 12 is on <a href="http://aki16.urscorp.com">aki16.urscorp.com</a><br>Process 10 of 12 is on <b><a href="http://aki16.urscorp.com">aki16.urscorp.com</a></b> <br>Process 11 of 12 is on <b><a href="http://aki16.urscorp.com">aki16.urscorp.com</a></b> <br>Process 8 of 12 is on <b><a href="http://aki16.urscorp.com">aki16.urscorp.com</a></b> <br>Process 6 of 12 is on <b><a href="http://aki17.urscorp.com">aki17.urscorp.com</a></b> <br>Process 4 of 12 is on <b><a href="http://aki17.urscorp.com">aki17.urscorp.com</a></b> <br>Process 5 of 12 is on <b><a href="http://aki17.urscorp.com">aki17.urscorp.com</a></b> <br>Process 7 of 12 is on <b><a href="http://aki17.urscorp.com">aki17.urscorp.com</a></b><br>Process 0 of 12 is on <b><a href="http://aki18.urscorp.com">aki18.urscorp.com</a></b> <br>Process 1 of 12 is on <b><a href="http://aki18.urscorp.com">aki18.urscorp.com</a></b><br>Process 2 of 12 is on <b><a href="http://aki18.urscorp.com">aki18.urscorp.com</a></b> <br>Process 3 of 12 is on <b><a href="http://aki18.urscorp.com">aki18.urscorp.com</a></b><br>pi is approximately 3.1415926544231256, Error is 0.0000000008333325 <br>wall clock time = 0.004010 </b></div><div><br></div><div><div><br></div><div>However, changing the requested number of CPUs to 17 causes a fatal error:</div><div><br></div><div><font class="Apple-style-span" face="'Courier New'">% mpiexec -f mpi.machinefile -n 17 /opt/mpich2-1.3/examples/cpi</font></div><div><br></div><div>and gives the output:</div></div><div><br></div><div><b>Fatal error in MPI_Init: Other MPI error, error stack:</b><b> </b></div><div><b>MPIR_Init_thread(385).................: </b><b><br></b><b>MPID_Init(135)........................: channel initialization failed</b><b> <br></b><b>MPIDI_CH3_Init(38)....................: </b><b><br></b><b>MPID_nem_init(196)....................: </b><b><br></b><b>MPIDI_CH3I_Seg_commit(366)............: </b><b><br></b><b>MPIU_SHMW_Hnd_deserialize(324)........: </b><b><br></b><b>MPIU_SHMW_Seg_open(863)...............: </b><b><br></b><b>MPIU_SHMW_Seg_create_attach_templ(637): open failed - No such file or directory</b><b> <br></b><b>APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)</b><b> </b></div><div><br></div><div><br></div><div><br></div><div>I also tried setting MPI_NO_LOCAL=1 but that did not help.</div><div><br></div><div>Any help you can provide is greatly appreciated.</div><div><br></div><div>Thanks,</div><div>Rob Graves</div><div>Research Geophysicst</div><div>US Geological Survey</div><div>Pasadena, CA</div></body></html>