<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello<div><br><div><div>Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :</div><br class="Apple-interchange-newline"><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; font-family: Courier; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><br>Looks like the shared memory is bombing out. Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?<br><br>-- Pavan<br></span></blockquote></div><div><br></div><div>Another test (to still point the same failure)</div><div> 1/ after getting rid of limits on Linux machine (SL5, Linux 2.6.x)</div><div><div style="font-size: 13px; "><i> >limit</i></div><div style="font-size: 13px; "><i>cputime unlimited</i></div><div style="font-size: 13px; "><i>filesize unlimited</i></div><div style="font-size: 13px; "><i>datasize unlimited</i></div><div style="font-size: 13px; "><i>stacksize unlimited</i></div><div style="font-size: 13px; "><i>coredumpsize unlimited</i></div><div style="font-size: 13px; "><i>memoryuse unlimited</i></div><div style="font-size: 13px; "><i>vmemoryuse unlimited</i></div><div style="font-size: 13px; "><i>descriptors 1000000 </i></div><div style="font-size: 13px; "><i>memorylocked unlimited</i></div><div style="font-size: 13px; "><i>maxproc 409600 </i></div><div style="font-size: 13px; "><i><br></i></div><div style="font-size: 13px; "><div><i>>more /proc/sys/kernel/shmall</i></div><div><i>8388608000</i></div></div><div style="font-size: 13px; "><i><br></i></div><div style="font-size: 17px; "> 2/ after increasing <span class="Apple-style-span" style="font-size: 14px; "><i>FD_SETSIZE </i></span>and recompiling mpich2 1.4.1p1</div><div style="font-size: 13px; "><i><br></i></div><div style="font-size: 16px; "><div style="font-size: 13px; "><i>>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h</i></div><div style="font-size: 13px; "><i>/usr/include/bits/typesizes.h:#define</i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>__FD_SETSIZE 8192</i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span></div><div><i style="font-size: 13px; ">/usr/include/linux/posix_types.h:#define __FD_SETSIZE</i><span class="Apple-tab-span" style="white-space: pre; font-size: 13px; "><i>        </i></span><i style="font-size: 13px; "> 8192</i><span class="Apple-tab-span" style="white-space: pre; ">        </span></div><div><span class="Apple-style-span" style="white-space: pre;"><br></span></div><div><span class="Apple-tab-span" style="white-space: pre; "><div style="white-space: normal; "><div><div><span class="Apple-style-span" style="font-size: 18px; "><font class="Apple-style-span" size="4"><span class="Apple-style-span" style="font-size: 16px; "><div><div style="font-size: 13px; "><span class="Apple-style-span" style="font-size: 16px; ">I still get the same problem, when trying to run a basic code with more than ~150 tasks (trying with 170 tasks)</span></div></div><div><div><br></div><div><div style="font-size: 13px; "><i>>mpich2version</i></div><div style="font-size: 13px; "><i>MPICH2 Version: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>1.4.1p1</i></div><div style="font-size: 13px; "><i>MPICH2 Release date:</i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>Thu Sep 1 13:53:02 CDT 2011</i></div><div style="font-size: 13px; "><i>MPICH2 Device: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>ch3:nemesis</i></div><div style="font-size: 13px; "><i>MPICH2 configure: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>--prefix=//scratch/BC/mpich2-1.4</i></div><div style="font-size: 13px; "><i>MPICH2 CC: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>/usr/bin/gcc -m64 -O2</i></div><div style="font-size: 13px; "><i>MPICH2 CXX: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>c++ -m64 -O2</i></div><div style="font-size: 13px; "><i>MPICH2 F77: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>/usr/bin/f77 -O2</i></div><div style="font-size: 13px; "><i>MPICH2 FC: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>f95 </i></div></div></div><div style="font-size: 13px; "><i><br></i></div><div><div><div><i>>mpiexec -np 170 bin/advance_test</i></div><div><i>Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0</i></div><div><i>internal ABORT - process 0</i></div></div><div><br></div></div><div><br></div></span></font></span></div></div></div></span></div></div></div><div>Another interesting thing is that the same basic code, running with older release of mpich2 (<span class="Apple-style-span" style="font-size: 16px; ">1.0.8p1, using mpd daemon, default installation on our machines) </span>run without any failure </div><div><span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Courier; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; "><div><div><div><div><div><br></div><div style="font-size: 13px; "><i>>mpich2version </i></div><div style="font-size: 13px; "><i>MPICH2 Version: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>1.0.8p1</i></div><div style="font-size: 13px; "><i>MPICH2 Release date:</i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>Unknown, built on Tue Apr 21 13:52:10 CEST 2009</i></div><div style="font-size: 13px; "><i>MPICH2 Device: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>ch3:sock</i></div><div style="font-size: 13px; "><i>MPICH2 configure: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>-prefix=/usr/local/mpich2</i></div><div style="font-size: 13px; "><i>MPICH2 CC: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>gcc -O2</i></div><div style="font-size: 13px; "><i>MPICH2 CXX: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>c++ -O2</i></div><div style="font-size: 13px; "><i>MPICH2 F77: </i><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span><i>g77 -O2</i></div><div style="font-size: 15px; "><span class="Apple-style-span" style="font-size: 13px; "><i>MPICH2 F90: </i></span><span class="Apple-style-span" style="font-size: 13px; "><span class="Apple-tab-span" style="white-space: pre; "><i>        </i></span></span><span class="Apple-style-span" style="font-size: 13px; "><i>f95 -O2</i></span></div><div><font class="Apple-style-span" size="4"><span class="Apple-style-span" style="font-size: 16px; "><span class="Apple-style-span" style="font-size: 18px; "><br></span></span></font></div><div><div style="font-size: 16px; "><div><div><div><div style="font-size: 15px; "><i>>mpicc -O2 -o bin/advance_test advance_test.c</i></div></div></div><div style="font-size: 15px; "></div><div style="font-size: 15px; "><i>>mpdboot --ncpus=170</i></div><div style="font-size: 15px; "><i>>mpiexec -np 170 bin/advance_test | more</i></div></div></div><div style="font-size: 16px; "><div style="font-size: 15px; "><i>Running 170 tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>In slave tasks </i></div><div style="font-size: 15px; "><i>…</i></div><div style="font-size: 15px; "><i>mpdallexit</i></div></div></div><div><br></div><div>The test code run without failure </div><div><br></div><div><div>If you ask me why such a test, it's because, after installing mpich2 1.4.1.p1</div><div>and running jobs thru GridEngine, everything works fine if jobs specify small number of tasks</div><div><span class="hps">then</span> <span class="hps">I get</span> <span class="hps">failures</span> <span class="hps">as</span> <span class="hps">the number of</span> <span class="hps">tasks</span> <span class="hps">increases</span></div><div><span class="hps"></span>(let's say with for example 32 tasks 100% jobs pass, with 64 tasks, 70% of jobs fails)</div><div><br></div><div>So at the current time, I can't provide Mpich2 for ours user</div><div><br></div><div><div>Thank you for any help</div></div><div><br></div><div> </div></div><div><br></div><div>PS : the basic test code</div><div> </div><div><div style="font-size: 13px; "> if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {</div><div style="font-size: 13px; "> printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);</div><div style="font-size: 13px; "> return(1);</div><div style="font-size: 13px; "> }</div><div style="font-size: 13px; "><br></div><div style="font-size: 13px; "> int rank;</div><div style="font-size: 13px; "> if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {</div><div style="font-size: 13px; "> printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);</div><div style="font-size: 13px; "> MPI_Abort(MPI_COMM_WORLD, 1);</div><div style="font-size: 13px; "> return(1);</div><div style="font-size: 13px; "> }</div><div style="font-size: 13px; "> </div><div style="font-size: 13px; "> if (rank == 0) {</div><div style="font-size: 13px; "> int nprocs;</div><div style="font-size: 13px; "> if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {</div><div style="font-size: 13px; "> printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);</div><div style="font-size: 13px; "> MPI_Abort(MPI_COMM_WORLD, 1);</div><div style="font-size: 13px; "> return(1);</div><div style="font-size: 13px; "> }</div><div style="font-size: 13px; "> </div><div style="font-size: 13px; "> printf("Running %d tasks \n", nprocs) ; fflush(stdout);</div><div style="font-size: 13px; "> MPI_Finalize(); </div><div style="font-size: 13px; "> return(0); </div><div style="font-size: 13px; "> } else {</div><div style="font-size: 13px; "> printf("In slave tasks \n") ; fflush(stdout);</div><div style="font-size: 13px; "> sleep(1);</div><div style="font-size: 13px; "> // MPI_Finalize(); // mandatory if <= mpich2-1.2 ?</div><div style="font-size: 13px; "> return(0);</div><div style="font-size: 13px; "> }</div></div><div><br></div><div>---------------<br>Bernard CHAMBON<br>IN2P3 / CNRS<br>04 72 69 42 18<br></div></div></div></div></div></span>
</div>
<br></div></body></html>