<div dir="ltr"><div class="gmail_quote">On Tue, Feb 7, 2012 at 08:06, Derek Gaston <span dir="ltr"><<a href="mailto:friedmud@gmail.com">friedmud@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello all,<div><br></div><div>I'm running some largish finite element calculations at the moment (50 Million to 400 Million DoFs on up to 10,000 processors) using a code based on PETSc (obviously!) and while most of the simulations are working well, every now again I seem to run into a hang in the setup phase of the simulation.</div>
<div><br></div><div>I've attached GDB several times and it seems to alway be hanging in PetscLayoutSetUp() during matrix creation. Here is the top of a stack trace showing what I mean:</div><div><br></div><div><div>
#0 0x00002aac9d86cef2 in opal_progress () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libopen-pl.so.0</div>
<div>#1 0x00002aac9d16a0c4 in ompi_request_default_wait_all () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0</div><div>#2 0x00002aac9d1da9ee in ompi_coll_tuned_sendrecv_actual () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0</div>
<div>#3 0x00002aac9d1e2716 in ompi_coll_tuned_allgather_intra_bruck () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0</div><div>#4 0x00002aac9d1db439 in ompi_coll_tuned_allgather_intra_dec_fixed () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0</div>
<div>#5 0x00002aac9d1827e6 in PMPI_Allgather () from /apps/local/openmpi/1.4.4/intel-12.1.1/opt/lib/libmpi.so.0</div><div>#6 0x0000000000508184 in PetscLayoutSetUp ()</div><div>#7 0x00000000005b9f39 in MatMPIAIJSetPreallocation_MPIAIJ ()</div>
<div>#8 0x00000000005c1317 in MatCreateMPIAIJ ()</div></div></blockquote><div><br></div><div>Are _all_ the processes making it here?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br></div><div>As you can see, I'm currently using openMPI (even though I do have access to others) along with the intel compiler (this is a mostly C++ code). This problem doesn't exhibit itself on any smaller problems (we run TONS of runs all the time in the 10,000-5,000,000 DoF range on 1-3000 procs) and only seems to come up on these larger runs.</div>
<div><br></div><div>I'm starting to suspect that it's an openMPI issue. Has anyone seen anything like this before?</div><div><br></div><div>Here are some specs for my current environment</div><div><br></div><div>
PETSc 3.1-p8 (I know, I know....)</div><div>OpenMPI 1.4.4</div><div>intel compilers 12.1.1</div><div>Modified Redhat with 2.6.18 Kernel</div><div>QDR Infiniband</div><div><br></div><div>Thanks for any help!</div><div><br>
</div><font color="#888888"><div>Derek</div>
</font></blockquote></div><br></div>