<span id="mailbox-conversation"><div>Thanks. I’ll start work on rolling back to another version of MPI</div></span><div class="mailbox_signature"><br></div>
<br><br><div class="gmail_quote"><p>On Fri, Feb 13, 2015 at 4:15 PM, Barry Smith <span dir="ltr"><<a href="mailto:bsmith@mcs.anl.gov" target="_blank">bsmith@mcs.anl.gov</a>></span> wrote:<br></p><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br>> On Feb 13, 2015, at 5:07 PM, Andrew Spott <andrew@spott.us> wrote:
<br>>
<br>> Is there a known workaround for this?
<br><br>No. This should be reported as a show stopper to the vendor who sold you the system and provided the software.
<br><br> Barry
<br><br><br> Note that the code in PETSc that "triggers" the hang has been there for at least 15 years and has never been problematic with earlier versions of OpenMPI or any other MPI implementation. The OpenMPI guys just got lazy and made a non-MPI standard assumption that MPI attribute destructors would never use MPI internally in the 1.8 series and ripped out all the old code that handled it correctly because it was "too complicated".
<br><br>>
<br>> It also occurs in 1.8.0 (so far I’ve checked 1.8.{0,1,2,3}). Unfortunately, going back farther requires actually building openMPI, which requires something special (IB drivers, I believe).
<br>>
<br>> -Andrew
<br>>
<br>>
<br>>
<br>> On Fri, Feb 13, 2015 at 3:47 PM, Satish Balay <balay@mcs.anl.gov> wrote:
<br>>
<br>> I'll suggest 1.6
<br>>
<br>> I belive 1.6, 1.8 etc are considered stable releases - and 1.5, 1.7
<br>> etc are considered development releases.
<br>>
<br>> Satish
<br>>
<br>> On Fri, 13 Feb 2015, Barry Smith wrote:
<br>>
<br>> >
<br>> > I think it was introduced in 1.8.1 so I think 1.8.0 should be ok, but if it hangs then go back to 1.7.4
<br>> >
<br>> >
<br>> > > On Feb 13, 2015, at 4:43 PM, Andrew Spott <andrew@spott.us> wrote:
<br>> > >
<br>> > > What version of OpenMPI was this introduced in? I appear to be finding it in OpenMPI 1.8.3 and 1.8.1 as well. Should I go back to 1.8.0 or 1.7.4?
<br>> > >
<br>> > > Thanks,
<br>> > >
<br>> > > -Andrew
<br>> > >
<br>> > >
<br>> > >
<br>> > > On Fri, Feb 13, 2015 at 2:23 PM, Andrew Spott <andrew@spott.us> wrote:
<br>> > >
<br>> > > Thanks! You just saved me hours of debugging.
<br>> > >
<br>> > > I’ll look into linking against an earlier implementation of OpenMPI.
<br>> > >
<br>> > > -Andrew
<br>> > >
<br>> > >
<br>> > >
<br>> > > On Fri, Feb 13, 2015 at 2:21 PM, Barry Smith <bsmith@mcs.anl.gov> wrote:
<br>> > >
<br>> > >
<br>> > > Andrew,
<br>> > >
<br>> > > This is a bug in the 1.8.2 OpenMPI implementation they recently introduced. Can you link against an earlier OpenMPI implementation on the machine? Or do they have MPICH installed you could use?
<br>> > >
<br>> > > Barry
<br>> > >
<br>> > >
<br>> > >
<br>> > > > On Feb 13, 2015, at 3:17 PM, Andrew Spott <andrew@spott.us> wrote:
<br>> > > >
<br>> > > > Local tests on OS X can’t reproduce, but production tests on our local supercomputer always hang while waiting for a lock.
<br>> > > >
<br>> > > > The back trace:
<br>> > > >
<br>> > > > #0 0x00002ba2980df054 in __lll_lock_wait () from /lib64/libpthread.so.0
<br>> > > > #1 0x00002ba2980da388 in _L_lock_854 () from /lib64/libpthread.so.0
<br>> > > > #2 0x00002ba2980da257 in pthread_mutex_lock () from /lib64/libpthread.so.0
<br>> > > > #3 0x00002ba29a1d9e2c in ompi_attr_get_c () from /curc/tools/x_86_64/rh6/openmpi/1.8.2/gcc/4.9.1/lib/libmpi.so.1
<br>> > > > #4 0x00002ba29a207f8e in PMPI_Attr_get () from /curc/tools/x_86_64/rh6/openmpi/1.8.2/gcc/4.9.1/lib/libmpi.so.1
<br>> > > > #5 0x00002ba294aa111e in Petsc_DelComm_Outer () at /home/ansp6066/local/src/petsc-3.5.3/src/sys/objects/pinit.c:409
<br>> > > > #6 0x00002ba29a1dae02 in ompi_attr_delete_all () from /curc/tools/x_86_64/rh6/openmpi/1.8.2/gcc/4.9.1/lib/libmpi.so.1
<br>> > > > #7 0x00002ba29a1dcb6c in ompi_comm_free () from /curc/tools/x_86_64/rh6/openmpi/1.8.2/gcc/4.9.1/lib/libmpi.so.1
<br>> > > > #8 0x00002ba29a20c713 in PMPI_Comm_free () from /curc/tools/x_86_64/rh6/openmpi/1.8.2/gcc/4.9.1/lib/libmpi.so.1
<br>> > > > #9 0x00002ba294aba7cf in PetscSubcommCreate_contiguous(_n_PetscSubcomm*) () from /home/ansp6066/local/petsc-3.5.3-debug/lib/libpetsc.so.3.5
<br>> > > > #10 0x00002ba294ab89d5 in PetscSubcommSetType () from /home/ansp6066/local/petsc-3.5.3-debug/lib/libpetsc.so.3.5
<br>> > > > #11 0x00002ba2958ce437 in PCSetUp_Redundant(_p_PC*) () from /home/ansp6066/local/petsc-3.5.3-debug/lib/libpetsc.so.3.5
<br>> > > > #12 0x00002ba2957a243d in PCSetUp () at /home/ansp6066/local/src/petsc-3.5.3/src/ksp/pc/interface/precon.c:902
<br>> > > > #13 0x00002ba2958dea31 in KSPSetUp () at /home/ansp6066/local/src/petsc-3.5.3/src/ksp/ksp/interface/itfunc.c:306
<br>> > > > #14 0x00002ba29a7f8e70 in STSetUp_Sinvert(_p_ST*) () at /home/ansp6066/local/src/slepc-3.5.3/src/sys/classes/st/impls/sinvert/sinvert.c:145
<br>> > > > #15 0x00002ba29a7e92cf in STSetUp () at /home/ansp6066/local/src/slepc-3.5.3/src/sys/classes/st/interface/stsolve.c:301
<br>> > > > #16 0x00002ba29a845ea6 in EPSSetUp () at /home/ansp6066/local/src/slepc-3.5.3/src/eps/interface/epssetup.c:207
<br>> > > > #17 0x00002ba29a849f91 in EPSSolve () at /home/ansp6066/local/src/slepc-3.5.3/src/eps/interface/epssolve.c:88
<br>> > > > #18 0x0000000000410de5 in petsc::EigenvalueSolver::solve() () at /home/ansp6066/code/petsc_cpp_wrapper/src/petsc_cpp/EigenvalueSolver.cpp:40
<br>> > > > #19 0x00000000004065c7 in main () at /home/ansp6066/code/new_work_project/src/main.cpp:165
<br>> > > >
<br>> > > > This happens for MPI or single process runs. Does anyone have any hints on how I can debug this? I honestly have no idea.
<br>> > > >
<br>> > > > -Andrew
<br>> > > >
<br>> > >
<br>> > >
<br>> > >
<br>> >
<br>> >
<br>>
<br>>
<br><br></blockquote></div><br>