[petsc-dev] mvapich and petsc-dev
Ethan Coon
ecoon at lanl.gov
Thu Apr 14 14:29:56 CDT 2011
On Thu, 2011-04-14 at 11:11 -0500, Satish Balay wrote:
> Can you run the mvapich codde with -vecscatter_alltoall and see if
> it goes through?
>
> satish
Hmm, this seems to be breaking in a different way -- it gets to what I
suspect is the same spot and then just hangs, waiting for a
communication I suspect. No seg fault, but something is still breaking.
On Thu, 2011-04-14 at 11:11 -0500, Barry Smith wrote:
Ethan,
>
> First valgrind the heck out of the code on a system where you can
and make sure it is completely clean.
>
It's clean except for a few string problems -- like setting names of
things with Fortran using TRIM()
> On the bad system see if the code crashes with no optimization
turned on.
Yep.
>
> Does it always crash at the same place? Or seemingly at some
random place. If the same place can you do some kind of restart file so
that it crashes soon after you start instead of after many time-steps.
>
Yes it does, I'll dump out a restart file which helps a bit. We have
some debugging support on this machine, but I'm having issues getting
totalview to play nicely with mvapich. Will bug people here.
On another note, I found a valgrind hidden in a generic tools module,
and am getting some new memcheck errors from mvapich.
First, in the initialization stage of the program:
==18207== Conditional jump or move depends on uninitialised value(s)
==18207== at 0xC3D7AD: intra_shmem_Allreduce (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0xC32827: PMPI_Allreduce (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x6F503A: PetscSplitOwnership (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x51B7B5: PetscLayoutSetUp (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x505026: VecCreate_Seq_Private (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x50D87D: VecCreate_Seq (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4F8885: VecSetType (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x510163: VecCreate_Standard (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4F8885: VecSetType (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x594941: DMCreateLocalVector_DA (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4A0F65: DMCreateLocalVector (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4A3E56: dmcreatelocalvector_ (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207==
And then, when the error actually occurs:
[7]PETSC ERROR: ------------------------------------------------------------------------
[7]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[7]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[7]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#valgrind[7]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[7]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[7]PETSC ERROR: to get more information on the crash.
==18207== Invalid read of size 4
==18207== at 0xC552B5: MPID_IsendContig (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0xC59770: MPID_IsendDatatype (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0xC2209F: PMPI_Start (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4E1FD0: VecScatterBegin_1 (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x4DBB1F: VecScatterBegin (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x592350: DMDALocalToLocalBegin (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x49D94E: dmdalocaltolocalbegin_ (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x415B7C: lbm_distribution_function_module_distributioncommunicatedensity_ (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x42B27A: lbm_module_lbmrun_ (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x42FB06: MAIN_ (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== by 0x40D6DF: main (in /usr/projects/porescale/research-lbm/hybrid-lbm-pflotran-lobo/tests/fracture-micromodel2/runLBMSimulation)
==18207== Address 0x38cd9980 is 3390064 bytes inside data symbol "temporary"
==18207==
Surely this is an mvapich bug?
Thanks,
Ethan
> My guess is the problem is a combination of the mvapich and
possibly the hardware. Maybe bug the systems people about upgrades on
the system?
>
> Barry
>
>
> On Thu, 14 Apr 2011, Ethan Coon wrote:
>
> > I'm a bit grasping at straws here, because I'm completely stymied, so
> > please bear with me.
> >
> >
> > I'm running a program in two locations -- on local workstations with
> > mpich2 and on a supercomputer with mvapich.
> >
> > On the workstation, the program runs, in all cases I've tested,
> > including 8 processes (the number of cores), and up to 64 processes
> > (multiple procs per core).
> >
> > On the supercomputer, it runs on 16 cores (one full node). With 64
> > cores, it seg-faults and core dumps many timesteps in to the
> > simulation.
> >
> > Using a debugger, a debug-enabled petsc-dev, but with no access to
> > debugging symbols in the mvapich installation, I've looked at the core.
> > It appears to dump during VecScatterBegin_1 (within a DMDALocalToLocal()
> > with xin = xout). The Vec I pass in as both input and output appears
> > normal.
> >
> > The stack looks something like:
> >
> > MPIR_HBT_lookup, FP=7fff1010f740
> > PMPI_Attr_get, FP=7fff1010f780
> > PetscCommDuplicate, FP=7fff1010f7d0
> > PetscViewerASCIIGetStdout, FP=7fff1010f800
> > PETSC_VIEWER_STDOUT_, FP=7fff1010f820
> > PetscDefaultSignalHandler, FP=7fff1010fa70
> > PetscSignalHandler_Private, FP=7fff1010fa90
> > **** Signal Stack Frame ******************
> > MPID_IsendContig, FP=7fff1010ff20
> > MPID_IsendDatatype, FP=7fff1010ffa0
> > PMPI_Start, FP=7fff1010fff0
> > VecScatterBegin_1, FP=7fff10110080
> > VecScatterBegin, FP=7fff101100e0
> > DMDALocalToLocalBegin, FP=7fff10110120
> > dmdalocaltolocalbegin_, FP=7fff10110160
> >
> >
> > Has anyone run into anything like this before? I have no clue even how
> > to proceed, and I doubt this is a PETSc problem, but I figured you guys
> > might have enough experience in these types of issues to know where to
> > look from here...
> >
> > Thanks,
> >
> > Ethan
> >
> >
> >
>
--
------------------------------------
Ethan Coon
Post-Doctoral Researcher
Applied Mathematics - T-5
Los Alamos National Laboratory
505-665-8289
http://www.ldeo.columbia.edu/~ecoon/
------------------------------------
More information about the petsc-dev
mailing list