[petsc-users] parallel HDF5 output of DMDA data with dof>1
Matteo Semplice
matteo.semplice at uninsubria.it
Fri Jul 16 05:27:09 CDT 2021
Il 15/07/21 17:44, Matteo Semplice ha scritto:
> Hi.
>
> When I write (HDF5 viewer) a vector associated to a DMDA with 1 dof,
> the output is independent of the number of cpus used.
>
> However, for a DMDA with dof=2, the output seems to be correct when I
> run on 1 or 2 cpus, but is scrambled when I run with 4 cpus. Judging
> from the ranges of the data, each field gets written to the correct
> part, and its the data witin the field that is scrambled. Here's my MWE:
>
> #include <petscversion.h>
> #include <petscdmda.h>
> #include <petscviewer.h>
> #include <petscsys.h>
> #include <petscviewerhdf5.h>
>
> int main(int argc, char **argv) {
>
> PetscErrorCode ierr;
> ierr = PetscInitialize(&argc,&argv,(char*)0,help); CHKERRQ(ierr);
> PetscInt Nx=11;
> PetscInt Ny=11;
> PetscScalar dx = 1.0 / (Nx-1);
> PetscScalar dy = 1.0 / (Ny-1);
> DM dmda;
> ierr = DMDACreate2d(PETSC_COMM_WORLD,
> DM_BOUNDARY_NONE,DM_BOUNDARY_NONE,
> DMDA_STENCIL_STAR,
> Nx,Ny, //global dim
> PETSC_DECIDE,PETSC_DECIDE, //n proc on each dim
> 2,1, //dof, stencil width
> NULL, NULL, //n nodes per direction on each cpu
> &dmda); CHKERRQ(ierr);
> ierr = DMSetFromOptions(dmda); CHKERRQ(ierr);
> ierr = DMSetUp(dmda); CHKERRQ(ierr); CHKERRQ(ierr);
> ierr = DMDASetUniformCoordinates(dmda, 0.0, 1.0, 0.0, 1.0, 0.0,
> 1.0); CHKERRQ(ierr);
> ierr = DMDASetFieldName(dmda,0,"s"); CHKERRQ(ierr);
> ierr = DMDASetFieldName(dmda,1,"c"); CHKERRQ(ierr);
> DMDALocalInfo daInfo;
> ierr = DMDAGetLocalInfo(dmda,&daInfo); CHKERRQ(ierr);
> IS *is;
> DM *daField;
> ierr = DMCreateFieldDecomposition(dmda,NULL, NULL, &is, &daField);
> CHKERRQ(ierr);
> Vec U0;
> ierr = DMCreateGlobalVector(dmda,&U0); CHKERRQ(ierr);
>
> //Initial data
> typedef struct{ PetscScalar s,c;} data_type;
> data_type **u;
> ierr = DMDAVecGetArray(dmda,U0,&u); CHKERRQ(ierr);
> for (PetscInt j=daInfo.ys; j<daInfo.ys+daInfo.ym; j++){
> PetscScalar y = j*dy;
> for (PetscInt i=daInfo.xs; i<daInfo.xs+daInfo.xm; i++){
> PetscScalar x = i*dx;
> u[j][i].s = x+2.*y;
> u[j][i].c = 10. + 2.*x*x+y*y;
> }
> }
> ierr = DMDAVecRestoreArray(dmda,U0,&u); CHKERRQ(ierr);
>
> PetscViewer viewer;
> ierr =
> PetscViewerHDF5Open(PETSC_COMM_WORLD,"solutionSC.hdf5",FILE_MODE_WRITE,&viewer);
> CHKERRQ(ierr);
> Vec uField;
> ierr = VecGetSubVector(U0,is[0],&uField); CHKERRQ(ierr);
> PetscObjectSetName((PetscObject) uField, "S");
> ierr = VecView(uField,viewer); CHKERRQ(ierr);
> ierr = VecRestoreSubVector(U0,is[0],&uField); CHKERRQ(ierr);
> ierr = VecGetSubVector(U0,is[1],&uField); CHKERRQ(ierr);
> PetscObjectSetName((PetscObject) uField, "C");
> ierr = VecView(uField,viewer); CHKERRQ(ierr);
> ierr = VecRestoreSubVector(U0,is[1],&uField); CHKERRQ(ierr);
> ierr = PetscViewerDestroy(&viewer); CHKERRQ(ierr);
>
> ierr = PetscFinalize();
> return ierr;
> }
>
> and my xdmf file
>
> <?xml version="1.0" ?>
> <Xdmf
> xmlns:xi="https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude&data=04%7C01%7Cmatteo.semplice%40uninsubria.it%7C5207ada4b58b4312880108d947a769e8%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637619607056182657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ksp%2B0oGApgERb%2FIokBrF4q0HzxDoqstxUxl%2FB%2B4Fn3U%3D&reserved=0"
> Version="2.0">
> <Domain>
> <Grid GridType="Collection" CollectionType="Temporal">
> <Time TimeType="List">
> <DataItem Dimensions="1">1.0</DataItem>
> </Time>
> <Grid GridType="Uniform" Name="domain">
> <Topology TopologyType="2DCoRectMesh" Dimensions="11 11"/>
> <Geometry GeometryType="ORIGIN_DXDY">
> <DataItem Format="XML" NumberType="Float" Dimensions="2">0.0
> 0.0</DataItem>
> <DataItem Format="XML" NumberType="Float" Dimensions="2">0.1
> 0.1</DataItem>
> </Geometry>
> <Attribute Name="S" Center="Node" AttributeType="Scalar">
> <DataItem Format="HDF" Precision="8" Dimensions="11
> 11">solutionSC.hdf5:/S</DataItem>
> </Attribute>
> <Attribute Name="C" Center="Node" AttributeType="Scalar">
> <DataItem Format="HDF" Precision="8" Dimensions="11
> 11">solutionSC.hdf5:/C</DataItem>
> </Attribute>
> </Grid>
> </Grid>
> </Domain>
> </Xdmf>
>
> Steps to reprduce: run code and open the xdmf with paraview. If the
> code was run with 1,2 or 3 cpus, the data are correct (except the
> plane xy has become the plane yz), but with 4 cpus the data are
> scrambled.
>
> Does anyone have any insight?
>
> (I am using Petsc Release Version 3.14.2, but I can compile a newer
> one if you think it's important.)
Hi,
I have a small update on this issue.
First, it is still here with version 3.15.2.
Secondly, I have run the code under valgrind and
- for 1 or 2 processes, I get no errors
- for 4 processes, 3 out of 4, trigger the following
==25921== Conditional jump or move depends on uninitialised value(s)
==25921== at 0xB3D6259: ??? (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xAAEB29B: mca_common_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921== by 0xB316605: mca_io_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921== by 0x73C7FE7: PMPI_File_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921== by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921== by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921== by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921== by 0x684F091: H5PB_write (H5PB.c:1031)
==25921== by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921== by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921== by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
==25921== by 0x668C839: H5D__init_storage (H5Dint.c:2473)
==25921== Uninitialised value was created by a heap allocation
==25921== at 0x483577F: malloc (vg_replace_malloc.c:299)
==25921== by 0xB3D6155: ??? (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921== by 0xAAEB29B: mca_common_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921== by 0xB316605: mca_io_ompio_file_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921== by 0x73C7FE7: PMPI_File_write_at_all (in
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921== by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921== by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921== by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921== by 0x684F091: H5PB_write (H5PB.c:1031)
==25921== by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921== by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921== by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
Does anyone have any hint on what might be causing this?
Is this the "buggy MPI-IO" that Matt was mentioning in
https://lists.mcs.anl.gov/pipermail/petsc-users/2021-July/044138.html?
I am using the release branch (commit c548142fde) and I have configured
with --download-hdf5; configure finds the installed openmpi 3.1.3 from
Debian buster. The relevant lines from configure.log are
MPI:
Version: 3
Mpiexec: mpiexec --oversubscribe
OMPI_VERSION: 3.1.3
hdf5:
Version: 1.12.0
Includes: -I/home/matteo/software/petsc/opt/include
Library: -Wl,-rpath,/home/matteo/software/petsc/opt/lib
-L/home/matteo/software/petsc/opt/lib -lhdf5hl_fortran -lhdf5_fortran
-lhdf5_hl -lhdf5
Best
Matteo
More information about the petsc-users
mailing list