[petsc-users] parallel HDF5 output of DMDA data with dof>1

Matteo Semplice matteo.semplice at uninsubria.it
Fri Jul 16 05:27:09 CDT 2021


Il 15/07/21 17:44, Matteo Semplice ha scritto:
> Hi.
>
> When I write (HDF5 viewer) a vector associated to a DMDA with 1 dof, 
> the output is independent of the number of cpus used.
>
> However, for a DMDA with dof=2, the output seems to be correct when I 
> run on 1 or 2 cpus, but is scrambled when I run with 4 cpus. Judging 
> from the ranges of the data, each field gets written to the correct 
> part, and its the data witin the field that is scrambled. Here's my MWE:
>
> #include <petscversion.h>
> #include <petscdmda.h>
> #include <petscviewer.h>
> #include <petscsys.h>
> #include <petscviewerhdf5.h>
>
> int main(int argc, char **argv) {
>
>   PetscErrorCode ierr;
>   ierr = PetscInitialize(&argc,&argv,(char*)0,help); CHKERRQ(ierr);
>   PetscInt Nx=11;
>   PetscInt Ny=11;
>   PetscScalar dx = 1.0 / (Nx-1);
>   PetscScalar dy = 1.0 / (Ny-1);
>   DM dmda;
>   ierr = DMDACreate2d(PETSC_COMM_WORLD,
>                       DM_BOUNDARY_NONE,DM_BOUNDARY_NONE,
>                       DMDA_STENCIL_STAR,
>                       Nx,Ny, //global dim
>                       PETSC_DECIDE,PETSC_DECIDE, //n proc on each dim
>                       2,1, //dof, stencil width
>                       NULL, NULL, //n nodes per direction on each cpu
>                       &dmda);      CHKERRQ(ierr);
>   ierr = DMSetFromOptions(dmda); CHKERRQ(ierr);
>   ierr = DMSetUp(dmda); CHKERRQ(ierr); CHKERRQ(ierr);
>   ierr = DMDASetUniformCoordinates(dmda, 0.0, 1.0, 0.0, 1.0, 0.0, 
> 1.0); CHKERRQ(ierr);
>   ierr = DMDASetFieldName(dmda,0,"s"); CHKERRQ(ierr);
>   ierr = DMDASetFieldName(dmda,1,"c"); CHKERRQ(ierr);
>   DMDALocalInfo daInfo;
>   ierr = DMDAGetLocalInfo(dmda,&daInfo); CHKERRQ(ierr);
>   IS *is;
>   DM *daField;
>   ierr = DMCreateFieldDecomposition(dmda,NULL, NULL, &is, &daField); 
> CHKERRQ(ierr);
>   Vec U0;
>   ierr = DMCreateGlobalVector(dmda,&U0); CHKERRQ(ierr);
>
>   //Initial data
>   typedef struct{ PetscScalar s,c;} data_type;
>   data_type **u;
>   ierr = DMDAVecGetArray(dmda,U0,&u); CHKERRQ(ierr);
>   for (PetscInt j=daInfo.ys; j<daInfo.ys+daInfo.ym; j++){
>     PetscScalar y = j*dy;
>     for (PetscInt i=daInfo.xs; i<daInfo.xs+daInfo.xm; i++){
>       PetscScalar x = i*dx;
>       u[j][i].s = x+2.*y;
>       u[j][i].c = 10. + 2.*x*x+y*y;
>     }
>   }
>   ierr = DMDAVecRestoreArray(dmda,U0,&u); CHKERRQ(ierr);
>
>   PetscViewer viewer;
>   ierr = 
> PetscViewerHDF5Open(PETSC_COMM_WORLD,"solutionSC.hdf5",FILE_MODE_WRITE,&viewer); 
> CHKERRQ(ierr);
>   Vec uField;
>   ierr = VecGetSubVector(U0,is[0],&uField); CHKERRQ(ierr);
>   PetscObjectSetName((PetscObject) uField, "S");
>   ierr = VecView(uField,viewer); CHKERRQ(ierr);
>   ierr = VecRestoreSubVector(U0,is[0],&uField); CHKERRQ(ierr);
>   ierr = VecGetSubVector(U0,is[1],&uField); CHKERRQ(ierr);
>   PetscObjectSetName((PetscObject) uField, "C");
>   ierr = VecView(uField,viewer); CHKERRQ(ierr);
>   ierr = VecRestoreSubVector(U0,is[1],&uField); CHKERRQ(ierr);
>   ierr = PetscViewerDestroy(&viewer); CHKERRQ(ierr);
>
>   ierr = PetscFinalize();
>   return ierr;
> }
>
> and my xdmf file
>
> <?xml version="1.0" ?>
> <Xdmf 
> xmlns:xi="https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.w3.org%2F2001%2FXInclude&data=04%7C01%7Cmatteo.semplice%40uninsubria.it%7C5207ada4b58b4312880108d947a769e8%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637619607056182657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ksp%2B0oGApgERb%2FIokBrF4q0HzxDoqstxUxl%2FB%2B4Fn3U%3D&reserved=0" 
> Version="2.0">
>   <Domain>
>     <Grid GridType="Collection" CollectionType="Temporal">
>       <Time TimeType="List">
>         <DataItem Dimensions="1">1.0</DataItem>
>       </Time>
>       <Grid GridType="Uniform" Name="domain">
>         <Topology TopologyType="2DCoRectMesh" Dimensions="11 11"/>
>         <Geometry GeometryType="ORIGIN_DXDY">
>           <DataItem Format="XML" NumberType="Float" Dimensions="2">0.0 
> 0.0</DataItem>
>           <DataItem Format="XML" NumberType="Float" Dimensions="2">0.1 
> 0.1</DataItem>
>         </Geometry>
>         <Attribute Name="S" Center="Node" AttributeType="Scalar">
>           <DataItem Format="HDF" Precision="8" Dimensions="11 
> 11">solutionSC.hdf5:/S</DataItem>
>         </Attribute>
>         <Attribute Name="C" Center="Node" AttributeType="Scalar">
>           <DataItem Format="HDF" Precision="8" Dimensions="11 
> 11">solutionSC.hdf5:/C</DataItem>
>         </Attribute>
>       </Grid>
>     </Grid>
>   </Domain>
> </Xdmf>
>
> Steps to reprduce: run code and open the xdmf with paraview. If the 
> code was run with 1,2 or 3 cpus, the data are correct (except the 
> plane xy has become the plane yz), but with 4 cpus the data are 
> scrambled.
>
> Does anyone have any insight?
>
> (I am using Petsc Release Version 3.14.2, but I can compile a newer 
> one if you think it's important.)

Hi,

     I have a small update on this issue.

First, it is still here with version 3.15.2.

Secondly, I have run the code under valgrind and

- for 1 or 2 processes, I get no errors

- for 4 processes, 3 out of 4, trigger the following

==25921== Conditional jump or move depends on uninitialised value(s)
==25921==    at 0xB3D6259: ??? (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921==    by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921==    by 0xAAEB29B: mca_common_ompio_file_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921==    by 0xB316605: mca_io_ompio_file_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921==    by 0x73C7FE7: PMPI_File_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921==    by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921==    by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921==    by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921==    by 0x684F091: H5PB_write (H5PB.c:1031)
==25921==    by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921==    by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921==    by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)
==25921==    by 0x668C839: H5D__init_storage (H5Dint.c:2473)
==25921==  Uninitialised value was created by a heap allocation
==25921==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==25921==    by 0xB3D6155: ??? (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921==    by 0xB3D85C8: mca_fcoll_two_phase_file_write_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_fcoll_two_phase.so)
==25921==    by 0xAAEB29B: mca_common_ompio_file_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmca_common_ompio.so.41.9.0)
==25921==    by 0xB316605: mca_io_ompio_file_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_io_ompio.so)
==25921==    by 0x73C7FE7: PMPI_File_write_at_all (in 
/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so.40.10.3)
==25921==    by 0x69E8700: H5FD__mpio_write (H5FDmpio.c:1466)
==25921==    by 0x670D6EB: H5FD_write (H5FDint.c:248)
==25921==    by 0x66DA0D3: H5F__accum_write (H5Faccum.c:826)
==25921==    by 0x684F091: H5PB_write (H5PB.c:1031)
==25921==    by 0x66E8055: H5F_shared_block_write (H5Fio.c:205)
==25921==    by 0x6674538: H5D__chunk_collective_fill (H5Dchunk.c:5064)
==25921==    by 0x6674538: H5D__chunk_allocate (H5Dchunk.c:4736)

Does anyone have any hint on what might be causing this?

Is this the "buggy MPI-IO" that Matt was mentioning in 
https://lists.mcs.anl.gov/pipermail/petsc-users/2021-July/044138.html?

I am using the release branch (commit c548142fde) and I have configured 
with --download-hdf5; configure finds the installed openmpi 3.1.3 from 
Debian buster. The relevant lines from configure.log are

MPI:
   Version:  3
   Mpiexec: mpiexec --oversubscribe
   OMPI_VERSION: 3.1.3
hdf5:
   Version:  1.12.0
   Includes: -I/home/matteo/software/petsc/opt/include
   Library:  -Wl,-rpath,/home/matteo/software/petsc/opt/lib 
-L/home/matteo/software/petsc/opt/lib -lhdf5hl_fortran -lhdf5_fortran 
-lhdf5_hl -lhdf5

Best

     Matteo



More information about the petsc-users mailing list