[MOAB-dev] DMMoabLoadFromFile() - parallel performance issue

Tue Dec 15 19:06:51 CST 2015

Hi Jim,
Thanks for your message
The relatively bad scalability results are due to the way the mesh was generated / ordered in the file.
when reading in parallel, mesh is cherry-picked from the file, each task tries to read from the file 
only the elements it needs.
The ids of the elements in each partition are very disjoint, there are big gaps in "file id space"
So each task has to read many sub-sequences from the file.
In serial, there is one sequence, no cherry-picking

It you partition your file using -R option, you will get better results, as the elements are reordered in the file
relative to their partition (so elements in partition 0 will be "close together" in the file too, not only in physical space)

something like this

mbpart -z RCB 16 -R scaling_study.h5m  reor.h5m

this is what I got with your initial file scaling_study.h5m

on 1 processor(s) 
ReadHDF5:             0.179603
0.182674 PARALLEL TOTAL

 on 2 processor(s) 
ReadHDF5:             0.363472  
0.483488 PARALLEL TOTAL

 on 4 processor(s) 
ReadHDF5:             0.310096
 0.390116 PARALLEL TOTAL

 on 8 processor(s) 
ReadHDF5:             0.302726
 0.366929 PARALLEL TOTAL

reordered file:  reor.h5m
 on 1 processor(s)
ReadHDF5:             0.0257461
 0.0286331 PARALLEL TOTAL

 on 2 processor(s) 
ReadHDF5:             0.025044
0.138264 PARALLEL TOTAL

 on 4 processor(s) 
ReadHDF5:             0.0203938
 0.099154 PARALLEL TOTAL

on 8 processor(s) 
ReadHDF5:             0.0345509
 0.103593 PARALLEL TOTAL

Parallel Total includes the time it needs to "resolve shared entities" between tasks.

ReadHDF5 does the actual IO read from the disk.

________________________________________
From: moab-dev-bounces at mcs.anl.gov [moab-dev-bounces at mcs.anl.gov] on behalf of WARNER, JAMES E. (LARC-D309) [james.e.warner at nasa.gov]
Sent: Tuesday, December 15, 2015 7:27 AM
To: Vijay S. Mahadevan
Cc: moab-dev at mcs.anl.gov
Subject: Re: [MOAB-dev] DMMoabLoadFromFile() - parallel performance issue

Good morning,

Thanks for the quick response.

We have seen this issue both on a linux workstation without a parallel
filesystem and a supercomputer with a Lustre parallel filesystem. Here is
a link to the machine statistics of the latter:
http://www.nas.nasa.gov/hecc/resources/pleiades.html, this is where we
attempted the scalability test that failed on 500 procs.

By the way, we are building HDF5/MOAB via the PETSc install. Here are the
library versions / configuration details for the workstation build:

------------------------------------------------------------
compiler - comp-intel/2015.0.09
mpi        - mpich2-1.4

PETSc:
CFLAGS=-O2 CXXFLAGS=-O2 --with-debugging=0 --with-shared-libraries=0
--download-metis=1 --download-parmetis=1 --download-hdf5=1
--download-moab=1 --download-mumps=1 --download-scalapack
--with-blas-lapack-dir=/opt/intel15/mkl/lib/intel64/

HDF5:
./configure
--prefix=/users0/jewarne1/src/petsc/build/petsc-3.5.4/intel-mpich
--libdir=/users0/jewarne1/src/petsc/build/petsc-3.5.4/intel-mpich/lib
CC="mpicc" CFLAGS="-O2 " --enable-parallel --enable-fortran FC="mpif90"
F9X="mpif90" F90="mpif90" --disable-shared

MOAB:
./configure
--prefix=/users0/jewarne1/src/petsc/build/petsc-3.5.4/intel-mpich
CC="mpicc" CFLAGS="-O2 " CXX="mpicxx" CXXFLAGS="-O2    " F90="mpif90"
F90FLAGS=" -O3  " F77="mpif90" FFLAGS=" -O3  " FC="mpif90" FCLAGS=" -O3  "
--disable-shared --with-mpi
--with-hdf5=/users0/jewarne1/src/petsc/build/petsc-3.5.4/intel-mpich
--without-netcdf
------------------------------------------------------------

Here are the same details for the cluster build:

------------------------------------------------------------

Compiler - comp-intel/2015.3.187
mpi - mpi-sgi/mpt.2.12r26

petsc:
CFLAGS="-O2 -xAVX" CXXFLAGS="-O2 -xAVX" --with-debugging=0
--with-shared-libraries=0 --download-metis=1 --download-parmetis=1
--download-hdf5=1 --download-moab=1 --download-mumps=1
--download-scalapack
--with-blas-lapack-dir=/nasa/intel/Compiler/2015.3.187/composer_xe_2015.3.1
87/mkl/lib/intel64/

moab:
--prefix=/home6/jhochhal/src/petsc-3.5.4/intel-mpt CC="mpicc"
CFLAGS="-O2 -xAVX " CXX="mpicxx" CXXFLAGS="-O2 -xAVX    " F90="mpif90"
F90FLAGS=" -O3  " F77="mpif90" FFLAGS=" -O3  " FC="mpif90" FCLAGS=" -O3
" --disable-shared --with-mpi
--with-hdf5=/home6/jhochhal/src/petsc-3.5.4/intel-mpt --without-netcdf

hdf5:
--prefix=/home6/jhochhal/src/petsc-3.5.4/intel-mpt
--libdir=/home6/jhochhal/src/petsc-3.5.4/intel-mpt/lib CC="mpicc"
CFLAGS="-O2 -xAVX " --enable-parallel --enable-fortran FC="mpif90"
F9X="mpif90" F90="mpif90" --disable-shared
------------------------------------------------------------

Obviously we will not see scaling for the serial moab/hdf5 reader on the
machine without parallel I/O, but we expected to see CPU times on the
order of nprocs * time_serial. Instead were seeing a slowdown of about 35X
from NP=1 to NP=2.

Let me know if you have any comments about the above information or any
results from your performance on the attached test case. Thanks!

Best,
Jim

On 12/14/15, 7:24 PM, "Vijay S. Mahadevan" <vijay.m at gmail.com> wrote:

>James,
>
>I haven't locally tested your test case yet but those numbers look
>suspicious. We have seen bad slowdown of the HDF5 I/O until 4000 procs
>for loading mesh files that are over O(100)Gb. Even there, the plateau
>in timing still happens around couple of minutes and not 45 minutes.
>So the simple test should certainly not see bad performance
>degradation.
>
>DMMoabLoadFromFile internally just calls moab->load_file, which is
>primarily a load factory that invokes ReadHDF5. So the bulk of the
>behavior can be reduced to ReadHDF5 and HDF5 itself. So here are some
>questions to better understand what you are doing here.
>
>1) What version of HDF5 are you using and how is it configured ?
>Debug/optimized ?
>2) Is MOAB configured with optimized mode ?
>3) What compiler and MPI version are you using on your machine so that
>we can better understand if its a compiler flag issue.
>4) What are your machine characteristics ? Cluster or large scale
>machine ? GPFS or Lustre or some other base system ?
>5) Can you work with a MOAB branch ? We have a PR that is currently
>being reviewed, which should give you fine grained profiling data
>during the read. Take a look at [1].
>
>Let us know some of these answers. Meanwhile, we will also try to use
>your test case to check the I/O performance and see if your results
>are replicable to some degree.
>
>Vijay
>
>[1]
>https://bitbucket.org/fathomteam/moab/pull-requests/170/genlargemesh-corre
>ctions/diff
>
>On Mon, Dec 14, 2015 at 6:25 PM, WARNER, JAMES E. (LARC-D309)
><james.e.warner at nasa.gov> wrote:
>> Hi Vijay & Iulian,
>>
>> Hope you are doing well! I have a question regarding some strange
>>behavior
>> we¹re seeing with the DMMoabLoadFromFile() functionŠ
>>
>> After doing some recent profiling of our MOAB-based finite element
>>code, we
>> noticed that we are spending a disproportionate amount of CPU time
>>within
>> the DMMoabLoadFromFile() function, which gets slower / remains constant
>>as
>> we increase the number of processors. We also recently attempted a
>> scalability test with ~30M FEM nodes  on 500 processors which hung in
>> DMMoabLoadFromFile() for about 45 minutes before we killed the job. We
>>then
>> re-ran the test on one processor and it made it through successfully in
>> several seconds.
>>
>> To reproduce the problem we¹re seeing, we wrote a test case (attached
>>here)
>> that simply loads a smaller mesh with approximately 16K nodes and
>>prints the
>> run time. When I run the code on an increasing number of processors, I
>>get
>> something like:
>>
>> NP=1: Time to read file: 0.0416839 [sec.]
>> NP=2: Time to read file: 1.42497 [sec.]
>> NP=4: Time to read file: 1.13678 [sec.]
>> NP=8: Time to read file: 1.0475 [sec.]
>> Š
>>
>> If it is relevant/helpful  we are using the mbpart tool to partition
>>the
>> mesh.  Do you have any ideas why we are not seeing scalability here? Any
>> thoughts/tips would be appreciated! Let me know if you would like any
>>more
>> information.
>>
>> Thanks,
>> Jim
>>
>>
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: globalID.png
Type: image/png
Size: 112406 bytes
Desc: globalID.png
URL: <http://lists.mcs.anl.gov/pipermail/moab-dev/attachments/20151216/39cba5e4/attachment-0001.png>