[MOAB-dev] Problems reading large meshes in MOAB in parallel

Mon Mar 22 20:39:21 CDT 2010

If anybody has any idea what may be going on here (described in detail below),
I'd really appreciate the insight.  I am trying to summarize this as
concisely as possible,
yet provide sufficient detail.  If more info is needed, I'm ready to offer it.

I have been experiencing trouble with reading large meshes into MOAB
in parallel.
Since there are several machines involved (Linux laptop, ANL clusters:
cosmea, fusion),
several code revisions, and several read modes (bcast_delete,
read_part, read_delete, bcast),
which generate various problems, I want to focuse on a limited subset,
to describe the observed
performance and isolate the problems.

Machine(s):
I have temporarily abandoned running performance benchmarks on fusion, since
that machine has a broken autoconf system (libtool is missing) and
building updated revisions
there is time consuming and not easily automated (a configure script
has to be prepared elsewhere,
transferred to fusion, etc).  This leaves cosmea as the only parallel
machine I'm using, at least for now.

Code:
Also, I am currently using codes based on two MOAB revisions:
"stable", based on an old rev. 3556 and "unstable", based on rev. 3668.
The rationale for using two codes is that I used to have problems with hanging

Read modes:
I'm focusing on:
 -  bcast_delete, as the most memory intensive (procs have to be able
to allocate
    the entire mesh, receive it from the root proc and then delete the
nonlocal portions)
 -  read_part, as the most scalable.

Meshes:
I'm using two different meshes -- the "small" and the "large":
  - "small" is the tjunc6 mesh, partitioned into 64 pieces (I haven't
run on more than 64 procs)
  - "large" is the 64bricks_1mhex mesh, partitioned into 1024 pieces.
Both partitions have been generated using mbzoltan with Alvaro's help.
A brief note: larger meshes, such as 64bricks_4mhex and (even more so)
64bricks_16mhex
cause a bad_alloc(), so I'm not even attempting to read them at the moment.

I'm running both "stable" and "unstable" version on tjun6_64.h5m and
64bricks_1mhex_1024.h5m
with the equivalent of mbparallelcomm_test 0 3 1 <meshfile>
PARALLEL_PARTITION (bcast_delete)
or mbparallelcomm_test -2 3 1 <mesfile> PARALLEL_PARTITION (read_part).
In particular, I'm not looking at the effects of using mpi-io and other options.

The results are these: both "stable" and "unstable" codes have no
problem reading tjiunc6_64
in any mode, but they both fail on 64bricks_1mhex_1024.  Below is the
description of the failure,
which depends on the read mode:

=============================================================================
bcast_delete:
Essentially the same type of failure appears to occur when using both
stable and unstable codes,
running on 4 or 8 procs, with, the equivalent of  following command
(modulo the location of files):
mpiexec -np 4 mbparallelcomm_test 0 3 1 64bricks_1mhex_1024.h5m
PARALLEL_PARTITION
It appears that the mesh is read in, but the resolution of shared
entities causes a problem.
>From the rather cryptic output below, I'm not sure what sort of
problem this may be.
------------------------------------------------------------------------------------------------------------------------
Using MPI from /gfs/software/software/mvapich2/1.0-2008-02-06-intel-shlib
Running on 1 nodes
4 cores per node, one MPI process per core
for a total of 4 MPI processes
.................................................
PBS nodefile:
n018
n018
n018
n018
.................................................
mpd ring nodefile:
n018:4
.................................................
Running mpdboot ...
done
Running mpdtrace ...
n018
done
Running mpdringtest ...
time for 1 loops = 0.000300884246826 seconds
done
Using MPIEXEC_CMD=/gfs/software/software/mvapich2/1.0-2008-02-06-intel-shlib/bin/mpiexec
-n 4
Commencing parallel run 64bricks_1mhex_1024 of executable
/home/karpeev/fathom/moab/stable/build/parallel/mbparallelcomm_test
Couldn't read mesh; error message:
Failed in step PARALLEL RESOLVE_SHARED_ENTS

Failed to find new entity in send list.
Trouble getting remote handles when packing entities.
Failed to pack entities from a sequence.
Packing entities failed.
Trouble resolving shared entity remote handles.

application called MPI_Abort(MPI_COMM_WORLD, 0) - process 2Finished
........................................
Running mpdallexit ...
done

===========================================================================================
read_part:
The "unstable" code generates essentially the same error here as the
one that occurs with bcast_delete.
The "stable" coce, however, doesn't fail outright, but reads 0
entities on all procs:
--------------------------------------------------------------------------------------------------------------------------------------------------
Using MPI from /gfs/software/software/mvapich2/1.0-2008-02-06-intel-shlib
Running on 1 nodes
4 cores per node, one MPI process per core
for a total of 4 MPI processes
.................................................
PBS nodefile:
n032
n032
n032
n032
.................................................
mpd ring nodefile:
n032:4
.................................................
Running mpdboot ...
done
Running mpdtrace ...
n032
done
Running mpdringtest ...
time for 1 loops = 0.000304937362671 seconds
done
Using MPIEXEC_CMD=/gfs/software/software/mvapich2/1.0-2008-02-06-intel-shlib/bin/mpiexec
-n 4
Commencing parallel run 64bricks_1mhex_1024 of executable
/home/karpeev/fathom/moab/stable/build/parallel/mbparallelcomm_test
Proc 1 iface entities:
    0 0d iface entities.
    0 1d iface entities.Proc 0 iface entities:
    0 0d iface entities.
    0 1d iface entities.
    0 2d iface entities.
    0 3d iface entities.
    (0 verts adj to other iface ents)
Proc 2 iface entities:
    0 0d iface entities.
    0 1d iface entities.
    0 2d iface entities.
    0 3d iface entities.
    (0 verts adj to other iface ents)
Proc 3 iface entities:
    0 0d iface entities.
    0 1d iface entities.
    0 2d iface entities.
    0 3d iface entities.
    (0 verts adj to other iface ents)

    0 2d iface entities.
    0 3d iface entities.
    (0 verts adj to other iface ents)
Proc 0: Success.
Proc 1: Success.
Proc 2: Success.
Proc 3: Success.
Finished ........................................
Running mpdallexit ...
done
============================================================================
My guess is that the "unstable" code is trying to do the "right" thing
in both bcast_delete and read_part
cases, but fails (e.g., running out of memory?).  The "stable" code
does something similar that in the bcast_delete
case, but behaves incorrectly (reads zero entity sets) in the read_part case.

Anybody have any idea about what's going on?

Thanks!
Dmitry.