[MOAB-dev] scaling on intrepid

Fri Jan 11 10:09:05 CST 2013

I guess intrepid is old news and we should be looking at Mira now...

Distressingly long ago, I was working with Tim and Jason on scaling
moab on intrepid.   What would happen with the stock MPI-IO library is
that MOAB would feed HDF5 a request, HDF5 would build up a complicated
MPI-IO workload, and the MPI-IO library on Intrepid would consume too
much memory and fail.

I came up with a scheme to fit ROMIO parameters to the available
memory.  This scheme seems to be working ok for reads and I'm able to
scale up to 8k mpi processors without manually setting any hints
(except for the one that says "size this parameter automatically")

The write step is presently causing me some grief, and it does not
immediately look like the write problem is MPI-IO.  

I was hoping I could run the experiment scenario by some moab folks as a sanity
check to make sure I am still driving MOAB in a correct and useful way.

I've been working with mbconvert like this:

NODES=2048
CORES=$(($NODES * 4))

# because read-only home file system
cd /intrepid-fs0/users/robl/scratch/moab-test

cqsub -t 30 -m vn -p BGQtools_esp -e ROMIO_HINTS=/home/robl/src/moab-svn/experiments/romio_hints:MPIRUN_LABEL=1:BG_COREDUMPONEXIT=1 \
        -n $NODES -c  $CORES /home/robl/src/moab-svn/build/tools/mbconvert\
        -O CPUTIME -O PARALLEL_GHOSTS=3.0.1 -O PARALLEL=READ_PART \
        -O PARALLEL_RESOLVE_SHARED_ENTS -O PARTITION -t \
        -o CPUTIME -o PARALLEL=WRITE_PART /intrepid-fs0/users/tautges/persistent/meshes/2bricks/nogeom/64bricks_8mtet_ng_rib_${CORES}.h5m \
        /intrepid-fs0/users/robl/scratch/moab/8mtet_ng-${CORES}-out.h5m

[ Unsurprisingly (since it crashed in the middle of write)
/intrepid-fs0/users/robl/scratch/moab/8mtet_ng-8192-out.h5m exists,
but is only about 7k big and is not recognized as an HDF5 file. ]

I'm using moab-svn r5930

The program terminates with "killed with signal 15" , but the job has
only run for 20 minutes.  I asked for 30.  I'll resubmit for 60
minutes. 

I get this much output:
stdout[0] Parallel Read times: 
stdout[0]   47.6997 PARALLEL READ PART
stdout[0]   0.284176 PARALLEL RESOLVE_SHARED_ENTS
stdout[0]   1.98761 PARALLEL EXCHANGE_GHOSTS
stdout[0]   1.79284 PARALLEL RESOLVE_SHARED_SETS
stdout[0]   50.0319 PARALLEL TOTAL  
stdout[0]   real:   50.4s
stdout[0]   user:   50.4s  
stdout[0]   system: 0.0s

(that's some pretty awful performance: 50 seconds to read 317 MiB?  I'll get to
that next once I've got things actually performing at all.)

I dumped 8192 core files and stitched them together with
coreprocessor.  The backtrace is not very helpful.

- Everyone gets to this function:
moab::WriteHDF5::write_file_impl(char const*, bool,
  moab::FileOptions const&, unsigned int const*, int,
  std::vector<std::string, std::allocator<std::string> > const&,
  moab::TagInfo* const*, int, int)

- all but ten make it to MPI_Bcast (I think it's the "send ID to every
  proc" bcast at WriteHDF5Parallel.cpp:1058). 

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA