[MOAB-dev] moab: problem reading large files in parallel
Dmitry Karpeev
karpeev at mcs.anl.gov
Mon Mar 8 22:46:29 CST 2010
I keep running into a strange problem with mbparallelcomm_test:
While trying to read ilcmesh_16mtet.cub I observe this behavior:
a) with 4 cores (1 quadcore node) it dies (but not immediately) with a
strange (reproducible) message:
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(1302): MPI_Bcast(buf=0x2ba3681a3010, count=-2009243095,
MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
PMPI_Bcast(1241): Negative count, value is -2009243095
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(1302): MPI_Bcast(buf=0x2adbedb42010, count=-2009243095,
MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
PMPI_Bcast(1241): Negative count, value is -2009243095
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(1302): MPI_Bcast(buf=0x2ac18ff41010, count=-2009243095,
MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
PMPI_Bcast(1241): Negative count, value is -2009243095
Fatal error in PMPI_Bcast: Invalid count, error stack:
PMPI_Bcast(1302): MPI_Bcast(buf=0x2b50581a2010, count=-2009243095,
MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
PMPI_Bcast(1241): Negative count, value is -2009243095
b) with 8 cores (2 nodes) it takes a very long time to read in the mesh.
I haven't figured out what's going on yet (debugging on cosmea isn't
easy, and I haven't been able to reproduce the problem on my laptop),
but is it possible that with 4 cores the local partition is too large
and a signed integer is overflown with a very large number?
Not clear what's happening with 8 cores, but I have seen it take much
longer on 8 procs with a much smaller file: ho_test.cub.
It is read without a problem on 4 cores, but takes much longer (and
exceeds the modest 10 minute time limit eventually)
on 8 cores.
Finally, with 64bricks_16mhex.cub my 4 core run dies of a "bad_alloc"
-- an unhandled exception.
Any idea about what might be going on?
I'm running on cosmea with mpich2 build using intel compilers.
I'm about to try this on fusion, but I had trouble using its native
mvapich build, so I'm going to have to use my own, and it looks like
it will have to
be gcc -- intel build crashes.
Thanks!
Dmitry.
More information about the moab-dev
mailing list