[MOAB-dev] moab: problem reading large files in parallel
Tim Tautges
tautges at mcs.anl.gov
Tue Mar 9 06:59:22 CST 2010
Are these all using a bcast_delete read method? For meshes that large, it could easily be that the whole mesh won't fit
onto a single processor, which is required for that read method. Even read_delete might fail there, since the same
thing is required. Also, is cosmea a 32 bit OS? That might further limit things (your laptop probably is 64bit?)
- tim
Dmitry Karpeev wrote:
> I keep running into a strange problem with mbparallelcomm_test:
>
> While trying to read ilcmesh_16mtet.cub I observe this behavior:
> a) with 4 cores (1 quadcore node) it dies (but not immediately) with a
> strange (reproducible) message:
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ba3681a3010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2adbedb42010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ac18ff41010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2b50581a2010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
>
> b) with 8 cores (2 nodes) it takes a very long time to read in the mesh.
> I haven't figured out what's going on yet (debugging on cosmea isn't
> easy, and I haven't been able to reproduce the problem on my laptop),
> but is it possible that with 4 cores the local partition is too large
> and a signed integer is overflown with a very large number?
>
> Not clear what's happening with 8 cores, but I have seen it take much
> longer on 8 procs with a much smaller file: ho_test.cub.
> It is read without a problem on 4 cores, but takes much longer (and
> exceeds the modest 10 minute time limit eventually)
> on 8 cores.
>
> Finally, with 64bricks_16mhex.cub my 4 core run dies of a "bad_alloc"
> -- an unhandled exception.
>
> Any idea about what might be going on?
> I'm running on cosmea with mpich2 build using intel compilers.
> I'm about to try this on fusion, but I had trouble using its native
> mvapich build, so I'm going to have to use my own, and it looks like
> it will have to
> be gcc -- intel build crashes.
>
> Thanks!
> Dmitry.
>
--
================================================================
"You will keep in perfect peace him whose mind is
steadfast, because he trusts in you." Isaiah 26:3
Tim Tautges Argonne National Laboratory
(tautges at mcs.anl.gov) (telecommuting from UW-Madison)
phone: (608) 263-8485 1500 Engineering Dr.
fax: (608) 263-4499 Madison, WI 53706
More information about the moab-dev
mailing list