[MOAB-dev] moab: problem reading large files in parallel

Tue Mar 9 06:59:22 CST 2010

Are these all using a bcast_delete read method?  For meshes that large, it could easily be that the whole mesh won't fit 
onto a single processor, which is required for that read method.  Even read_delete might fail there, since the same 
thing is required.  Also, is cosmea a 32 bit OS?  That might further limit things (your laptop probably is 64bit?)

- tim

Dmitry Karpeev wrote:
> I keep running into a strange problem with mbparallelcomm_test:
> 
> While trying to read  ilcmesh_16mtet.cub I observe this behavior:
> a) with 4 cores (1 quadcore node) it dies (but not immediately) with a
> strange (reproducible) message:
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ba3681a3010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2adbedb42010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ac18ff41010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> Fatal error in PMPI_Bcast: Invalid count, error stack:
> PMPI_Bcast(1302): MPI_Bcast(buf=0x2b50581a2010, count=-2009243095,
> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
> PMPI_Bcast(1241): Negative count, value is -2009243095
> 
> b) with 8 cores (2 nodes) it takes a very long time to read in the mesh.
> I haven't figured out what's going on yet (debugging on cosmea isn't
> easy, and I haven't been able to reproduce the problem on my laptop),
> but is it possible that with 4 cores the local partition is too large
> and a signed integer is overflown with a very large number?
> 
> Not clear what's happening with 8 cores, but I have seen it take much
> longer on 8 procs with a much smaller file: ho_test.cub.
> It is read without a problem on 4 cores, but takes much longer (and
> exceeds the modest 10 minute time limit eventually)
> on 8 cores.
> 
> Finally, with 64bricks_16mhex.cub my 4 core run dies of a "bad_alloc"
> -- an unhandled exception.
> 
> Any idea about what might be going on?
> I'm running on cosmea with mpich2 build using intel compilers.
> I'm about to try this on fusion, but I had trouble using its native
> mvapich build, so I'm going to have to use my own, and it looks like
> it will have to
> be gcc -- intel build crashes.
> 
> Thanks!
> Dmitry.
> 

-- 
================================================================
"You will keep in perfect peace him whose mind is
   steadfast, because he trusts in you."               Isaiah 26:3

              Tim Tautges            Argonne National Laboratory
          (tautges at mcs.anl.gov)      (telecommuting from UW-Madison)
          phone: (608) 263-8485      1500 Engineering Dr.
            fax: (608) 263-4499      Madison, WI 53706