[MOAB-dev] moab: problem reading large files in parallel
Dmitry Karpeev
karpeev at mcs.anl.gov
Tue Mar 9 07:17:17 CST 2010
cosmea is 64bit and, yes, i'm using bcast_delete.
I'm just a bit surprised by a strange failure on the 16mhex mesh on 4 cores:
a negative value where MPI expects a positive?
I can better grasp the "bad_alloc" on a 16mtet mesh.
Also, now that I think about it, I don't exactly understand how "bcast" works.
Thanks!
Dmitry.
On Tue, Mar 9, 2010 at 6:59 AM, Tim Tautges <tautges at mcs.anl.gov> wrote:
> Are these all using a bcast_delete read method? For meshes that large, it
> could easily be that the whole mesh won't fit onto a single processor, which
> is required for that read method. Even read_delete might fail there, since
> the same thing is required. Also, is cosmea a 32 bit OS? That might
> further limit things (your laptop probably is 64bit?)
>
> - tim
>
> Dmitry Karpeev wrote:
>>
>> I keep running into a strange problem with mbparallelcomm_test:
>>
>> While trying to read ilcmesh_16mtet.cub I observe this behavior:
>> a) with 4 cores (1 quadcore node) it dies (but not immediately) with a
>> strange (reproducible) message:
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ba3681a3010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2adbedb42010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ac18ff41010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2b50581a2010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>>
>> b) with 8 cores (2 nodes) it takes a very long time to read in the mesh.
>> I haven't figured out what's going on yet (debugging on cosmea isn't
>> easy, and I haven't been able to reproduce the problem on my laptop),
>> but is it possible that with 4 cores the local partition is too large
>> and a signed integer is overflown with a very large number?
>>
>> Not clear what's happening with 8 cores, but I have seen it take much
>> longer on 8 procs with a much smaller file: ho_test.cub.
>> It is read without a problem on 4 cores, but takes much longer (and
>> exceeds the modest 10 minute time limit eventually)
>> on 8 cores.
>>
>> Finally, with 64bricks_16mhex.cub my 4 core run dies of a "bad_alloc"
>> -- an unhandled exception.
>>
>> Any idea about what might be going on?
>> I'm running on cosmea with mpich2 build using intel compilers.
>> I'm about to try this on fusion, but I had trouble using its native
>> mvapich build, so I'm going to have to use my own, and it looks like
>> it will have to
>> be gcc -- intel build crashes.
>>
>> Thanks!
>> Dmitry.
>>
>
> --
> ================================================================
> "You will keep in perfect peace him whose mind is
> steadfast, because he trusts in you." Isaiah 26:3
>
> Tim Tautges Argonne National Laboratory
> (tautges at mcs.anl.gov) (telecommuting from UW-Madison)
> phone: (608) 263-8485 1500 Engineering Dr.
> fax: (608) 263-4499 Madison, WI 53706
>
>
More information about the moab-dev
mailing list