[MOAB-dev] moab: problem reading large files in parallel

Tue Mar 9 07:17:17 CST 2010

cosmea is 64bit and, yes, i'm using bcast_delete.
I'm just a bit surprised by a strange failure on the 16mhex mesh on 4 cores:
a negative value where MPI expects a positive?
I can better grasp the "bad_alloc" on a 16mtet mesh.
Also, now that I think about it, I don't exactly understand how "bcast" works.

Thanks!
Dmitry.

On Tue, Mar 9, 2010 at 6:59 AM, Tim Tautges <tautges at mcs.anl.gov> wrote:
> Are these all using a bcast_delete read method?  For meshes that large, it
> could easily be that the whole mesh won't fit onto a single processor, which
> is required for that read method.  Even read_delete might fail there, since
> the same thing is required.  Also, is cosmea a 32 bit OS?  That might
> further limit things (your laptop probably is 64bit?)
>
> - tim
>
> Dmitry Karpeev wrote:
>>
>> I keep running into a strange problem with mbparallelcomm_test:
>>
>> While trying to read  ilcmesh_16mtet.cub I observe this behavior:
>> a) with 4 cores (1 quadcore node) it dies (but not immediately) with a
>> strange (reproducible) message:
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ba3681a3010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2adbedb42010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2ac18ff41010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>> Fatal error in PMPI_Bcast: Invalid count, error stack:
>> PMPI_Bcast(1302): MPI_Bcast(buf=0x2b50581a2010, count=-2009243095,
>> MPI_UNSIGNED_CHAR, root=0, MPI_COMM_WORLD) failed
>> PMPI_Bcast(1241): Negative count, value is -2009243095
>>
>> b) with 8 cores (2 nodes) it takes a very long time to read in the mesh.
>> I haven't figured out what's going on yet (debugging on cosmea isn't
>> easy, and I haven't been able to reproduce the problem on my laptop),
>> but is it possible that with 4 cores the local partition is too large
>> and a signed integer is overflown with a very large number?
>>
>> Not clear what's happening with 8 cores, but I have seen it take much
>> longer on 8 procs with a much smaller file: ho_test.cub.
>> It is read without a problem on 4 cores, but takes much longer (and
>> exceeds the modest 10 minute time limit eventually)
>> on 8 cores.
>>
>> Finally, with 64bricks_16mhex.cub my 4 core run dies of a "bad_alloc"
>> -- an unhandled exception.
>>
>> Any idea about what might be going on?
>> I'm running on cosmea with mpich2 build using intel compilers.
>> I'm about to try this on fusion, but I had trouble using its native
>> mvapich build, so I'm going to have to use my own, and it looks like
>> it will have to
>> be gcc -- intel build crashes.
>>
>> Thanks!
>> Dmitry.
>>
>
> --
> ================================================================
> "You will keep in perfect peace him whose mind is
>  steadfast, because he trusts in you."               Isaiah 26:3
>
>             Tim Tautges            Argonne National Laboratory
>         (tautges at mcs.anl.gov)      (telecommuting from UW-Madison)
>         phone: (608) 263-8485      1500 Engineering Dr.
>           fax: (608) 263-4499      Madison, WI 53706
>
>