[mpich-discuss] Overflow in MPI_Aint
Christina Patrick
christina.subscribes at gmail.com
Tue Jul 21 15:43:46 CDT 2009
Hi Joe,
Thank you very much for the patch. I will give it a try.
Regards,
Christina.
On Tue, Jul 21, 2009 at 4:10 PM, Joseph Ratterman<jratt at us.ibm.com> wrote:
> Pavan,
>
> There are a number of MPI_Aint changes that are not in 1.1. For example,
> there is a block of changes in configure.in that relates to this, and it is
> only in the BGP version.
>
>
>
> Christina,
>
> I ran it on our system and is exited without error after creating an 8GB
> file. You can find the patches here:
> http://dcmf.anl-external.org/patches/patch-delete
> http://dcmf.anl-external.org/patches/1.1.0.patch.gz
> http://dcmf.anl-external.org/patches/1.0.7.patch.gz
>
> The 1.0.7, the patch should apply normally. The 1.1.0 version was changed
> to be a lot smaller by not including deleted files; you will have to run
> "patch-delete" against the resultant tree:
> zcat patch.gz | patch -p0 --force --directory=mpich2-1.1
> zcat patch.gz | ./patch-delete --directory=mpich2-1.1
>
> For both, you will also have to do a "developer style build" (see link), and
> use the "--with-aint-size=8" that Pavan mentioned.
> http://wiki.mcs.anl.gov/mpich2/index.php/Getting_And_Building_MPICH2
>
>
>
> $ mpirun -np 16 io
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> cols = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> rows = 32768
> cols = 32768
> rows = 32768
> Filesize = 0
> cols = 32768
> rows = 32768
> cols = 32768
> cols = 32768
> cols = 32768
> cols = 32768
> cols = 32768
> cols = 32768
> Element size = 8
> cols = 32768
> cols = 32768
> cols = 32768
> cols = 32768
> Element size = 8
> cols = 32768
> rows = 32768
> Element size = 8
> cols = 32768
> Element size = 8
> Element size = 8
> Element size = 8
> Element size = 8
> Element size = 8
> Element size = 8
> Filesize = 0
> Element size = 8
> Element size = 8
> Element size = 8
> Element size = 8
> Filesize = 0
> Element size = 8
> cols = 32768
> Filesize = 0
> Element size = 8
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Coll buf size = 8388608
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Filesize = 0
> Coll buf size = 8388608
> Filesize = 0
> Element size = 8
> Coll buf size = 8388608
> Filesize = 0
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Rows in coll buf = 512
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Coll buf size = 8388608
> Rows in coll buf = 512
> Coll buf size = 8388608
> Filesize = 0
> Rows in coll buf = 512
> Coll buf size = 8388608
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Cols in coll buf = 2048
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Rows in coll buf = 512
> Cols in coll buf = 2048
> Rows in coll buf = 512
> Coll buf size = 8388608
> Cols in coll buf = 2048
> Rows in coll buf = 512
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> array_size[] = {32768, 32768}
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> Cols in coll buf = 2048
> array_size[] = {32768, 32768}
> Cols in coll buf = 2048
> Rows in coll buf = 512
> array_size[] = {32768, 32768}
> Cols in coll buf = 2048
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_subsize[] = {32768, 2048}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_size[] = {32768, 32768}
> array_subsize[] = {32768, 2048}
> array_size[] = {32768, 32768}
> Cols in coll buf = 2048
> array_subsize[] = {32768, 2048}
> array_size[] = {32768, 32768}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_start[] = {0, 14336}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_subsize[] = {32768, 2048}
> array_start[] = {0, 22528}
> array_subsize[] = {32768, 2048}
> array_size[] = {32768, 32768}
> array_start[] = {0, 28672}
> array_subsize[] = {32768, 2048}
> array_start[] = {0, 4096}
> array_start[] = {0, 6144}
> array_start[] = {0, 12288}
> array_start[] = {0, 18432}
> array_start[] = {0, 2048}
> array_start[] = {0, 30720}
> array_start[] = {0, 20480}
> array_start[] = {0, 24576}
> array_start[] = {0, 16384}
> array_start[] = {0, 8192}
> array_start[] = {0, 26624}
> array_subsize[] = {32768, 2048}
> array_start[] = {0, 10240}
> array_start[] = {0, 0}
>
>
>
> $ ls -l testfile
> -rw-r--r-- 1 jratt jratt 8589934592 2009-07-21 14:40 testfile
>
>
>
>
> Thanks,
> Joe Ratterman
> IBM Blue Gene/P Messsaging
> jratt at us.ibm.com
>
>
>
>
>
> From:
> Pavan Balaji <balaji at mcs.anl.gov>
> To: mpich-discuss at mcs.anl.gov
> Cc: Joseph Ratterman/Rochester/IBM at IBMUS
> Date: 07/21/09 01:55 PM
> Subject: Re: [mpich-discuss] Overflow in MPI_Aint
> ________________________________
>
>
>
> Joe: I believe all the Aint related patches have already gone into 1.1.
> Is something still missing?
>
> Christina: If you update to mpich2-1.1, you can try the --with-aint-size
> configure option. Note that this has not been tested on anything other
> than BG/P, but it might be worth a shot.
>
> -- Pavan
>
> On 07/21/2009 01:49 PM, Christina Patrick wrote:
>> Hi Joe,
>>
>> I am attaching my test case in this email. If you run it with any
>> number of processes except one, it will give you the SIGFPE error.
>> Similarly if you change the write in this program to a read, you will
>> get the same problem.
>>
>> I would sure appreciate a patch for this problem. If it is not too
>> much trouble, could you please give me the patch? I could try making
>> the corresponding changes to my setup.
>>
>> Thanks and Regards,
>> Christina.
>>
>> On Tue, Jul 21, 2009 at 2:06 PM, Joe Ratterman<jratt at us.ibm.com> wrote:
>>> Christina,
>>> Blue Gene/P is a 32-bit platform where we have hit similar problems. To
>>> get
>>> around this, we increased the size of MPI_Aint in MPICH2 to be larger
>>> than
>>> void*, to 64 bits. I suspect that your test case would work on our
>>> system,
>>> and I would like to see your test code if that is possible. It should
>>> run
>>> on our system, and I would like to make sure we have it correct.
>>> If you are interested, we have patches against 1.0.7 and 1.1.0 that you
>>> can
>>> use (we skipped 1.0.8). If you can build MPICH2 using those patches, you
>>> may be able to run your application. On the other hand, they may be too
>>> specific to our platform. We have been working with ANL to incorporate
>>> our
>>> changes into the standard MPICH2 releases, but there isn't a lot of
>>> demand
>>> for 64-bit MPI-IO on 32-bit machines.
>>>
>>> Thanks,
>>> Joe Ratterman
>>> IBM Blue Gene/P Messsaging
>>> jratt at us.ibm.com
>>>
>>>
>>> On Fri, Jul 17, 2009 at 7:12 PM, Christina Patrick
>>> <christina.subscribes at gmail.com> wrote:
>>>> Hi Pavan,
>>>>
>>>> I ran the command
>>>>
>>>> $ getconf | grep -i WORD
>>>> WORD_BIT=32
>>>>
>>>> So I guess it is a 32 bit system.
>>>>
>>>> Thanks and Regards,
>>>> Christina.
>>>>
>>>> On Fri, Jul 17, 2009 at 8:06 PM, Pavan Balaji<balaji at mcs.anl.gov> wrote:
>>>>> Is it a 32-bit system? MPI_Aint is the size of a (void *), so on 32-bit
>>>>> systems it's restricted to 2GB.
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> On 07/17/2009 07:04 PM, Christina Patrick wrote:
>>>>>> Hi Everybody,
>>>>>>
>>>>>> I am trying to create an array 32768 x 32768 x 8 bytes(double) = 8GB
>>>>>> file using 16 MPI processes. However, everytime, I try doing that, MPI
>>>>>> aborts. The backtrace is showing me that there is a problem in
>>>>>> ADIOI_Calc_my_off_len() function. There is a variable there:
>>>>>> MPI_Aint filetype_extent;
>>>>>>
>>>>>> and the value of the variable is filetype_extent = 0 whenever it
>>>>>> executes
>>>>>> MPI_Type_extent(fd->filetype, &filetype_extent);
>>>>>> Hence, when it reaches the statement:
>>>>>> 335 n_filetypes = (offset - flat_file->indices[0]) /
>>>>>> filetype_extent;
>>>>>> I always get SIGFPE. Is there a solution to this problem? Can I create
>>>>>> such a big file?
>>>>>> I checked the value of the variable while creating a file of upto 2G
>>>>>> and it is NOT zero which makes me conclude that there is an overflow
>>>>>> when I am specifying 8G.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Christina.
>>>>>>
>>>>>> PS: I am using the PVFS2 filesystem with mpich2-1.0.8 and pvfs-2.8.0.
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>>
>>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
>
More information about the mpich-discuss
mailing list