[petsc-users] On the edge of 2^31 unknowns

Barry Smith bsmith at mcs.anl.gov
Mon Nov 16 14:22:39 CST 2015


> On Nov 16, 2015, at 12:26 PM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
> 
> Barry,
> 
> I can't launch the code again and retrieve other informations, since I am not allowed to do so: the cluster have around ~780 nodes and I got a very special permission to reserve 530 of them...
> 
> So the best I can do is to give you the backtrace PETSc gave me... :/
> (see the first post with the backtrace: http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)
> 
> And until today, all smaller meshes with the same solver succeeded to complete... (I went up to 219 millions of unknowns on 64 nodes).
> 
> I understand then that there could be some use of PetscInt64 in the actual code that would help fix problems like the one I got.  I found it is a big challenge to track down all occurrence of this kind of overflow in the code, due to the size of the systems you have to have to reproduce this problem....

 Eric,

    This is exactly our problem and why I asked for you data. Doing a manual code inspection line by line looking for the potential overflow points is tedious and would take forever so we need to fix these instead as we become aware of them. 

 You then wrote

I looked into the code of PetscLLCondensedCreate_Scalable:

...
 ierr = PetscMalloc1(2*(nlnk_max+2),lnk);CHKERRQ(ierr);
...


and just for fun, I tried this:

#include <iostream>

int main() {
 int a=1741445953; // my number of unknowns...
 int b=2*(a+2);
 unsigned long int c = b;
 std::cout << " a: " << a <<  " b: " << b << " c: " << c <<std::endl;
 return 0;
}

and it gives:

a: 1741445953 b: -812075386 c: 18446744072897476230

  Thanks for this. This is exactly one of the places where int overflow can occur and where we have to at a minimum do additional checking for int overflow (when using 32 bit integers).

 Note that even if your final result easily fits within a 32 bit integer matrix it is possible that an intermediate integer/work space will not fit into a 32 bit integer or array (or that it would fit but since we don't know the minimal size we need to over estimate the space as Matt says and push it above 32 bit integer size). 
 
  Out goal is that if something won't fit in a 32 bit int we use a 64 bit integer when possible or at least produce a very useful error message instead of the horrible malloc error you get.  The more crashes you can give us the quicker we can fix these errors.

   Thanks


  Barry


> 
> Eric
> 
> 
> On 16/11/15 12:40 PM, Barry Smith wrote:
>> 
>>   Eric,
>> 
>>     The behavior you get with bizarre integers and a crash is not the behavior we want. We would like to detect these overflows appropriately.   If you can track through the error and determine the location where the overflow occurs then we would gladly put in additional checks and use of PetscInt64 to handle these things better. So let us know the exact cause and we'll improve the code.
>> 
>>   Barry
>> 
>> 
>> 
>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>>> 
>>> On 16/11/15 10:42 AM, Matthew Knepley wrote:
>>>> Sometimes when we do not have exact counts, we need to overestimate
>>>> sizes. This is especially true
>>>> in sparse MatMat.
>>> 
>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with
>>> "--with-64-bit-indices" is the only solution to my problem?
>>> 
>>> I mean, no other fixes exist for this overestimation in a more recent release of petsc, like putting the result in a "long int" instead?
>>> 
>>> Thanks,
>>> 
>>> Eric
>>> 
> 



More information about the petsc-users mailing list