[petsc-users] On the edge of 2^31 unknowns
Eric Chamberland
Eric.Chamberland at giref.ulaval.ca
Fri Jun 17 15:55:49 CDT 2016
Hi,
We got the another run on the cluster with petsc 3.5.4 compiled with 64
bit indices (see end of message for configure options).
This time, the execution terminated with a segmentation violation with
the following backtrace:
Thu Jun 16 16:03:08 2016<stderr>:#000: reqBacktrace(std::string&) >>>
/rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
Thu Jun 16 16:03:08 2016<stderr>:#001: attacheDebugger() >>>
/rap/jsf-051-aa/ericc/GIREF/bin/probGD.opt
Thu Jun 16 16:03:08 2016<stderr>:#002:
/rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Util.so(traitementSignal+0x305)
[0x2b6c4885a875]
Thu Jun 16 16:03:08 2016<stderr>:#003: /lib64/libc.so.6(+0x326a0)
[0x2b6c502156a0]
Thu Jun 16 16:03:08 2016<stderr>:#004:
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0xee5)
[0x2b6c55e99ab5]
Thu Jun 16 16:03:08 2016<stderr>:#005:
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)
[0x2b6c55e9b8f8]
Thu Jun 16 16:03:08 2016<stderr>:#006:
/software6/mpi/openmpi/1.8.8_intel/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x3ff)
[0x2b6c55e9c50f]
Thu Jun 16 16:03:08 2016<stderr>:#007:
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(PetscMallocAlign+0x17)
[0x2b6c4a49ecc7]
Thu Jun 16 16:03:08 2016<stderr>:#008:
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMultSymbolic_MPIAIJ_MPIAIJ+0x37d)
[0x2b6c4a915eed]
Thu Jun 16 16:03:08 2016<stderr>:#009:
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult_MPIAIJ_MPIAIJ+0x1a3)
[0x2b6c4a915713]
Thu Jun 16 16:03:08 2016<stderr>:#010:
/software6/libs/petsc/3.5.4_intel_openmpi1.8.8/lib/libpetsc.so.3.5(MatMatMult+0x5f4)
[0x2b6c4a630f44]
Thu Jun 16 16:03:08 2016<stderr>:#011: girefMatMatMult(MatricePETSc
const&, MatricePETSc const&, MatricePETSc&, MatReuse) >>>
/rap/jsf-051-aa/ericc/GIREF/lib/libgiref_opt_Petsc.so
Now looking at the MatMatMultSymbolic_MPIAIJ_MPIAIJ modifications, since
I run in 64 bit indices, there is something either really bad with what
we try to do or maybe there is still something in this routine...
What is your advice and how could I retreive more information if I can
launch it again?
Is a -malloc_dump or -malloc_log would help or anything else?
(the very same calculus passed with 240M unknowns).
Thanks for your insights!
Eric
here are the configure options:
static const char *petscconfigureoptions = "PETSC_ARCH=linux-gnu-intel
CFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\"
FFLAGS=\"-O3 -xHost -mkl -fPIC -m64 -no-diag-message-catalog\" --prefix=/sof
tware6/libs/petsc/3.5.4_intel_openmpi1.8.8 --with-x=0
--with-mpi-compilers=1 --with-mpi-dir=/software6/mpi/openmpi/1.8.8_intel
--known-mpi-shared-libraries=1 --with-debugging=no
--with-64-bit-indices=1 --with-s
hared-libraries=1
--with-blas-lapack-dir=/software6/compilers/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64
--with-scalapack=1
--with-scalapack-include=/software6/compilers/intel/composer_xe_2013_sp1.0.080/m
kl/include --with-scalapack-lib=\"-lmkl_scalapack_lp64
-lmkl_blacs_openmpi_lp64\" --download-ptscotch=1
--download-superlu_dist=yes --download-parmetis=yes --download-metis=yes
--download-hypre=yes";
On 16/11/15 07:12 PM, Barry Smith wrote:
>
> I have started a branch with utilities to help catch/handle these integer overflow issues https://bitbucket.org/petsc/petsc/pull-requests/389/add-utilities-for-handling-petscint/diff all suggestions are appreciated
>
> Barry
>
>> On Nov 16, 2015, at 12:26 PM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>>
>> Barry,
>>
>> I can't launch the code again and retrieve other informations, since I am not allowed to do so: the cluster have around ~780 nodes and I got a very special permission to reserve 530 of them...
>>
>> So the best I can do is to give you the backtrace PETSc gave me... :/
>> (see the first post with the backtrace: http://lists.mcs.anl.gov/pipermail/petsc-users/2015-November/027644.html)
>>
>> And until today, all smaller meshes with the same solver succeeded to complete... (I went up to 219 millions of unknowns on 64 nodes).
>>
>> I understand then that there could be some use of PetscInt64 in the actual code that would help fix problems like the one I got. I found it is a big challenge to track down all occurrence of this kind of overflow in the code, due to the size of the systems you have to have to reproduce this problem....
>>
>> Eric
>>
>>
>> On 16/11/15 12:40 PM, Barry Smith wrote:
>>>
>>> Eric,
>>>
>>> The behavior you get with bizarre integers and a crash is not the behavior we want. We would like to detect these overflows appropriately. If you can track through the error and determine the location where the overflow occurs then we would gladly put in additional checks and use of PetscInt64 to handle these things better. So let us know the exact cause and we'll improve the code.
>>>
>>> Barry
>>>
>>>
>>>
>>>> On Nov 16, 2015, at 11:11 AM, Eric Chamberland <Eric.Chamberland at giref.ulaval.ca> wrote:
>>>>
>>>> On 16/11/15 10:42 AM, Matthew Knepley wrote:
>>>>> Sometimes when we do not have exact counts, we need to overestimate
>>>>> sizes. This is especially true
>>>>> in sparse MatMat.
>>>>
>>>> Ok... so, to be sure, I am correct if I say that recompiling petsc with
>>>> "--with-64-bit-indices" is the only solution to my problem?
>>>>
>>>> I mean, no other fixes exist for this overestimation in a more recent release of petsc, like putting the result in a "long int" instead?
>>>>
>>>> Thanks,
>>>>
>>>> Eric
>>>>
>>
More information about the petsc-users
mailing list