[petsc-users] matcreate and assembly issue

Karl Lin karl.linkui at gmail.com
Thu Jul 2 18:30:17 CDT 2020


Hi, Matt

Thanks for the tip last time. We just encountered another issue with large
data sets. This time the behavior is the opposite from last time. The data
is 13.5TB, the total number of matrix columns is 2.4 billion. Our program
crashed during matrix loading due to memory overflow in one node. As said
before, we have a little memory check during loading the matrix to keep
track of rss. The printout of rss in the log shows normal increase in many
nodes, i.e., if we load in a portion of the matrix that is 1GB, after
MatSetValues for that portion, rss will increase roughly about 1GB. On the
node that has memory overflow, the rss increased by 2GB after only 1GB of
matrix is loaded through MatSetValues. We are very puzzled by this. What
could make the memory footprint twice as much as needed? Thanks in advance
for any insight.

Regards,

Karl

On Thu, Jun 11, 2020 at 12:00 PM Matthew Knepley <knepley at gmail.com> wrote:

> On Thu, Jun 11, 2020 at 12:52 PM Karl Lin <karl.linkui at gmail.com> wrote:
>
>> Hi, Matthew
>>
>> Thanks for the suggestion, just did another run and here are some
>> detailed stack traces, maybe will provide some more insight:
>>  *** Process received signal ***
>> Signal: Aborted (6)
>> Signal code:  (-6)
>> /lib64/libpthread.so.0(+0xf5f0)[0x2b56c46dc5f0]
>>  [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b56c5486337]
>>  [ 2] /lib64/libc.so.6(abort+0x148)[0x2b56c5487a28]
>>  [ 3] /libpetsc.so.3.10(PetscTraceBackErrorHandler+0xc4)[0x2b56c1e6a2d4]
>>  [ 4] /libpetsc.so.3.10(PetscError+0x1b5)[0x2b56c1e69f65]
>>  [ 5] /libpetsc.so.3.10(PetscCommBuildTwoSidedFReq+0x19f0)[0x2b56c1e03cf0]
>>  [ 6] /libpetsc.so.3.10(+0x77db17)[0x2b56c2425b17]
>>  [ 7] /libpetsc.so.3.10(+0x77a164)[0x2b56c2422164]
>>  [ 8] /libpetsc.so.3.10(MatAssemblyBegin_MPIAIJ+0x36)[0x2b56c23912b6]
>>  [ 9] /libpetsc.so.3.10(MatAssemblyBegin+0xca)[0x2b56c1feccda]
>>
>> By reconfiguring, you mean recompiling petsc with that option, correct?
>>
>
> Reconfiguring.
>
>   Thanks,
>
>     Matt
>
>
>> Thank you.
>>
>> Karl
>>
>> On Thu, Jun 11, 2020 at 10:56 AM Matthew Knepley <knepley at gmail.com>
>> wrote:
>>
>>> On Thu, Jun 11, 2020 at 11:51 AM Karl Lin <karl.linkui at gmail.com> wrote:
>>>
>>>> Hi, there
>>>>
>>>> We have written a program using Petsc to solve large sparse matrix
>>>> system. It has been working fine for a while. Recently we encountered a
>>>> problem when the size of the sparse matrix is larger than 10TB. We used
>>>> several hundred nodes and 2200 processes. The program always crashes during
>>>> MatAssemblyBegin.Upon a closer look, there seems to be something unusual.
>>>> We have a little memory check during loading the matrix to keep track of
>>>> rss. The printout of rss in the log shows normal increase up to rank 2160,
>>>> i.e., if we load in a portion of matrix that is 1GB, after MatSetValues for
>>>> that portion, rss will increase roughly about that number. From rank 2161
>>>> onwards, the rss in every rank doesn't increase after matrix loaded. Then
>>>> comes MatAssemblyBegin, the program crashed on rank 2160.
>>>>
>>>> Is there a upper limit on the number of processes Petsc can handle?
>>>> or is there a upper limit in terms of the size of the matrix petsc can
>>>> handle? Thank you very much for any info.
>>>>
>>>
>>> It sounds like you overflowed int somewhere. We try and check for this,
>>> but catching every place is hard. Try reconfiguring with
>>>
>>>   --with-64-bit-indices
>>>
>>>   Thanks,
>>>
>>>      Matt
>>>
>>>
>>>> Regards,
>>>>
>>>> Karl
>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/
>>> <http://www.cse.buffalo.edu/~knepley/>
>>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200702/31617453/attachment.html>


More information about the petsc-users mailing list