[petsc-users] memory corruption

Hong Zhang hzhang at mcs.anl.gov
Fri Jul 29 10:35:07 CDT 2011


Rebecca:

> Yes, there is no memory corruption for small size problem test after I add those options. I have not tested on big size problems yet.

Great!   I suggest you also install mumps.
Although this might be inappropriate for your current position :-(
you have an additional solver for debugging.
When superlu crashes,
switch to mumps for investigation.
>
> For using superlu_dist as a direct solver, I would like to make sure that the there is no memory corruption for large size problem, i.e., the memory requirement for a single processor does not exceed the memory of each processor.

You must experiment. The memory analysis below is based on the size of
the original matrix,
not the factored matrix, which  could be many times larger than the
original one.
Option '-mat_superlu_dist_statprint' gives memory usage in superlu_dist.

Hong

>
> For example, on Hopper, 1 node has 24 cores, and if I have 24 cores per node, each core has the memory approximately 1.33 Gb.
>
> Assume the problem size is M*N, and there are 4 unknowns per grid point, and the standard 13-point scheme is used for discretization. Therefore, the number of non zeroes per row is 13.
>
> The sparse matrix is 4MN X 4MN in size with 13*4MN non zeroes.
>
> Assume that the data type is PetscScalar (8 bytes), what would be the memory usage for the matrix?
>
> I ran a test problem of 64X64 for 10 time steps in np=1,2,4,8,16,32.
>
> With the help of '-log_summary', I was able to tell the memory usage, for example, when np=32,
>
>
>              Matrix     9              9      1172968     0
>
>
> The memory usage for processor 0 is 1,172,968 bytes in Matrix.
>
> Is this 1,172,968 bytes the total number for those 9 creations/destructions, or how could I understand this 1,172,968 come from?
>
> I did the math as follow, the non zeroes of the matrix is 13*4MN=212,992;
>
> and the memory is 8(bytes for PetscScalar) * 212,992 = 1,703,936;
>
> on each processor, the memory is 1,703,936/32 = 53,248
>
> this 53,248 is far from 1,172,968.
>
> even if I multiply by 10 (for 10 time steps iterations), 530,248 is still far from 1,172,958.
>
> Where did I get this number wrong?
>
>  -log_summary returns:
>
> np=1              Matrix     3              3     38785428     0
> np=2            Matrix     9              9     19871976     0
> np=4            Matrix     9              9      9948136     0
> np=8            Matrix    9               9       4894696 0
> np=16              Matrix     9              9      2413544     0
> np=32            Matrix     9              9      1172968     0
>
> for other cases.
>
> Thanks very much!
>
> Best regards,
>
> Rebecca
>
>
> On Jul 28, 2011, at 1:05 PM, Hong Zhang wrote:
>
>> Rebecca:
>>
>> Turn off orderings and some options, e.g.,
>> -mat_superlu_dist_equil NO -mat_superlu_dist_rowperm NATURAL
>> -mat_superlu_dist_colperm NATURAL
>>
>> Do you still get memory corruption?
>>
>> Hong
>>
>>> Hello all,
>>>
>>> I tried to use superlu as a direct solver running on Hopper, but found that there are some memory corruption errors:
>>>
>>> x/xyuan> cd $PBS_O_WORKDIR
>>> Directory: /global/homes/x/xyuan/Workspace_Nersc/cartmhdpdslin/trunk/test_superlu_as_direct_solver/m256_p1024
>>> test_superlu_as_direct_solver/m256_p1024> aprun -n 1024 ./twcartffxmhd.exe -options_file option_twcartffxmhd_256
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25137 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25146 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25144 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25133 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25136 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25142 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25145 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25148 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25149 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25147 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25135 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25134 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25138 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25141 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25140 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25139 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25132 : Permission denied
>>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25143 : Permission denied
>>> *********************************************
>>> cartesian coordinate code(np = 1024)
>>> start time = 0.0000
>>> time accuracy order = 2
>>> viscosity = 0.0500
>>> resistivity = 0.0050
>>> skin depth = 1.0000
>>> hyper resistivity = 0.00000630
>>> hyper viscosity = 0.00503929
>>> problem size: 256 by 256
>>> dt = 0.1000
>>> *********************************************
>>> ******* start solving for time = 0.10000 at time step = 1******
>>>  0 SNES Function norm 6.220836330249e-03
>>> Linear solve converged due to CONVERGED_ITS iterations 1
>>>  1 SNES Function norm 3.041982522542e-07
>>> *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe*** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: malloc(): memory corruption: 0x0000000001e31280 ***
>>>
>>> Any idea what is wrong here?
>>>
>>> Thanks very much!
>>>
>>> Xuefei (Rebecca) Yuan
>>> Postdoctoral Fellow
>>> Lawrence Berkeley National Laboratory
>>> Tel: 1-510-486-7031
>>>
>>>
>>>
>
>


More information about the petsc-users mailing list