[petsc-users] memory corruption

Xuefei (Rebecca) Yuan xyuan at lbl.gov
Thu Jul 28 17:06:45 CDT 2011


Dear Hong,

Yes, there is no memory corruption for small size problem test after I add those options. I have not tested on big size problems yet.

For using superlu_dist as a direct solver, I would like to make sure that the there is no memory corruption for large size problem, i.e., the memory requirement for a single processor does not exceed the memory of each processor.

For example, on Hopper, 1 node has 24 cores, and if I have 24 cores per node, each core has the memory approximately 1.33 Gb.

Assume the problem size is M*N, and there are 4 unknowns per grid point, and the standard 13-point scheme is used for discretization. Therefore, the number of non zeroes per row is 13.

The sparse matrix is 4MN X 4MN in size with 13*4MN non zeroes.

Assume that the data type is PetscScalar (8 bytes), what would be the memory usage for the matrix?

I ran a test problem of 64X64 for 10 time steps in np=1,2,4,8,16,32.

With the help of '-log_summary', I was able to tell the memory usage, for example, when np=32,


              Matrix     9              9      1172968     0
 

The memory usage for processor 0 is 1,172,968 bytes in Matrix.

Is this 1,172,968 bytes the total number for those 9 creations/destructions, or how could I understand this 1,172,968 come from?

I did the math as follow, the non zeroes of the matrix is 13*4MN=212,992;

and the memory is 8(bytes for PetscScalar) * 212,992 = 1,703,936;

on each processor, the memory is 1,703,936/32 = 53,248

this 53,248 is far from 1,172,968.

even if I multiply by 10 (for 10 time steps iterations), 530,248 is still far from 1,172,958.

Where did I get this number wrong?

 -log_summary returns:

np=1              Matrix     3              3     38785428     0
np=2		Matrix     9              9     19871976     0
np=4		Matrix     9              9      9948136     0
np=8		Matrix    9               9       4894696 0
np=16              Matrix     9              9      2413544     0
np=32		 Matrix     9              9      1172968     0
 
for other cases.

Thanks very much!

Best regards,

Rebecca


On Jul 28, 2011, at 1:05 PM, Hong Zhang wrote:

> Rebecca:
> 
> Turn off orderings and some options, e.g.,
> -mat_superlu_dist_equil NO -mat_superlu_dist_rowperm NATURAL
> -mat_superlu_dist_colperm NATURAL
> 
> Do you still get memory corruption?
> 
> Hong
> 
>> Hello all,
>> 
>> I tried to use superlu as a direct solver running on Hopper, but found that there are some memory corruption errors:
>> 
>> x/xyuan> cd $PBS_O_WORKDIR
>> Directory: /global/homes/x/xyuan/Workspace_Nersc/cartmhdpdslin/trunk/test_superlu_as_direct_solver/m256_p1024
>> test_superlu_as_direct_solver/m256_p1024> aprun -n 1024 ./twcartffxmhd.exe -options_file option_twcartffxmhd_256
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25137 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25146 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25144 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25133 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25136 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25142 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25145 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25148 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25149 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25147 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25135 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25134 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25138 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25141 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25140 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25139 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25132 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25143 : Permission denied
>> *********************************************
>> cartesian coordinate code(np = 1024)
>> start time = 0.0000
>> time accuracy order = 2
>> viscosity = 0.0500
>> resistivity = 0.0050
>> skin depth = 1.0000
>> hyper resistivity = 0.00000630
>> hyper viscosity = 0.00503929
>> problem size: 256 by 256
>> dt = 0.1000
>> *********************************************
>> ******* start solving for time = 0.10000 at time step = 1******
>>  0 SNES Function norm 6.220836330249e-03
>> Linear solve converged due to CONVERGED_ITS iterations 1
>>  1 SNES Function norm 3.041982522542e-07
>> *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe*** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: malloc(): memory corruption: 0x0000000001e31280 ***
>> 
>> Any idea what is wrong here?
>> 
>> Thanks very much!
>> 
>> Xuefei (Rebecca) Yuan
>> Postdoctoral Fellow
>> Lawrence Berkeley National Laboratory
>> Tel: 1-510-486-7031
>> 
>> 
>> 



More information about the petsc-users mailing list