[petsc-users] memory corruption
Xuefei (Rebecca) Yuan
xyuan at lbl.gov
Thu Jul 28 17:06:45 CDT 2011
Dear Hong,
Yes, there is no memory corruption for small size problem test after I add those options. I have not tested on big size problems yet.
For using superlu_dist as a direct solver, I would like to make sure that the there is no memory corruption for large size problem, i.e., the memory requirement for a single processor does not exceed the memory of each processor.
For example, on Hopper, 1 node has 24 cores, and if I have 24 cores per node, each core has the memory approximately 1.33 Gb.
Assume the problem size is M*N, and there are 4 unknowns per grid point, and the standard 13-point scheme is used for discretization. Therefore, the number of non zeroes per row is 13.
The sparse matrix is 4MN X 4MN in size with 13*4MN non zeroes.
Assume that the data type is PetscScalar (8 bytes), what would be the memory usage for the matrix?
I ran a test problem of 64X64 for 10 time steps in np=1,2,4,8,16,32.
With the help of '-log_summary', I was able to tell the memory usage, for example, when np=32,
Matrix 9 9 1172968 0
The memory usage for processor 0 is 1,172,968 bytes in Matrix.
Is this 1,172,968 bytes the total number for those 9 creations/destructions, or how could I understand this 1,172,968 come from?
I did the math as follow, the non zeroes of the matrix is 13*4MN=212,992;
and the memory is 8(bytes for PetscScalar) * 212,992 = 1,703,936;
on each processor, the memory is 1,703,936/32 = 53,248
this 53,248 is far from 1,172,968.
even if I multiply by 10 (for 10 time steps iterations), 530,248 is still far from 1,172,958.
Where did I get this number wrong?
-log_summary returns:
np=1 Matrix 3 3 38785428 0
np=2 Matrix 9 9 19871976 0
np=4 Matrix 9 9 9948136 0
np=8 Matrix 9 9 4894696 0
np=16 Matrix 9 9 2413544 0
np=32 Matrix 9 9 1172968 0
for other cases.
Thanks very much!
Best regards,
Rebecca
On Jul 28, 2011, at 1:05 PM, Hong Zhang wrote:
> Rebecca:
>
> Turn off orderings and some options, e.g.,
> -mat_superlu_dist_equil NO -mat_superlu_dist_rowperm NATURAL
> -mat_superlu_dist_colperm NATURAL
>
> Do you still get memory corruption?
>
> Hong
>
>> Hello all,
>>
>> I tried to use superlu as a direct solver running on Hopper, but found that there are some memory corruption errors:
>>
>> x/xyuan> cd $PBS_O_WORKDIR
>> Directory: /global/homes/x/xyuan/Workspace_Nersc/cartmhdpdslin/trunk/test_superlu_as_direct_solver/m256_p1024
>> test_superlu_as_direct_solver/m256_p1024> aprun -n 1024 ./twcartffxmhd.exe -options_file option_twcartffxmhd_256
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25137 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25146 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25144 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25133 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25136 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25142 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25145 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25148 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25149 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25147 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25135 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25134 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25138 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25141 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25140 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25139 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25132 : Permission denied
>> [0] ERROR - MPIU_nem_gni_get_hugepages(): Can't create file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.1.25143 : Permission denied
>> *********************************************
>> cartesian coordinate code(np = 1024)
>> start time = 0.0000
>> time accuracy order = 2
>> viscosity = 0.0500
>> resistivity = 0.0050
>> skin depth = 1.0000
>> hyper resistivity = 0.00000630
>> hyper viscosity = 0.00503929
>> problem size: 256 by 256
>> dt = 0.1000
>> *********************************************
>> ******* start solving for time = 0.10000 at time step = 1******
>> 0 SNES Function norm 6.220836330249e-03
>> Linear solve converged due to CONVERGED_ITS iterations 1
>> 1 SNES Function norm 3.041982522542e-07
>> *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe*** glibc detected *** *** glibc detected *** *** glibc detected *** ./twcartffxmhd.exe: malloc(): memory corruption: 0x0000000001e31280 ***
>>
>> Any idea what is wrong here?
>>
>> Thanks very much!
>>
>> Xuefei (Rebecca) Yuan
>> Postdoctoral Fellow
>> Lawrence Berkeley National Laboratory
>> Tel: 1-510-486-7031
>>
>>
>>
More information about the petsc-users
mailing list