[petsc-dev] Error during KSPDestroy

Barry Smith bsmith at mcs.anl.gov
Mon May 7 15:36:49 CDT 2012


   Alexander

   Satish and I have determined the problem (took some valgrind and debugger work). We were not allocating enough "workspace" for one of the work arrays passed to zgesvd(). We have fixed it in petsc-dev. You should be able to do a hg pull -u  then recompile the libraries with make cmake then relink and run your example.

    Thank you for your patience.

     Barry

On May 7, 2012, at 9:10 AM, Alexander Grayver wrote:

> On 07.05.2012 15:04, Barry Smith wrote:
>>    I am also running complex.
>> 
>>     Look in the file dlasq2.f (it will be in the externalpackages subdirectory of the PETSc directory. Look at line 215, this is where valgrind has a problem. In my copy
>> 
>>       END IF
>> *
>> *     Check for negative data and compute sums of q's and e's.
>> *<------ this is line 215
>>       Z( 2*N ) = ZERO
>> 
>> it is a comment, which is not good. Is lione 215 also a comment in your copy of dlasq2.f?
> Barry,
> 
> *
> *     Rearrange data for locality: Z=(q1,qq1,e1,ee1,q2,qq2,e2,ee2,...).
> *
>      DO 30 K = 2*N, 2, -2
>         Z( 2*K ) = ZERO                                             ! <----------- LINE 215
>         Z( 2*K-1 ) = Z( K )
>         Z( 2*K-2 ) = ZERO
>         Z( 2*K-3 ) = Z( K-1 )
>   30 CONTINUE
> 
> In valgrind log you can see that it complaints about following lines as well:
> 
> ==9009== Invalid write of size 8
> ==9009==    at 0x10651D5: dlasq2_ (dlasq2.f:215)
> ==9009==    by 0x1064683: dlasq1_ (dlasq1.f:135)
> ==9009==    by 0x104EB3F: zbdsqr_ (zbdsqr.f:225)
> ==9009==    by 0x1023B74: zgesvd_ (zgesvd.f:2040)
> ==9009==    by 0xD38725: KSPComputeExtremeSingularValues_GMRES (gmreig.c:46)
> ==9009==    by 0xCB3CC7: KSPComputeExtremeSingularValues (itfunc.c:47)
> ==9009==    by 0x406DF2: main (solveTest.c:47)
> ==9009==  Address 0x6ef5d88 is 8 bytes before a block of size 832 alloc'd
> ==9009==    at 0x4C2786E: memalign (vg_replace_malloc.c:581)
> ==9009==    by 0x47E3CB: PetscMallocAlign (mal.c:30)
> ==9009==    by 0xD2E286: KSPSetUp_GMRES (gmres.c:73)
> ==9009==    by 0xCB5464: KSPSetUp (itfunc.c:239)
> ==9009==    by 0xCB6E56: KSPSolve (itfunc.c:402)
> ==9009==    by 0x406DDB: main (solveTest.c:46)
> ==9009==
> ==9009== Invalid write of size 8
> ==9009==    at 0x1065204: dlasq2_ (dlasq2.f:216)
> ....
> ==9009==
> ==9009== Invalid write of size 8
> ==9009==    at 0x1065223: dlasq2_ (dlasq2.f:217)
> ....
> ==9009==
> ==9009== Invalid write of size 8
> ==9009==    at 0x1065255: dlasq2_ (dlasq2.f:218)
> ....
> 
> All further output is also related to the Z array.
> Hard to believe this is a LAPACK problem... I tried 3 implementations over 2 machines.
> I have bad feeling it's my stupid mistake somewhere... :)
> 
> Just in case, I run ubuntu 11.1 and PETSc is configured like this with default gcc compiler:
> ./configure --with-petsc-arch=mpich-gcc-complex-debug-c --download-f-blas-lapack --with-precision=double --with-scalar-type=complex --download-mpich
> 
>> There are two possible causes I can think of for your problem
>> 
>> 1) PETSc does not allocate enough work space for zgesvd() or
>> 2) the BLAS/LAPACK routines have a bug where they sometimes access out of their work space.
>> 
>> 
>>    Satish,
>> 
>>      Can you try the same build options on a Linux machine as close to Alexander as we have and see if you can reproduce this?
>> 
>> 
>>    Barry
>> 
>> 
>> 
>> On May 7, 2012, at 2:16 AM, Alexander Grayver wrote:
>> 
>>> On 06.05.2012 22:24, Barry Smith wrote:
>>>>   Alexander,
>>>> 
>>>>      I cannot reproduce this on my mac with 3 different blas/lapack.
>>> Barry,
>>> 
>>> I'm surprised. I ran it on my home PC with ubuntu and PETSc configured from scratch as following:
>>> --download-mpich --with-fortran-interfaces=1 --download-scalapack --download-blacs --with-scalar-type=complex --download-blas-lapack --with-precision=double
>>> 
>>> And it's still there.
>>> Please note that all my numbers are complex.
>>> 
>>>>      Could you please run the case below but with --download-f-blas-lapack   (you forgot the -f last time)? Send us the valgrind results. This will tell use the exact line number in dlasq3() that is triggering the bad read.
>>> I did:
>>> ./configure --with-petsc-arch=openmpi-intel-complex-debug-c --download-scalapack --download-blacs --download-f-blas-lapack --with-precision=double --with-scalar-type=complex
>>> 
>>> And then valgrind program. The first message from log:
>>> 
>>> ==27656== Invalid write of size 8
>>> ==27656==    at 0x15A8E9E: dlasq2_ (dlasq2.f:215)
>>> ==27656==    by 0x15A83A4: dlasq1_ (dlasq1.f:135)
>>> ==27656==    by 0x158ACEC: zbdsqr_ (zbdsqr.f:225)
>>> ==27656==    by 0x154EC27: zgesvd_ (zgesvd.f:2038)
>>> ==27656==    by 0x695DD3: KSPComputeExtremeSingularValues_GMRES (gmreig.c:46)
>>> ==27656==    by 0x69DD76: KSPComputeExtremeSingularValues (itfunc.c:47)
>>> ==27656==    by 0x44E98C: main (solveTest.c:62)
>>> ==27656==  Address 0xfad2d98 is 8 bytes before a block of size 832 alloc'd
>>> ==27656==    at 0x4C25D66: memalign (vg_replace_malloc.c:694)
>>> ==27656==    by 0x4B642B: PetscMallocAlign (mal.c:30)
>>> ==27656==    by 0x687775: KSPSetUp_GMRES (gmres.c:73)
>>> ==27656==    by 0x69FE4A: KSPSetUp (itfunc.c:239)
>>> ==27656==    by 0x6A2058: KSPSolve (itfunc.c:402)
>>> ==27656==    by 0x44E969: main (solveTest.c:61)
>>> 
>>> Please find full log attached.
>>> 
>>>>     Barry
>>>> 
>>>> 
>>>> On May 6, 2012, at 9:16 AM, Alexander Grayver wrote:
>>>> 
>>>>> On 06.05.2012 15:34, Matthew Knepley wrote:
>>>>>> On Sun, May 6, 2012 at 9:24 AM, Alexander Grayver<agrayver at gfz-potsdam.de>   wrote:
>>>>>> Hm, valgrind gives a lot of output like that (see full log in previous message):
>>>>>> 
>>>>>> Can you run this with --download-f-blas-lapack? This sounds much more like an MKL bug.
>>>>> I did:
>>>>> --download-scalapack --download-blacs --download-blas-lapack --with-precision=double --with-scalar-type=complex
>>>>> 
>>>>> The error is still there. I checked "ldd solveTest", mkl is not used for sure. This is not an MKL problem I guess:
>>>>> 
>>>>> ==13600== Invalid read of size 8
>>>>> ==13600==    at 0x58636AF: dlasq3_ (in /usr/local/lib/liblapack.so.3.2.2)
>>>>> ==13600==    by 0x5862C84: dlasq2_ (in /usr/local/lib/liblapack.so.3.2.2)
>>>>> ==13600==    by 0x5861F2C: dlasq1_ (in /usr/local/lib/liblapack.so.3.2.2)
>>>>> ==13600==    by 0x571A479: zbdsqr_ (in /usr/local/lib/liblapack.so.3.2.2)
>>>>> ==13600==    by 0x57466A7: zgesvd_ (in /usr/local/lib/liblapack.so.3.2.2)
>>>>> ==13600==    by 0x694687: KSPComputeExtremeSingularValues_GMRES (gmreig.c:46)
>>>>> ==13600==    by 0x69C62A: KSPComputeExtremeSingularValues (itfunc.c:47)
>>>>> ==13600==    by 0x44E02C: main (solveTest.c:62)
>>>>> ==13600==  Address 0x10826b90 is 16 bytes before a block of size 832 alloc'd
>>>>> ==13600==    at 0x4C25D66: memalign (vg_replace_malloc.c:694)
>>>>> ==13600==    by 0x4B5ACB: PetscMallocAlign (mal.c:30)
>>>>> ==13600==    by 0x686181: KSPSetUp_GMRES (gmres.c:73)
>>>>> ==13600==    by 0x69E6FE: KSPSetUp (itfunc.c:239)
>>>>> ==13600==    by 0x6A090C: KSPSolve (itfunc.c:402)
>>>>> ==13600==    by 0x44E009: main (solveTest.c:61)
>>>>> 
>>>>> The weird thing is that the it gives correct result, so zgesvd works fine.
>>>>> 
>>>>> And also running this program with 10 iterations in valgrind doesn't produce error. The low above is with 100 iterations.
>>>>> Without valgrind the error is always there.
>>>>> 
>>>>> -- 
>>>>> Regards,
>>>>> Alexander
>>>>> 
>>> 
>>> -- 
>>> Regards,
>>> Alexander
>>> 
>>> <valgrind.zip>
> 
> 
> -- 
> Regards,
> Alexander
> 




More information about the petsc-dev mailing list