[petsc-users] error running parallel on cluster

Sepideh Kavousi skavou1 at lsu.edu
Mon Apr 16 21:33:00 CDT 2018


I think I find the problem but I do not know how to solve it. I believe the problem is in line 112 of code which is the derivative of aYdot[i][j] in y-direction:

                Pyt=(aYdot[j+1][i].p-aYdot[j-1][i].p)/(2.0*user->hy);

in here I used 2 processors and the error is exactly in the middle of the domain in y-direction where  I believe aYdot[j-1][i].p belongs to the other processor.


It is interesting that I have this problem because always when discretizing derivatives, points can belong to different processors and it does not normally cause any problem (at least I did not have such problem when there were only spacial derivatives in FormFunction calculations of DMDATSSetIFunctionLocal. Can you please help me what I should do?

Sepideh



________________________________
From: Sepideh Kavousi
Sent: Monday, April 16, 2018 9:22:31 PM
To: Matthew Knepley
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] error running parallel on cluster


I run with valgrind with the following command:

/work/skavou1/petsc/debug/bin/mpiexec -np 2 valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p ./one.out -malloc off -ts_monitor -snes_fd_color -ts_max_snes_failures -1  -ts_type beuler -pc_type bjacobi  -snes_linesearch_type nleqerr -snes_type newtontr -ksp_gmres_restart 31 -sub_pc_type ilu -snes_monitor -snes_converged_reason -ksp_monitor



the valgringoutput is as following:

==39396== Invalid read of size 8
==39396==    at 0x402679: FormFunction (one.c:212)
==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==  Address 0x83f5158 is 8 bytes before a block of size 7,200 alloc'd
==39396==    at 0x4A05588: memalign (vg_replace_malloc.c:727)
==39396==    by 0x4D87516: PetscMallocAlign (mal.c:42)
==39396==    by 0x4D88DE0: PetscMallocA (mal.c:397)
==39396==    by 0x50AB230: VecGetArray2d (rvector.c:2167)
==39396==    by 0x589437E: DMDAVecGetArray (dagetarray.c:73)
==39396==    by 0x611A8ED: TSComputeIFunction_DMDA (dmdats.c:74)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==
==39396== Invalid read of size 8
==39396==    at 0x402689: FormFunction (one.c:212)
==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==39396==

Thanks,

Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com>
Sent: Monday, April 16, 2018 8:44:43 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] error running parallel on cluster

On Mon, Apr 16, 2018 at 9:39 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:

this is not used forghost points . my code is :

1) You are using ghost points, since you use i-1 as an index

2) I was wrong. X_t should also be ghosted.

Run under valgrind.

  Thanks,

     Matt


    for (j=info2->ys;j<info2->ys+info2->ym;j++){
        for (i=info2->xs;i<info2->xs+info2->xm;i++){
            if (i==0) {aF[j][i].U=aY[j][i+1].U-aY[j][i].U ; aF[j][i].p=aY[j][i+1].p-aY[j][i].p ;}
            else if (i==info2->mx-1) {aF[j][i].U=aY[j][i-1].U-aY[j][i].U ; aF[j][i].p=aY[j][i].p-aY[j][i-1].p ;}
            else if (j==0) {aF[j][i].U=aY[j+1][i].U-aY[j][i].U; aF[j][i].p=aY[j+1][i].p-aY[j][i].p ;}
            else if (j==info2->my-1) {aF[j][i].U=aY[j-1][i].U-aY[j][i].U ; aF[j][i].p=aY[j][i].p-aY[j-1][i].p ;}
            else {
                //derivatives of U and p
                Px=(aY[j][i+1].p-aY[j][i-1].p)/(2.0*user->hx);
                Py=(aY[j+1][i].p-aY[j-1][i].p)/(2.0*user->hy);
                Pxx=((aY[j][i+1].p+aY[j][i-1].p-2.0*aY[j][i].p)/(hx2));
                Pxy=((aY[j+1][i+1].p+aY[j-1][i-1].p-aY[j+1][i-1].p-aY[j-1][i+1].p)/(4.0*hxy));
                Pyy=((aY[j+1][i].p+aY[j-1][i].p-2.0*aY[j][i].p)/hy2);
                Pxt=(aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx);
                Pyt=(aYdot[j+1][i].p-aYdot[j-1][i].p)/(2.0*user->hy);


                if (((Px*Px)+(Py*Py))>0.0 ) { //
                    theta=PetscAtanReal(Py/Px);
                    thetax=(Px*Pxy-Py*Pxx)/((Px*Px)+(Py*Py));
                    thetay=(Px*Pyy-Py*Pxy)/((Px*Px)+(Py*Py));
                    }
                else  {
                    theta=PETSC_PI*0.5;
                    thetax=0.0;
                    thetay=0.0;
                    }
                e=(1.0+user->epsilon*cos(4.0*theta));;
                ep=-(4.0*user->epsilon*sin(4.0*theta));
                epp=-(4.0*user->epsilon*4.0*cos(4.0*theta));



                Ux=(aY[j][i+1].U-aY[j][i-1].U)/(2.0*user->hx);
                Uy=(aY[j+1][i].U-aY[j-1][i].U)/(2.0*user->hy);
                Uxx=((aY[j][i+1].U+aY[j][i-1].U-2.0*aY[j][i].U)/(hx2));
                Uyy=((aY[j+1][i].U+aY[j-1][i].U-2.0*aY[j][i].U)/hy2);

                U1=user->D*user->tau_0/(user->W*user->W);
                U2= ((Px*Px+Py*Py)>0.0) ? user->a_p/sqrt(Px*Px+Py*Py) : 0.0 ;
                //U2=0.0;
                P1= (user->W*user->hy*j-user->Vp*user->tau_0*t)/user->lT;

                aF[j][i].U=(U1*(-0.5*(Ux*Px+Uy*Py)+0.5*(1-aY[j][i].p)*(Uxx+Uyy))+U2*(1+(1-user->k)*aY[j][i].U)*(Pxt*Px+Pyt*Py+aYdot[j][i].p*(Pxx+Pyy))+U2*(1-user->k)*aYdot[j][i].p*(Ux*Px+Uy*Py)+0.5*(1+(1-user->k)*aY[j][i].U)*aYdot[j][i].p-0.5*(1+user->k-(1-user->k)*aY[j][i].p)*aYdot[j][i].U)*hx2;

                aF[j][i].p=( e*e*(Pxx+Pyy)+2.0*e*ep*(thetax*Px+thetay*Py)+(e*epp+ep*ep)*(thetay*Px-thetax*Py)+(aY[j][i].p-aY[j][i].p*aY[j][i].p*aY[j][i].p)-user->lambda*(aY[j][i].p*aY[j][i].p-1)*(aY[j][i].p*aY[j][i].p-1)*(aY[j][i].U+P1)-(1-(1-user->k)*P1)*(1.0+4.0*user->epsilon*cos(4.0*theta))*(1.0+4.0*user->epsilon*cos(4.0*theta))*aYdot[j][i].p )*hx2 ;

}
        }

Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Sent: Monday, April 16, 2018 8:36:37 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] error running parallel on cluster

On Mon, Apr 16, 2018 at 9:20 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:
Hello,
I am solving a PDE where I have the spacial derivtive of the time derivative of a variable.  In the DMDATSSetIFunctionLocal function I defined d(dY/dt)/dx   as:
(aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx)

I do not think that you get ghosted Ydot in that function.

   Matt

on my workstation, it is working but on the cluster it gives an error. I can not run with debugger on the cluster but I have checked and the error is related to this part and I do not have any problem on cluster when running in series. Can you please help me about it.

the error is:
0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] TSComputeIFunction_DMDA line 63 /ddnB/work/skavou1/petsc/src/ts/utils/dmdats.c
[0]PETSC ERROR: [0] TS user implicit function line 829 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] TSComputeIFunction line 815 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] SNESTSFormFunction_Theta line 640 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] SNESTSFormFunction line 4564 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] SNES user function line 2202 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] SNESComputeFunction line 2187 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] SNESSolve_NEWTONTR line 90 /ddnB/work/skavou1/petsc/src/snes/impls/tr/tr.c
[0]PETSC ERROR: [0] SNESSolve line 4203 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] TSTheta_SNESSolve line 175 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] TSStep_Theta line 191 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] TSStep line 3526 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] TSSolve line 3668 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.9.0, unknown
[0]PETSC ERROR: ./one.out on a debug named mike1 by skavou1 Mon Apr 16 20:11:44 2018
[0]PETSC ERROR: Configure options --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-mpich --download-fblaslapack
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
==29057==
==29057== HEAP SUMMARY:
==29057==     in use at exit: 560,151 bytes in 383 blocks
==29057==   total heap usage: 8,734 allocs, 8,351 frees, 4,449,350 bytes allocated
==29057==
==29057== LEAK SUMMARY:
==29057==    definitely lost: 0 bytes in 0 blocks
==29057==    indirectly lost: 0 bytes in 0 blocks
==29057==      possibly lost: 0 bytes in 0 blocks
==29057==    still reachable: 560,151 bytes in 383 blocks
==29057==         suppressed: 0 bytes in 0 blocks
==29057== Rerun with --leak-check=full to see details of leaked memory
==29057==
==29057== For counts of detected and suppressed errors, rerun with: -v
==29057== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)




Thanks,
Sepideh



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180417/b9c7e5a6/attachment-0001.html>


More information about the petsc-users mailing list