[petsc-users] error running parallel on cluster

Matthew Knepley knepley at gmail.com
Mon Apr 16 21:34:01 CDT 2018


On Mon, Apr 16, 2018 at 10:22 PM, Sepideh Kavousi <skavou1 at lsu.edu> wrote:

> I run with valgrind with the following command:
>
> /work/skavou1/petsc/debug/bin/mpiexec -np 2 valgrind --tool=memcheck -q
> --num-callers=20 --log-file=valgrind.log.%p ./one.out -malloc off
> -ts_monitor -snes_fd_color -ts_max_snes_failures -1  -ts_type beuler
> -pc_type bjacobi  -snes_linesearch_type nleqerr -snes_type newtontr
> -ksp_gmres_restart 31 -sub_pc_type ilu -snes_monitor -snes_converged_reason
> -ksp_monitor
>

Okay, I think this is coded wrong. The DMDA stuff ignores DMLocal and only
chooses to ghost the input vector X and not X_t

https://bitbucket.org/petsc/petsc/annotate/be3efd428a942676a0189b3273b3c582694ff011/src/ts/utils/dmdats.c?at=master&fileviewer=file-view-default#dmdats.c-68

Unfortunately, this was done by Peter in 2012 and he is gone, so we can't
blame him.
It is true that very few people use spatial derivatives of X_t, but it does
make sense.
Right there, you want to add

  ierr = DMGetLocalVector(dm,&Xdotloc);CHKERRQ(ierr);  ierr =
DMGlobalToLocalBegin(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
ierr = DMGlobalToLocalEnd(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);

  ierr = DMDAVecGetArray(dm,Xdotloc,&xdot);CHKERRQ(ierr);


and it should work.

  Matt


> the valgringoutput is as following:
>
> ==39396== Invalid read of size 8
> ==39396==    at 0x402679: FormFunction (one.c:212)
> ==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
> ==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
> ==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
> ==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
> ==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
> ==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
> ==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
> ==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
> ==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
> ==39396==    by 0x6153380: TSStep (ts.c:3548)
> ==39396==    by 0x615559D: TSSolve (ts.c:3731)
> ==39396==    by 0x403C95: main (one.c:303)
> ==39396==  Address 0x83f5158 is 8 bytes before a block of size 7,200
> alloc'd
> ==39396==    at 0x4A05588: memalign (vg_replace_malloc.c:727)
> ==39396==    by 0x4D87516: PetscMallocAlign (mal.c:42)
> ==39396==    by 0x4D88DE0: PetscMallocA (mal.c:397)
> ==39396==    by 0x50AB230: VecGetArray2d (rvector.c:2167)
> ==39396==    by 0x589437E: DMDAVecGetArray (dagetarray.c:73)
> ==39396==    by 0x611A8ED: TSComputeIFunction_DMDA (dmdats.c:74)
> ==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
> ==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
> ==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
> ==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
> ==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
> ==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
> ==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
> ==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
> ==39396==    by 0x6153380: TSStep (ts.c:3548)
> ==39396==    by 0x615559D: TSSolve (ts.c:3731)
> ==39396==    by 0x403C95: main (one.c:303)
> ==39396==
> ==39396== Invalid read of size 8
> ==39396==    at 0x402689: FormFunction (one.c:212)
> ==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
> ==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
> ==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
> ==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
> ==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
> ==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
> ==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
> ==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
> ==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
> ==39396==    by 0x6153380: TSStep (ts.c:3548)
> ==39396==    by 0x615559D: TSSolve (ts.c:3731)
> ==39396==    by 0x403C95: main (one.c:303)
> ==39396==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
> ==39396==
>
> Thanks,
>
> Sepideh
> ------------------------------
> *From:* Matthew Knepley <knepley at gmail.com>
> *Sent:* Monday, April 16, 2018 8:44:43 PM
> *To:* Sepideh Kavousi
> *Cc:* petsc-users at mcs.anl.gov
> *Subject:* Re: [petsc-users] error running parallel on cluster
>
> On Mon, Apr 16, 2018 at 9:39 PM, Sepideh Kavousi <skavou1 at lsu.edu> wrote:
>
> this is not used forghost points . my code is :
>
>
> 1) You are using ghost points, since you use i-1 as an index
>
> 2) I was wrong. X_t should also be ghosted.
>
> Run under valgrind.
>
>   Thanks,
>
>      Matt
>
>
>     for (j=info2->ys;j<info2->ys+info2->ym;j++){
>         for (i=info2->xs;i<info2->xs+info2->xm;i++){
>             if (i==0) {aF[j][i].U=aY[j][i+1].U-aY[j][i].U ;
> aF[j][i].p=aY[j][i+1].p-aY[j][i].p ;}
>             else if (i==info2->mx-1) {aF[j][i].U=aY[j][i-1].U-aY[j][i].U
> ; aF[j][i].p=aY[j][i].p-aY[j][i-1].p ;}
>             else if (j==0) {aF[j][i].U=aY[j+1][i].U-aY[j][i].U;
> aF[j][i].p=aY[j+1][i].p-aY[j][i].p ;}
>             else if (j==info2->my-1) {aF[j][i].U=aY[j-1][i].U-aY[j][i].U
> ; aF[j][i].p=aY[j][i].p-aY[j-1][i].p ;}
>             else {
>                 //derivatives of U and p
>                 Px=(aY[j][i+1].p-aY[j][i-1].p)/(2.0*user->hx);
>                 Py=(aY[j+1][i].p-aY[j-1][i].p)/(2.0*user->hy);
>                 Pxx=((aY[j][i+1].p+aY[j][i-1].p-2.0*aY[j][i].p)/(hx2));
>                 Pxy=((aY[j+1][i+1].p+aY[j-1][
> i-1].p-aY[j+1][i-1].p-aY[j-1][i+1].p)/(4.0*hxy));
>                 Pyy=((aY[j+1][i].p+aY[j-1][i].p-2.0*aY[j][i].p)/hy2);
>                 Pxt=(aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx);
>                 Pyt=(aYdot[j+1][i].p-aYdot[j-1][i].p)/(2.0*user->hy);
>
>
>                 if (((Px*Px)+(Py*Py))>0.0 ) { //
>                     theta=PetscAtanReal(Py/Px);
>                     thetax=(Px*Pxy-Py*Pxx)/((Px*Px)+(Py*Py));
>                     thetay=(Px*Pyy-Py*Pxy)/((Px*Px)+(Py*Py));
>                     }
>                 else  {
>                     theta=PETSC_PI*0.5;
>                     thetax=0.0;
>                     thetay=0.0;
>                     }
>                 e=(1.0+user->epsilon*cos(4.0*theta));;
>                 ep=-(4.0*user->epsilon*sin(4.0*theta));
>                 epp=-(4.0*user->epsilon*4.0*cos(4.0*theta));
>
>
>
>                 Ux=(aY[j][i+1].U-aY[j][i-1].U)/(2.0*user->hx);
>                 Uy=(aY[j+1][i].U-aY[j-1][i].U)/(2.0*user->hy);
>                 Uxx=((aY[j][i+1].U+aY[j][i-1].U-2.0*aY[j][i].U)/(hx2));
>                 Uyy=((aY[j+1][i].U+aY[j-1][i].U-2.0*aY[j][i].U)/hy2);
>
>                 U1=user->D*user->tau_0/(user->W*user->W);
>                 U2= ((Px*Px+Py*Py)>0.0) ? user->a_p/sqrt(Px*Px+Py*Py) :
> 0.0 ;
>                 //U2=0.0;
>                 P1= (user->W*user->hy*j-user->Vp*user->tau_0*t)/user->lT;
>
>                 aF[j][i].U=(U1*(-0.5*(Ux*Px+U
> y*Py)+0.5*(1-aY[j][i].p)*(Uxx+Uyy))+U2*(1+(1-user->k)*aY[j][
> i].U)*(Pxt*Px+Pyt*Py+aYdot[j][i].p*(Pxx+Pyy))+U2*(1-user->k)
> *aYdot[j][i].p*(Ux*Px+Uy*Py)+0.5*(1+(1-user->k)*aY[j][i].U)
> *aYdot[j][i].p-0.5*(1+user->k-(1-user->k)*aY[j][i].p)*aYdot[j][i].U)*hx2;
>
>                 aF[j][i].p=( e*e*(Pxx+Pyy)+2.0*e*ep*(thetax
> *Px+thetay*Py)+(e*epp+ep*ep)*(thetay*Px-thetax*Py)+(aY[j][i]
> .p-aY[j][i].p*aY[j][i].p*aY[j][i].p)-user->lambda*(aY[j][i].
> p*aY[j][i].p-1)*(aY[j][i].p*aY[j][i].p-1)*(aY[j][i].U+P1)-
> (1-(1-user->k)*P1)*(1.0+4.0*user->epsilon*cos(4.0*theta))*
> (1.0+4.0*user->epsilon*cos(4.0*theta))*aYdot[j][i].p )*hx2 ;
>
> }
>         }
>
> Sepideh
> ------------------------------
> *From:* Matthew Knepley <knepley at gmail.com>
> *Sent:* Monday, April 16, 2018 8:36:37 PM
> *To:* Sepideh Kavousi
> *Cc:* petsc-users at mcs.anl.gov
> *Subject:* Re: [petsc-users] error running parallel on cluster
>
> On Mon, Apr 16, 2018 at 9:20 PM, Sepideh Kavousi <skavou1 at lsu.edu> wrote:
>
> Hello,
> I am solving a PDE where I have the spacial derivtive of the time
> derivative of a variable.  In the DMDATSSetIFunctionLocal function I
> defined d(dY/dt)/dx   as:
> (aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx)
>
>
> I do not think that you get ghosted Ydot in that function.
>
>    Matt
>
>
> on my workstation, it is working but on the cluster it gives an error. I
> can not run with debugger on the cluster but I have checked and the error
> is related to this part and I do not have any problem on cluster when
> running in series. Can you please help me about it.
>
> the error is:
> 0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/d
> ocumentation/faq.html#valgrind
> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS
> X to find memory corruption errors
> [0]PETSC ERROR: likely location of problem given in stack below
> [0]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [0]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [0]PETSC ERROR:       INSTEAD the line number of the start of the function
> [0]PETSC ERROR:       is given.
> [0]PETSC ERROR: [0] TSComputeIFunction_DMDA line 63
> /ddnB/work/skavou1/petsc/src/ts/utils/dmdats.c
> [0]PETSC ERROR: [0] TS user implicit function line 829
> /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
> [0]PETSC ERROR: [0] TSComputeIFunction line 815
> /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
> [0]PETSC ERROR: [0] SNESTSFormFunction_Theta line 640
> /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
> [0]PETSC ERROR: [0] SNESTSFormFunction line 4564
> /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
> [0]PETSC ERROR: [0] SNES user function line 2202
> /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
> [0]PETSC ERROR: [0] SNESComputeFunction line 2187
> /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
> [0]PETSC ERROR: [0] SNESSolve_NEWTONTR line 90
> /ddnB/work/skavou1/petsc/src/snes/impls/tr/tr.c
> [0]PETSC ERROR: [0] SNESSolve line 4203 /ddnB/work/skavou1/petsc/src/s
> nes/interface/snes.c
> [0]PETSC ERROR: [0] TSTheta_SNESSolve line 175
> /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
> [0]PETSC ERROR: [0] TSStep_Theta line 191 /ddnB/work/skavou1/petsc/src/t
> s/impls/implicit/theta/theta.c
> [0]PETSC ERROR: [0] TSStep line 3526 /ddnB/work/skavou1/petsc/src/t
> s/interface/ts.c
> [0]PETSC ERROR: [0] TSSolve line 3668 /ddnB/work/skavou1/petsc/src/t
> s/interface/ts.c
> [0]PETSC ERROR: --------------------- Error Message
> --------------------------------------------------------------
> [0]PETSC ERROR: Signal received
> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html
> for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.9.0, unknown
> [0]PETSC ERROR: ./one.out on a debug named mike1 by skavou1 Mon Apr 16
> 20:11:44 2018
> [0]PETSC ERROR: Configure options --with-cc=gcc --with-cxx=g++
> --with-fc=gfortran --download-mpich --download-fblaslapack
> [0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
> application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
> ==29057==
> ==29057== HEAP SUMMARY:
> ==29057==     in use at exit: 560,151 bytes in 383 blocks
> ==29057==   total heap usage: 8,734 allocs, 8,351 frees, 4,449,350 bytes
> allocated
> ==29057==
> ==29057== LEAK SUMMARY:
> ==29057==    definitely lost: 0 bytes in 0 blocks
> ==29057==    indirectly lost: 0 bytes in 0 blocks
> ==29057==      possibly lost: 0 bytes in 0 blocks
> ==29057==    still reachable: 560,151 bytes in 383 blocks
> ==29057==         suppressed: 0 bytes in 0 blocks
> ==29057== Rerun with --leak-check=full to see details of leaked memory
> ==29057==
> ==29057== For counts of detected and suppressed errors, rerun with: -v
> ==29057== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
>
>
>
>
> Thanks,
> Sepideh
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>
>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180416/c618a5e2/attachment-0001.html>


More information about the petsc-users mailing list