[petsc-users] error running parallel on cluster

Sepideh Kavousi skavou1 at lsu.edu
Wed Apr 18 15:06:05 CDT 2018


Mathew

I added the lines and I still have the same issue. It may be a silly question but should I configure and install petsc again using this new lines added? or changing the line is enough? the highlighted lines are the lines I modified.


PetscErrorCode ierr;
  DM             dm;
  DMTS_DA        *dmdats = (DMTS_DA*)ctx;
  DMDALocalInfo  info;
  Vec            Xloc,Xdotloc;
  void           *x,*f,*xdot;

  PetscFunctionBegin;
  PetscValidHeaderSpecific(ts,TS_CLASSID,1);
  PetscValidHeaderSpecific(X,VEC_CLASSID,2);
  PetscValidHeaderSpecific(F,VEC_CLASSID,3);
  if (!dmdats->ifunctionlocal) SETERRQ(PetscObjectComm((PetscObject)ts),PETSC_ERR_PLIB,"Corrupt context");
  ierr = TSGetDM(ts,&dm);CHKERRQ(ierr);
  ierr = DMGetLocalVector(dm,&Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalBegin(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalEnd(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
  ierr = DMGetLocalVector(dm,&Xloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalBegin(dm,X,INSERT_VALUES,Xloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalEnd(dm,X,INSERT_VALUES,Xloc);CHKERRQ(ierr);
  ierr = DMDAGetLocalInfo(dm,&info);CHKERRQ(ierr);
  ierr = DMDAVecGetArray(dm,Xloc,&x);CHKERRQ(ierr);
  ierr = DMDAVecGetArray(dm,Xdotloc,&xdot);CHKERRQ(ierr);

Thanks,

Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com>
Sent: Tuesday, April 17, 2018 5:59:12 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov
Subject: Re: [petsc-users] error running parallel on cluster

On Tue, Apr 17, 2018 at 3:07 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:

The reason  I can not use your method is that the input arguments of SetIFunctionLocal are the arrays of x,x_t instead of x,x_t vectors. In your method which was:

ierr = DMGetLocalVector(dm,&Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalBegin(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalEnd(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
  ierr = DMDAVecGetArray(dm,Xdotloc,&xdot);CHKERRQ(ierr);

You misunderstand my suggestion. I mean stick this code in right here in PETSc

https://bitbucket.org/petsc/petsc/annotate/be3efd428a942676a0189b3273b3c582694ff011/src/ts/utils/dmdats.c?at=master&fileviewer=file-view-default#dmdats.c-68

Then the X_t array you get in your local function will be ghosted.

   Matt


I need to have the vector of Xdot, not the array. So I think I should use SetIFunction instead of SetIFunctionLocal.


Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Sent: Tuesday, April 17, 2018 1:22:53 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] error running parallel on cluster

On Tue, Apr 17, 2018 at 1:50 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:

Mathew,

I previously use DMDATSSetIFunctionLocal(user.da,INSERT_VALUES,(DMDATSIFunctionLocal) FormFunction,&user) in my code.  If I want to use your solution I can not use it because in the FormFunction definition I must use arrays, not vectors.So to solve this issue I followed two methods where none were able to solve it.
1- in first method I decided to use TSSetIFunction instead of DMDATSSetIFunctionLocal

for this means first in the main function, I use TSSetDM and  my form function variables were as:
PetscErrorCode FormFunction(TS ts,PetscScalar t,Vec Y,Vec Ydot,Vec F, struct VAR_STRUCT *user) {
.
.
.
.
    ierr = TSGetDM(ts,&dmda);CHKERRQ(ierr);
    ierr= DMDAGetLocalInfo(dmda,&info2) ;CHKERRQ(ierr);

    ierr = DMGetLocalVector(dmda,&Ydot_local);CHKERRQ(ierr);
    ierr = DMGlobalToLocalBegin(dmda,Ydot,INSERT_VALUES,Ydot_local);CHKERRQ(ierr);
    ierr = DMGlobalToLocalEnd(dmda,Ydot,INSERT_VALUES,Ydot_local);CHKERRQ(ierr);
.
.
.

}
But still, it does not consider vectors y,ydot,f related to dmda (problem executing DMDAVecGetArray)

We cannot help you if you do not show full error messages.

Why not fix the code with SetIFunctionLocal(), as I said in my last email. I will fix PETSc proper in branch at the end of the week. I
have a proposal due tomorrow, so I cannot do it right now.

  Thanks,

    Matt

2- in second method I decided to use DMTSSetIFunction
but still, FormFunction is in form of TSIFunction where we do not define dm object and I think it does not understand dm and da are connected, although I have used TSSetDM in the main function.

Can you please help me what should I do?
Regards,
Sepideh





________________________________
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Sent: Monday, April 16, 2018 9:34:01 PM

To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] error running parallel on cluster

On Mon, Apr 16, 2018 at 10:22 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:

I run with valgrind with the following command:

/work/skavou1/petsc/debug/bin/mpiexec -np 2 valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p ./one.out -malloc off -ts_monitor -snes_fd_color -ts_max_snes_failures -1  -ts_type beuler -pc_type bjacobi  -snes_linesearch_type nleqerr -snes_type newtontr -ksp_gmres_restart 31 -sub_pc_type ilu -snes_monitor -snes_converged_reason -ksp_monitor

Okay, I think this is coded wrong. The DMDA stuff ignores DMLocal and only chooses to ghost the input vector X and not X_t

https://bitbucket.org/petsc/petsc/annotate/be3efd428a942676a0189b3273b3c582694ff011/src/ts/utils/dmdats.c?at=master&fileviewer=file-view-default#dmdats.c-68

Unfortunately, this was done by Peter in 2012 and he is gone, so we can't blame him.
It is true that very few people use spatial derivatives of X_t, but it does make sense.
Right there, you want to add


  ierr = DMGetLocalVector(dm,&Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalBegin(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);
  ierr = DMGlobalToLocalEnd(dm,Xdot,INSERT_VALUES,Xdotloc);CHKERRQ(ierr);

  ierr = DMDAVecGetArray(dm,Xdotloc,&xdot);CHKERRQ(ierr);



and it should work.

  Matt


the valgringoutput is as following:

==39396== Invalid read of size 8
==39396==    at 0x402679: FormFunction (one.c:212)
==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==  Address 0x83f5158 is 8 bytes before a block of size 7,200 alloc'd
==39396==    at 0x4A05588: memalign (vg_replace_malloc.c:727)
==39396==    by 0x4D87516: PetscMallocAlign (mal.c:42)
==39396==    by 0x4D88DE0: PetscMallocA (mal.c:397)
==39396==    by 0x50AB230: VecGetArray2d (rvector.c:2167)
==39396==    by 0x589437E: DMDAVecGetArray (dagetarray.c:73)
==39396==    by 0x611A8ED: TSComputeIFunction_DMDA (dmdats.c:74)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==
==39396== Invalid read of size 8
==39396==    at 0x402689: FormFunction (one.c:212)
==39396==    by 0x611AA5F: TSComputeIFunction_DMDA (dmdats.c:79)
==39396==    by 0x61349BD: TSComputeIFunction (ts.c:830)
==39396==    by 0x623DD1C: SNESTSFormFunction_Theta (theta.c:649)
==39396==    by 0x615E599: SNESTSFormFunction (ts.c:4569)
==39396==    by 0x600C84E: SNESComputeFunction (snes.c:2203)
==39396==    by 0x60A34B6: SNESSolve_NEWTONTR (tr.c:105)
==39396==    by 0x60246C0: SNESSolve (snes.c:4312)
==39396==    by 0x6237346: TSTheta_SNESSolve (theta.c:176)
==39396==    by 0x6237CA8: TSStep_Theta (theta.c:216)
==39396==    by 0x6153380: TSStep (ts.c:3548)
==39396==    by 0x615559D: TSSolve (ts.c:3731)
==39396==    by 0x403C95: main (one.c:303)
==39396==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==39396==

Thanks,

Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Sent: Monday, April 16, 2018 8:44:43 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] error running parallel on cluster

On Mon, Apr 16, 2018 at 9:39 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:

this is not used forghost points . my code is :

1) You are using ghost points, since you use i-1 as an index

2) I was wrong. X_t should also be ghosted.

Run under valgrind.

  Thanks,

     Matt


    for (j=info2->ys;j<info2->ys+info2->ym;j++){
        for (i=info2->xs;i<info2->xs+info2->xm;i++){
            if (i==0) {aF[j][i].U=aY[j][i+1].U-aY[j][i].U ; aF[j][i].p=aY[j][i+1].p-aY[j][i].p ;}
            else if (i==info2->mx-1) {aF[j][i].U=aY[j][i-1].U-aY[j][i].U ; aF[j][i].p=aY[j][i].p-aY[j][i-1].p ;}
            else if (j==0) {aF[j][i].U=aY[j+1][i].U-aY[j][i].U; aF[j][i].p=aY[j+1][i].p-aY[j][i].p ;}
            else if (j==info2->my-1) {aF[j][i].U=aY[j-1][i].U-aY[j][i].U ; aF[j][i].p=aY[j][i].p-aY[j-1][i].p ;}
            else {
                //derivatives of U and p
                Px=(aY[j][i+1].p-aY[j][i-1].p)/(2.0*user->hx);
                Py=(aY[j+1][i].p-aY[j-1][i].p)/(2.0*user->hy);
                Pxx=((aY[j][i+1].p+aY[j][i-1].p-2.0*aY[j][i].p)/(hx2));
                Pxy=((aY[j+1][i+1].p+aY[j-1][i-1].p-aY[j+1][i-1].p-aY[j-1][i+1].p)/(4.0*hxy));
                Pyy=((aY[j+1][i].p+aY[j-1][i].p-2.0*aY[j][i].p)/hy2);
                Pxt=(aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx);
                Pyt=(aYdot[j+1][i].p-aYdot[j-1][i].p)/(2.0*user->hy);


                if (((Px*Px)+(Py*Py))>0.0 ) { //
                    theta=PetscAtanReal(Py/Px);
                    thetax=(Px*Pxy-Py*Pxx)/((Px*Px)+(Py*Py));
                    thetay=(Px*Pyy-Py*Pxy)/((Px*Px)+(Py*Py));
                    }
                else  {
                    theta=PETSC_PI*0.5;
                    thetax=0.0;
                    thetay=0.0;
                    }
                e=(1.0+user->epsilon*cos(4.0*theta));;
                ep=-(4.0*user->epsilon*sin(4.0*theta));
                epp=-(4.0*user->epsilon*4.0*cos(4.0*theta));



                Ux=(aY[j][i+1].U-aY[j][i-1].U)/(2.0*user->hx);
                Uy=(aY[j+1][i].U-aY[j-1][i].U)/(2.0*user->hy);
                Uxx=((aY[j][i+1].U+aY[j][i-1].U-2.0*aY[j][i].U)/(hx2));
                Uyy=((aY[j+1][i].U+aY[j-1][i].U-2.0*aY[j][i].U)/hy2);

                U1=user->D*user->tau_0/(user->W*user->W);
                U2= ((Px*Px+Py*Py)>0.0) ? user->a_p/sqrt(Px*Px+Py*Py) : 0.0 ;
                //U2=0.0;
                P1= (user->W*user->hy*j-user->Vp*user->tau_0*t)/user->lT;

                aF[j][i].U=(U1*(-0.5*(Ux*Px+Uy*Py)+0.5*(1-aY[j][i].p)*(Uxx+Uyy))+U2*(1+(1-user->k)*aY[j][i].U)*(Pxt*Px+Pyt*Py+aYdot[j][i].p*(Pxx+Pyy))+U2*(1-user->k)*aYdot[j][i].p*(Ux*Px+Uy*Py)+0.5*(1+(1-user->k)*aY[j][i].U)*aYdot[j][i].p-0.5*(1+user->k-(1-user->k)*aY[j][i].p)*aYdot[j][i].U)*hx2;

                aF[j][i].p=( e*e*(Pxx+Pyy)+2.0*e*ep*(thetax*Px+thetay*Py)+(e*epp+ep*ep)*(thetay*Px-thetax*Py)+(aY[j][i].p-aY[j][i].p*aY[j][i].p*aY[j][i].p)-user->lambda*(aY[j][i].p*aY[j][i].p-1)*(aY[j][i].p*aY[j][i].p-1)*(aY[j][i].U+P1)-(1-(1-user->k)*P1)*(1.0+4.0*user->epsilon*cos(4.0*theta))*(1.0+4.0*user->epsilon*cos(4.0*theta))*aYdot[j][i].p )*hx2 ;

}
        }

Sepideh

________________________________
From: Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Sent: Monday, April 16, 2018 8:36:37 PM
To: Sepideh Kavousi
Cc: petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>
Subject: Re: [petsc-users] error running parallel on cluster

On Mon, Apr 16, 2018 at 9:20 PM, Sepideh Kavousi <skavou1 at lsu.edu<mailto:skavou1 at lsu.edu>> wrote:
Hello,
I am solving a PDE where I have the spacial derivtive of the time derivative of a variable.  In the DMDATSSetIFunctionLocal function I defined d(dY/dt)/dx   as:
(aYdot[j][i+1].p-aYdot[j][i-1].p)/(2.0*user->hx)

I do not think that you get ghosted Ydot in that function.

   Matt

on my workstation, it is working but on the cluster it gives an error. I can not run with debugger on the cluster but I have checked and the error is related to this part and I do not have any problem on cluster when running in series. Can you please help me about it.

the error is:
0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] TSComputeIFunction_DMDA line 63 /ddnB/work/skavou1/petsc/src/ts/utils/dmdats.c
[0]PETSC ERROR: [0] TS user implicit function line 829 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] TSComputeIFunction line 815 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] SNESTSFormFunction_Theta line 640 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] SNESTSFormFunction line 4564 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] SNES user function line 2202 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] SNESComputeFunction line 2187 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] SNESSolve_NEWTONTR line 90 /ddnB/work/skavou1/petsc/src/snes/impls/tr/tr.c
[0]PETSC ERROR: [0] SNESSolve line 4203 /ddnB/work/skavou1/petsc/src/snes/interface/snes.c
[0]PETSC ERROR: [0] TSTheta_SNESSolve line 175 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] TSStep_Theta line 191 /ddnB/work/skavou1/petsc/src/ts/impls/implicit/theta/theta.c
[0]PETSC ERROR: [0] TSStep line 3526 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: [0] TSSolve line 3668 /ddnB/work/skavou1/petsc/src/ts/interface/ts.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.9.0, unknown
[0]PETSC ERROR: ./one.out on a debug named mike1 by skavou1 Mon Apr 16 20:11:44 2018
[0]PETSC ERROR: Configure options --with-cc=gcc --with-cxx=g++ --with-fc=gfortran --download-mpich --download-fblaslapack
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
==29057==
==29057== HEAP SUMMARY:
==29057==     in use at exit: 560,151 bytes in 383 blocks
==29057==   total heap usage: 8,734 allocs, 8,351 frees, 4,449,350 bytes allocated
==29057==
==29057== LEAK SUMMARY:
==29057==    definitely lost: 0 bytes in 0 blocks
==29057==    indirectly lost: 0 bytes in 0 blocks
==29057==      possibly lost: 0 bytes in 0 blocks
==29057==    still reachable: 560,151 bytes in 383 blocks
==29057==         suppressed: 0 bytes in 0 blocks
==29057== Rerun with --leak-check=full to see details of leaked memory
==29057==
==29057== For counts of detected and suppressed errors, rerun with: -v
==29057== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)




Thanks,
Sepideh



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>



--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.caam.rice.edu/~mk51/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20180418/578c6a9c/attachment-0001.html>


More information about the petsc-users mailing list