[petsc-users] Question about KSP, and makefile linking MPICH

Fri Apr 12 00:18:24 CDT 2019

   This means the mpiexec in your path (use which mpiexec for find exactly which one it is) is not associated with the same MPI that PETSc was ./configured with. Make sure that PETSC_DIR and PETSC_ARCH are defined in your .bashrc file before the PATH is defined and make sure each time you edit the .bashrc file you run source ~/.bashrc

   Barry

> On Apr 12, 2019, at 12:13 AM, Yuyun Yang <yyang85 at stanford.edu> wrote:
> 
> I do have a follow-up question regarding MPICH. I’ve set the PATH in my bashrc file according to the suggestion. When I call mpiexec -n 2 ./main, however, it does not seem like the two processors are splitting up the work. Rather, they seem to still be solving the same problem individually with different speeds. Please see a snapshot of some of the results here (it’s printing out the number of time steps taken, time and delta T):
>  
> 40: t = 2.757997927832237e+01 s, dt = 3.02226e+00
> 43: t = 3.930627798681907e+01 s, dt = 4.41084e+00
> 41: t = 3.100752205217186e+01 s, dt = 3.42754e+00
> 44: t = 4.431113388879382e+01 s, dt = 5.00486e+00
> 42: t = 3.489543343038465e+01 s, dt = 3.88791e+00
> 45: t = 4.999071244027611e+01 s, dt = 5.67958e+00
> 43: t = 3.930627798681907e+01 s, dt = 4.41084e+00
> 46: t = 5.643670313814994e+01 s, dt = 6.44599e+00
> 44: t = 4.431113388879382e+01 s, dt = 5.00486e+00
> 47: t = 6.375326587559283e+01 s, dt = 7.31656e+00
> 45: t = 4.999071244027611e+01 s, dt = 5.67958e+00
> 48: t = 7.205870073386113e+01 s, dt = 8.30543e+00
> 46: t = 5.643670313814994e+01 s, dt = 6.44599e+00
> 49: t = 8.148735633607836e+01 s, dt = 9.42866e+00
> 47: t = 6.375326587559283e+01 s, dt = 7.31656e+00
> 50: t = 9.219190788945764e+01 s, dt = 1.07046e+01
> 48: t = 7.205870073386113e+01 s, dt = 8.30543e+00
>  
> From: petsc-users <petsc-users-bounces at mcs.anl.gov> On Behalf Of Yuyun Yang via petsc-users
> Sent: Thursday, April 11, 2019 10:02 PM
> To: Smith, Barry F. <bsmith at mcs.anl.gov>
> Cc: petsc-users at mcs.anl.gov
> Subject: Re: [petsc-users] Question about KSP, and makefile linking MPICH
>  
> I think this problem arose because I did not reset the ksp for solving a different problem! It's not giving me an error anymore now that I added the reset, so it's all good :)
>  
> Thanks,
> Yuyun
>  
> Get Outlook for iOS
> From: Smith, Barry F. <bsmith at mcs.anl.gov>
> Sent: Thursday, April 11, 2019 9:21:11 PM
> To: Yuyun Yang
> Cc: petsc-users at mcs.anl.gov
> Subject: Re: [petsc-users] Question about KSP, and makefile linking MPICH
>  
> 
>    Ahh, I just realized one other thing we can try. Run the program that crashes with -ksp_mat_view binary  this will produce a file called binaryoutput, send that file to petsc-maint at mcs.anl.gov and we'll see if we can get MUMPS to mis-behave with it also.
> 
>    Barry
> 
> 
> 
> > On Apr 11, 2019, at 11:17 PM, Yuyun Yang <yyang85 at stanford.edu> wrote:
> > 
> > Thanks Barry for the detailed answers!
> > 
> > Regarding the problem with valgrind, this is the only error produced, and if I allow it to run further, the program would break (at a later function I get NaN for some of the values being calculated, and I've put an assert to prevent NaN results). I will take a look at it in the debugger. This is for testing, but for bigger problems I won't end up using Cholesky, so it's not really a big issue.
> > 
> > Thanks again for the timely help!
> > Yuyun
> > 
> > Get Outlook for iOS
> > From: Smith, Barry F. <bsmith at mcs.anl.gov>
> > Sent: Thursday, April 11, 2019 6:44:54 PM
> > To: Yuyun Yang
> > Cc: petsc-users at mcs.anl.gov
> > Subject: Re: [petsc-users] Question about KSP, and makefile linking MPICH
> >  
> > 
> > 
> > > On Apr 11, 2019, at 5:44 PM, Yuyun Yang via petsc-users <petsc-users at mcs.anl.gov> wrote:
> > > 
> > > Hello team,
> > >  
> > > I’d like to check if it’s ok to use the same ksp object and change its operator (the matrix A) later on in the code to solve a different problem?
> > 
> >    Do you mean call KSPSetOperators() with one matrix and then later call it with a different matrix? This is ok if the two matrices are the same size and have the same parallel layout. But if the matrices are different size, have different parallel layout then you need to destroy the KSP and create a new one or call KSPReset() in between for example
> > 
> >   KSPSetFromOptions(ksp);
> >   KSPSetOperators(ksp,A,A);
> >   KSPSolve(ksp,b,x); 
> >   KSPReset(ksp);
> >   KSPSetOperators(ksp,B,B);
> >   KSPSolve(ksp,newb,newx);
> > 
> > >  
> > > Also, I know I’ve asked this before about linking to MPICH when I call mpirun, instead of using my computer’s default MPI, but I want to check again. The same problem was solved on my cluster by using a different CLINKER (called mpiicc) in the Makefile and a different intel compiler, which will link my compiled code with MPICH. Is there a similar thing I can do on my own computer, instead of having to use a very long path to locate the MPICH I configured with PETSc, and then calling the executable? (I tried making CLINKER = mpiicc on my own computer but that didn’t work.)
> > 
> >     Are you asking how you can avoid something like
> > 
> >       /home/me/petsc/arch-myarch/bin/mpiexec -n 2 ./mycode ?
> > 
> >    You can add /home/me/petsc/arch-myarch/bin to the beginning of your PATH, for example with bash put the following into your ~/.bashrc file
> > 
> >       export PATH=/home/me/petsc/arch-myarch/bin:$PATH
> >       mpiexec -n 2 ./mycode
> > 
> > >  
> > > The final question is related to valgrind. I have defined a setupKSP function to do all the solver/pc setup. It seems like there is a problem with memory allocation but I don’t really understand why. This only happens for MUMPSCHOLESKY though (running CG, AMG etc. was fine):
> > >  
> > > ==830== Invalid read of size 8
> > > ==830==    at 0x6977C95: dmumps_ana_o_ (dana_aux.F:2054)
> > > ==830==    by 0x6913B5A: dmumps_ana_driver_ (dana_driver.F:390)
> > > ==830==    by 0x68C152C: dmumps_ (dmumps_driver.F:1213)
> > > ==830==    by 0x68BBE1C: dmumps_f77_ (dmumps_f77.F:267)
> > > ==830==    by 0x68BA4EB: dmumps_c (mumps_c.c:417)
> > > ==830==    by 0x5A070D6: MatCholeskyFactorSymbolic_MUMPS (mumps.c:1654)
> > > ==830==    by 0x54071F2: MatCholeskyFactorSymbolic (matrix.c:3179)
> > > ==830==    by 0x614AFE9: PCSetUp_Cholesky (cholesky.c:88)
> > > ==830==    by 0x62BA574: PCSetUp (precon.c:932)
> > > ==830==    by 0x640BB29: KSPSetUp (itfunc.c:391)
> > > ==830==    by 0x4A1192: PressureEq::setupKSP(_p_KSP*&, _p_PC*&, _p_Mat*&) (pressureEq.cpp:834)
> > > ==830==    by 0x4A1258: PressureEq::computeInitialSteadyStatePressure(Domain&) (pressureEq.cpp:862)
> > >  
> > > ==830==  Address 0xb8149c0 is 0 bytes after a block of size 7,872 alloc'd
> > >  
> > > ==830==    at 0x4C2FFC6: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> > > ==830==    by 0x500E7E0: PetscMallocAlign (mal.c:41)
> > > ==830==    by 0x59F8A16: MatConvertToTriples_seqaij_seqsbaij (mumps.c:402)
> > > ==830==    by 0x5A06B53: MatCholeskyFactorSymbolic_MUMPS (mumps.c:1618)
> > > ==830==    by 0x54071F2: MatCholeskyFactorSymbolic (matrix.c:3179)
> > > ==830==    by 0x614AFE9: PCSetUp_Cholesky (cholesky.c:88)
> > > ==830==    by 0x62BA574: PCSetUp (precon.c:932)
> > > ==830==    by 0x640BB29: KSPSetUp (itfunc.c:391)
> > > ==830==    by 0x4A1192: PressureEq::setupKSP(_p_KSP*&, _p_PC*&, _p_Mat*&) (pressureEq.cpp:834)
> > > ==830==    by 0x4A1258: PressureEq::computeInitialSteadyStatePressure(Domain&) (pressureEq.cpp:862)
> > > ==830==    by 0x49B809: PressureEq::PressureEq(Domain&) (pressureEq.cpp:62)
> > > ==830==    by 0x4A88E9: StrikeSlip_LinearElastic_qd::StrikeSlip_LinearElastic_qd(Domain&) (strikeSlip_linearElastic_qd.cpp:57)
> > 
> >    This is curious. The line in the MUMPS code where valgrind detects a problem is 
> > 
> >             K = 1_8
> >             THEMIN = ZERO
> >             DO
> >                IF(THEMIN .NE. ZERO) EXIT
> >                THEMIN = abs(id%A(K))                               <<<<<<< this line
> >                K = K+1_8
> > 
> >    So it has a problem accessing id%A(1)  the very first entry in numerical values of the sparse matrix. Meanwhile it states 
> > > 0 bytes after a block of size 7,872 alloc'd MatConvertToTriples_seqaij_seqsbaij (mumps.c:402)  which is where PETSc allocates
> > the values passed to MUMPS. So it almost as if MUMPS never allocated any space for id%A(), I can't imagine why that would ever happen (the
> > problem size is super small so its not like it might have run out of memory)
> > 
> >     What happens if you allow the valgrind to continue? Do you get more valgrind errors?
> > 
> >     What happens if run without valgrind? Does it crash at this point in the code? At some later point? Does it run to completion and seem to 
> > produce the correct answer? If it crashes, you could run it in the debugger and when it crashes print the value of id, id%A etc and see if they look
> > reasonable. 
> > 
> >    Barry
> > 
> > 
> > 
> > 
> > >  
> > > Thank you!
> > > Yuyun
>