[petsc-users] killed 9 signal after upgrade from petsc 3.9.4 to 3.12.2

Fri Jan 10 14:19:36 CST 2020

  Can you please try v3.12.3  There was some funky business mistakenly added related to partitioning that has been fixed in 3.12.3

   Barry


> On Jan 10, 2020, at 1:57 PM, Santiago Andres Triana <repepo at gmail.com> wrote:
> 
> Dear all,
> 
> I ran the program with valgrind --tool=massif, the results are cryptic to me ... not sure who's the memory hog! the logs are attached.
> 
> The command I used is:
> mpiexec -n 24 valgrind --tool=massif --num-callers=20 --log-file=valgrind.log.%p ./ex7 -f1 A.petsc -f2 B.petsc -eps_nev 1 $opts -eps_target -4.008e-3+1.57142i -eps_target_magnitude -eps_tol 1e-14
> 
> Is there any possibility to install a version of superlu_dist (or mumps) different from what the petsc version automatically downloads?
> 
> Thanks!
> Santiago
> 
> 
> On Thu, Jan 9, 2020 at 10:04 PM Dave May <dave.mayhem23 at gmail.com> wrote:
> This kind of issue is difficult to untangle because you have potentially three pieces of software which might have changed between v3.9 and v3.12, namely
> PETSc, SLEPC and SuperLU_dist. 
> You need to isolate which software component is responsible for the 2x increase in memory.
> 
> When I look at the memory usage in the log files, things look very very similar for the raw PETSc objects.
> 
> [v3.9]
> --- Event Stage 0: Main Stage
> 
>               Viewer     4              3         2520     0.
>               Matrix    15             15    125236536     0.
>               Vector    22             22     19713856     0.
>            Index Set    10             10       995280     0.
>          Vec Scatter     4              4         4928     0.
>           EPS Solver     1              1         2276     0.
>   Spectral Transform     1              1          848     0.
>        Basis Vectors     1              1         2168     0.
>          PetscRandom     1              1          662     0.
>               Region     1              1          672     0.
>        Direct Solver     1              1        17440     0.
>        Krylov Solver     1              1         1176     0.
>       Preconditioner     1              1         1000     0.
> 
> versus 
> 
> [v3.12]
> --- Event Stage 0: Main Stage
> 
>               Viewer     4              3         2520     0.
>               Matrix    15             15    125237144     0.
>               Vector    22             22     19714528     0.
>            Index Set    10             10       995096     0.
>          Vec Scatter     4              4         3168     0.
>    Star Forest Graph     4              4         3936     0.
>           EPS Solver     1              1         2292     0.
>   Spectral Transform     1              1          848     0.
>        Basis Vectors     1              1         2184     0.
>          PetscRandom     1              1          662     0.
>               Region     1              1          672     0.
>        Direct Solver     1              1        17456     0.
>        Krylov Solver     1              1         1400     0.
>       Preconditioner     1              1         1000     0.
> 
> Certainly there is no apparent factor 2x increase in memory usage in the underlying petsc objects themselves.
> Furthermore, the counts of creations of petsc objects in toobig.log and justfine.log match, indicating that none of the implementations used in either PETSc or SLEPc have fundamentally changed wrt the usage of the native petsc objects.
> 
> It is also curious that VecNorm is called 3 times in "justfine.log" and 19 times in "toobig.log" - although I don't see how that could be related to you problem...
> 
> The above at least gives me the impression that issue of memory increase is likely not coming from PETSc.
> I just read Barry's useful email which is even more compelling and also indicates SLEPc is not the likely culprit either as it uses PetscMalloc() internally.
> 
> Some options to identify the problem:
> 
> 1/ Eliminate SLEPc as a possible culprit by not calling EPSSolve() and rather just call KSPSolve() with some RHS vector.
> * If you still see a 2x increase, switch the preconditioner to using -pc_type bjacobi -ksp_max_it 10 rather than superlu_dist.
> If the memory usage is good, you can be pretty certain the issue arises internally to superl_dist.
> 
> 2/ Leave your code as is and perform your profiling using mumps rather than superlu_dist. 
> This is a less reliable test than 1/ since the mumps implementation used with v3.9 and v3.12 may differ...
> 
> Thanks
> Dave
> 
> On Thu, 9 Jan 2020 at 20:17, Santiago Andres Triana <repepo at gmail.com> wrote:
> Dear all,
> 
> I think parmetis is not involved since I still run out of memory if I use the following options:
> export opts='-st_type sinvert -st_ksp_type preonly -st_pc_type lu -st_pc_factor_mat_solver_type superlu_dist -eps_true_residual 1'
> and  issuing:
> mpiexec -n 24 ./ex7 -f1 A.petsc -f2 B.petsc -eps_nev 1 -eps_target -4.008e-3+1.57142i $opts -eps_target_magnitude -eps_tol 1e-14 -memory_view
> 
> Bottom line is that the memory usage of petsc-3.9.4 / slepc-3.9.2 is much lower than current version. I can only solve relatively small problems using the 3.12 series :(
> I have an example with smaller matrices that will likely fail in a 32 Gb ram machine with petsc-3.12 but runs just fine with petsc-3.9. The -memory_view output is
> 
> with petsc-3.9.4: (log 'justfine.log' attached)
> 
> Summary of Memory Usage in PETSc
> Maximum (over computational time) process memory:        total 1.6665e+10 max 7.5674e+08 min 6.4215e+08
> Current process memory:                                  total 1.5841e+10 max 7.2881e+08 min 6.0905e+08
> Maximum (over computational time) space PetscMalloc()ed: total 3.1290e+09 max 1.5868e+08 min 1.0179e+08
> Current space PetscMalloc()ed:                           total 1.8808e+06 max 7.8368e+04 min 7.8368e+04
> 
> 
> with petsc-3.12.2: (log 'toobig.log' attached)
> 
> Summary of Memory Usage in PETSc
> Maximum (over computational time) process memory:        total 3.1564e+10 max 1.3662e+09 min 1.2604e+09
> Current process memory:                                  total 3.0355e+10 max 1.3082e+09 min 1.2254e+09
> Maximum (over computational time) space PetscMalloc()ed: total 2.7618e+09 max 1.4339e+08 min 8.6493e+07
> Current space PetscMalloc()ed:                           total 3.6127e+06 max 1.5053e+05 min 1.5053e+05
> 
> Strangely, monitoring with 'top' I can see *appreciably higher* peak memory use, usually twice what -memory_view ends up reporting, both for petsc-3.9.4 and current. Program fails usually at this peak if not enough ram available
> 
> The matrices for the example quoted above can be downloaded here (I use slepc's tutorial ex7.c to solve the problem):
> https://www.dropbox.com/s/as9bec9iurjra6r/A.petsc?dl=0  (about 600 Mb)
> https://www.dropbox.com/s/u2bbmng23rp8l91/B.petsc?dl=0  (about 210 Mb)
> 
> I haven't been able to use a debugger successfully since I am using a compute node without the possibility of an xterm ... note that I have no experience using a debugger so any help on that will also be appreciated!
> Hope I can switch to the current petsc/slepc version for my production runs soon...
> 
> Thanks again!
> Santiago
> 
> 
> 
> On Thu, Jan 9, 2020 at 4:25 PM Stefano Zampini <stefano.zampini at gmail.com> wrote:
> Can you reproduce the issue with smaller matrices? Or with a debug build (i.e. using —with-debugging=1 and compilation flags -02 -g)? 
> 
> The only changes in parmetis between the two PETSc releases are these below, but I don’t see how they could cause issues
> 
> kl-18448:pkg-parmetis szampini$ git log -2
> commit ab4fedc6db1f2e3b506be136e3710fcf89ce16ea (HEAD -> master, tag: v4.0.3-p5, origin/master, origin/dalcinl/random, origin/HEAD)
> Author: Lisandro Dalcin <dalcinl at gmail.com>
> Date:   Thu May 9 18:44:10 2019 +0300
> 
>     GKLib: Make FPRFX##randInRange() portable for 32bit/64bit indices
> 
> commit 2b4afc79a79ef063f369c43da2617fdb64746dd7
> Author: Lisandro Dalcin <dalcinl at gmail.com>
> Date:   Sat May 4 17:22:19 2019 +0300
> 
>     GKlib: Use gk_randint32() to define the RandomInRange() macro
> 
> 
> 
>> On Jan 9, 2020, at 4:31 AM, Smith, Barry F. via petsc-users <petsc-users at mcs.anl.gov> wrote:
>> 
>> 
>>  This is extremely worrisome:
>> 
>> ==23361== Use of uninitialised value of size 8
>> ==23361==    at 0x847E939: gk_randint64 (random.c:99)
>> ==23361==    by 0x847EF88: gk_randint32 (random.c:128)
>> ==23361==    by 0x81EBF0B: libparmetis__Match_Global (in /space/hpc-home/trianas/petsc-3.12.3/arch-linux2-c-debug/lib/libparmetis.so)
>> 
>> do you get that with PETSc-3.9.4 or only with 3.12.3?  
>> 
>>   This may result in Parmetis using non-random numbers and then giving back an inappropriate ordering that requires more memory for SuperLU_DIST.
>> 
>>  Suggest looking at the code, or running in the debugger to see what is going on there. We use parmetis all the time and don't see this.
>> 
>>  Barry
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Jan 8, 2020, at 4:34 PM, Santiago Andres Triana <repepo at gmail.com> wrote:
>>> 
>>> Dear Matt, petsc-users:
>>> 
>>> Finally back after the holidays to try to solve this issue, thanks for your patience!
>>> I compiled the latest petsc (3.12.3) with debugging enabled, the same problem appears: relatively large matrices result in out of memory errors. This is not the case for petsc-3.9.4, all fine there.
>>> This is a non-hermitian, generalized eigenvalue problem, I generate the A and B matrices myself and then I use example 7 (from the slepc tutorial at $SLEPC_DIR/src/eps/examples/tutorials/ex7.c ) to solve the problem:
>>> 
>>> mpiexec -n 24 valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p ./ex7 -malloc off -f1 A.petsc -f2 B.petsc -eps_nev 1 -eps_target -2.5e-4+1.56524i -eps_target_magnitude -eps_tol 1e-14 $opts
>>> 
>>> where the $opts variable is:
>>> export opts='-st_type sinvert -st_ksp_type preonly -st_pc_type lu -eps_error_relative ::ascii_info_detail -st_pc_factor_mat_solver_type superlu_dist -mat_superlu_dist_iterrefine 1 -mat_superlu_dist_colperm PARMETIS -mat_superlu_dist_parsymbfact 1 -eps_converged_reason -eps_conv_rel -eps_monitor_conv -eps_true_residual 1'
>>> 
>>> the output from valgrind (sample from one processor) and from the program are attached.
>>> If it's of any use the matrices are here (might need at least 180 Gb of ram to solve the problem succesfully under petsc-3.9.4):
>>> 
>>> https://www.dropbox.com/s/as9bec9iurjra6r/A.petsc?dl=0
>>> https://www.dropbox.com/s/u2bbmng23rp8l91/B.petsc?dl=0
>>> 
>>> WIth petsc-3.9.4 and slepc-3.9.2 I can use matrices up to 10Gb (with 240 Gb ram), but only up to 3Gb with the latest petsc/slepc.
>>> Any suggestions, comments or any other help are very much appreciated!
>>> 
>>> Cheers,
>>> Santiago
>>> 
>>> 
>>> 
>>> On Mon, Dec 23, 2019 at 11:19 PM Matthew Knepley <knepley at gmail.com> wrote:
>>> On Mon, Dec 23, 2019 at 3:14 PM Santiago Andres Triana <repepo at gmail.com> wrote:
>>> Dear all,
>>> 
>>> After upgrading to petsc 3.12.2 my solver program crashes consistently. Before the upgrade I was using petsc 3.9.4 with no problems.
>>> 
>>> My application deals with a complex-valued, generalized eigenvalue problem. The matrices involved are relatively large, typically 2 to 10 Gb in size, which is no problem for petsc 3.9.4.
>>> 
>>> Are you sure that your indices do not exceed 4B? If so, you need to configure using
>>> 
>>>  --with-64-bit-indices
>>> 
>>> Also, it would be nice if you ran with the debugger so we can get a stack trace for the SEGV.
>>> 
>>>  Thanks,
>>> 
>>>    Matt
>>> 
>>> However, after the upgrade I can only obtain solutions when the matrices are small, the solver crashes when the matrices' size exceed about 1.5 Gb:
>>> 
>>> [0]PETSC ERROR: ------------------------------------------------------------------------
>>> [0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
>>> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>> [0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
>>> [0]PETSC ERROR: to get more information on the crash.
>>> 
>>> and so on for each cpu.
>>> 
>>> 
>>> I tried using valgrind and this is the typical output:
>>> 
>>> ==2874== Conditional jump or move depends on uninitialised value(s)
>>> ==2874==    at 0x4018178: index (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x400752D: expand_dynamic_string_token (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x4008009: _dl_map_object (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x40013E4: map_doit (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x400EA53: _dl_catch_error (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x4000ABE: do_preload (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x4000EC0: handle_ld_preload (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x40034F0: dl_main (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x4016274: _dl_sysdep_start (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x4004A99: _dl_start (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x40011F7: ??? (in /lib64/ld-2.22.so)
>>> ==2874==    by 0x12: ???
>>> ==2874== 
>>> 
>>> 
>>> These are my configuration options. Identical for both petsc 3.9.4 and 3.12.2:
>>> 
>>> ./configure --with-scalar-type=complex --download-mumps --download-parmetis --download-metis --download-scalapack=1 --download-fblaslapack=1 --with-debugging=0 --download-superlu_dist=1 --download-ptscotch=1 CXXOPTFLAGS='-O3 -march=native' FOPTFLAGS='-O3 -march=native' COPTFLAGS='-O3 -march=native'
>>> 
>>> 
>>> Thanks in advance for any comments or ideas!
>>> 
>>> Cheers,
>>> Santiago
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/
>>> <test1.e6034496><valgrind.log.23361>
>> 
> 
> <massif.out.petsc-3.9><massif.out.petsc-3.12>