<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">So when I run with -no-signal-handler I
      get an error from valgrind saying that it received a signal and
      was aborting...<br>
      Is there a way to prune a little what PETSc will get from
      argc/argv in order to let it run correctly?<br>
      <br>
      I also ran an example on my local computer with the same
      parameters and with 4 mpi ranks (that's  all my desktop can do)
      and got the following outputs:<br>
      <br>
      <blockquote><font face="Courier New, Courier, monospace">luc@euler:~/research/simulations/Petsc_FS/parallel_ISFS/10elem_test/4cores$
          mpirun -n 4 /usr/bin/valgrind --leak-check=full
          --tool=memcheck --track-origins=yes -q
          /home/luc/research/feap_repo/ShearBands/parfeap/feap -ksp_type
          preonly -pc_type lu -pc_factor_mat_solver_package mumps
          -ksp_diagonal_scale<br>
          <br>
          <br>
              F I N I T E   E L E M E N T   A N A L Y S I S   P R O G R
          A M<br>
          <br>
                     FEAP (C) Regents of the University of California<br>
                                   All Rights Reserved.<br>
                                 VERSION: Release 8.3.19      <br>
                                    DATE: 29 March 2011       <br>
          <br>
                   Files are set as:   Status    Filename<br>
          <br>
                     Input   (read ) : Exists 
          ILU_0001                        <br>
                     Output  (write) : Exists 
          OLU_0001                        <br>
                     Restart (read ) : New    
          RLU_0001                        <br>
                     Restart (write) : New    
          RLU_0001                        <br>
                     Plots   (write) : New    
          PLU_0001                        <br>
          <br>
                   Caution, existing write files will be overwritten.<br>
          <br>
                   Are filenames correct? ( y or n; s = stop) :y<br>
          <br>
                   R U N N I N G    F E A P    P R O B L E M    N O W<br>
          <br>
                    --> Please report errors by e-mail to:<br>
                        <a class="moz-txt-link-abbreviated" href="mailto:feap@ce.berkeley.edu">feap@ce.berkeley.edu</a><br>
          <br>
           Saving Parallel data to PLU_000000.pvtu<br>
          ==7933== Syscall param writev(vector[...]) points to
          uninitialised byte(s)<br>
          ==7933==    at 0x6C2EF57: writev (writev.c:49)<br>
          ==7933==    by 0x7360F30: MPL_large_writev (mplsock.c:32)<br>
          ==7933==    by 0x65AB588: MPIDU_Sock_writev (sock_immed.i:610)<br>
          ==7933==    by 0x65940B3: MPIDI_CH3_iSendv (ch3_isendv.c:84)<br>
          ==7933==    by 0x65862AC: MPIDI_CH3_EagerContigIsend
          (ch3u_eager.c:550)<br>
          ==7933==    by 0x658AB5C: MPID_Isend (mpid_isend.c:131)<br>
          ==7933==    by 0x6625309: PMPI_Isend (isend.c:122)<br>
          ==7933==    by 0x656671D: PMPI_ISEND (isendf.c:267)<br>
          ==7933==    by 0x1AE7745: __dmumps_comm_buffer_MOD_dmumps_62
          (dmumps_comm_buffer.F:567)<br>
          ==7933==    by 0x1B6D1FC: dmumps_242_ (dmumps_part2.F:739)<br>
          ==7933==    by 0x1AB264C: dmumps_249_ (dmumps_part8.F:6541)<br>
          ==7933==    by 0x1AAB0F6: dmumps_245_ (dmumps_part8.F:3885)<br>
          ==7933==  Address 0x8959098 is 8 bytes inside a block of size
          336 alloc'd<br>
          ==7933==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7933==    by 0x1AE9465: __dmumps_comm_buffer_MOD_dmumps_2
          (dmumps_comm_buffer.F:175)<br>
          ==7933==    by 0x1AE9852: __dmumps_comm_buffer_MOD_dmumps_55
          (dmumps_comm_buffer.F:123)<br>
          ==7933==    by 0x1AC5FD5: dmumps_301_ (dmumps_part8.F:989)<br>
          ==7933==    by 0x1B36855: dmumps_ (dmumps_part1.F:665)<br>
          ==7933==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7933==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7933==    by 0x13AF869: MatSolve_MUMPS (mumps.c:606)<br>
          ==7933==    by 0xB58981: MatSolve (matrix.c:3122)<br>
          ==7933==    by 0x152174C: PCApply_LU (lu.c:198)<br>
          ==7933==    by 0x14BAB70: PCApply (precon.c:440)<br>
          ==7933==    by 0x164AB13: KSP_PCApply (kspimpl.h:230)<br>
          ==7933==  Uninitialised value was created by a stack
          allocation<br>
          ==7933==    at 0x1AAE33D: dmumps_249_ (dmumps_part8.F:5817)<br>
          ==7933== <br>
          ==7932== Syscall param writev(vector[...]) points to
          uninitialised byte(s)<br>
          ==7932==    at 0x6C2EF57: writev (writev.c:49)<br>
          ==7932==    by 0x7360F30: MPL_large_writev (mplsock.c:32)<br>
          ==7932==    by 0x65AB588: MPIDU_Sock_writev (sock_immed.i:610)<br>
          ==7932==    by 0x65940B3: MPIDI_CH3_iSendv (ch3_isendv.c:84)<br>
          ==7932==    by 0x65862AC: MPIDI_CH3_EagerContigIsend
          (ch3u_eager.c:550)<br>
          ==7932==    by 0x658AB5C: MPID_Isend (mpid_isend.c:131)<br>
          ==7932==    by 0x6625309: PMPI_Isend (isend.c:122)<br>
          ==7932==    by 0x656671D: PMPI_ISEND (isendf.c:267)<br>
          ==7932==    by 0x1AE7745: __dmumps_comm_buffer_MOD_dmumps_62
          (dmumps_comm_buffer.F:567)<br>
          ==7932==    by 0x1B6D1FC: dmumps_242_ (dmumps_part2.F:739)<br>
          ==7932==    by 0x1AB264C: dmumps_249_ (dmumps_part8.F:6541)<br>
          ==7932==    by 0x1AAB0F6: dmumps_245_ (dmumps_part8.F:3885)<br>
          ==7932==  Address 0x86e4448 is 8 bytes inside a block of size
          336 alloc'd<br>
          ==7932==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7932==    by 0x1AE9465: __dmumps_comm_buffer_MOD_dmumps_2
          (dmumps_comm_buffer.F:175)<br>
          ==7932==    by 0x1AE9852: __dmumps_comm_buffer_MOD_dmumps_55
          (dmumps_comm_buffer.F:123)<br>
          ==7932==    by 0x1AC5FD5: dmumps_301_ (dmumps_part8.F:989)<br>
          ==7932==    by 0x1B36855: dmumps_ (dmumps_part1.F:665)<br>
          ==7932==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7932==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7932==    by 0x13AF869: MatSolve_MUMPS (mumps.c:606)<br>
          ==7932==    by 0xB58981: MatSolve (matrix.c:3122)<br>
          ==7932==    by 0x152174C: PCApply_LU (lu.c:198)<br>
          ==7932==    by 0x14BAB70: PCApply (precon.c:440)<br>
          ==7932==    by 0x164AB13: KSP_PCApply (kspimpl.h:230)<br>
          ==7932==  Uninitialised value was created by a stack
          allocation<br>
          ==7932==    at 0x1AAE33D: dmumps_249_ (dmumps_part8.F:5817)<br>
          ==7932== <br>
          ==7934== Syscall param writev(vector[...]) points to
          uninitialised byte(s)<br>
          ==7934==    at 0x6C2EF57: writev (writev.c:49)<br>
          ==7934==    by 0x7360F30: MPL_large_writev (mplsock.c:32)<br>
          ==7934==    by 0x65AB588: MPIDU_Sock_writev (sock_immed.i:610)<br>
          ==7934==    by 0x65940B3: MPIDI_CH3_iSendv (ch3_isendv.c:84)<br>
          ==7934==    by 0x65862AC: MPIDI_CH3_EagerContigIsend
          (ch3u_eager.c:550)<br>
          ==7934==    by 0x658AB5C: MPID_Isend (mpid_isend.c:131)<br>
          ==7934==    by 0x6625309: PMPI_Isend (isend.c:122)<br>
          ==7934==    by 0x656671D: PMPI_ISEND (isendf.c:267)<br>
          ==7934==    by 0x1AE7745: __dmumps_comm_buffer_MOD_dmumps_62
          (dmumps_comm_buffer.F:567)<br>
          ==7934==    by 0x1B6D1FC: dmumps_242_ (dmumps_part2.F:739)<br>
          ==7934==    by 0x1AB264C: dmumps_249_ (dmumps_part8.F:6541)<br>
          ==7934==    by 0x1AAB0F6: dmumps_245_ (dmumps_part8.F:3885)<br>
          ==7934==  Address 0x89ec648 is 8 bytes inside a block of size
          336 alloc'd<br>
          ==7934==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7934==    by 0x1AE9465: __dmumps_comm_buffer_MOD_dmumps_2
          (dmumps_comm_buffer.F:175)<br>
          ==7934==    by 0x1AE9852: __dmumps_comm_buffer_MOD_dmumps_55
          (dmumps_comm_buffer.F:123)<br>
          ==7934==    by 0x1AC5FD5: dmumps_301_ (dmumps_part8.F:989)<br>
          ==7934==    by 0x1B36855: dmumps_ (dmumps_part1.F:665)<br>
          ==7934==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7934==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7934==    by 0x13AF869: MatSolve_MUMPS (mumps.c:606)<br>
          ==7934==    by 0xB58981: MatSolve (matrix.c:3122)<br>
          ==7934==    by 0x152174C: PCApply_LU (lu.c:198)<br>
          ==7934==    by 0x14BAB70: PCApply (precon.c:440)<br>
          ==7934==    by 0x164AB13: KSP_PCApply (kspimpl.h:230)<br>
          ==7934==  Uninitialised value was created by a stack
          allocation<br>
          ==7934==    at 0x1AAE33D: dmumps_249_ (dmumps_part8.F:5817)<br>
          ==7934== <br>
          ==7935== Syscall param writev(vector[...]) points to
          uninitialised byte(s)<br>
          ==7935==    at 0x6C2EF57: writev (writev.c:49)<br>
          ==7935==    by 0x7360F30: MPL_large_writev (mplsock.c:32)<br>
          ==7935==    by 0x65AB588: MPIDU_Sock_writev (sock_immed.i:610)<br>
          ==7935==    by 0x65940B3: MPIDI_CH3_iSendv (ch3_isendv.c:84)<br>
          ==7935==    by 0x65862AC: MPIDI_CH3_EagerContigIsend
          (ch3u_eager.c:550)<br>
          ==7935==    by 0x658AB5C: MPID_Isend (mpid_isend.c:131)<br>
          ==7935==    by 0x6625309: PMPI_Isend (isend.c:122)<br>
          ==7935==    by 0x656671D: PMPI_ISEND (isendf.c:267)<br>
          ==7935==    by 0x1AE7745: __dmumps_comm_buffer_MOD_dmumps_62
          (dmumps_comm_buffer.F:567)<br>
          ==7935==    by 0x1B6D1FC: dmumps_242_ (dmumps_part2.F:739)<br>
          ==7935==    by 0x1AB264C: dmumps_249_ (dmumps_part8.F:6541)<br>
          ==7935==    by 0x1AAB0F6: dmumps_245_ (dmumps_part8.F:3885)<br>
          ==7935==  Address 0x85d8968 is 8 bytes inside a block of size
          336 alloc'd<br>
          ==7935==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7935==    by 0x1AE9465: __dmumps_comm_buffer_MOD_dmumps_2
          (dmumps_comm_buffer.F:175)<br>
          ==7935==    by 0x1AE9852: __dmumps_comm_buffer_MOD_dmumps_55
          (dmumps_comm_buffer.F:123)<br>
          ==7935==    by 0x1AC5FD5: dmumps_301_ (dmumps_part8.F:989)<br>
          ==7935==    by 0x1B36855: dmumps_ (dmumps_part1.F:665)<br>
          ==7935==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7935==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7935==    by 0x13AF869: MatSolve_MUMPS (mumps.c:606)<br>
          ==7935==    by 0xB58981: MatSolve (matrix.c:3122)<br>
          ==7935==    by 0x152174C: PCApply_LU (lu.c:198)<br>
          ==7935==    by 0x14BAB70: PCApply (precon.c:440)<br>
          ==7935==    by 0x164AB13: KSP_PCApply (kspimpl.h:230)<br>
          ==7935==  Uninitialised value was created by a stack
          allocation<br>
          ==7935==    at 0x1AAE33D: dmumps_249_ (dmumps_part8.F:5817)<br>
          ==7935== <br>
           Saving Parallel data to PLU_000001.pvtu<br>
           Saving Parallel data to PLU_000002.pvtu<br>
           Saving Parallel data to PLU_000003.pvtu<br>
           Saving Parallel data to PLU_000004.pvtu<br>
           Saving Parallel data to PLU_000005.pvtu<br>
           Saving Parallel data to PLU_000006.pvtu<br>
           Saving Parallel data to PLU_000007.pvtu<br>
           Saving Parallel data to PLU_000008.pvtu<br>
           Saving Parallel data to PLU_000009.pvtu<br>
           Saving Parallel data to PLU_000010.pvtu<br>
          ==7932== 13 bytes in 13 blocks are definitely lost in loss
          record 7 of 51<br>
          ==7932==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7932==    by 0x1BBB3E1: mumps_397.3460
          (mumps_static_mapping.F:3746)<br>
          ==7932==    by 0x1BB8283: __mumps_static_mapping_MOD_mumps_369
          (mumps_static_mapping.F:302)<br>
          ==7932==    by 0x1A4BBF5: dmumps_537_ (dmumps_part5.F:1706)<br>
          ==7932==    by 0x1A550AC: dmumps_26_ (dmumps_part5.F:447)<br>
          ==7932==    by 0x1B346D6: dmumps_ (dmumps_part1.F:409)<br>
          ==7932==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7932==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7932==    by 0x13B2F2B: MatLUFactorSymbolic_AIJMUMPS
          (mumps.c:972)<br>
          ==7932==    by 0xB54FB5: MatLUFactorSymbolic (matrix.c:2842)<br>
          ==7932==    by 0x15205A9: PCSetUp_LU (lu.c:127)<br>
          ==7932==    by 0x14C0685: PCSetUp (precon.c:902)<br>
          ==7932== <br>
          ==7932== 13 bytes in 13 blocks are definitely lost in loss
          record 8 of 51<br>
          ==7932==    at 0x4C2AB80: malloc (in
          /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)<br>
          ==7932==    by 0x1BBB5C7: mumps_397.3460
          (mumps_static_mapping.F:3746)<br>
          ==7932==    by 0x1BB8283: __mumps_static_mapping_MOD_mumps_369
          (mumps_static_mapping.F:302)<br>
          ==7932==    by 0x1A4BBF5: dmumps_537_ (dmumps_part5.F:1706)<br>
          ==7932==    by 0x1A550AC: dmumps_26_ (dmumps_part5.F:447)<br>
          ==7932==    by 0x1B346D6: dmumps_ (dmumps_part1.F:409)<br>
          ==7932==    by 0x1A170C7: dmumps_f77_ (dmumps_part3.F:6651)<br>
          ==7932==    by 0x19EDDCA: dmumps_c (mumps_c.c:422)<br>
          ==7932==    by 0x13B2F2B: MatLUFactorSymbolic_AIJMUMPS
          (mumps.c:972)<br>
          ==7932==    by 0xB54FB5: MatLUFactorSymbolic (matrix.c:2842)<br>
          ==7932==    by 0x15205A9: PCSetUp_LU (lu.c:127)<br>
          ==7932==    by 0x14C0685: PCSetUp (precon.c:902)<br>
          ==7932==<br>
        </font></blockquote>
      All these mainly come from mumps and I don't think that they would
      create the memory problem that I described earlier.<br>
      It really seems that the issue comes with larger problems with
      more mpi ranks.<br>
      <pre class="moz-signature" cols="72">Best,
Luc</pre>
      On 10/30/2014 12:23 PM, Barry Smith wrote:<br>
    </div>
    <blockquote
      cite="mid:5B51768E-D3B7-4FE9-A2D7-A540C9750DC7@mcs.anl.gov"
      type="cite">
      <pre wrap="">
   Run with the additional PETSc option -no_signal_handler  then PETSc won’t mess with the signals and it may get you past this point.

  Barry



</pre>
      <blockquote type="cite">
        <pre wrap="">On Oct 30, 2014, at 9:21 AM, Luc Berger-Vergiat <a class="moz-txt-link-rfc2396E" href="mailto:lb2653@columbia.edu"><lb2653@columbia.edu></a> wrote:

Sorry for the late reply, it took longer than I thought.
So a little update on my situation: 
I have to use a custom version of valgrind on CETUS which has to be linked to my code using -Wl,-e,_start_valgrind at compilation (I also add object file and libraries).
After that I can run my code with the following arguments:
feap --ignore-ranges=0x4000000000000-0x4063000000000,0x003fdc0000000-0x003fe00000000 --suppressions=/soft/perftools/valgrind/cnk-baseline.supp
but I can't use the usual petsc argument (-ksp_type -pc_type ...) since valgrind does not recognize them.
So I decided to use the PETSC_OPTIONS='-ksp_type preonly -pc_type lu -pc_factor_mat_solver_package mumps -ksp_diagonal_scale', to pass my arguments to PETSc.
However I am a little worried that Petsc still receive the command line arguments that are used by valgrind since I get the following stderr from valgrind:

stderr[0]: ==1==    by 0x38BEA3F: handle_SCSS_change (m_signals.c:963)
stderr[0]: ==1==    by 0x38C13A7: vgPlain_do_sys_sigaction (m_signals.c:1114)
stderr[0]: ==1==    by 0x3962FBF: vgSysWrap_linux_sys_rt_sigaction_before (syswrap-linux.c:3073)
stderr[0]: ==1==    by 0x3928BF7: vgPlain_client_syscall (syswrap-main.c:1464)
stderr[0]: ==1==    by 0x3925F4B: vgPlain_scheduler (scheduler.c:1061)
stderr[0]: ==1==    by 0x3965EB3: run_a_thread_NORETURN (syswrap-linux.c:103)
stderr[0]: 
</pre>
      </blockquote>
      <pre wrap="">
   
</pre>
      <blockquote type="cite">
        <pre wrap="">stderr[0]: sched status:
stderr[0]:   running_tid=1
stderr[0]: 
stderr[0]: Thread 1: status = VgTs_Runnable
stderr[0]: ==1==    at 0x34290E4: __libc_sigaction (sigaction.c:80)
stderr[0]: ==1==    by 0x3BFF3A7: signal (signal.c:49)
stderr[0]: ==1==    by 0x1E710DF: PetscPushSignalHandler (in /projects/shearbands/ShearBands/parfeap/feap)
stderr[0]: ==1==    by 0x18BEA87: PetscOptionsCheckInitial_Private (in /projects/shearbands/ShearBands/parfeap/feap)
stderr[0]: ==1==    by 0x18E132F: petscinitialize (in /projects/shearbands/ShearBands/parfeap/feap)
stderr[0]: ==1==    by 0x1027557: pstart (in /projects/shearbands/ShearBands/parfeap/feap)
stderr[0]: ==1==    by 0x1000B1F: MAIN__ (feap83.f:213)
stderr[0]: ==1==    by 0x342ABD7: main (fmain.c:21)
stderr[0]: 
stderr[0]: 
stderr[0]: Note: see also the FAQ in the source distribution.
stderr[0]: It contains workarounds to several common problems.
stderr[0]: In particular, if Valgrind aborted or crashed after
stderr[0]: identifying problems in your program, there's a good chance
stderr[0]: that fixing those problems will prevent Valgrind aborting or
stderr[0]: crashing, especially if it happened in m_mallocfree.c.
stderr[0]: 
stderr[0]: If that doesn't help, please report this bug to: <a class="moz-txt-link-abbreviated" href="http://www.valgrind.org">www.valgrind.org</a>
stderr[0]: 
stderr[0]: In the bug report, send all the above text, the valgrind
stderr[0]: version, and what OS and version you are using.  Thank
stderr[0]: s.
stderr[0]: 

I am only showing the output of rank[0] but it seems that all ranks have about the same error message.
Since my problem happens in petscinitialize I have little possibilities to check what's wrong...
Any ideas?
Best,
Luc

On 10/28/2014 02:53 PM, Barry Smith wrote:
</pre>
        <blockquote type="cite">
          <pre wrap="">   You don’t care about checking for leaks. I use 

-q --tool=memcheck --num-callers=20 --track-origins=yes


</pre>
          <blockquote type="cite">
            <pre wrap="">On Oct 28, 2014, at 1:50 PM, Luc Berger-Vergiat <a class="moz-txt-link-rfc2396E" href="mailto:lb2653@columbia.edu"><lb2653@columbia.edu></a>
 wrote:

Yes, I am running with --leak-check=full
Reconfiguring and recompiling the whole library and my code in debug mode does take quite some time on CETUS/MIRA...
Hopefully the queue will go up fast and I can give you some details about the issue.

Best,
Luc

On 10/28/2014 02:25 PM, Barry Smith wrote:

</pre>
            <blockquote type="cite">
              <pre wrap="">  You need to pass some options to valgrind telling it to check for memory corruption issues



</pre>
              <blockquote type="cite">
                <pre wrap="">On Oct 28, 2014, at 12:30 PM, Luc Berger-Vergiat <a class="moz-txt-link-rfc2396E" href="mailto:lb2653@columbia.edu"><lb2653@columbia.edu></a>
 wrote:

Ok, I'm recompiling PETSc in debug mode then.
Do you know what the call sequence should be on CETUS to get valgrind attached to PETSc?
Would this work for example:
runjob --np 32 -p 8 --block $COBALT_PARTNAME --cwd /projects/shearbands/job1/200/4nodes_32cores/LU --verbose=INFO --envs FEAPHOME8_3=/projects/shearbands/ShearBands352 PETSC_DIR=/projects/shearbands/petsc-3.5.2 PETSC_ARCH=arch-linux2-c-opt : /usr/bin/valgrind --log-file=valgrind.log.%p /projects/shearbands/ShearBands352/parfeap/feap -ksp_type preonly -pc_type lu -pc_factor_mat_solver_package mumps -ksp_diagonal_scale < /projects/shearbands/job1/yesfile


Best,
Luc

On 10/28/2014 12:33 PM, Barry Smith wrote:

</pre>
                <blockquote type="cite">
                  <pre wrap="">    Hmm, this should never happen. In the code

  ierr = PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr);
  for (i=0; i<aij->B->rmap->n; i++) {
    for (j=0; j<B->ilen[i]; j++) {
      PetscInt data,gid1 = aj[B->i[i] + j] + 1;
      ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr);

Now mat->cmap->N+1 is the total number of columns in the matrix and gid1 are column entries which must always be smaller. Most likely there has been memory corruption somewhere before this point. Can you run with valgrind?

<a class="moz-txt-link-freetext" href="http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind">http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind</a>



  Barry



</pre>
                  <blockquote type="cite">
                    <pre wrap="">On Oct 28, 2014, at 10:04 AM, Luc Berger-Vergiat <a class="moz-txt-link-rfc2396E" href="mailto:lb2653@columbia.edu"><lb2653@columbia.edu></a>

 wrote:

Hi,
I am running a code on CETUS and I use PETSc for as a linear solver.
Here is my submission command:
qsub -A shearbands -t 60 -n 4 -O 4nodes_32cores_Mult --mode script 4nodes_32cores_LU

Here is "4nodes_32cores_LU":
#!/bin/sh

LOCARGS="--block $COBALT_PARTNAME ${COBALT_CORNER:+--corner} $COBALT_CORNER ${COBALT_SHAPE:+--shape} $COBALT_SHAPE"
echo "Cobalt location args: $LOCARGS" >&2

################################
#   32 cores on 4 nodes jobs   #
################################
runjob --np 32 -p 8 --block $COBALT_PARTNAME --cwd /projects/shearbands/job1/200/4nodes_32cores/LU --verbose=INFO --envs FEAPHOME8_3=/projects/shearbands/ShearBands352 PETSC_DIR=/projects/shearbands/petsc-3.5.2 PETSC_ARCH=arch-linux2-c-opt : /projects/shearbands/ShearBands352/parfeap/feap -ksp_type preonly -pc_type lu -pc_factor_mat_solver_package mumps -ksp_diagonal_scale -malloc_log mlog -log_summary time.log           < /projects/shearbands/job1/yesfile

I get the following error message:

[7]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[7]PETSC ERROR: Argument out of range
[7]PETSC ERROR: Petsc Release Version 3.5.2, unknown
[7]PETSC ERROR: key 532150 is greater than largest key allowed 459888
[7]PETSC ERROR: Configure options --known-mpi-int64_t=1 --download-cmake=1 --download-hypre=1 --download-metis=1 --download-parmetis=1 --download-plapack=1 --download-superlu_dist=1 --download-mumps=1 --download-ml=1 --known-bits-per-byte=8 --known-level1-dcache-assoc=0 --known-level1-dcache-linesize=32 --known-level1-dcache-size=32768 --known-memcmp-ok=1 --known-mpi-c-double-complex=1 --known-mpi-long-double=1 --known-mpi-shared-libraries=0 --known-sizeof-MPI_Comm=4 --known-sizeof-MPI_Fint=4 --known-sizeof-char=1 --known-sizeof-double=8 --known-sizeof-float=4 --known-sizeof-int=4 --known-sizeof-long-long=8 --known-sizeof-long=8 --known-sizeof-short=2 --known-sizeof-size_t=8 --known-sizeof-void-p=8 --with-batch=1 --with-blacs-include=/soft/libraries/alcf/current/gcc/SCALAPACK/ --with-blacs-lib=/soft/libraries/alcf/current/gcc/SCALAPACK/lib/libscalapack.a --with-blas-lapack-lib="-L/soft/libraries/alcf/current/gcc/LAPACK/lib -llapack -L/soft/libraries/alcf/current/gcc/BLAS/lib
 
 
   -lblas" -
-with-cc=mpicc --with-cxx=mpicxx --with-debugging=0 --with-fc=mpif90 --with-fortran-kernels=1 --with-is-color-value-type=short --with-scalapack-include=/soft/libraries/alcf/current/gcc/SCALAPACK/ --with-scalapack-lib=/soft/libraries/alcf/current/gcc/SCALAPACK/lib/libscalapack.a --with-shared-libraries=0 --with-x=0 -COPTFLAGS=" -O3 -qhot=level=0 -qsimd=auto -qmaxmem=-1 -qstrict -qstrict_induction" -CXXOPTFLAGS=" -O3 -qhot=level=0 -qsimd=auto -qmaxmem=-1 -qstrict -qstrict_induction" -FOPTFLAGS=" -O3 -qhot=level=0 -qsimd=auto -qmaxmem=-1 -qstrict -qstrict_induction"

[7]PETSC ERROR: #1 PetscTableFind() line 126 in /gpfs/mira-fs1/projects/shearbands/petsc-3.5.2/include/petscctable.h
[7]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in /gpfs/mira-fs1/projects/shearbands/petsc-3.5.2/src/mat/impls/aij/mpi/mmaij.c
[7]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 702 in /gpfs/mira-fs1/projects/shearbands/petsc-3.5.2/src/mat/impls/aij/mpi/mpiaij.c
[7]PETSC ERROR: #4 MatAssemblyEnd() line 4900 in /gpfs/mira-fs1/projects/shearbands/petsc-3.5.2/src/mat/interface/matrix.c

Well at least that is what I think comes out after I read all the jammed up messages from my MPI processes...

I would guess that I am trying to allocate more memory than I should which seems strange since the same problem runs fine on 2 nodes with 16 cores/node

Thanks for the help
Best,
Luc




</pre>
                  </blockquote>
                </blockquote>
              </blockquote>
            </blockquote>
            <pre wrap="">


</pre>
          </blockquote>
        </blockquote>
        <pre wrap="">
</pre>
      </blockquote>
      <pre wrap="">
</pre>
    </blockquote>
    <br>
  </body>
</html>