[petsc-dev] Case TS005062693 - XLF: ICE in xlfentry compiling a module with 358 subroutines

Satish Balay balay at mcs.anl.gov
Wed Mar 3 13:06:46 CST 2021


Sure - once any change works locally [for gcc and xlf]

When I try - I get a bunch of errors.. [yet to digest them.]

Satish

On Wed, 3 Mar 2021, Jacob Faibussowitsch wrote:

> > I'm not sure what would happen if these 'use' statements are removed [whats required and what can be removed?]
> > 
> > The relevant code that adds this is in lib/petsc/bin/maint/generatefortranstubs.py
> > 
> >              fd.write('      use petsc'+mansec+'def\n')
> 
> I suppose we can run it through CI, see if it breaks? 
> 
> Best regards,
> 
> Jacob Faibussowitsch
> (Jacob Fai - booss - oh - vitch)
> Cell: (312) 694-3391
> 
> > On Mar 3, 2021, at 12:49, Satish Balay <balay at mcs.anl.gov> wrote:
> > 
> > On Wed, 3 Mar 2021, Jacob Faibussowitsch wrote:
> > 
> >> Hello All,
> >> 
> >> I discovered a compiler bug in the IBM xl fortran compiler a few weeks ago that would crash the compiler when compiling petsc fortran interfaces. The TL;DR of it is that the xl compiler creates a function dictionary for every function imported in fortran modules, and since petsc fortran interfaces seem to import entire packages writ-large this exceeds the number of dictionary entries (2**21):
> >> 
> >>> The reason for the Internal Compiler Error is because we can't grow an interal dictionary anymore (ie we hit a 2**21 limit).
> >>> The file contains many module procedures and interfaces that use the same helper module. As a result, we are importing the dictionary entries for that module repeatedly reaching 
> >>> the limit.
> >>> 
> >>> Can you please give the following source code workaround a try?
> >>> Since there is already "use petscvecdefdummy" at the module scope, one workaround might be to remove the unnecessary "use petscvecdefdummy" in vecnotequal and vecequals 
> >>> and all similar procedures.
> >>> 
> >>> For example, the test case has:
> >>>        module petscvecdef
> >>>        use petscvecdefdummy
> >>> ...
> >>>        function vecnotequal(A,B)
> >>>          use petscvecdefdummy
> >>>          logical vecnotequal
> >>>          type(tVec), intent(in) :: A,B
> >>>          vecnotequal = (A%v .ne. B%v)
> >>>        end function
> >>>        function vecequals(A,B)
> >>>          use petscvecdefdummy
> >>>          logical vecequals
> >>>          type(tVec), intent(in) :: A,B
> >>>          vecequals = (A%v .eq. B%v)
> >>>        end function
> >>> ...
> >>> end module
> >>> Another workaround would be to put the procedure definitions from this large module into several submodules.  Each submodule would be able to accommodate a dictionary with 2**21 entries.
> >>> 
> >>> 
> >>> Please let us know if one of the above workarounds resolve the issue.
> >> 
> >> 
> >> The proposed fix from IBM would be to pull “use moduleXXX” out of subroutines or to have our auto-fortran interfaces detect which symbols to include from the respective modules and only include those in the subroutines. I’m not familiar at all with how the interfaces are generated so I don’t even know if this is possible.
> > 
> > I'm not sure what would happen if these 'use' statements are removed [whats required and what can be removed?]
> > 
> > The relevant code that adds this is in lib/petsc/bin/maint/generatefortranstubs.py
> > 
> >              fd.write('      use petsc'+mansec+'def\n')
> > 
> > Satish
> > 
> >>> IBM provided the following additional explanation and example. Can the process used to generate these routines and functions determine the specific symbols required and then use the only keyword or import statement to include them?
> >>> 
> >>> When factoring out use statements out of module procedures, you can just delete them.  But you can't completely remove them from interface blocks.  Instead, you can limit them either by using use <module>, only: <symbol> or import <symbol> . if the hundreds of use statements in the program are factored out / limited in this way, that should reduce the dictionary size sufficiently for the program to compile.
> >>> 
> >>> For example
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          use petscvecdef
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> 
> >>> imports all symbols from petscvecdef into the dictionary even though we only need tVec .  So we can either:
> >>> 
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          use petscvecdef, only: tVec
> >>>          implicit none
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> 
> >>> or if use petscvecdef is used in the outer scope, we can:
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          import tVec
> >>>          implicit none
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> (The two methods (use, only vs import) are equivalent in terms of impact to the dictionary.)
> >>> 
> >> 
> >> Is this compiler ~feature~ something that we intend to work around? Thoughts?
> >> 
> >> Best regards,
> >> 
> >> Jacob Faibussowitsch
> >> (Jacob Fai - booss - oh - vitch)
> >> Cell: (312) 694-3391
> >> 
> >>> Begin forwarded message:
> >>> 
> >>> From: "Roy Musselman" <roymuss at us.ibm.com>
> >>> Subject: Re: Case TS005062693 - XLF: ICE in xlfentry compiling a module with 358 subroutines
> >>> Date: March 3, 2021 at 08:23:17 CST
> >>> To: Jacob Faibussowitsch <faibuss2 at illinois.edu>
> >>> Cc: "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov>
> >>> 
> >>> Hi Jacob, 
> >>> I tried the first suggestion and commented out the use statements called within the functions. However, I hit the following error complaining about specific symbol dependencies provided by the library.
> >>> 
> >>> .../src/vec/f90-mod/petscvecmod.F90", line 107.37: 1514-084 (S) Identifier a is being declared with type name tvec which has not been defined in a derived type definition. 
> >>> 
> >>> IBM provided the following additional explanation and example. Can the process used to generate these routines and functions determine the specific symbols required and then use the only keyword or import statement to include them?
> >>> 
> >>> When factoring out use statements out of module procedures, you can just delete them.  But you can't completely remove them from interface blocks.  Instead, you can limit them either by using use <module>, only: <symbol> or import <symbol> . if the hundreds of use statements in the program are factored out / limited in this way, that should reduce the dictionary size sufficiently for the program to compile.
> >>> 
> >>> For example
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          use petscvecdef
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> 
> >>> imports all symbols from petscvecdef into the dictionary even though we only need tVec .  So we can either:
> >>> 
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          use petscvecdef, only: tVec
> >>>          implicit none
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> 
> >>> or if use petscvecdef is used in the outer scope, we can:
> >>>      Interface
> >>>        Subroutine VecRestoreArrayReadF90(v,array,ierr)
> >>>          import tVec
> >>>          implicit none
> >>>          real(kind=selected_real_kind(10)), pointer :: array(:)
> >>>          integer(kind=selected_int_kind(5)) ierr
> >>>          type(tVec)     v
> >>>        End Subroutine
> >>>      End Interface
> >>> (The two methods (use, only vs import) are equivalent in terms of impact to the dictionary.)
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> Roy Musselman---02/24/2021 07:08:45 PM---Hi Jacob, I opened the ticket with IBM: case TS005062693 and and the local LLNL Sierra Jira Ticket
> >>> 
> >>> From:  Roy Musselman/Rochester/Contr/IBM
> >>> To:  Jacob Faibussowitsch <faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu> <mailto:faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu>>>
> >>> Cc:  "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>
> >>> Date:  02/24/2021 07:08 PM
> >>> Subject:  Re: [EXTERNAL] Case TS005062693 - XLF: ICE in xlfentry compiling a module with 358 subroutines
> >>> 
> >>> 
> >>> 
> >>> Hi Jacob, 
> >>> I opened the ticket with IBM: case TS005062693 and and the local LLNL Sierra Jira Ticket at
> >>> https://lc.llnl.gov/jira/projects/SIERRA/issues/SIERRA-111?filter=allissues <https://lc.llnl.gov/jira/projects/SIERRA/issues/SIERRA-111?filter=allissues><https://urldefense.com/v3/__https://lc.llnl.gov/jira/projects/SIERRA/issues/SIERRA-111?filter=allissues__;!!DZ3fjg!vDUpTg4q6jg1lQwt37jm9Uzc7MqGrEdrg0wpKgGq9P5JoR3jKrqncOAKyni2BEUYOxQ$ <https://urldefense.com/v3/__https://lc.llnl.gov/jira/projects/SIERRA/issues/SIERRA-111?filter=allissues__;!!DZ3fjg!vDUpTg4q6jg1lQwt37jm9Uzc7MqGrEdrg0wpKgGq9P5JoR3jKrqncOAKyni2BEUYOxQ$>>
> >>> 
> >>> Today IBM provided the response below. I don't know when I'll have time to try it on the reproducer I gave IBM. Perhaps early next week. Can you review this and see if it helps? 
> >>> 
> >>> The reason for the Internal Compiler Error is because we can't grow an interal dictionary anymore (ie we hit a 2**21 limit).
> >>> The file contains many module procedures and interfaces that use the same helper module. As a result, we are importing the dictionary entries for that module repeatedly reaching 
> >>> the limit.
> >>> 
> >>> Can you please give the following source code workaround a try?
> >>> Since there is already "use petscvecdefdummy" at the module scope, one workaround might be to remove the unnecessary "use petscvecdefdummy" in vecnotequal and vecequals 
> >>> and all similar procedures.
> >>> 
> >>> For example, the test case has:
> >>>        module petscvecdef
> >>>        use petscvecdefdummy
> >>> ...
> >>>        function vecnotequal(A,B)
> >>>          use petscvecdefdummy
> >>>          logical vecnotequal
> >>>          type(tVec), intent(in) :: A,B
> >>>          vecnotequal = (A%v .ne. B%v)
> >>>        end function
> >>>        function vecequals(A,B)
> >>>          use petscvecdefdummy
> >>>          logical vecequals
> >>>          type(tVec), intent(in) :: A,B
> >>>          vecequals = (A%v .eq. B%v)
> >>>        end function
> >>> ...
> >>> end module
> >>> Another workaround would be to put the procedure definitions from this large module into several submodules.  Each submodule would be able to accommodate a dictionary with 2**21 entries.
> >>> 
> >>> 
> >>> Please let us know if one of the above workarounds resolve the issue.
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> 
> >>> Roy Musselman---02/21/2021 09:42:55 PM---Hi Jacob, After some more experimentation, I think I may have found what is triggering the ICE. It
> >>> 
> >>> From:  Roy Musselman/Rochester/Contr/IBM
> >>> To:  Jacob Faibussowitsch <faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu> <mailto:faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu>>>
> >>> Cc:  "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>
> >>> Date:  02/21/2021 09:42 PM
> >>> Subject:  Re: [EXTERNAL] Re: xlf90_r Internal Compiler Error
> >>> 
> >>> 
> >>> Hi Jacob, 
> >>> 
> >>> After some more experimentation, I think I may have found what is triggering the ICE. It doesn't appear to be related to the subroutine name length. I think the compiler may be hitting an internal limit of the number of subroutines within a module. There are 358 subroutines contained in the expanded petscmatmod.F90. Removing 4 subroutines will allow the compile to complete successfully, so the limit must be 354 subroutines. Is it possible for you to bust up petscmatmod into multiple modules? I'll package up the reproducer and pass it on to the compiler development team.
> >>> 
> >>> I've asked for user feedback a couple years ago, when the IBM Power9 CORAL-1 Sierra systems were deployed, but received minimal responses. DOE is now working with Cray (aka HPE) developing the environment for the CORAL-2 system (El Capitan). I'll pass your request to the LLNL person I know that is dealing with math libraries for CORAL-2.
> >>> 
> >>> We use the spack tool to download and build petsc and its specified dependencies. I switched between the PETSC versions by changing the PETSCDIR variable in the script I shared with you. I've attached a tar ball containing the scripts used to build PETSc via spack.
> >>> 
> >>> [attachment "bld-petsc-spack.tgz" deleted by Roy Musselman/Rochester/Contr/IBM] 
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> 
> >>> Jacob Faibussowitsch ---02/21/2021 12:24:11 PM---Hi Roy, > I'm not sure which projects at LLNL are using PETSc or if they chose to build their own ve
> >>> 
> >>> From:  Jacob Faibussowitsch <faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu> <mailto:faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu>>>
> >>> To:  Roy Musselman <roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>>
> >>> Cc:  "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>
> >>> Date:  02/21/2021 12:24 PM
> >>> Subject:  [EXTERNAL] Re: xlf90_r Internal Compiler Error
> >>> 
> >>> 
> >>> 
> >>> Hi Roy, I'm not sure which projects at LLNL are using PETSc or if they chose to build their own version. Entirely unrelated to our problem, but is it possible to find this out? It would be great if yes, but also completely fine if not. PETSc 
> >>> Hi Roy,
> >>> I'm not sure which projects at LLNL are using PETSc or if they chose to build their own version.
> >>> Entirely unrelated to our problem, but is it possible to find this out? It would be great if yes, but also completely fine if not. PETSc is potentially undergoing a rather transformative rewrite over the next few years and we’d like to gather current usage data to get a better idea of where PETSc fits into our users workflows. But we aren’t sure how to gather this data (we don’t particularly want to scrape and silently send it off without users consent/knowledge) absent user questionnaires and HPC usage statistics.
> >>> If you are interested, I can share with you the spack recipes I use to build petsc with hdf5, hypre, and suplerlu-dist.
> >>> Yes that would be quite useful. I can let it percolate through our dev channels for any other recommendations etc.
> >>> 3.14.0 and 3.14.1
> >>> 
> >>> "../roymuss/spack-stage-petsc-3.14.0-on3lboy4slkz65tsjttgfmwghzky54jj/spack-src/src/vec/f90-mod/petscvecmod.F90", line 9.13: 1514-219 (S) Unable to access module symbol file for module petscisdefdummy. Check path and file permissions of file. Use association not done for this module.
> >>> 1501-511 Compilation failed for file petscvecmod.F90.
> >>> How exactly did you switch between versions? PETSc has 2 types of fortran bindings, “ftn-custom” and “ftn-auto” (technically 3 including the F90 files, but those simply call either of the two preceding ones), a copy of which you will find in every src directory. As the names imply ftn-auto is auto generated while ftn-custom is hand-written. 
> >>> 
> >>> This also means that the ftn-auto files are __not__ tracked by git, so a simple git checkout [new-tag] may not properly dispose of the old auto-generated files (very rare, but IIRC we made a major enough change to the fortran bindings within the last year to warrant having to "make deletefortranstubs" before rebuilding).
> >>> Adding the option -qlanglvl=2003std or -qlanglvl=2008std produces a bunch of other warning messages, but it still encounters the ICE. So, I'm uncertain if the subroutine name length is the root of the problem. 
> >>> Our current compiler flag selection philosophy is to require a minimum but choose the maximum available reasonable flag for the compiler (I.e. we require C99, but very often you will find that your code is compiled with C11 or C17 if they are available). It is therefore odd that configure did not use the same methodology for fortran compilers. I will relay this on our side.
> >>> Is it possible for you to use subroutines that are less than 32 characters and see if that works four you? Have you used other fortran 90 compilers and do any of them complain of this? 
> >>> Of all of the small quirks fortran has this is probably the most esoteric one I’ve come across… I’ve attached a list of all the F90 compilers, and their flags which we use in CI/CD (all of which is run multiple times daily and __must__ pass). I got them all via grep, so there may be some duplicates here or there. As for using shorter names, this is also something we can look at, but since none of the other compilers have had issues with this I’m not sure this is the change to make.
> >>> Are there any unusual or questionable language constructs used in any of the functions mentioned above that may possibly challenge the compiler? 
> >>> Not that I am aware of, but again I will ask around our dev channels and see if anything comes to mind.
> >>> 
> >>> 
> >>> Best regards,
> >>> 
> >>> Jacob Faibussowitsch
> >>> (Jacob Fai - booss - oh - vitch)
> >>> Cell: (312) 694-3391[attachment "compilerList" deleted by Roy Musselman/Rochester/Contr/IBM] 
> >>> On Feb 20, 2021, at 22:05, Roy Musselman <roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>> wrote:
> >>> Hi Jacob,
> >>> Thanks for letting me know that you are a PETSc developer and that you are testing it on the LLNL lassen system. I've used the spack build tool to build and deploy a few versions on the systems. I'm not sure which projects at LLNL are using PETSc or if they chose to build their own version. I did however provide a single precision version upon request that was integrated with MVAPICH2-MPI instead of the IBM-provided Spectrum-MPI. Here's what's available on the systems today.
> >>> 
> >>>> ml avail petsc
> >>> ----------------------------------------------------- /usr/tcetmp/modulefiles/Core -----------------------------------------------------
> >>> petsc/default petsc/3.10.2 petsc/3.11.3 petsc/3.13.0 (D)  
> >>> petsc/3.13.1-mvapich2-2020.01.09-xl-2020.03.18.single
> >>> 
> >>> If you are interested, I can share with you the spack recipes I use to build petsc with hdf5, hypre, and suplerlu-dist.
> >>> 
> >>> After several attempts I was able to reproduce the Internal Compiler Errro (ICE) that you are seeing using version 3.14.4. I've whittled it down to the petscmatmod.F90 file and it's specific dependencies. 
> >>> The following script is what I'm using. Note that in the 2nd set of compiles, the -E option is used to expand all included source files and headers and encapsulating it into a single large source file. This can be used to help isolate the source of the problem.  
> >>> 
> >>> #!/bin/bash
> >>> 
> >>> PETSCDIR="../roymuss/spack-stage-petsc-3.14.4-eh5arny7l3cqjlltlfpjp6f4jofbnmz6/spack-src" 
> >>> OPTIONS=" -qmoddir=moddir -I$PETSCDIR/arch-linux-c-opt/include -I$PETSCDIR/include"
> >>> mkdir -p moddir
> >>> 
> >>> set -x 
> >>> 
> >>> # Compile original source files including dependencies
> >>> if [ 0 = 1 ]; then
> >>> mpif90 -c -g $OPTIONS $PETSCDIR/src/sys/f90-mod/petscsysmod.F90 -o petscsysmod.o 
> >>> mpif90 -c -g $OPTIONS $PETSCDIR/src/vec/f90-mod/petscvecmod.F90 -o petscvecmod.o
> >>> mpif90 -c -g $OPTIONS $PETSCDIR/src/mat/f90-mod/petscmatmod.F90 -o petscmatmod.o
> >>> fi
> >>> 
> >>> # Use -E option to expand source into full source files
> >>> if [ 0 = 1 ]; then
> >>> mpif90 -c -g -E $OPTIONS $PETSCDIR/src/sys/f90-mod/petscsysmod.F90 -o full_petscsysmod.F90
> >>> mpif90 -c -g -E $OPTIONS $PETSCDIR/src/vec/f90-mod/petscvecmod.F90 -o full_petscvecmod.F90
> >>> mpif90 -c -g -E $OPTIONS $PETSCDIR/src/mat/f90-mod/petscmatmod.F90 -o full_petscmatmod.F90
> >>> fi
> >>> 
> >>> # Compile from full source files
> >>> if [ 1 = 1 ]; then
> >>> mpif90 -c -g -Imoddir -qmoddir=moddir full_petscsysmod.F90 -o full_petscsysmod.o
> >>> mpif90 -c -g -Imoddir -qmoddir=moddir full_petscvecmod.F90 -o full_petscvecmod.o
> >>> mpif90 -V -c -g -Imoddir -qmoddir=moddir full_petscmatmod.F90 -o full_petscmatmod.o
> >>> fi
> >>> 
> >>> <eof>
> >>> 
> >>> Petsc 3.13.6 it the most recent version that did not fail. I tried all subsequent versions and got the folowing results: 
> >>> 
> >>> 3.14.0 and 3.14.1
> >>> 
> >>> "../roymuss/spack-stage-petsc-3.14.0-on3lboy4slkz65tsjttgfmwghzky54jj/spack-src/src/vec/f90-mod/petscvecmod.F90", line 9.13: 1514-219 (S) Unable to access module symbol file for module petscisdefdummy. Check path and file permissions of file. Use association not done for this module.
> >>> 1501-511 Compilation failed for file petscvecmod.F90.
> >>> 
> >>> 3.14.2, 3.14.3, and 3.14.4
> >>> 
> >>> . . .
> >>> ** matnullspaceequals === End of Compilation 8 ===
> >>> *** Error in `/usr/tce/packages/xl/xl-2020.11.12/xlf/16.1.1/exe/xlfentry': free(): invalid pointer: 0x0000200001740018 ***
> >>> 
> >>> Examining the tail end of petscmatmod.F90
> >>> 
> >>> 
> >>> 80 function matnullspaceequals(A,B)
> >>> 81 use petscmatdefdummy
> >>> 82 logical matnullspaceequals
> >>> 83 type(tMatNullSpace), intent(in) :: A,B
> >>> 84 matnullspaceequals = (A%v .eq. B%v)
> >>> 85 end function
> >>> 86 
> >>> 87 #if defined(_WIN32) && defined(PETSC_USE_SHARED_LIBRARIES)
> >>> 88 !DEC$ ATTRIBUTES DLLEXPORT::matnotequal
> >>> 89 !DEC$ ATTRIBUTES DLLEXPORT::matequals
> >>> 90 !DEC$ ATTRIBUTES DLLEXPORT::matfdcoloringnotequal
> >>> 91 !DEC$ ATTRIBUTES DLLEXPORT::matfdcoloringequals
> >>> 92 !DEC$ ATTRIBUTES DLLEXPORT::matnullspacenotequal
> >>> 93 !DEC$ ATTRIBUTES DLLEXPORT::matnullspaceequals
> >>> 94 #endif
> >>> 95 module petscmat
> >>> 96 use petscmatdef
> >>> 97 use petscvec
> >>> 98 #include <../src/mat/f90-mod/petscmat.h90>
> >>> 99 interface
> >>> 100 #include <../src/mat/f90-mod/ftn-auto-interfaces/petscmat.h90>
> >>> 101 end interface
> >>> 102 end module
> >>> 103 
> >>> 
> >>> Compiling the matnullspaceequals function was successful just before hitting the error. The error goes away when removing either or both of the #include lines 98 and 100. Both #include statements are required to produce the error. The 3.13.6 and 3.14.4 version of the file identified in the first #include at line 98 are identical. The file identified in line 100 is different between 3.13.6 and 3.14.4.
> >>> Just looking at the list of subroutines contained within each version, the following are the differences. 
> >>> 
> >>> Old subroutines available in 3.13.6 but removed from 4.14.4
> >>> subroutine MatFreeIntermediateDataStructures(a,z)
> >>> 
> >>> New subroutines available in 4.14.4 but not contained in 3.13.6 
> >>> subroutine MatDenseReplaceArray(a,b,z)
> >>> subroutine MatIsShell(a,b,z)
> >>> subroutine MatRARtMultEqual(a,b,c,d,e,z)
> >>> subroutine MatScaLAPACKGetBlockSizes(a,b,c,z)
> >>> subroutine MatScaLAPACKSetBlockSizes(a,b,c,z)
> >>> subroutine MatSeqAIJCUSPARSESetGenerateTranspose(a,b,z)
> >>> subroutine MatSeqAIJSetTotalPreallocation(a,b,z)
> >>> subroutine MatSetLayouts(a,b,c,z)
> >>> 
> >>> Methodically removing the new subroutines did not provide a consistent result. But I did notice the extra long subroutine name MatSeqAIJCUSPARSESetGenerateTranspose had 37 characters.
> >>> A little research found: In Fortran 90/95 the maximum length was 31 characters, in Fortran 2003 it is now 63 characters. I found the following subroutines with greater than 31 characters
> >>> 
> >>> subroutine MatCreateMPIMatConcatenateSeqMat
> >>> subroutine MatFactorFactorizeSchurComplement
> >>> subroutine MatMPIAdjCreateNonemptySubcommMat
> >>> subroutine MatSeqAIJCUSPARSESetGenerateTranspose
> >>> subroutine MatMPIAIJSetUseScalableIncreaseOverlap
> >>> subroutine MatFactorSolveSchurComplementTranspose
> >>> 
> >>> I individually ifdef'd them out of the source file and was able to compile the files successfully without encountering the ICE. 
> >>> 
> >>> I'm not exactly sure what the maximum subroutine name length that the XLF compiler allows, but if it is only 31, it would be useful if the compiler detected this and issue a message instead of the ICE.
> >>> Adding the option -qlanglvl=2003std or -qlanglvl=2008std produces a bunch of other warning messages, but it still encounters the ICE. So, I'm uncertain if the subroutine name length is the root of the problem. 
> >>> 
> >>> Is it possible for you to use subroutines that are less than 32 characters and see if that works four you? Have you used other fortran 90 compilers and do any of them complain of this? 
> >>> Are there any unusual or questionable language constructs used in any of the functions mentioned above that may possibly challenge the compiler? 
> >>> 
> >>> I'll package this up and send it to the IBM XL compiler development team for their examination and comment. 
> >>> 
> >>> Best Regards,
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> <graycol.gif>Jacob Faibussowitsch ---02/18/2021 02:17:05 PM---> The most recently built version available on the CORAL systems is 3.13.0. (ml load petsc/3.13.0) W
> >>> 
> >>> From:  Jacob Faibussowitsch <faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu> <mailto:faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu>>>
> >>> To:  Roy Musselman <roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>>
> >>> Cc:  "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>
> >>> Date:  02/18/2021 02:17 PM
> >>> Subject:  [EXTERNAL] Re: xlf90_r Internal Compiler Error
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> The most recently built version available on the CORAL systems... 
> >>> This Message Is From an External Sender
> >>> This message came from outside your organization.
> >>> The most recently built version available on the CORAL systems is 3.13.0. (ml load petsc/3.13.0) Will that work for you?
> >>> I am building petsc from source as part of development work on petsc itself so modules are unfortunately not useful here.
> >>> The files you sent me do not contain all the dependencies (other mod files) required to reproduce the error. 
> >>> I'll attempt to build version 3.14.4 from scratch and recreate the failing symptom you are observing.
> >>> Yes, petsc uses an automated system to generate the fortran files from C which goes about 20 rabbit holes deeper than I was willing to dig. Let me know if you run into trouble configuring and building petsc, I can point you in the right direction. I’ve attached a “reconfigure” script with this email, it contains all of the arguments I used to configure petsc successfully on Lassen. If you place it into your $PETSC_DIR (i.e. the folder titled “petsc” and that contains a “configure” file) and run:
> >>> 
> >>> $ python3 ./reconfigure-arch-linux-c-debug.py
> >>> 
> >>> It should work. If not, you will have to 
> >>> 
> >>> $ ./configure —all-the-args —in-the-reconfigure —file
> >>> 
> >>> Best regards,
> >>> 
> >>> Jacob Faibussowitsch
> >>> (Jacob Fai - booss - oh - vitch)
> >>> Cell: (312) 694-3391[attachment "reconfigure-arch-linux-c-debug.py" deleted by Roy Musselman/Rochester/Contr/IBM] 
> >>> On Feb 18, 2021, at 15:07, Roy Musselman <roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>> wrote:
> >>> Hi Jacob,
> >>> 
> >>> The source file appears to come from the PETSc 3.14.4 library. The most recently built version available on the CORAL systems is 3.13.0. (ml load petsc/3.13.0) Will that work for you?
> >>> The files you sent me do not contain all the dependencies (other mod files) required to reproduce the error. 
> >>> I'll attempt to build version 3.14.4 from scratch and recreate the failing symptom you are observing.
> >>> 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> <graycol.gif>Roy Musselman---02/18/2021 11:18:20 AM---I'll take a look. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Roy Musselman
> >>> 
> >>> From: Roy Musselman/Rochester/Contr/IBM
> >>> To: LC Hotline <lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov> <mailto:lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov>>>
> >>> Cc: "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>
> >>> Date: 02/18/2021 11:18 AM
> >>> Subject: Re: [EXTERNAL] FW: xlf90_r Internal Compiler Error
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> I'll take a look. 
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Roy Musselman
> >>> IBM HPC Application Analyst at Lawrence Livermore National Lab
> >>> email: roymuss at us.ibm.com <mailto:roymuss at us.ibm.com> <mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>
> >>> LLNL office: 925-422-6033
> >>> Cell: 507-358-8895, Home: 507-281-9565
> >>> 
> >>> 
> >>> <graycol.gif>LC Hotline ---02/18/2021 11:03:55 AM---Hi John, Roy, Can you help this user with the problem that he is seeing when he tries to build with
> >>> 
> >>> From: LC Hotline <lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov> <mailto:lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov>>>
> >>> To: "Gyllenhaal, John C." <gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov> <mailto:gyllenhaal1 at llnl.gov <mailto:gyllenhaal1 at llnl.gov>>>, Roy Musselman <roymuss at us.ibm.com <mailto:roymuss at us.ibm.com><mailto:roymuss at us.ibm.com <mailto:roymuss at us.ibm.com>>>
> >>> Date: 02/18/2021 11:03 AM
> >>> Subject: [EXTERNAL] FW: xlf90_r Internal Compiler Error
> >>> 
> >>> 
> >>> 
> >>> Hi John, Roy, Can you help this user with the problem that he is... 
> >>> This Message Is From an External Sender
> >>> This message came from outside your organization.
> >>> Hi John, Roy,
> >>> 
> >>> Can you help this user with the problem that he is seeing when he tries to build with xlf90 on Lassen?
> >>> 
> >>> Thanks,
> >>> Ryan
> >>> --
> >>> LC Hotline
> >>> 
> >>> From: Jacob Faibussowitsch <faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu> <mailto:faibuss2 at illinois.edu <mailto:faibuss2 at illinois.edu>>>
> >>> Date: Wednesday, February 17, 2021 at 5:27 PM
> >>> To: LC Hotline <lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov> <mailto:lc-hotline at llnl.gov <mailto:lc-hotline at llnl.gov>>>
> >>> Subject: xlf90_r Internal Compiler Error
> >>> 
> >>> Hello LC Support, 
> >>> 
> >>> While compiling my application on Lassen I seem have run afoul of the xlf90 mpi compiler wrapper with the following error:
> >>> 
> >>> *** Error in `/usr/tce/packages/xl/xl-2020.11.12/xlf/16.1.1/exe/xlfentry': free(): invalid pointer: 0x0000200001740018 ***
> >>> 
> >>> I’m fairly certain this isn’t my fault as this is code that compiles regularly on extensive CI/CD under various other compilers and machines, but you can never rule it out. I have included a verbose full log of my make run (which includes a comprehensive rundown of the environment) as well as a separate file containing the error message and stack trace from the compiler. Additionally I have also included the file which I believe is causing the error. Let me know if there is anything else I should send.
> >>> 
> >>> P.S. My list of loaded modules:
> >>> 
> >>> Currently Loaded Modules:
> >>> 1) StdEnv (S) 4) cuda/11.1.1 7) valgrind/3.16.1
> >>> 2) clang/ibm-11.0.0 5) python/3.8.2 8) lapack/3.9.0-xl-2020.11.12
> >>> 3) spectrum-mpi/rolling-release 6) cmake/3.18.0 9) hip/3.0.0
> >>> 
> >>> Best regards,
> >>> 
> >>> Jacob Faibussowitsch
> >>> (Jacob Fai - booss - oh - vitch)
> >>> Cell: (312) 694-3391[attachment "errorReport.zip" deleted by Roy Musselman/Rochester/Contr/IBM] 
> 
> 


More information about the petsc-dev mailing list