[petsc-dev] [petsc-maint] Incident INC0122538 MKL on Cori/KNL

Richard Tran Mills rtmills at anl.gov
Tue Jul 3 13:10:10 CDT 2018


Mark, were you trying this in Valgrind with a binary targeting KNL, i.e.,
built to use AVX-512 instructions? I don't think Valgrind implements all
(or any?) of those, so a failure is not a surprise. Indeed, I've had
Valgrind choke on some AVX2 instructions, though maybe the most recent
versions of Valgrind will handle these properly now.

--Richard

On Tue, Jul 3, 2018 at 4:25 AM, Mark Adams <mfadams at lbl.gov> wrote:

> Well this does work without valgrind.
>
> On Tue, Jul 3, 2018 at 6:36 AM Mark Adams <mfadams at lbl.gov> wrote:
>
>> I built a 32 bit integer version and now it dies in PetscInit. Ugh.
>>
>> ==3965== Conditional jump or move depends on uninitialised value(s)
>> ==3965==    at 0x27AFFD8F: _int_free (malloc.c:3945)
>> ==3965==    by 0x20074F48: PetscOptionsSetValue (options.c:1152)
>> ==3965==    by 0x20070450: PetscOptionsInsertArgs_Private (options.c:636)
>> ==3965==    by 0x2007189F: PetscOptionsInsert (options.c:746)
>> ==3965==    by 0x20093DB1: PetscInitialize (pinit.c:929)
>> ==3965==    by 0x2000ABCE: main (ex19.c:106)
>> ==3965==
>> vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0x7F 0x8 0x7B 0xC2
>> 0xC5 0xFB 0x10 0xD
>> vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
>> vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
>> vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
>> ==3965== valgrind: Unrecognised instruction at address 0x20104a53.
>> ==3965==    at 0x20104A53: kh_resize_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x20104EE2: kh_put_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x201061EF: PetscOptionsHelpPrintedCheck (viewreg.c:89)
>> ==3965==    by 0x2009AD61: PetscOptionsBegin_Private (aoptions.c:34)
>> ==3965==    by 0x2008362E: PetscOptionsSetFromOptions (options.c:2646)
>> ==3965==    by 0x20094689: PetscInitialize (pinit.c:967)
>> ==3965==    by 0x2000ABCE: main (ex19.c:106)
>> ==3965== Your program just tried to execute an instruction that Valgrind
>> ==3965== did not recognise.  There are two possible reasons for this.
>> ==3965== 1. Your program has a bug and erroneously jumped to a non-code
>> ==3965==    location.  If you are running Memcheck and you just saw a
>> ==3965==    warning about a bad jump, it's probably your program's fault.
>> ==3965== 2. The instruction is legitimate but Valgrind doesn't handle it,
>> ==3965==    i.e. it's Valgrind's fault.  If you think this is the case or
>> ==3965==    you are not sure, please let us know and we'll try to fix it.
>> ==3965== Either way, Valgrind will now raise a SIGILL signal which will
>> ==3965== probably kill your program.
>> [0]PETSC ERROR: ==3965== Conditional jump or move depends on
>> uninitialised value(s)
>> ==3965==    at 0x27B18B48: strchrnul (strchr.S:106)
>> ==3965==    by 0x27AE30C8: __find_specmb (printf-parse.h:108)
>> ==3965==    by 0x27AE30C8: vfprintf (vfprintf.c:1311)
>> ==3965==    by 0x27AFA045: vsnprintf (vsnprintf.c:119)
>> ==3965==    by 0x20116802: PetscVSNPrintf (mprint.c:178)
>> ==3965==    by 0x20117006: PetscVFPrintfDefault (mprint.c:294)
>> ==3965==    by 0x20136BB2: PetscErrorPrintfDefault (errtrace.c:116)
>> ==3965==    by 0x2013938C: PetscSignalHandlerDefault (signal.c:131)
>> ==3965==    by 0x2013903B: PetscSignalHandler_Private (signal.c:43)
>> ==3965==    by 0x2335D07F: ??? (in /global/u2/m/madams/petsc_
>> install/petsc/src/snes/examples/tutorials/ex19)
>> ==3965==    by 0x20104A52: kh_resize_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x20104EE2: kh_put_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x201061EF: PetscOptionsHelpPrintedCheck (viewreg.c:89)
>> ==3965==
>> ==3965== Conditional jump or move depends on uninitialised value(s)
>> ==3965==    at 0x27B08184: strlen (strlen.S:210)
>> ==3965==    by 0x200B3A48: PetscStrlen (str.c:158)
>> ==3965==    by 0x2011693D: PetscVSNPrintf (mprint.c:188)
>> ==3965==    by 0x20117006: PetscVFPrintfDefault (mprint.c:294)
>> ==3965==    by 0x20136BB2: PetscErrorPrintfDefault (errtrace.c:116)
>> ==3965==    by 0x2013938C: PetscSignalHandlerDefault (signal.c:131)
>> ==3965==    by 0x2013903B: PetscSignalHandler_Private (signal.c:43)
>> ==3965==    by 0x2335D07F: ??? (in /global/u2/m/madams/petsc_
>> install/petsc/src/snes/examples/tutorials/ex19)
>> ==3965==    by 0x20104A52: kh_resize_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x20104EE2: kh_put_HTPrinted (viewreg.c:12)
>> ==3965==    by 0x201061EF: PetscOptionsHelpPrintedCheck (viewreg.c:89)
>> ==3965==    by 0x2009AD61: PetscOptionsBegin_Private (aoptions.c:34)
>> ==3965==
>> ------------------------------------------------------------------------
>> [0]PETSC ERROR: Caught signal number 4 Illegal instruction: Likely due to
>> memory corruption
>>
>>
>> On Mon, Jul 2, 2018 at 8:53 PM Mark Adams <mfadams at lbl.gov> wrote:
>>
>>> Looping Treb and Baky back into the thread, and dropping PETSs.
>>>
>>> Great, thanks Barry for figuring this out.
>>>
>>> Treb and Baky need 64 bit indices, but in the mean time I can build a 32
>>> bit version to let them test.
>>>
>>> I am all set up to test a 64 bit version. If you can give me a branch I
>>> can test.
>>>
>>> Thanks,
>>> Mark
>>>
>>>
>>>
>>> On Mon, Jul 2, 2018 at 7:47 PM Smith, Barry F. <bsmith at mcs.anl.gov>
>>> wrote:
>>>
>>>>
>>>>    Progress has been made. These libraries do contain
>>>> mkl_set_num_threads() is found so ./configure knows that it is MKL
>>>> libraries (unlike before when it did not recognize that it was MKL
>>>> libraries).
>>>>
>>>>
>>>> Damn it, here is why:
>>>>
>>>> --with-64-bit-indices=1
>>>>
>>>> currently the MKL sparse stuff only works with 32 bit integers AND 32
>>>> bit integer BLAS/LAPACK.
>>>>
>>>> It will never work with 64 bit indices and 32 bit integer BLAS/LAPACK.
>>>> but maybe could be upgraded to work with 64 bit indices and 64 bit integer
>>>> BLAS/LAPACK   Richard?
>>>>
>>>>
>>>> Anyways the requirement in mkl_sparse.py is
>>>>
>>>>     self.requires32bitint = 1
>>>>
>>>> The problem is ./configure does not print enough information to make it
>>>> immediately clear it is rejecting the package because of this
>>>> incompatibility.
>>>>
>>>>
>>>>
>>>>   Barry
>>>>
>>>>
>>>>
>>>>
>>>> > On Jul 2, 2018, at 5:57 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>> >
>>>> > Same error:
>>>> >
>>>> > 15:53 nid02517 master *= ~/petsc_install/petsc/src/snes/examples/tutorials$
>>>> make PETSC_DIR=/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp
>>>> PETSC_ARCH="" ex19
>>>> > cc -o ex19.o -c -g -O0 -mkl -static-intel -fopenmp
>>>>  -I/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/include
>>>> -I/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/include
>>>> -I/global/homes/m/madams/tmp/hypre-2.14.0/include
>>>> -I/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/include
>>>> -I/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/include
>>>>   `pwd`/ex19.c
>>>>
>>>>
>>>> > cc -g -O0 -mkl -static-intel -fopenmp  -o ex19 ex19.o
>>>> -L/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/lib
>>>> -Wl,-rpath,/global/homes/m/madams/tmp/hypre-2.14.0/lib
>>>> -L/global/homes/m/madams/tmp/hypre-2.14.0/lib
>>>> -Wl,-rpath,/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/lib
>>>> -L/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp/lib
>>>> -L/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64
>>>> -lpetsc -lHYPRE -lparmetis -lmetis -lstdc++ -ldl -lmkl_intel_ilp64
>>>> -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
>>>> > /opt/intel/compilers_and_libraries_2018.1.163/linux/
>>>> mkl/lib/intel64/libmkl_core.a(mkl_semaphore.o): In function
>>>> `mkl_serv_load_inspector':
>>>>
>>>> > mkl_semaphore.c:(.text+0x123): warning: Using 'dlopen' in statically
>>>> linked applications requires at runtime the shared libraries from the glibc
>>>> version used for linking
>>>> > /global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-
>>>> intel-omp/lib/libpetsc.a(send.o): In function `PetscOpenSocket':
>>>>
>>>> > /global/u2/m/madams/petsc_install/petsc/src/sys/classes/
>>>> viewer/impls/socket/send.c:108: warning: Using 'gethostbyname' in
>>>> statically linked applications requires at runtime the shared libraries
>>>> from the glibc version used for linking
>>>>
>>>>
>>>> > rm ex19.o
>>>>
>>>>
>>>> > 15:54 nid02517 master *= ~/petsc_install/petsc/src/snes/examples/tutorials$
>>>> make PETSC_DIR=/global/homes/m/madams/petsc_install/petsc-cori-knl-dbg64-intel-omp
>>>> PETSC_ARCH="" runex19_gamg
>>>> > lid velocity = 0.0625, prandtl # = 1., grashof # = 1.
>>>>
>>>>
>>>> > [0]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>>
>>>> > [1]PETSC ERROR: --------------------- Error Message
>>>> --------------------------------------------------------------
>>>>
>>>> > [1]PETSC ERROR: Unknown type. Check for miss-spelling or missing
>>>> package: http://www.mcs.anl.gov/petsc/documentation/installation.
>>>> html#external
>>>> > [1]PETSC ERROR: Unknown Mat type given: aijmkl
>>>> >
>>>> > On Mon, Jul 2, 2018 at 6:43 PM Satish Balay <balay at mcs.anl.gov>
>>>> wrote:
>>>> > Hm - I suspect its an issue with mkl includes - not the libraries.
>>>> >
>>>> > Satish
>>>> >
>>>> > On Mon, 2 Jul 2018, Smith, Barry F. wrote:
>>>> >
>>>> > >
>>>> > >   Mark,
>>>> > >
>>>> > >     This is not useful. We need the new configure log from when you
>>>> list all the libraries NERSE recommends (which may work).
>>>> > >
>>>> > >     I already said what was wrong with this configuration.
>>>> > >
>>>> > >     Barry
>>>> > >
>>>> > >
>>>> > > > On Jul 2, 2018, at 5:25 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > On Mon, Jul 2, 2018 at 6:22 PM Satish Balay <balay at mcs.anl.gov>
>>>> wrote:
>>>> > > > I don't understand the problem here..
>>>> > > >
>>>> > > > > > > [0]PETSC ERROR: Unknown type. Check for miss-spelling or
>>>> missing package: http://www.mcs.anl.gov/petsc/
>>>> documentation/installation.html#external
>>>> > > > > > > [0]PETSC ERROR: Unknown Mat type given: aijmkl
>>>> > > >
>>>> > > > If this is the problem - then we'll have to look at configure.log
>>>> to check why PETSC_HAVE_MKL_SPARSE flag is not set.
>>>> > > >
>>>> > > > <configure.log>
>>>> > >
>>>> >
>>>> > <configure.log>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20180703/b17c50a3/attachment.html>


More information about the petsc-dev mailing list