[petsc-dev] VecScatter scaling problem on KNL

Tuomas Koskela tkoskela at lbl.gov
Wed Mar 8 16:06:09 CST 2017


Hi, I'm a NERSC postdoc working on this issue. I spoke to Steve Leak, 
who is in charge of the NERSC PETSc builds and he had a couple of 
suggestions to try out. If the issue is in the MPI library, we could try 
to build PETSc on top of Intel MPI instead of cray-mpich. He's building 
a version for me to test that. Another suggestion was fixing the issue 
with cray-mpich reported here:

http://www.nersc.gov/users/computational-systems/edison/updates-and-status/open-issues/mpi-3-atomic-performance-degradation-since-cray-mpich7-3-0/

by linking against the DMAPP library. I will test that as well.

-Tuomas

On 3/8/17 13:51, Barry Smith wrote:
>> On Mar 8, 2017, at 3:33 PM, Mark Adams <mfadams at lbl.gov> wrote:
>>
>> Our code is having scaling problems on KNL (Cori), when we get up to
>> about 1K sockets.
>>
>> We have isolated the problem to a certain VecScatter. This code stores
>> the data redundantly. Scattering into the solver is just a local copy,
>> but scattering out requires that each process send all of its data to
>> every other process. It is this second one that is not scaling well.
>    Mark,
>
>      Is the scatter created with VecScatterCreateToAll()? If so, internally the VecScatterBegin/End will use VecScatterBegin_MPI_ToAll() which then uses a MPI_Allgatherv() to do the communication.  You can check  in the debugger for this (on 2 processes) by just putting a break point in VecScatterBegin_MPI_ToAll() to confirm if it is called.
>
>     The Intel profiling tools are pretty good at showing where the time in MPI is spent so you could run your "bad case" and confirm if the MPI_Allgatherv()  is the issue.
>
>     IMHO it is the vendors responsibility to provide a "good" MPI_Allgatherv(), that is we shouldn't have to emulate MPI_Allgatherv() with our own message passing code. So if it is performing far less well then it should based on simple models you could get in touch with the right people at Nersc/Intel to get this to scale better.
>
>    Let us know if this is the problem point or if something else is consuming the time.
>
>     Barry
>
>
>
>
>
>> I wish I had more data, but this is urgent, jobs are in the queue, but
>> this is all I have. Any recommendation for parameters that we might
>> test while we get more data?
>>
>> Also, we got this error with -log_view.
>>
>> I've updated their PETSc with maint and we are waiting for it to run
>> again. Apparently this was not on the first time step, so the code
>> seems to have run for a while with what looks to me like a logic bug.
>>
>> Thanks,
>> Mark
>>
>>
>> [4098]PETSC ERROR: --------------------- Error Message
>> --------------------------------------------------------------
>> [4098]PETSC ERROR: Object is in wrong state
>> [4098]PETSC ERROR: Logging event had unbalanced begin/end pairs
>> [4098]PETSC ERROR: See
>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>> shooting.
>> [4098]PETSC ERROR: Petsc Release Version 3.6.3, unknown
>> [4098]PETSC ERROR: /global/cscratch1/sd/worleyph/XGC1_KNL/xgc2 on a
>> v3.6.3-arch-knl-opt64-intel named nid05668 by worleyph Mon Mar  6
>> 11:33:19 2017
>> [4098]PETSC ERROR: Configure options COPTFLAGS="-g -O3 -fp-model fast
>> -xMIC-AVX512 -DX2_HAVE_INTEL" CXXOPTFLAGS="-g -O3 -fp-model fast
>> -xMIC-AVX512\
>> -DX2_HAVE_INTEL" FOPTFLAGS="-g -O3 -fp-model fast -xMIC-AVX512
>> -DX2_HAVE_INTEL" --download-metis=1 --download-parmetis=1
>> --with-blas-lapack-dir=/g\
>> lobal/common/cori/software/intel/compilers_and_libraries_2017.0.098/linux/mkl
>> --with-cc=cc --with-cxx=cc --with-debugging=0 --with-fc=ftn --with-mp\
>> iexec=srun --with-batch=0 --with-memalign=64 --with-64-bit-indices
>> --known-mpi-shared-libraries=1 PETSC_ARCH=v3.6.3-arch-knl-opt64-intel
>> --with-ope\
>> nmp=1 PETSC_DIR=/global/homes/t/tkoskela/git/petsc
>> [4098]PETSC ERROR: #1 PetscLogEventEndDefault() line 696 in
>> /global/u2/t/tkoskela/git/petsc/src/sys/logging/utils/eventlog.c
>> [4098]PETSC ERROR: #2 VecSet() line 577 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/interface/rvector.c
>> [4098]PETSC ERROR: #3 VecCreate_Seq() line 44 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/impls/seq/bvec3.c
>> [4098]PETSC ERROR: #4 VecSetType() line 53 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/interface/vecreg.c
>> [4098]PETSC ERROR: #5 VecDuplicate_Seq() line 786 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/impls/seq/bvec2.c
>> [4098]PETSC ERROR: #6 VecDuplicate() line 399 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/interface/vector.c
>> [4098]PETSC ERROR: #7 VecDuplicateVecs_Default() line 840 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/interface/vector.c
>> [4098]PETSC ERROR: #8 VecDuplicateVecs() line 473 in
>> /global/u2/t/tkoskela/git/petsc/src/vec/vec/interface/vector.c




More information about the petsc-dev mailing list