[petsc-users] Sporadic MPI_Allreduce() called in different locations on larger core counts

Mark Lohry mlohry at gmail.com
Tue Aug 6 07:43:03 CDT 2019

I'm running some larger cases than I have previously with a working code,
and I'm running into failures I don't see on smaller cases. Failures are on
400 cores, ~100M unknowns, 25B non-zero jacobian entries. Runs successfully
on half size case on 200 cores.

1) The first error output from petsc is "MPI_Allreduce() called in
different locations". Is this a red herring, suggesting some process failed
prior to this and processes have diverged?

2) I don't think I'm running out of memory -- globally at least. Slurm
output shows e.g.
Memory Utilized: 459.15 GB (estimated maximum)
Memory Efficiency: 26.12% of 1.72 TB (175.78 GB/node)
I did try with and without --64-bit-indices.

3) The debug traces seem to vary, see below. I *think* the failure might be
happening in the vicinity of a Coloring call. I'm using MatFDColoring like

   ISColoring    iscoloring;
    MatFDColoring fdcoloring;
    MatColoring   coloring;

    MatColoringCreate(ctx.JPre, &coloring);
    MatColoringSetType(coloring, MATCOLORINGGREEDY);

   // converges stalls badly without this on small cases, don't know why
    MatColoringSetWeightType(coloring, MAT_COLORING_WEIGHT_LEXICAL);

   // none of these worked.
    //    MatColoringSetType(coloring, MATCOLORINGJP);
    // MatColoringSetType(coloring, MATCOLORINGSL);
    // MatColoringSetType(coloring, MATCOLORINGID);

    MatColoringApply(coloring, &iscoloring);
    MatFDColoringCreate(ctx.JPre, iscoloring, &fdcoloring);

I have had issues in the past with getting a functional coloring setup for
finite difference jacobians, and the above is the only configuration I've
managed to get working successfully. Have there been any significant
development changes to that area of code since v3.8.3? I'll try upgrading
in the mean time and hope for the best.

Any ideas?



mlohry at lancer:/ssd/dev_ssd/cmake-build$ grep "\[0\]" slurm-3429773.out
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (functions)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n19 by mlohry
Tue Aug  6 06:05:02 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
--with-mpiexec=/usr/bin/srun --with-64-bit-indices
[0]PETSC ERROR: #1 TSSetMaxSteps() line 2944 in
[0]PETSC ERROR: #2 TSSetMaxSteps() line 2944 in
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Invalid argument
[0]PETSC ERROR: Enum value must be same on all processes, argument # 2
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n19 by mlohry
Tue Aug  6 06:05:02 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
--with-mpiexec=/usr/bin/srun --with-64-bit-indices
[0]PETSC ERROR: #3 TSSetExactFinalTime() line 2250 in
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the
batch system) has told this process to end
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] PetscCommDuplicate line 130
[0]PETSC ERROR: [0] PetscHeaderCreate_Private line 34
[0]PETSC ERROR: [0] DMCreate line 36
[0]PETSC ERROR: [0] DMShellCreate line 983
[0]PETSC ERROR: [0] TSGetDM line 5287
[0]PETSC ERROR: [0] TSSetIFunction line 1310
[0]PETSC ERROR: [0] TSSetExactFinalTime line 2248
[0]PETSC ERROR: [0] TSSetMaxSteps line 2942
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n19 by mlohry
Tue Aug  6 06:05:02 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
--with-mpiexec=/usr/bin/srun --with-64-bit-indices
[0]PETSC ERROR: #4 User provided function() line 0 in  unknown file


mlohry at lancer:/ssd/dev_ssd/cmake-build$ grep "\[0\]" slurm-3429158.out
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h21c2n1 by mlohry
Mon Aug  5 23:58:19 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #1 MatSetBlockSizes() line 7206 in
[0]PETSC ERROR: #2 MatSetBlockSizes() line 7206 in
[0]PETSC ERROR: #3 MatSetBlockSize() line 7170 in
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h21c2n1 by mlohry
Mon Aug  5 23:58:19 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #4 VecSetSizes() line 1310 in
[0]PETSC ERROR: #5 VecSetSizes() line 1310 in
[0]PETSC ERROR: #6 VecCreateMPIWithArray() line 609 in
[0]PETSC ERROR: #7 MatSetUpMultiply_MPIAIJ() line 111 in
[0]PETSC ERROR: #8 MatAssemblyEnd_MPIAIJ() line 735 in
[0]PETSC ERROR: #9 MatAssemblyEnd() line 5243 in
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] PetscSFSetGraphLayout line 497
[0]PETSC ERROR: [0] GreedyColoringLocalDistanceTwo_Private line 208
[0]PETSC ERROR: [0] MatColoringApply_Greedy line 559
[0]PETSC ERROR: [0] MatColoringApply line 357
[0]PETSC ERROR: [0] VecSetSizes line 1308
[0]PETSC ERROR: [0] VecCreateMPIWithArray line 605
[0]PETSC ERROR: [0] MatSetUpMultiply_MPIAIJ line 24
[0]PETSC ERROR: [0] MatAssemblyEnd_MPIAIJ line 698
[0]PETSC ERROR: [0] MatAssemblyEnd line 5234
[0]PETSC ERROR: [0] MatSetBlockSizes line 7204
[0]PETSC ERROR: [0] MatSetBlockSize line 7167
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h21c2n1 by mlohry
Mon Aug  5 23:58:19 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #10 User provided function() line 0 in  unknown file


mlohry at lancer:/ssd/dev_ssd/cmake-build$ grep "\[0\]" slurm-3429134.out
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h20c2n1 by mlohry
Mon Aug  5 23:24:23 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #1 PetscSplitOwnership() line 88 in
[0]PETSC ERROR: #2 PetscSplitOwnership() line 88 in
[0]PETSC ERROR: #3 PetscLayoutSetUp() line 137 in
[0]PETSC ERROR: #4 VecCreate_MPI_Private() line 489 in
[0]PETSC ERROR: #5 VecCreate_MPI() line 537 in
[0]PETSC ERROR: #6 VecSetType() line 51 in
[0]PETSC ERROR: #7 VecCreateMPI() line 40 in
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Object is in wrong state
[0]PETSC ERROR: Vec object's type is not set: Argument # 1
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h20c2n1 by mlohry
Mon Aug  5 23:24:23 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #8 VecGetLocalSize() line 665 in


mlohry at lancer:/ssd/dev_ssd/cmake-build$ grep "\[0\]" slurm-3429102.out
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n16 by mlohry
Mon Aug  5 22:50:12 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #1 TSSetExactFinalTime() line 2250 in
[0]PETSC ERROR: #2 TSSetExactFinalTime() line 2250 in
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n16 by mlohry
Mon Aug  5 22:50:12 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #3 MatSetBlockSizes() line 7206 in
[0]PETSC ERROR: #4 MatSetBlockSizes() line 7206 in
[0]PETSC ERROR: #5 MatSetBlockSize() line 7170 in
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Petsc has generated inconsistent data
[0]PETSC ERROR: MPI_Allreduce() called in different locations (code lines)
on different processors
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n16 by mlohry
Mon Aug  5 22:50:12 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #6 MatStashScatterBegin_Ref() line 476 in
[0]PETSC ERROR: #7 MatStashScatterBegin_Ref() line 476 in
[0]PETSC ERROR: #8 MatStashScatterBegin_Private() line 455 in
[0]PETSC ERROR: #9 MatAssemblyBegin_MPIAIJ() line 679 in
[0]PETSC ERROR: #10 MatAssemblyBegin() line 5154 in
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X
to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] MatStashScatterEnd_Ref line 137
[0]PETSC ERROR: [0] MatStashScatterEnd_Private line 126
[0]PETSC ERROR: [0] MatAssemblyEnd_MPIAIJ line 698
[0]PETSC ERROR: [0] MatAssemblyEnd line 5234
[0]PETSC ERROR: [0] MatStashScatterBegin_Ref line 473
[0]PETSC ERROR: [0] MatStashScatterBegin_Private line 454
[0]PETSC ERROR: [0] MatAssemblyBegin_MPIAIJ line 676
[0]PETSC ERROR: [0] MatAssemblyBegin line 5143
[0]PETSC ERROR: [0] MatSetBlockSizes line 7204
[0]PETSC ERROR: [0] MatSetBlockSize line 7167
[0]PETSC ERROR: [0] TSSetExactFinalTime line 2248
[0]PETSC ERROR: --------------------- Error Message
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for
trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.8.3, Dec, 09, 2017
[0]PETSC ERROR: maDG on a arch-linux2-c-opt named tiger-h19c1n16 by mlohry
Mon Aug  5 22:50:12 2019
[0]PETSC ERROR: Configure options
PETSC_DIR=/home/mlohry/build/external/petsc PETSC_ARCH=arch-linux2-c-opt
--with-fc=0 --with-clanguage=C++ --with-pic=1 --with-debugging=yes
COPTFLAGS='-O3' CXXOPTFLAGS='-O3' --with-shared-libraries=1
--download-parmetis --download-metis MAKEFLAGS=$MAKEFLAGS
[0]PETSC ERROR: #11 User provided function() line 0 in  unknown file
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190806/d597050f/attachment-0001.html>

More information about the petsc-users mailing list