[petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

Barry Smith bsmith at mcs.anl.gov
Mon Oct 24 09:31:17 CDT 2016


> [Or perhaps Hong is using a different test code and is observing bugs
> with superlu_dist interface..]

   She states that her test does a NEW MatCreate() for each matrix load (I cut and pasted it in the email I just sent). The bug I fixed was only related to using the SAME matrix from one MatLoad() in another MatLoad(). 

  Barry



> On Oct 24, 2016, at 9:25 AM, Satish Balay <balay at mcs.anl.gov> wrote:
> 
> Yes - but this test code [that Hong is also using] is buggy due to
> using MatLoad() twice - so the corrupted Matrix does have wierd
> behavior later in PC.
> 
> With your fix - the test code rpovided by Anton behaves fine for
> me. So Hong would have to restart the diagnosis - and I suspect all
> the wierd behavior she observed will go away [well I don't see the the
> original wired behavior with this test code anymore]..
> 
> Sinced you said "This will also make MatMPIAIJSetPreallocation() work
> properly with multiple calls" - perhaps Anton's issue is also somehow
> releated? I think its best if he can try this fix.
> 
> And if it doesn't work - then we'll need a better test case to
> reproduce.
> 
> [Or perhaps Hong is using a different test code and is observing bugs
> with superlu_dist interface..]
> 
> Satish
> 
> On Mon, 24 Oct 2016, Barry Smith wrote:
> 
>> 
>>   Hong wrote:  (Note that it creates a new Mat each time so shouldn't be affected by the bug I fixed; it also "works" with MUMPs but not superlu_dist.)
>> 
>> 
>> It is not problem with Matload twice. The file has one matrix, but is loaded twice.
>> 
>> Replacing pc with ksp, the code runs fine. 
>> The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN.
>> I'll further look at it later.
>> 
>> Hong
>> ________________________________________
>> From: Zhang, Hong
>> Sent: Friday, October 21, 2016 8:18 PM
>> To: Barry Smith; petsc-users
>> Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4
>> 
>> I am investigating it. The file has two matrices. The code takes following steps:
>> 
>> PCCreate(PETSC_COMM_WORLD, &pc);
>> 
>> MatCreate(PETSC_COMM_WORLD,&A);
>> MatLoad(A,fd);
>> PCSetOperators(pc,A,A);
>> PCSetUp(pc);
>> 
>> MatCreate(PETSC_COMM_WORLD,&A);
>> MatLoad(A,fd);
>> PCSetOperators(pc,A,A);
>> PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu or superlu_dist np=1
>> 
>> Hong
>> 
>>> On Oct 24, 2016, at 9:00 AM, Satish Balay <balay at mcs.anl.gov> wrote:
>>> 
>>> Since the provided test code dosn't crash [and is valgrind clean] -
>>> with this fix - I'm not sure what bug Hong is chasing..
>>> 
>>> Satish
>>> 
>>> On Mon, 24 Oct 2016, Barry Smith wrote:
>>> 
>>>> 
>>>> Anton,
>>>> 
>>>>  Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue which I think Hong is working on, this only resolves multiple loads of matrices into the same Mat.
>>>> 
>>>> Barry
>>>> 
>>>>> On Oct 24, 2016, at 5:07 AM, Anton Popov <popov at uni-mainz.de> wrote:
>>>>> 
>>>>> Thank you Barry, Satish, Fande!
>>>>> 
>>>>> Is there a chance to get this fix in the maintenance release 3.7.5 together with the latest SuperLU_DIST? Or next release is a more realistic option?
>>>>> 
>>>>> Anton
>>>>> 
>>>>> On 10/24/2016 01:58 AM, Satish Balay wrote:
>>>>>> The original testcode from Anton also works [i.e is valgrind clean] with this change..
>>>>>> 
>>>>>> Satish
>>>>>> 
>>>>>> On Sun, 23 Oct 2016, Barry Smith wrote:
>>>>>> 
>>>>>>>  Thanks Satish,
>>>>>>> 
>>>>>>>     I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  (in next for testing)
>>>>>>> 
>>>>>>>   Fande,
>>>>>>> 
>>>>>>>       This will also make MatMPIAIJSetPreallocation() work properly with multiple calls (you will not need a MatReset()).
>>>>>>> 
>>>>>>>  Barry
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 21, 2016, at 6:48 PM, Satish Balay <balay at mcs.anl.gov> wrote:
>>>>>>>> 
>>>>>>>> On Fri, 21 Oct 2016, Barry Smith wrote:
>>>>>>>> 
>>>>>>>>> valgrind first
>>>>>>>> balay at asterix /home/balay/download-pine/x/superlu_dist_test
>>>>>>>> $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
>>>>>>>> First MatLoad!
>>>>>>>> Mat Object: 2 MPI processes
>>>>>>>> type: mpiaij
>>>>>>>> row 0: (0, 4.)  (1, -1.)  (6, -1.)
>>>>>>>> row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
>>>>>>>> row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
>>>>>>>> row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
>>>>>>>> row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
>>>>>>>> row 5: (4, -1.)  (5, 4.)  (11, -1.)
>>>>>>>> row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
>>>>>>>> row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
>>>>>>>> row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
>>>>>>>> row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
>>>>>>>> row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
>>>>>>>> row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
>>>>>>>> row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
>>>>>>>> row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
>>>>>>>> row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
>>>>>>>> row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
>>>>>>>> row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
>>>>>>>> row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
>>>>>>>> row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
>>>>>>>> row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
>>>>>>>> row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
>>>>>>>> row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
>>>>>>>> row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
>>>>>>>> row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
>>>>>>>> row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
>>>>>>>> row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
>>>>>>>> row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
>>>>>>>> row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
>>>>>>>> row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
>>>>>>>> row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
>>>>>>>> row 30: (24, -1.)  (30, 4.)  (31, -1.)
>>>>>>>> row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
>>>>>>>> row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
>>>>>>>> row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
>>>>>>>> row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
>>>>>>>> row 35: (29, -1.)  (34, -1.)  (35, 4.)
>>>>>>>> Second MatLoad!
>>>>>>>> Mat Object: 2 MPI processes
>>>>>>>> type: mpiaij
>>>>>>>> ==4592== Invalid read of size 4
>>>>>>>> ==4592==    at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
>>>>>>>> ==4592==    by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>>>>>>>> ==4592==    by 0x53373D7: MatView (matrix.c:989)
>>>>>>>> ==4592==    by 0x40107E: main (ex16.c:30)
>>>>>>>> ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
>>>>>>>> ==4592==    at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>>>>>>>> ==4592==    by 0x4FD121A: PetscMallocAlign (mal.c:28)
>>>>>>>> ==4592==    by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
>>>>>>>> ==4592==    by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
>>>>>>>> ==4592==    by 0x536B299: MatAssemblyEnd (matrix.c:5298)
>>>>>>>> ==4592==    by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
>>>>>>>> ==4592==    by 0x5337FEA: MatLoad (matrix.c:1101)
>>>>>>>> ==4592==    by 0x400D9F: main (ex16.c:22)
>>>>>>>> ==4592==
>>>>>>>> ==4591== Invalid read of size 4
>>>>>>>> ==4591==    at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
>>>>>>>> ==4591==    by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>>>>>>>> ==4591==    by 0x53373D7: MatView (matrix.c:989)
>>>>>>>> ==4591==    by 0x40107E: main (ex16.c:30)
>>>>>>>> ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
>>>>>>>> ==4591==    at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>>>>>>>> ==4591==    by 0x4FD121A: PetscMallocAlign (mal.c:28)
>>>>>>>> ==4591==    by 0x4F31FB5: PetscStrallocpy (str.c:197)
>>>>>>>> ==4591==    by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
>>>>>>>> ==4591==    by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
>>>>>>>> ==4591==    by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
>>>>>>>> ==4591==    by 0x51F6DE9: VecCreate (veccreate.c:35)
>>>>>>>> ==4591==    by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
>>>>>>>> ==4591==    by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
>>>>>>>> ==4591==    by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
>>>>>>>> ==4591==    by 0x536B299: MatAssemblyEnd (matrix.c:5298)
>>>>>>>> ==4591==    by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
>>>>>>>> ==4591==    by 0x5337FEA: MatLoad (matrix.c:1101)
>>>>>>>> ==4591==    by 0x400D9F: main (ex16.c:22)
>>>>>>>> ==4591==
>>>>>>>> [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>>>>>>> [0]PETSC ERROR: Argument out of range
>>>>>>>> [0]PETSC ERROR: Column too large: col 96 max 35
>>>>>>>> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>>>>>>>> [0]PETSC ERROR: Petsc Development GIT revision: v3.7.4-1729-g4c4de23  GIT Date: 2016-10-20 22:22:58 +0000
>>>>>>>> [0]PETSC ERROR: ./ex16 on a arch-idx64-slu named asterix by balay Fri Oct 21 18:47:51 2016
>>>>>>>> [0]PETSC ERROR: Configure options --download-metis --download-parmetis --download-superlu_dist PETSC_ARCH=arch-idx64-slu
>>>>>>>> [0]PETSC ERROR: #1 MatSetValues_MPIAIJ() line 585 in /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>> [0]PETSC ERROR: #2 MatAssemblyEnd_MPIAIJ() line 724 in /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>> [0]PETSC ERROR: #3 MatAssemblyEnd() line 5298 in /home/balay/petsc/src/mat/interface/matrix.c
>>>>>>>> [0]PETSC ERROR: #4 MatView_MPIAIJ_ASCIIorDraworSocket() line 1410 in /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>> [0]PETSC ERROR: #5 MatView_MPIAIJ() line 1440 in /home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
>>>>>>>> [0]PETSC ERROR: #6 MatView() line 989 in /home/balay/petsc/src/mat/interface/matrix.c
>>>>>>>> [0]PETSC ERROR: #7 main() line 30 in /home/balay/download-pine/x/superlu_dist_test/ex16.c
>>>>>>>> [0]PETSC ERROR: PETSc Option Table entries:
>>>>>>>> [0]PETSC ERROR: -display :0.0
>>>>>>>> [0]PETSC ERROR: -f /home/balay/datafiles/matrices/small
>>>>>>>> [0]PETSC ERROR: -malloc_dump
>>>>>>>> [0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint at mcs.anl.gov----------
>>>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 63) - process 0
>>>>>>>> [cli_0]: aborting job:
>>>>>>>> application called MPI_Abort(MPI_COMM_WORLD, 63) - process 0
>>>>>>>> ==4591== 16,965 (2,744 direct, 14,221 indirect) bytes in 1 blocks are definitely lost in loss record 1,014 of 1,016
>>>>>>>> ==4591==    at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>>>>>>>> ==4591==    by 0x4FD121A: PetscMallocAlign (mal.c:28)
>>>>>>>> ==4591==    by 0x52F3B14: MatCreate (gcreate.c:84)
>>>>>>>> ==4591==    by 0x581390A: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1371)
>>>>>>>> ==4591==    by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>>>>>>>> ==4591==    by 0x53373D7: MatView (matrix.c:989)
>>>>>>>> ==4591==    by 0x40107E: main (ex16.c:30)
>>>>>>>> ==4591==
>>>>>>>> 
>>>>>>>> ===================================================================================
>>>>>>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>>>>>> =   PID 4591 RUNNING AT asterix
>>>>>>>> =   EXIT CODE: 63
>>>>>>>> =   CLEANING UP REMAINING PROCESSES
>>>>>>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>>>>>> ===================================================================================
>>>>>>>> balay at asterix /home/balay/download-pine/x/superlu_dist_test
>>>>>>>> $
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 



More information about the petsc-users mailing list