[petsc-users] MPI-FFTW example crashes

Sajid Ali sajidsyed2021 at u.northwestern.edu
Sun Jun 2 23:14:20 CDT 2019


Hi PETSc-developers,

I'm trying to run ex143 on a cluster (alcf-theta). I compiled PETSc on
login node with cray-fftw-3.3.8.1 and there was no error in either
configure or make.

When I try running ex143 with 1 MPI rank on compute node, everything works
fine but with 2 MPI ranks, it crashes due to illegal instruction due to
memory corruption. I tried running it with valgrind but the available
valgrind module on theta gives the error `valgrind: failed to start tool
'memcheck' for platform 'amd64-linux': No such file or directory`.

To get around this, I tried running it with gdb4hpc and I attached the
backtrace which shows that there is some error with mpi-fftw being called.
I also attach the output with -start_in_debugger command option.

What could possibly cause this error and how do I fix it ?

Thank You,
Sajid Ali
Applied Physics
Northwestern University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190602/c4666ee3/attachment.html>
-------------- next part --------------
sajid at thetamom1:/gpfs/mira-home/sajid/sajid_proj/test_fftw> aprun -n 2 --cc depth -d 1 -j 1 -r 1 ./ex143 -start_in_debugger -log_view &> out                                  
sajid at thetamom1:/gpfs/mira-home/sajid/sajid_proj/test_fftw> cat out                                                                                                           
PETSC: Attaching gdb to ./ex143 of pid 62260 on display :0.0 on machine nid03832                                                                                              
PETSC: Attaching gdb to ./ex143 of pid 62259 on display :0.0 on machine nid03832                                                                                              
xterm: xterm: Xt error: Can't open display: :0.0                                                                                                                              
Xt error: Can't open display: :0.0                                                                                                                                            
xterm: xterm: DISPLAY is not set                                                                                                                                              
DISPLAY is not set                                                                                                                                                            
Use PETSc-FFTW interface...1-DIM: 30                                                                                                                                          
[1]PETSC ERROR: [0]PETSC ERROR: ------------------------------------------------------------------------                                                                      
------------------------------------------------------------------------                                                                                                      
[0]PETSC ERROR: [1]PETSC ERROR: Caught signal number 4 Illegal instruction: Likely due to memory corruption                                                                   
Caught signal number 4 Illegal instruction: Likely due to memory corruption                                                                                                   
[0]PETSC ERROR: [1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger                                                                                    
Try option -start_in_debugger or -on_error_attach_debugger                                                                                                                    
[0]PETSC ERROR: [1]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind                                                                           
or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind                                                                                                           
[0]PETSC ERROR: [1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors                                                   
or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors                                                                                   
[1]PETSC ERROR: [0]PETSC ERROR: likely location of problem given in stack below                                                                                               
likely location of problem given in stack below                                                                                                                               
[1]PETSC ERROR: [0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------                                                                      
---------------------  Stack Frames ------------------------------------                                                                                                      
[0]PETSC ERROR: [1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,                                                                                  
Note: The EXACT line numbers in the stack are not available,                                                                                                                  
[0]PETSC ERROR: [1]PETSC ERROR:       INSTEAD the line number of the start of the function                                                                                    
      INSTEAD the line number of the start of the function                                                                                                                    
[0]PETSC ERROR: [1]PETSC ERROR:       is given.                                                                                                                               
      is given.                                                                                                                                                               
[0]PETSC ERROR: [1]PETSC ERROR: [0] MatMult_MPIFFTW line 236 /gpfs/mira-home/sajid/packages/petsc/src/mat/impls/fft/fftw/fftw.c                                               
[1] MatMult_MPIFFTW line 236 /gpfs/mira-home/sajid/packages/petsc/src/mat/impls/fft/fftw/fftw.c                                                                               
[0]PETSC ERROR: [1]PETSC ERROR: [1] MatMult line 2402 /gpfs/mira-home/sajid/packages/petsc/src/mat/interface/matrix.c                                                         
[0] MatMult line 2402 /gpfs/mira-home/sajid/packages/petsc/src/mat/interface/matrix.c                                                                                         
[1]PETSC ERROR: [0]PETSC ERROR: User provided function() line 0 in  unknown file (null)                                                                                       
User provided function() line 0 in  unknown file (null)                                                                                                                       
_pmiu_daemon(SIGCHLD): [NID 03832] [c7-1c2s14n0] [Mon Jun  3 04:10:53 2019] PE RANK 0 exit signal Aborted                                                                     
[NID 03832] 2019-06-03 04:10:53 Apid 13751865: initiated application termination                                                                                              
Application 13751865 exit codes: 134                                                                                                                                          
Application 13751865 resources: utime ~0s, stime ~2s, Rss ~27708, inblocks ~9678, outblocks ~0                                                                                
sajid at thetamom1:/gpfs/mira-home/sajid/sajid_proj/test_fftw>                                                                                                                   
-------------- next part --------------
sajid at thetamom1:/gpfs/mira-home/sajid/sajid_proj/test_fftw> gdb4hpc
gdb4hpc 3.0 - Cray Line Mode Parallel Debugger
With Cray Comparative Debugging Technology.
Copyright 2007-2018 Cray Inc. All Rights Reserved.
Copyright 1996-2016 University of Queensland. All Rights Reserved.

Type "help" for a list of commands.
Type "help <cmd>" for detailed help about a command.
dbg all> maint set unsafe on
dbg all> launch $a{2} ./ex143
Starting application, please wait...
Creating MRNet communication network...
Waiting for debug servers to attach to MRNet communications network...
Timeout in 400 seconds. Please wait for the attach to complete.
Number of dbgsrvs connected: [0];  Timeout Counter: [1]
Number of dbgsrvs connected: [1];  Timeout Counter: [0]
Number of dbgsrvs connected: [1];  Timeout Counter: [1]
Number of dbgsrvs connected: [2];  Timeout Counter: [0]
Finalizing setup...
Launch complete.
a{0..1}: Initial breakpoint, main at /lus/theta-fs0/projects/large3dxrayADSP/sajid_proj/test_fftw/ex143.c:27
dbg all> step 20
a{0..1}: main at /lus/theta-fs0/projects/large3dxrayADSP/sajid_proj/test_fftw/ex143.c:100
dbg all> step 20
<$a>: Use PETSc-FFTW interface...1-DIM: 30
a{0..1}: main at /lus/theta-fs0/projects/large3dxrayADSP/sajid_proj/test_fftw/ex143.c:128
dbg all> step 20
a{0..1}: Program received signal SIGILL.
a{0..1}: In sadt at :0
dbg all> backtrace
a{0..1}: #0  0x00002aaab5c769c2 in sadt
a{0..1}: #1  0x00002aaab26399f6 in MatMult_MPIFFTW
a{0..1}: #2  0x00002aaab2579c2a in MatMult
a{0..1}: #3  0x0000000000404ded in main at /lus/theta-fs0/projects/large3dxrayADSP/sajid_proj/test_fftw/ex143.c:128
dbg all>


More information about the petsc-users mailing list