[petsc-users] Code sometimes work, sometimes hang when increase cpu usage

TAY wee-beng zonexo at gmail.com
Fri Dec 25 03:55:06 CST 2015


Hi,

I just realised that the nodes which I tested on may have some problems, 
as it has just been setup. So there's problem in the MPI communication.

I now do my test on the old nodes.

I have reduced my problem to a serial one. The code works fine with the 
debug version. But for the optimized version, there's different response 
for different compilers. For the intel built, it works pass time step 1, 
solving the poisson eqn. But it gives segmentation fault in the next 
time step.

For gfortran, it hangs right at the start. I have attached the valgrind 
output. But there's some problems running valgrind with my code.

My normal code runs like this:

inal initial IIB_cell_no       24000
  min I_cell_no           0
  max I_cell_no         900
  final initial I_cell_no       45000
  size(IIB_cell_u),size(I_cell_u),size(IIB_equal_cell_u),size(I_equal_cell_u) 24000       45000       24000       45000
  IIB_I_cell_no_uvw_total1        8297        8251        8332 
13965       13830       14089

... solve Poisson eqn

  1      0.01445783      0.26942225      0.33036381      1.15264402 
-0.29464833E+03 -0.28723563E+02  0.27972784E+07
  2      0.01445783      0.26942225      0.33036381      1.15264402 
-0.29464833E+03 -0.28723563E+02  0.27972784E+07
...

For the ifort, it stops bet 1 and 2. For the gfortran, it hangs right at 
the start.


Thank you.

Yours sincerely,

TAY wee-beng

On 25/12/2015 12:42 PM, Barry Smith wrote:
>> On Dec 24, 2015, at 10:37 PM, TAY wee-beng <zonexo at gmail.com> wrote:
>>
>> Hi,
>>
>> I tried valgrind in MPI but it aborts very early, with the error msg regarding PETSc initialize.
>    It shouldn't "abort" it should print some error message and continue. Please send all the output when running with valgrind.
>
>     It is possible you are solving large enough problem that require configure --with-64-bit-indices . Does that resolve the problem?
>
>    Barry
>
>> I retry again, using a lower resolution.
>>
>> GAMG  works, but BoomerAMG and hypre doesn't. Increasing cpu too high (80) also cause it to hang. 60 works fine.
>>
>> My grid size is 98x169x169
>>
>> But when I increase the resolution, GAMG can't work again.
>>
>> I tried to increase the cpu no but it still doesn't work.
>>
>> Previously, using single z direction partition, it work using GAMG and hypre. So what could be the problem?
>> Thank you.
>>
>> Yours sincerely,
>>
>> TAY wee-beng
>>
>> On 25/12/2015 12:33 AM, Matthew Knepley wrote:
>>> It sounds like you have memory corruption in a different part of the code. Run in valgrind.
>>>
>>>    Matt
>>>
>>> On Thu, Dec 24, 2015 at 10:14 AM, TAY wee-beng <zonexo at gmail.com> wrote:
>>> Hi,
>>>
>>> I have this strange error. I converted my CFD code from a z directon only partition to the yz direction partition. The code works fine but when I increase the cpu no, strange things happen when solving the Poisson eqn.
>>>
>>> I increase cpu no from 24 to 40.
>>>
>>> Sometimes it works, sometimes it doesn't. When it doesn't, it just hangs there with no output, or it gives the error below:
>>>
>>> Using MPI_Barrier during debug shows that it hangs at
>>>
>>> call KSPSolve(ksp,b_rhs,xx,ierr).
>>>
>>> I use hypre BoomerAMG and GAMG (-poisson_pc_gamg_agg_nsmooths 1 -poisson_pc_type gamg)
>>>
>>>
>>> Why is this so random? Also how do I debug this type of problem.
>>>
>>>
>>> [32]PETSC ERROR: ------------------------------------------------------------------------
>>> [32]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
>>> [32]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
>>> [32]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>>> [32]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
>>> [32]PETSC ERROR: likely location of problem given in stack below
>>> [32]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
>>> [32]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
>>> [32]PETSC ERROR:       INSTEAD the line number of the start of the function
>>> [32]PETSC ERROR:       is given.
>>> [32]PETSC ERROR: [32] HYPRE_SetupXXX line 174 /home/wtay/Codes/petsc-3.6.2/src/ksp/pc/impls/hypre/hypre.c
>>> [32]PETSC ERROR: [32] PCSetUp_HYPRE line 122 /home/wtay/Codes/petsc-3.6.2/src/ksp/pc/impls/hypre/hypre.c
>>> [32]PETSC ERROR: [32] PCSetUp line 945 /home/wtay/Codes/petsc-3.6.2/src/ksp/pc/interface/precon.c
>>> [32]PETSC ERROR: [32] KSPSetUp line 247 /home/wtay/Codes/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
>>> [32]PETSC ERROR: [32] KSPSolve line 510 /home/wtay/Codes/petsc-3.6.2/src/ksp/ksp/interface/itfunc.c
>>> [32]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
>>> [32]PETSC ERROR: Signal received
>>> [32]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
>>> [32]PETSC ERROR: Petsc Release Version 3.6.2, Oct, 02, 2015
>>> [32]PETSC ERROR: ./a.out on a petsc-3.6.2_shared_gnu_debug named n12-40 by wtay Thu Dec 24 17:01:51 2015
>>> [32]PETSC ERROR: Configure options --with-mpi-dir=/opt/ud/openmpi-1.8.8/ --download-fblaslapack=1 --with-debugging=1 --download-hypre=1 --prefix=/home/wtay/Lib/petsc-3.6.2_shared_gnu_debug --known-mpi-shared=1 --with-shared-libraries --with-fortran-interfaces=1
>>> [32]PETSC ERROR: #1 User provided function() line 0 in  unknown file
>>> --------------------------------------------------------------------------
>>> MPI_ABORT was invoked on rank 32 in communicator MPI_COMM_WORLD
>>> with errorcode 59.
>>>
>>> -- 
>>> Thank you.
>>>
>>> Yours sincerely,
>>>
>>> TAY wee-beng
>>>
>>>
>>>
>>>
>>> -- 
>>> What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
>>> -- Norbert Wiener

-------------- next part --------------
[wtay at n12-72:1]$ valgrind ./a.out
==39723== Memcheck, a memory error detector
==39723== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==39723== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==39723== Command: ./a.out
==39723== 
==39723== Conditional jump or move depends on uninitialised value(s)
==39723==    at 0x4018B16: index (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40079D3: expand_dynamic_string_token (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40084F5: _dl_map_object (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400160D: map_doit (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400F2F3: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4000E8D: do_preload (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004296: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== 
==39723== Conditional jump or move depends on uninitialised value(s)
==39723==    at 0x4018B1B: index (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40079D3: expand_dynamic_string_token (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40084F5: _dl_map_object (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400160D: map_doit (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400F2F3: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4000E8D: do_preload (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004296: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== 
==39723== Conditional jump or move depends on uninitialised value(s)
==39723==    at 0x4018B27: index (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40079D3: expand_dynamic_string_token (in /usr/lib64/ld-2.17.so)
==39723==    by 0x40084F5: _dl_map_object (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400160D: map_doit (in /usr/lib64/ld-2.17.so)
==39723==    by 0x400F2F3: _dl_catch_error (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4000E8D: do_preload (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004296: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== 
vex amd64->IR: unhandled instruction bytes: 0x66 0xF 0x1B 0x4 0x24 0x66 0xF 0x1B
==39723== valgrind: Unrecognised instruction at address 0x40152b7.
==39723==    at 0x40152B7: _dl_runtime_resolve (in /usr/lib64/ld-2.17.so)
==39723==    by 0x7FF82F8: __exp_finite (in /usr/lib64/libm-2.17.so)
==39723==    by 0x400B8BA: _dl_relocate_object (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4003AC9: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== Your program just tried to execute an instruction that Valgrind
==39723== did not recognise.  There are two possible reasons for this.
==39723== 1. Your program has a bug and erroneously jumped to a non-code
==39723==    location.  If you are running Memcheck and you just saw a
==39723==    warning about a bad jump, it's probably your program's fault.
==39723== 2. The instruction is legitimate but Valgrind doesn't handle it,
==39723==    i.e. it's Valgrind's fault.  If you think this is the case or
==39723==    you are not sure, please let us know and we'll try to fix it.
==39723== Either way, Valgrind will now raise a SIGILL signal which will
==39723== probably kill your program.
==39723== 
==39723== Process terminating with default action of signal 4 (SIGILL)
==39723==  Illegal opcode at address 0x40152B7
==39723==    at 0x40152B7: _dl_runtime_resolve (in /usr/lib64/ld-2.17.so)
==39723==    by 0x7FF82F8: __exp_finite (in /usr/lib64/libm-2.17.so)
==39723==    by 0x400B8BA: _dl_relocate_object (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4003AC9: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== Jump to the invalid address stated on the next line
==39723==    at 0x486: ???
==39723==    by 0x92968EF: ??? (in /usr/lib64/libc-2.17.so)
==39723==    by 0x4003AC9: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723==  Address 0x486 is not stack'd, malloc'd or (recently) free'd
==39723== 
==39723== 
==39723== Process terminating with default action of signal 11 (SIGSEGV)
==39723==  Bad permissions for mapped region at address 0x486
==39723==    at 0x486: ???
==39723==    by 0x92968EF: ??? (in /usr/lib64/libc-2.17.so)
==39723==    by 0x4003AC9: dl_main (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4016804: _dl_sysdep_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4004DA3: _dl_start (in /usr/lib64/ld-2.17.so)
==39723==    by 0x4001427: ??? (in /usr/lib64/ld-2.17.so)
==39723== 
==39723== HEAP SUMMARY:
==39723==     in use at exit: 0 bytes in 0 blocks
==39723==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==39723== 
==39723== All heap blocks were freed -- no leaks are possible
==39723== 
==39723== For counts of detected and suppressed errors, rerun with: -v
==39723== Use --track-origins=yes to see where uninitialised values come from
==39723== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)


More information about the petsc-users mailing list