[petsc-users] run successfully on 20 processors but failed on 24 processors

Xiangdong Liang xdliang at gmail.com
Wed Dec 14 16:07:12 CST 2011


I use MatNorm and VecNorm on A and b before I call kspsolve, and both
of them are finite. Then I use fp_trap to catch where the nan comes
from. However, it traces down to unlikely place pthread_join. The gdb
where information is given below. Can you give me some help? Thanks.

 0x00007faeb6d8cbe5 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007faeb8ab7d6b in sopalin_launch_thread (procnum=18, procnbr=24,
    ptr=0xffffffff, calc_thrdnbr=1,
    calc_routine=0x7faeb8a7c9f3 <Z_Ugmres_smp>, calc_data=0xfe78f0,
    comm_thrdnbr=0, comm_routine=0x7faeb8a82e4c <Z_Usopalin_updo_comm>,
    comm_data=0xfe78f0, ooc_thrdnbr=0, ooc_routine=0, ooc_data=0xfe78f0)
    at sopalin/src/sopalin_thread.c:235
#2  0x00007faeb8a81e52 in Z_Ugmres_thread (datacode=0x11c8390,
    sopaparam=0x11c8520) at sopalin/src/raff.c:1174
#3  0x00007faeb8a53993 in Z_pastix_task_raff (pastix_data=0x11c8390,
    pastix_comm=0x1e02690, n=210000, b=0x2528ec0, rhsnbr=1, loc2glob=0x0)
    at sopalin/src/pastix.c:3581
#4  0x00007faeb8a54a0e in z_pastix (pastix_data=0x20defc0,
    pastix_comm=0x1e02690, n=210000, colptr=0x245b610, row=0x7faea68b7760,
    avals=0x7faea4005760, perm=0x238dd60, invp=0x20e1080, b=0x2528ec0, rhs=1,
    iparm=0x20df004, dparm=0x20df108) at sopalin/src/pastix.c:4262
#5  0x00007faeb83b7fd5 in MatSolve_PaStiX (A=0x203e280, b=0x12140d0,
    x=0x1f1b430)
    at /home/xdliang/MyLocal/petsc-dev/src/mat/impls/aij/mpi/pastix/pastix.c:328
#6  0x00007faeb7b51d7c in MatSolve (mat=0x203e280, b=0x12140d0, x=0x1f1b430)
    at /home/xdliang/MyLocal/petsc-dev/src/mat/interface/matrix.c:3106
#7  0x00007faeb8540f0e in PCApply_LU (pc=0x1de4d20, x=0x12140d0, y=0x1f1b430)
---Type <return> to continue, or q <return> to quit---
    at /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/impls/factor/lu/lu.c:204
#8  0x00007faeb85df50b in PCApply (pc=0x1de4d20, x=0x12140d0, y=0x1f1b430)
    at /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/interface/precon.c:383
#9  0x00007faeb863c660 in KSPSolve_PREONLY (ksp=0x1e37590)
    at /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/impls/preonly/preonly.c:26
#10 0x00007faeb86707fe in KSPSolve (ksp=0x1e37590, b=0x12140d0, x=0x1f1b430)
    at /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/interface/itfunc.c:429
#11 0x000000000040a213 in EigenSolver_cmplx (data=0x7fffba7529b0, Linear=1,
    Eig=0, maxeigit=10) at EigenSolver_cmplx.c:66
#12 0x0000000000408ddf in main (argc=77, argv=0x7fffba754df8)
    at mldos_cmplx.c:322

Here is the error information:

[18]PETSC ERROR: *** unknown floating point error occurred ***
[18]PETSC ERROR: The specific exception can be determined by running
in a debugger.  When the
[18]PETSC ERROR: debugger traps the signal, the exception can be found
with fetestexcept(0x3d)
[18]PETSC ERROR: where the result is a bitwise OR of the following flags:
[18]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
FE_UNDERFLOW=0x10 FE_INEXACT=0x20
[18]PETSC ERROR: Try option -start_in_debugger
[18]PETSC ERROR: likely location of problem given in stack below
[18]PETSC ERROR: ---------------------  Stack Frames
------------------------------------
[18]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[18]PETSC ERROR:       INSTEAD the line number of the start of the function
[18]PETSC ERROR:       is given.
[18]PETSC ERROR: [18] PetscDefaultFPTrap line 342
/home/xdliang/MyLocal/petsc-dev/src/sys/error/fp.c
[18]PETSC ERROR: [18] MatSolve_PaStiX line 303
/home/xdliang/MyLocal/petsc-dev/src/mat/impls/aij/mpi/pastix/pastix.c
[18]PETSC ERROR: [18] MatSolve line 3089
/home/xdliang/MyLocal/petsc-dev/src/mat/interface/matrix.c
[18]PETSC ERROR: [18] PCApply_LU line 202
/home/xdliang/MyLocal/petsc-dev/src/ksp/pc/impls/factor/lu/lu.c
[18]PETSC ERROR: [18] PCApply line 373
/home/xdliang/MyLocal/petsc-dev/src/ksp/pc/interface/precon.c
[18]PETSC ERROR: [18] KSPSolve_PREONLY line 19
/home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/impls/preonly/preonly.c
[18]PETSC ERROR: [18] KSPSolve line 334
/home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/interface/itfunc.c
[18]PETSC ERROR: User provided function() line 0 in Unknown
directoryUnknown file trapped floating point error





On Tue, Dec 13, 2011 at 11:59 PM, Matthew Knepley <knepley at gmail.com> wrote:
> On Tue, Dec 13, 2011 at 9:49 PM, Xiangdong Liang <xdliang at gmail.com> wrote:
>>
>> Hello everyone,
>>
>> I am solving complex Ax=b with PaStix on 20 processors successfully
>> but failed on 24 processors. The relatively error indicated by
>> mat_pastix_verbose becomes "nan"  for 24 processors. Where could be
>> wrong? Can someone give me some hints on how I can debug? Thanks.
>
>
> First, make sure you did not put any NaNs in your matrix or rhs.
>
>    Matt
>
>>
>>
>> Xiangdong
>
> --
> What most experimenters take for granted before they begin their experiments
> is infinitely more interesting than any results to which their experiments
> lead.
> -- Norbert Wiener


More information about the petsc-users mailing list