[petsc-users] run successfully on 20 processors but failed on 24 processors

Matthew Knepley knepley at gmail.com
Wed Dec 14 16:12:03 CST 2011


On Wed, Dec 14, 2011 at 4:07 PM, Xiangdong Liang <xdliang at gmail.com> wrote:

> I use MatNorm and VecNorm on A and b before I call kspsolve, and both
> of them are finite. Then I use fp_trap to catch where the nan comes
> from. However, it traces down to unlikely place pthread_join. The gdb
> where information is given below. Can you give me some help? Thanks.
>

Great. This is very helpful. It seems quite clear that this is a Pastix
problem. I
would submit this is to the Pastix development list.

  Thanks,

     Matt


>  0x00007faeb6d8cbe5 in pthread_join () from /lib64/libpthread.so.0
> #1  0x00007faeb8ab7d6b in sopalin_launch_thread (procnum=18, procnbr=24,
>    ptr=0xffffffff, calc_thrdnbr=1,
>    calc_routine=0x7faeb8a7c9f3 <Z_Ugmres_smp>, calc_data=0xfe78f0,
>    comm_thrdnbr=0, comm_routine=0x7faeb8a82e4c <Z_Usopalin_updo_comm>,
>    comm_data=0xfe78f0, ooc_thrdnbr=0, ooc_routine=0, ooc_data=0xfe78f0)
>    at sopalin/src/sopalin_thread.c:235
> #2  0x00007faeb8a81e52 in Z_Ugmres_thread (datacode=0x11c8390,
>    sopaparam=0x11c8520) at sopalin/src/raff.c:1174
> #3  0x00007faeb8a53993 in Z_pastix_task_raff (pastix_data=0x11c8390,
>    pastix_comm=0x1e02690, n=210000, b=0x2528ec0, rhsnbr=1, loc2glob=0x0)
>    at sopalin/src/pastix.c:3581
> #4  0x00007faeb8a54a0e in z_pastix (pastix_data=0x20defc0,
>    pastix_comm=0x1e02690, n=210000, colptr=0x245b610, row=0x7faea68b7760,
>    avals=0x7faea4005760, perm=0x238dd60, invp=0x20e1080, b=0x2528ec0,
> rhs=1,
>    iparm=0x20df004, dparm=0x20df108) at sopalin/src/pastix.c:4262
> #5  0x00007faeb83b7fd5 in MatSolve_PaStiX (A=0x203e280, b=0x12140d0,
>    x=0x1f1b430)
>    at
> /home/xdliang/MyLocal/petsc-dev/src/mat/impls/aij/mpi/pastix/pastix.c:328
> #6  0x00007faeb7b51d7c in MatSolve (mat=0x203e280, b=0x12140d0,
> x=0x1f1b430)
>    at /home/xdliang/MyLocal/petsc-dev/src/mat/interface/matrix.c:3106
> #7  0x00007faeb8540f0e in PCApply_LU (pc=0x1de4d20, x=0x12140d0,
> y=0x1f1b430)
> ---Type <return> to continue, or q <return> to quit---
>    at /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/impls/factor/lu/lu.c:204
> #8  0x00007faeb85df50b in PCApply (pc=0x1de4d20, x=0x12140d0, y=0x1f1b430)
>    at /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/interface/precon.c:383
> #9  0x00007faeb863c660 in KSPSolve_PREONLY (ksp=0x1e37590)
>    at
> /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/impls/preonly/preonly.c:26
> #10 0x00007faeb86707fe in KSPSolve (ksp=0x1e37590, b=0x12140d0,
> x=0x1f1b430)
>    at /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/interface/itfunc.c:429
> #11 0x000000000040a213 in EigenSolver_cmplx (data=0x7fffba7529b0, Linear=1,
>    Eig=0, maxeigit=10) at EigenSolver_cmplx.c:66
> #12 0x0000000000408ddf in main (argc=77, argv=0x7fffba754df8)
>    at mldos_cmplx.c:322
>
> Here is the error information:
>
> [18]PETSC ERROR: *** unknown floating point error occurred ***
> [18]PETSC ERROR: The specific exception can be determined by running
> in a debugger.  When the
> [18]PETSC ERROR: debugger traps the signal, the exception can be found
> with fetestexcept(0x3d)
> [18]PETSC ERROR: where the result is a bitwise OR of the following flags:
> [18]PETSC ERROR: FE_INVALID=0x1 FE_DIVBYZERO=0x4 FE_OVERFLOW=0x8
> FE_UNDERFLOW=0x10 FE_INEXACT=0x20
> [18]PETSC ERROR: Try option -start_in_debugger
> [18]PETSC ERROR: likely location of problem given in stack below
> [18]PETSC ERROR: ---------------------  Stack Frames
> ------------------------------------
> [18]PETSC ERROR: Note: The EXACT line numbers in the stack are not
> available,
> [18]PETSC ERROR:       INSTEAD the line number of the start of the function
> [18]PETSC ERROR:       is given.
> [18]PETSC ERROR: [18] PetscDefaultFPTrap line 342
> /home/xdliang/MyLocal/petsc-dev/src/sys/error/fp.c
> [18]PETSC ERROR: [18] MatSolve_PaStiX line 303
> /home/xdliang/MyLocal/petsc-dev/src/mat/impls/aij/mpi/pastix/pastix.c
> [18]PETSC ERROR: [18] MatSolve line 3089
> /home/xdliang/MyLocal/petsc-dev/src/mat/interface/matrix.c
> [18]PETSC ERROR: [18] PCApply_LU line 202
> /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/impls/factor/lu/lu.c
> [18]PETSC ERROR: [18] PCApply line 373
> /home/xdliang/MyLocal/petsc-dev/src/ksp/pc/interface/precon.c
> [18]PETSC ERROR: [18] KSPSolve_PREONLY line 19
> /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/impls/preonly/preonly.c
> [18]PETSC ERROR: [18] KSPSolve line 334
> /home/xdliang/MyLocal/petsc-dev/src/ksp/ksp/interface/itfunc.c
> [18]PETSC ERROR: User provided function() line 0 in Unknown
> directoryUnknown file trapped floating point error
>
>
>
>
>
> On Tue, Dec 13, 2011 at 11:59 PM, Matthew Knepley <knepley at gmail.com>
> wrote:
> > On Tue, Dec 13, 2011 at 9:49 PM, Xiangdong Liang <xdliang at gmail.com>
> wrote:
> >>
> >> Hello everyone,
> >>
> >> I am solving complex Ax=b with PaStix on 20 processors successfully
> >> but failed on 24 processors. The relatively error indicated by
> >> mat_pastix_verbose becomes "nan"  for 24 processors. Where could be
> >> wrong? Can someone give me some hints on how I can debug? Thanks.
> >
> >
> > First, make sure you did not put any NaNs in your matrix or rhs.
> >
> >    Matt
> >
> >>
> >>
> >> Xiangdong
> >
> > --
> > What most experimenters take for granted before they begin their
> experiments
> > is infinitely more interesting than any results to which their
> experiments
> > lead.
> > -- Norbert Wiener
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111214/4567210f/attachment.htm>


More information about the petsc-users mailing list