Problems porting code to an IBM power5+ machine

Tue Mar 20 08:45:16 CDT 2007

I would point out that PETSc does this automatically with -start_in_debugger. By
default it attaches the debugger to all processes, but you can attach
to a subset
using -debugger_nodes 0,2,5 or some other choice of processes.

   Matt

On 3/20/07, Shaman Mahmoudi <shma7099 at student.uu.se> wrote:
> Hi,
>
> This might sound retarded, but this is how I debugged when I couldn't
> get my debuggers for parallel code working.
>
> What I did was that I added a wait call such as getc in C after I had
> initialized the processors. I ran my program as usual, and of course
> it halted as it was waiting for input (getc) from stdin. Then I
> checked my processes ID numbers, and locked the regular serial
> debugger on one of those processes, in my case the process that CPU
> 0  was running, and debugged just that one. Finally I pressed enter
> in my program as it was waiting for stdin and it continued running
> and cached the error eventually.
>
> Better than no debugging at all.
>
> With best regards, Shaman Mahmoudi
>
> On Mar 20, 2007, at 2:03 PM, Matthew Knepley wrote:
>
> > On 3/20/07, Knut Erik Teigen <knutert at stud.ntnu.no> wrote:
> >> On Mon, 2007-03-19 at 08:39 -0500, Matthew Knepley wrote:
> >> >   This smells like a memory overwrite or use of an uninitialized
> >> > variable. The initial norm is 1e28.
> >> >
> >> > 1) Try Jacobi instead of ICC to see if it is localized to the PC
> >>
> >> With jacobi I get:
> >> [0] PCSetUpSetting up new PC
> >>   0 KSP Residual norm 1.466015468749e+13
> >>   1 KSP Residual norm 2.720083022075e+23
> >> [0] KSPDefaultConvergedLinear solver is diverging. Initial right hand
> >> size norm 1.46602e+13, current residual norm 2.72008e+23 at
> >> iteration 1
> >>
> >> And without a preconditioner at all:
> >>  0 KSP Residual norm 7.719804763794e+00
> >>   1 KSP Residual norm 1.137387752533e+01
> >> [0] KSPSolve_CGdiverging due to indefinite or negative definite
> >> matrix
> >>
> >> The matrix definitely isn't indefinite or negative definite, as is
> >> clear
> >> from the output in my previous post.
> >>
> >> > 2) Run with valgrind or something similar to check for a memory
> >> overwrite
> >> >
> >> > 3) Maybe insert CHKMEMQ statements into the code
> >>
> >> I've run with CHKMEMQ statements, and with -malloc_debug, but
> >> didn't get
> >> any complaints.
> >> Valgrind unfortunately isn't installed on the cluster.
> >
> > Then I am afraid it will be a slow painful process. I am almost
> > positive that you
> > are overwriting memory somewhere. It is different on the two
> > machines because
> > the initial layout is different. The huge norm can be traced back
> > to a huge Vec
> > element, which can be localied. You will just have to go through
> > the code
> > methodically with the debugger to find it. Or install valgrind.
> >
> >   Matt
> >
> >> -Knut Erik-
> >> >
> >> >   Thanks,
> >> >
> >> >     Matt
> >> >
> >> > On 3/19/07, Knut Erik Teigen <knutert at stud.ntnu.no> wrote:
> >> > > Hello
> >> > >
> >> > > I have got a working code on my local machine( Pentium 4), but
> >> when I
> >> > > try to run the code on a power5+ machine, the equation solver
> >> won't
> >> > > converge. It seems like it calculates the wrong right hand
> >> side norm.
> >> > > Below is the result with run-time options
> >> > > "-ksp_type cg -pc_type icc -ksp_monitor -ksp_view -info
> >> > > First with the code running on the power5+ machine, then on my
> >> local
> >> > > machine. I've also printed the right hand side, jacobian
> >> matrix and
> >> > > solution for a small 3x3 grid.
> >> > >
> >> > > Can anyone help me figure out what's wrong?
> >> > >
> >> > > Regards,
> >> > > Knut Erik Teigen
> >> > >
> >> > > Code running on Power5+:
> >> > > rhs:
> >> > > 980
> >> > > 980
> >> > > 980
> >> > > -0
> >> > > -0
> >> > > -0
> >> > > -980
> >> > > -980
> >> > > -980
> >> > > jacobian:
> >> > > row 0: (0, 20)  (1, -10)  (3, -10)
> >> > > row 1: (0, -10)  (1, 30)  (2, -10)  (4, -10)
> >> > > row 2: (1, -10)  (2, 20)  (5, -10)
> >> > > row 3: (0, -10)  (3, 30)  (4, -10)  (6, -10)
> >> > > row 4: (1, -10)  (3, -10)  (4, 40)  (5, -10)  (7, -10)
> >> > > row 5: (2, -10)  (4, -10)  (5, 30)  (8, -10)
> >> > > row 6: (3, -10)  (6, 20)  (7, -10)
> >> > > row 7: (4, -10)  (6, -10)  (7, 30)  (8, -10)
> >> > > row 8: (5, -10)  (7, -10)  (8, 20)
> >> > > [0] PCSetUpSetting up new PC
> >> > > [0] PetscCommDuplicateDuplicating a communicator 1 4 max tags =
> >> > > 1073741823
> >> > > [0] PetscCommDuplicateUsing internal PETSc communicator 1 4
> >> > > [0] PetscCommDuplicateUsing internal PETSc communicator 1 4
> >> > >   0 KSP Residual norm 7.410163701832e+28
> >> > >   1 KSP Residual norm 6.464393707520e+11
> >> > > [0] KSPDefaultConvergedLinear solver has converged. Residual norm
> >> > > 6.46439e+11 is less than relative tolerance 1e-07 times
> >> initial right
> >> > > hand side norm 7.41016e+28 at iteration 1
> >> > > KSP Object:
> >> > >   type: cg
> >> > >   maximum iterations=10000, initial guess is zero
> >> > >   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
> >> > >   left preconditioning
> >> > > PC Object:
> >> > >   type: icc
> >> > >     ICC: 0 levels of fill
> >> > >     ICC: factor fill ratio allocated 1
> >> > >     ICC: factor fill ratio needed 0.636364
> >> > >          Factored matrix follows
> >> > >         Matrix Object:
> >> > >           type=seqsbaij, rows=9, cols=9
> >> > >           total: nonzeros=21, allocated nonzeros=21
> >> > >               block size is 1
> >> > >   linear system matrix = precond matrix:
> >> > >   Matrix Object:
> >> > >     type=seqaij, rows=9, cols=9
> >> > >     total: nonzeros=33, allocated nonzeros=45
> >> > >       not using I-node routines
> >> > > solution:
> >> > > 6.37205e-09
> >> > > 7.13167e-09
> >> > > 7.49911e-09
> >> > > -2.48277e-09
> >> > > -4.56885e-10
> >> > > 0
> >> > > 0
> >> > > 0
> >> > > 0
> >> > >
> >> > > Code running on local machine:
> >> > > rhs:
> >> > > 980
> >> > > 980
> >> > > 980
> >> > > -0
> >> > > -0
> >> > > -0
> >> > > -980
> >> > > -980
> >> > > -980
> >> > > jacobian:
> >> > > row 0: (0, 20)  (1, -10)  (3, -10)
> >> > > row 1: (0, -10)  (1, 30)  (2, -10)  (4, -10)
> >> > > row 2: (1, -10)  (2, 20)  (5, -10)
> >> > > row 3: (0, -10)  (3, 30)  (4, -10)  (6, -10)
> >> > > row 4: (1, -10)  (3, -10)  (4, 40)  (5, -10)  (7, -10)
> >> > > row 5: (2, -10)  (4, -10)  (5, 30)  (8, -10)
> >> > > row 6: (3, -10)  (6, 20)  (7, -10)
> >> > > row 7: (4, -10)  (6, -10)  (7, 30)  (8, -10)
> >> > > row 8: (5, -10)  (7, -10)  (8, 20)
> >> > > [0] PCSetUp(): Setting up new PC
> >> > > [0] PetscCommDuplicate(): Duplicating a communicator 1140850689
> >> > > -2080374783 max tags = 2147483647
> >> > > [0] PetscCommDuplicate(): Using internal PETSc communicator
> >> 1140850689
> >> > > -2080374783
> >> > > [0] PetscCommDuplicate(): Using internal PETSc communicator
> >> 1140850689
> >> > > -2080374783
> >> > >   0 KSP Residual norm 2.505507810276e+02
> >> > >   1 KSP Residual norm 3.596555656581e+01
> >> > >   2 KSP Residual norm 2.632672485513e+00
> >> > >   3 KSP Residual norm 1.888285055287e-01
> >> > >   4 KSP Residual norm 7.029433008806e-03
> >> > >   5 KSP Residual norm 3.635267067420e-14
> >> > > [0] KSPDefaultConverged(): Linear solver has converged.
> >> Residual norm
> >> > > 3.63527e-14 is less than relative tolerance 1e-07 times
> >> initial right
> >> > > hand side norm 250.551 at iteration 5
> >> > > KSP Object:
> >> > >   type: cg
> >> > >   maximum iterations=10000, initial guess is zero
> >> > >   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000
> >> > >   left preconditioning
> >> > > PC Object:
> >> > >   type: icc
> >> > >     ICC: 0 levels of fill
> >> > >     ICC: factor fill ratio allocated 1
> >> > >     ICC: factor fill ratio needed 0.636364
> >> > >          Factored matrix follows
> >> > >         Matrix Object:
> >> > >           type=seqsbaij, rows=9, cols=9
> >> > >           total: nonzeros=21, allocated nonzeros=21
> >> > >               block size is 1
> >> > >   linear system matrix = precond matrix:
> >> > >   Matrix Object:
> >> > >     type=seqaij, rows=9, cols=9
> >> > >     total: nonzeros=33, allocated nonzeros=45
> >> > >       not using I-node routines
> >> > > solution:
> >> > > 92.3023
> >> > > 92.3023
> >> > > 92.3023
> >> > > -5.69767
> >> > > -5.69767
> >> > > -5.69767
> >> > > -103.698
> >> > > -103.698
> >> > > -103.698
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >>
> >>
> >
> >
> > --
> > One trouble is that despite this system, anyone who reads journals
> > widely
> > and critically is forced to realize that there are scarcely any
> > bars to eventual
> > publication. There seems to be no study too fragmented, no
> > hypothesis too
> > trivial, no literature citation too biased or too egotistical, no
> > design too
> > warped, no methodology too bungled, no presentation of results too
> > inaccurate, too obscure, and too contradictory, no analysis too
> > self-serving,
> > no argument too circular, no conclusions too trifling or too
> > unjustified, and
> > no grammar and syntax too offensive for a paper to end up in print. --
> > Drummond Rennie
> >
>
>

-- 
One trouble is that despite this system, anyone who reads journals widely
and critically is forced to realize that there are scarcely any bars to eventual
publication. There seems to be no study too fragmented, no hypothesis too
trivial, no literature citation too biased or too egotistical, no design too
warped, no methodology too bungled, no presentation of results too
inaccurate, too obscure, and too contradictory, no analysis too self-serving,
no argument too circular, no conclusions too trifling or too unjustified, and
no grammar and syntax too offensive for a paper to end up in print. --
Drummond Rennie