[MPICH2-dev] mpiexec with gdb

Ralph Butler rbutler at mtsu.edu
Wed Oct 11 11:21:52 CDT 2006


I can't say.  The problem with gdb is that it produces different  
output on different platforms, even
different installations of linux.
Our program attempts to parse that output and (within certain bounds)  
allow for differences.
This is done in mpdgdbdrv.py
Some folks have dived into the parser code and hacked it to print  
exactly what their particular
gdb is producing trying to guess at how to make mpdgdbdrv respond  
correctly.  I am not
advocating that you do that because none of us really wants to spend  
lots of time guessing at
what some odd flavor of gdb might produce, but it's about the only  
way to see for sure
what is going wrong.
--ralph

On WedOct 11, at Wed Oct 11 11:05AM, Florin Isaila wrote:

> Hi,
> thank you very much, Ralph.
>
> Your output is what I would have expected. But what happens when I  
> run the gdb (or even ddd the way you indicated) is that the program  
> wouldn't stop at the breakpoint and the gdb would just die, as  
> shown below.
> I have GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh) and mpich2-1.0.4p1.
>
> Could that be a configuration problem? Any hints about how could I  
> investigate what happens? Why is the breakpoint bypassed?
>
> c1::test(10:25am) #16% mpiexec -gdb -n 1 test
> 0:  (gdb) l
> 0:  1   void test_dt() {
> 0:  2     int *i =0;
> 0:  3     *i=1;
> 0:  4   }
> 0:  5
> 0:  6   int main(int argc, char* argv[]) {
> 0:  7     MPI_Init(&argc, &argv);
> 0:  8     test_dt();
> 0:  9     MPI_Finalize();
> 0:  10    return 0;
> 0:  (gdb) b 8
> 0:  Breakpoint 2 at 0x804969a: file test.c, line 8.
> 0:  (gdb) r
>  rank 0 in job 167  c1_32771   caused collective abort of all ranks
>   exit status of rank 0: killed by signal 9
> c1::test(10:25am) #17%
>
> Thanks
> Florin
>
> On 10/10/06, Ralph Butler <rbutler at mtsu.edu> wrote:
> On TueOct 10, at Tue Oct 10 4:48PM, Florin Isaila wrote:
>
> > Hi,
> >
> > I am having a problem running mpiexec with gdb. I set a breakpoint
> > at a program line, but the program wouldnt stop there in case an
> > error occurs (o/w it  stops normally).  The error  can be a
> > segmentation fault  or a  call to MPI_Abort.
> >
> > This makes debugging impossible. Is the old style of starting each
> > mpi process in a separate debugging session possible?
>
> I have tried running the pgm we see in your output in the same way
> you show and have included the output below.
> However, many folks prefer to use ddd like this:
>      mpiexec -n 2 ddd mpi_pgm
>
> This will launch 2 ddd windows on the desktop each running mpi_pgm.
> It's pretty easy to do around 4 this way.
>
> > While merging the output of several debuggers is helpful in some
> > cases, controlling each independent process is sometimes very
> > important.
> >
> > Here the simplest example with a forced segmentation fault. The
> > breakpoint at line 229 is ignored, even though the segmentation
> > fault occurs after. The gdb is also quited, without making clear
> > the source of error.
> >
> > stallion:~/tests/mpi/dtype % mpiexec -gdb -n 1 test
> > 0:  (gdb) l 204
> >
> > 0:  204 void test_dt() {
> > 0:  205   int *i = 0;
> > 0:  206   *i = 1;
> > 0:  209}
> >
> > 0:  (gdb) l 227
> > 0:  227 int main(int argc, char* argv[]) {
> > 0:  228   MPI_Init(&argc, &argv);
> > 0:  229   test_dt();
> > 0:  230   MPI_Finalize();
> > 0:  231   return 0;
> > 0:  232 }
> >
> > 0:  (gdb) b 229
> > 0:  Breakpoint 2 at 0x8049f79: file test.c, line 229.
> > 0:  (gdb) r
> >  rank 0 in job 72   stallion.ece.northwestern.edu_42447   caused
> > collective abort of all ranks
> >   exit status of rank 0: killed by signal 9
> >
> > Many thanks
> > Florin
>
> My run of the pgm:
>
> (magpie:52) % mpiexec -gdb -n 1 temp
> 0:  (gdb) l
> 0:  1   void test_dt() {
> 0:  2       int *i = 0;
> 0:  3       *i = 1;
> 0:  4   }
> 0:  5
> 0:  6   int main(int argc, char* argv[]) {
> 0:  7       MPI_Init(&argc, &argv);
> 0:  8       test_dt();
> 0:  9       MPI_Finalize();
> 0:  10      return 0;
> 0:  (gdb) b 8
> 0:  Breakpoint 2 at 0x80495fe: file temp.c, line 8.
> 0:  (gdb) r
> 0:  Continuing.
> 0:
> 0:  Breakpoint 2, main (argc=1, argv=0xbffff3b4) at temp.c:8
> 0:  8       test_dt();
> 0:  (gdb) 0:  (gdb) s
> 0:  test_dt () at temp.c:2
> 0:  2       int *i = 0;
> 0:  (gdb) s
> 0:  3       *i = 1;
> 0:  (gdb) p *i
> 0:  Cannot access memory at address 0x0
> 0:  (gdb) p i
> 0:  $1 = (int *) 0x0
> 0:  (gdb) c
> 0:  Continuing.
> 0:
> 0:  Program received signal SIGSEGV, Segmentation fault.
> 0:  0x080495d4 in test_dt () at temp.c:3
> 0:  3       *i = 1;
> 0:  (gdb) where
> 0:  #0  0x080495d4 in test_dt () at temp.c:3
> 0:  #1  0x08049603 in main (argc=1, argv=0xbffff3b4) at temp.c:8
> 0:  (gdb) q
> rank 0 in job 2  magpie_42682   caused collective abort of all ranks
>    exit status of rank 0: killed by signal 9
> (magpie:53) %
>
>




More information about the mpich2-dev mailing list