[mpich-discuss] seg fault

Anthony Chan chan at mcs.anl.gov
Fri May 27 09:33:04 CDT 2011


Please cc back to mpich-discuss at mcs.anl.gov.

Well, whenever there is a segfault, the most reliable way to find out
what went wrong is to use a debugger, i.e. so you need to recompile your code
with -g and MPICH2 with --enable-g=meminit,dbg and rerun it with a debugger
to get a backtrace or use debugger on the resulting coredump.
Sometimes using valgrind may help but it depends on situation.

I don't recommend using older MPICH2 like 1.2.1p1 as newer version
always contain bugfixes and performance enhancements.  Plus it is difficult for
us to trace down bug in older code, so it will slow down our response time.

Given you have ifort, do you have icc ?  If so, set CC=icc and CXX=icpc in
configuring MPICH2-1.3.2p1.

Also, why do you want to use --with-device=ch3:sock instead of default nemesis
device which has better performance on SMP box.

A.Chan

----- Original Message -----
> I just run my code with: mpiexec -n 4 ~/yambo < /dev/null > &. In the
> log, there only '...killed by signal 11' but no seg. fault
> information.
> 
> 
> --
> 
> S.D.Wang
> 王舒东
> 
> 
> 
> At 2011-05-27,"Anthony Chan" <chan at mcs.anl.gov> wrote:
> 
> >
> >signal 11 means there is a invalid access of memory. What command
> >produces
> >the segfault ?
> >
> >----- Original Message -----
> >> I give up 1.3.2 nad install mpich2-1.2.1pl, and the every thing
> >> seems
> >> OK when I test example. But I compile ny code with mpich2-1.2.1pl
> >> and
> >> run it , it apperas:
> >>
> >> rank 3 in job 1 n36_37374 caused collective abort of all ranks
> >> exit status of rank 3: killed by signal 11
> >> rank 2 in job 1 n36_37374 caused collective abort of all ranks
> >> exit status of rank 2: killed by signal 11
> >> rank 1 in job 1 n36_37374 caused collective abort of all ranks
> >> exit status of rank 1: killed by signal 11
> >> rank 0 in job 1 n36_37374 caused collective abort of all ranks
> >> exit status of rank 0: killed by signal 11
> >> What does this relate?
> >> --
> >>
> >> S.D.Wang
> >> 王舒东
> >>
> >>
> >>
> >> At 2011-05-26,"Anthony Chan" <chan at mcs.anl.gov> wrote:
> >>
> >> >
> >> >I don't see anything that is obviously wrong in your output files
> >> >except you seem to be using an old gcc, 3.4.6, which we don't
> >> >have access to anymore. Could you use newer gcc ? If not,
> >> >could you rebuild your mpich2 with extra configure option,
> >> >--enable-g=meminit,dbg, and see if "mpiexec -n 1 cpi" still
> >> >segfaults. If it does, try attaching a debugger to see where it
> >> >segfault and send us the backtrace ?
> >> >
> >> >A.Chan
> >> >
> >> >----- Original Message -----
> >> >> Dear developers:
> >> >> I insall mpich2-1.3.2pi, when I execute
> >> >>
> >> >>
> >> >> [sdwang at storage mpich2-1.3.2p1]$ cd examples/
> >> >> [sdwang at storage examples]$ mpiexec -n 1 cpi
> >> >> Segmentation fault (core dumped)
> >> >> What is the wrong?
> >> >> Thanks!
> >> >>
> >> >>
> >> >> and the complie file are in attachment.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> S.D.Wang
> >> >> 王舒东
> >> >> _______________________________________________
> >> >> mpich-discuss mailing list
> >> >> mpich-discuss at mcs.anl.gov
> >> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >> >_______________________________________________
> >> >mpich-discuss mailing list
> >> >mpich-discuss at mcs.anl.gov
> >> >https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list