[mpich-discuss] mpi/openmp hybrid seg fault

Wed Dec 21 22:06:18 CST 2011

All, I had posted about 2 weeks ago partially solved the problem but found
that I still have issues.  A quick recap of problems.  

Hybrid mpi/openmp code.  

-          Works if run from a node (say "node1"), and only have "node1" in
the machines file, grabs the correct number of threads and runs to
completion.

-          Works if compiled (I am using ifort) WITHOUT the "-openmp"
option, just without threading.  Can run across nodes and give correct
answer but threading capability is lost.

-          Fails if run across nodes when compiled with "-openmp" . i.e.
machines file has "node1, node2, node3" and then tries to run it gets a
segmentation fault.

Since then I did some exhaustive debugging as best I could and discovered
the problem is with static arrays that are used within the code.  I have a
sample "laplace.f" file that I will attach at the end of the question which
served as my debugging program, however I have a much larger code that
manifests the same problems.  

The sample program would crash with the segmentation fault when the initial
static arrays (u and du) were declared to be size imax = jmax = 2001.  When
I made the arrays very small, imax = jmax = 10 the code ran to completion
across several nodes without issue.  In further investigation for this
problem I found the "magic" size to be imax = jmax = 720.  At this point the
code still runs to completion, however if imax = jmax = 721 the code crashes
with the segmentation fault.  Additionally, as a hunch, I converted the
program to dynamically allocate the u and du arrays as opposed to static.
When this was done the previous segmentation faults disappeared.  Imax =
jmax = 2001 worked without error, imax = jmax = 10,000 worked just fine
across nodes also.

It seems like the problem is sensitive to whatever mechanisms are employed
in static array memory management but disappear with dynamic mechanisms, but
my knowledge about computer architecture is not robust enough to know if
this makes sense or not . nor what things to check.  I thought perhaps this
has to do with some limits set within Ubuntu (11.10), but beyond ensuring an
unlimited stack size (which is present) I haven't a good idea of what things
might need to be changed.  

The fact that I need this to work for the much larger problem is why simply
changing static arrays to dynamic equivalents is not a viable solution
unfortunately.  

Help is greatly appreciated, thanks.

~Jack

      program lpmlp

      include 'mpif.h'

      include "omp_lib.h" 

      integer imax,jmax,im1,im2,jm1,jm2,it,itmax

      !parameter (imax=2001,jmax=2001)

      !parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2)

      parameter (itmax=100)

      !real*8 u(imax,jmax),du(imax,jmax)

      real*8,allocatable,dimension(:,:) :: u,du

      real*8 umax,dumax,tol,pi

      parameter (umax=10.0,tol=1.0e-6,pi=3.14159)

! Additional MPI parameters

      integer istart,iend,jstart,jend

      integer size,rank,ierr,istat(MPI_STATUS_SIZE),mpigrid,length

      integer grdrnk,dims(1),gloc(1),up,down,isize,jsize

      integer ureq,dreq

      integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE)

      real*8 tstart,tend,gdumax

      logical cyclic(1)

      !real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)

      real*8,allocatable,dimension(:) :: uibuf,uobuf,dibuf,dobuf

! OpenMP parameters

      integer nthrds,nthreads      

! Initialize

      call MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)

      call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)

      call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)

! 1D linear topology

      dims(1)=size

      cyclic(1)=.FALSE.

      call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid

     +     ,ierr)

      call MPI_COMM_RANK(mpigrid,grdrnk,ierr)

      call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr)

      call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr)

      call sleep(180)

      !imax=10001;jmax=10001;

      imax=720;jmax=720;

      allocate(u(imax,jmax))

      allocate(du(imax,jmax))

      allocate(uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax))

      im1=imax-1; im2=imax-2; jm1=jmax-1; jm2=jmax-2;

      istart=2

      iend=imax-1

      jsize=jmax/size

      jstart=gloc(1)*jsize+1

      if (jstart.LE.1) jstart=2

      jend=(gloc(1)+1)*jsize

      if (jend.GE.jmax) jend=jmax-1

      nthrds=OMP_GET_NUM_PROCS()

      print*,"Rank=",rank,"Threads=",nthrds

      call omp_set_num_threads(nthrds)

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)

! Initialize -- done in parallel to force "first-touch" distribution

! on ccNUMA machines (i.e. O2k)

!$OMP DO

      do j=jstart-1,jend+1

         do i=istart-1,iend+1

            u(i,j)=0.0

            du(i,j)=0.0

         enddo

         u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1))

      enddo

!$OMP END DO

!$OMP END PARALLEL

! Main computation loop

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

      tstart=MPI_WTIME()

      do it=1,itmax

! We have to keep the OpenMP and MPI calls segregated...

        call omp_set_num_threads(nthrds)

        !nthreads = OMP_GET_NUM_THREADS()

        !print*,"Jack",rank,nthreads,nthrds

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)

!$OMP MASTER

        dumax=0.0

!$OMP END MASTER

!$OMP DO REDUCTION(max:dumax)

         do j=jstart,jend

            do i=istart,iend

               !nthreads = OMP_GET_NUM_THREADS()

               !print*,"Jack",rank,nthreads,nthrds

               du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j)

               dumax=max(dumax,abs(du(i,j)))

            enddo

         enddo

!$OMP END DO

!$OMP DO

         do j=jstart,jend

            do i=istart,iend

               u(i,j)=u(i,j)+du(i,j)

            enddo

         enddo

!$OMP END DO

!$OMP END PARALLEL

! Compute the overall residual

         call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0

     +        ,MPI_COMM_WORLD,ierr)

! Send phase

         if (down.NE.MPI_PROC_NULL) then

            j=1

            do i=istart,iend

               dobuf(j)=u(i,jstart)

               j=j+1

            enddo

            length=j-1

            call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid,

     +           dreq,ierr)

         endif

         if (up.NE.MPI_PROC_NULL) then

            j=1

            do i=istart,iend

               uobuf(j)=u(i,jend)

               j=j+1

            enddo

            length=j-1

            call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid,

     +           ureq,ierr)

         endif

! Receive phase

         if (down.NE.MPI_PROC_NULL) then

            length=iend-istart+1

            call MPI_RECV(dibuf,length,MPI_REAL8,down,it,

     +           mpigrid,istat,ierr)

            call MPI_WAIT(dreq,dstat,ierr)

            j=1

            do i=istart,iend

               u(i,jstart-1)=dibuf(j)

               j=j+1

            enddo

         endif

         if (up.NE.MPI_PROC_NULL) then

            length=iend-istart+1

            call MPI_RECV(uibuf,length,MPI_REAL8,up,it,

     +           mpigrid,istat,ierr)

            call MPI_WAIT(ureq,ustat,ierr)

            j=1

            do i=istart,iend

               u(i,jend+1)=uibuf(j)

               j=j+1

            enddo

         endif

         write (rank+10,*) rank,it,dumax,gdumax

         if (rank.eq.0) write (1,*) it,gdumax

      enddo

      call MPI_BARRIER(MPI_COMM_WORLD,ierr)

      tend=MPI_WTIME()

      if (rank.EQ.0) then

         write(*,*) 'Calculation took ',tend-tstart,'s. on ',size,

     +        ' MPI processes'

     +        ,' with ',nthrds,' OpenMP threads per process'

      endif

      call MPI_FINALIZE(ierr)

      stop

      end

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111221/2222e98d/attachment.htm>