[mpich-discuss] Strange invalid pointer error
Rajeev Thakur
thakur at mcs.anl.gov
Tue Oct 27 23:21:44 CDT 2009
The sendcount passed to MPI_Gather should be set to the size of the
message to be received from each process. But the buffer into which the
data is gathered from all processes must be large enough to hold the
data that is collected from all processes. If sendcount=RLEN*n in
MPI_Gather, the recvbuf on the root must be allocated large enough to
accommodate RLEN*n*num_processes
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Thomas Ruedas
> Sent: Tuesday, October 27, 2009 10:09 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Strange invalid pointer error
>
> Rajeev Thakur wrote:
> > The buftot that is passed to MPI_Gather on the root (rank
> 0) needs to be
> > allocated of size RLEN*n*nprocs where nprocs is the size of
> COMM_WORLD.
> > Is it that size?
> No, but the way I understand the documentation on
> http://www.mpi-forum.org/docs/mpi-11-html/node69.html#Node69
> it shouldn't be, because it says there:
> [ IN recvcount] number of elements for any single receive (integer,
> significant only at root)
> I interpret this as meaning that I should give the size of
> every slice
> passed to buftot from every subprocess. Is that wrong?
> For clarity: RLEN is the length in bytes of a single-precision real,
> n?tot are the dimensions of the total grid, n?pn the dimension of the
> part on every single node. The point of the routine of
> f_bindump is to
> collect the results of all subgrids from all nodes into a single big
> array on the root and write them into a file.
> If I try out what you suggest, the subroutine ggather looks
> as follows:
>
> subroutine ggather(buf,buftot,n)
> use mpi
> use precision,only: RLEN
> implicit none
> integer, intent(in) :: n
> integer :: nprocs,ierr
> real, intent(in) :: buf(n)
> real, intent(out) :: buftot(*)
> call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
> call
> MPI_GATHER(buf,RLEN*n,MPI_BYTE,buftot,RLEN*n*nprocs,MPI_BYTE,0
> ,MPI_COMM_WORLD,ierr)
> end subroutine ggather
>
> This results in the following error:
> Backtrace of the callstack at rank 0:
> Backtrace of the callstack at rank 1:
> Backtrace of the callstack at rank 2:
> Backtrace of the callstack at rank 3:
> At [0]: stagyympi(CollChk_err_han+0xd4)[0x848a1dc]
> At [1]: stagyympi(CollChk_dtype_scatter+0x11c)[0x848b7c7]
> At [2]: stagyympi(MPI_Gather+0xb0)[0x848a36c]
> At [3]: stagyympi(mpi_gather_+0x61)[0x8489354]
> At [4]: stagyympi(ggather_+0x6a)[0x845d79c]
> At [5]: stagyympi(f_bindump_+0x2db)[0x81b7c3b]
> At [6]: stagyympi(dump_frame_+0x3407)[0x81bbc21]
> At [7]: stagyympi(MAIN__+0x13791)[0x807fee1]
> At [8]: stagyympi(main+0x42)[0x806c73a]
> At [9]: /lib/tls/libc.so.6(__libc_start_main+0xd3)[0x503de3]
> At [10]: stagyympi[0x806c671]
> At [0]: stagyympi(CollChk_err_han+0xd4)[0x848a1dc]
> Fatal error in MPI_Comm_call_errhandler:
>
> Collective Checking: GATHER (Rank 2) --> Inconsistent datatype
> signatures detected between rank 2 and rank 0.
>
>
>
> At [0]: stagyympi(CollChk_err_han+0xd4)[0x848a1dc]
> At [1]: stagyympi(CollChk_dtype_scatter+0x11c)[0x848b7c7]
> At [2]: stagyympi(MPI_Gather+0xb0)[0x848a36c]
> At [1]: stagyympi(CollChk_dtype_scatter+0x11c)[0x848b7c7]
> At [0]: stagyympi(CollChk_err_han+0xd4)[0x848a1dc]
> At [3]: stagyympi(mpi_gather_+0x61)[0x8489354]
> At [4]: stagyympi(ggather_+0x6a)[0x845d79c]
> At [5]: stagyympi(f_bindump_+0x2db)[0x81b7c3b]
> At [2]: stagyympi(MPI_Gather+0xb0)[0x848a36c]
> At [3]: stagyympi(mpi_gather_+0x61)[0x8489354]
> At [1]: stagyympi(CollChk_dtype_scatter+0x11c)[0x848b7c7]
> At [2]: stagyympi(MPI_Gather+0xb0)[0x848a36c]
> At [6]: stagyympi(dump_frame_+0x3407)[0x81bbc21]
> At [7]: stagyympi(MAIN__+0x13791)[0x807fee1]
> At [8]: stagyympi(main+0x42)[0x806c73a]
> At [4]: stagyympi(ggather_+0x6a)[0x845d79c]
> At [5]: stagyympi(f_bindump_+0x2db)[0x81b7c3b]
> At [3]: stagyympi(mpi_gather_+0x61)[0x8489354]
> At [9]: /lib/tls/libc.so.6(__libc_start_main+0xd3)[0x38ede3]
> At [10]: stagyympi[0x806c671]
> Fatal error in MPI_Comm_call_errhandler:
>
> Collective Checking: GATHER (Rank 3) --> Inconsistent datatype
> signatures detected between rank 3 and rank 0.
>
>
>
> At [6]: stagyympi(dump_frame_+0x3407)[0x81bbc21]
> At [4]: stagyympi(ggather_+0x6a)[0x845d79c]
> At [5]: stagyympi(f_bindump_+0x2db)[0x81b7c3b]
> At [7]: stagyympi(MAIN__+0x13791)[0x807fee1]
> At [8]: stagyympi(main+0x42)[0x806c73a]
> At [9]: /lib/tls/libc.so.6(__libc_start_main+0xd3)[0x7fbde3]
> At [6]: stagyympi(dump_frame_+0x3407)[0x81bbc21]
> At [7]: stagyympi(MAIN__+0x13791)[0x807fee1]
> At [10]: stagyympi[0x806c671]
> Fatal error in MPI_Comm_call_errhandler:
>
> Collective Checking: GATHER (Rank 1) --> Inconsistent datatype
> signatures detected between rank 1 and rank 0.
>
>
>
> At [8]: stagyympi(main+0x42)[0x806c73a]
> At [9]: /lib/tls/libc.so.6(__libc_start_main+0xd3)[0xb89de3]
> At [10]: stagyympi[0x806c671]
> Fatal error in MPI_Comm_call_errhandler:
>
> Collective Checking: GATHER (Rank 0) --> Inconsistent datatype
> signatures detected between rank 0 and rank 0.
>
>
>
> rank 2 in job 6 xenia_46167 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> rank 3 in job 6 xenia_46167 caused collective abort of all ranks
> exit status of rank 3: killed by signal 9
>
>
> A similar error occurs if I calculate the size of buftot in a
> different
> way, without using MPI_COMM_SIZE. Evidently, I misunderstand
> something,
> but here I don't see why the datatype is inconsistent.
> Thomas
> --
> -----------------------------------
> Thomas Ruedas
> Department of Terrestrial Magnetism
> Carnegie Institution of Washington
> http://www.dtm.ciw.edu/users/ruedas/
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list