[mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.
Dave Goodell
goodell at mcs.anl.gov
Fri Feb 5 14:48:35 CST 2010
FWIW, I've filed this ticket to track this: https://trac.mcs.anl.gov/projects/mpich2/ticket/1005
It's probably happening because we are storing "count" in MPI_Status
as an int. Changing it has ABI implications, so we'll have to think
about this one for a little while before just changing it.
-Dave
On Feb 5, 2010, at 2:40 PM, Barry Smith wrote:
>
> Rusty,
>
> Look at the code. There are no 64 bit integers! The cnt is
> 433,438,806 which is completely representable in plan old 32 bit
> ints. All the ints passed to MPI are within legal limits.
> In fact, I believe you actually send the message correctly. You only
> give the wrong answer for count. It is simply because the sizeof the
> datatype times the number of entries being sent is so large that the
> problem occurs.
>
> Barry
>
> The same problem comes up in sending doubles, I just cut this code
> from where we sent long long int.
>
>
>
> On Feb 5, 2010, at 2:34 PM, Rusty Lusk wrote:
>
>> 64-bit integers too? by default?
>>
>> On Friday,Feb 5, 2010, at 2:28 PM, Barry Smith wrote:
>>
>>>
>>> #include "mpi.h"
>>> #include "stdlib.h"
>>>
>>> #undef __FUNCT__
>>> #define __FUNCT__ "main"
>>> int main(int argc,char **argv)
>>> {
>>> int ierr;
>>> int size,rank;
>>> int cnt = 433438806;
>>> MPI_Status status;
>>> long long int *cols;
>>>
>>> MPI_Init(&argc,&argv);
>>> ierr = MPI_Comm_size(MPI_COMM_WORLD,&size);
>>> ierr = MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>
>>> cols = (long long int*) malloc(cnt*sizeof(long long));
>>> if (rank == 0) {
>>> ierr = MPI_Send(cols,cnt,MPI_LONG_LONG_INT,1,0,MPI_COMM_WORLD);
>>>
>>> } else {
>>> ierr = MPI_Recv(cols,cnt,MPI_LONG_LONG_INT,
>>> 0,0,MPI_COMM_WORLD,&status);
>>> ierr = MPI_Get_count(&status,MPI_LONG_LONG_INT,&cnt);
>>> printf("count %d\n",cnt);
>>> }
>>> ierr = MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> crush is a 64 bit system with 64 bit pointers.
>>>
>>> crush:/usr> which mpicc
>>> /soft/apps/packages/mpich2-1.2.1-gcc/bin/mpicc
>>> crush:/usr> which mpiexec
>>> /soft/apps/packages/mpich2-1.2.1-gcc/bin/mpiexec
>>>
>>> crush:~> mpicc mpitest.c
>>> mpitest.c: In function ‘main’:
>>> mpitest.c:25: warning: incompatible implicit declaration of built-
>>> in function ‘printf’
>>> crush:~> mpiexec -n 2 a.out
>>> count -103432106
>>>
>>> I've had this problem reported to me by two completely different
>>> PETSc users so it is a real problem, not just academic. My guess
>>> is you don't use a long long int in the intermediate computations
>>> needed to get the final value for count.
>>>
>>> To cheer you up, when I run with openMPI it runs forever sucking
>>> down 100% CPU trying to send the messages :-)
>>>
>>> Barry
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list