[mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.

Dave Goodell goodell at mcs.anl.gov
Fri Feb 5 14:48:35 CST 2010


FWIW, I've filed this ticket to track this: https://trac.mcs.anl.gov/projects/mpich2/ticket/1005

It's probably happening because we are storing "count" in MPI_Status  
as an int.  Changing it has ABI implications, so we'll have to think  
about this one for a little while before just changing it.

-Dave

On Feb 5, 2010, at 2:40 PM, Barry Smith wrote:

>
>  Rusty,
>
>    Look at the code. There are no 64 bit integers! The cnt is  
> 433,438,806 which is completely representable in plan old 32 bit  
> ints. All the ints passed to MPI are within legal limits.
> In fact, I believe you actually send the message correctly. You only  
> give the wrong answer for count. It is simply because the sizeof the  
> datatype times the number of entries being sent is so large that the  
> problem occurs.
>
>   Barry
>
> The same problem comes up in sending doubles, I just cut this code  
> from where we sent long long int.
>
>
>
> On Feb 5, 2010, at 2:34 PM, Rusty Lusk wrote:
>
>> 64-bit integers too?  by default?
>>
>> On Friday,Feb 5, 2010, at 2:28 PM, Barry Smith wrote:
>>
>>>
>>> #include "mpi.h"
>>> #include "stdlib.h"
>>>
>>> #undef __FUNCT__
>>> #define __FUNCT__ "main"
>>> int main(int argc,char **argv)
>>> {
>>> int ierr;
>>> int    size,rank;
>>> int            cnt  = 433438806;
>>> MPI_Status     status;
>>> long long int  *cols;
>>>
>>> MPI_Init(&argc,&argv);
>>> ierr = MPI_Comm_size(MPI_COMM_WORLD,&size);
>>> ierr = MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>
>>> cols = (long long int*) malloc(cnt*sizeof(long long));
>>> if (rank == 0) {
>>>  ierr = MPI_Send(cols,cnt,MPI_LONG_LONG_INT,1,0,MPI_COMM_WORLD);
>>>
>>> } else {
>>>  ierr = MPI_Recv(cols,cnt,MPI_LONG_LONG_INT, 
>>> 0,0,MPI_COMM_WORLD,&status);
>>>  ierr = MPI_Get_count(&status,MPI_LONG_LONG_INT,&cnt);
>>>  printf("count %d\n",cnt);
>>> }
>>> ierr = MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> crush is a 64 bit system with 64 bit pointers.
>>>
>>> crush:/usr> which mpicc
>>> /soft/apps/packages/mpich2-1.2.1-gcc/bin/mpicc
>>> crush:/usr> which mpiexec
>>> /soft/apps/packages/mpich2-1.2.1-gcc/bin/mpiexec
>>>
>>> crush:~> mpicc mpitest.c
>>> mpitest.c: In function ‘main’:
>>> mpitest.c:25: warning: incompatible implicit declaration of built- 
>>> in function ‘printf’
>>> crush:~> mpiexec -n 2 a.out
>>> count -103432106
>>>
>>> I've had this problem reported to me by two completely different  
>>> PETSc users so it is a real problem, not just academic. My guess  
>>> is you don't use a long long int in the intermediate computations  
>>> needed to get the final value for count.
>>>
>>> To cheer you up, when I run with openMPI it runs forever sucking  
>>> down 100% CPU trying to send the messages :-)
>>>
>>> Barry
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list