[mpich-discuss] Persistent communications between 2 computers

Thierry Roudier thierry.roudier at kiastek.com
Fri Sep 9 13:28:53 CDT 2011


Hi,

I am still in a investigation phase... but to shortly explain: I would 
like to use MPI to exchange (Send/Receive - both direction) data every 
computation step between 2 computers. During a computation step, new 
data are calculated and then transmitted. Then the idea is to 
periodically repeat the same sequence of transmissions between 2 
instances of the same program. That's why I think the persistent 
communications seems to be adapted. And the solution I found below is a 
good starting point.

Thanks for all your help, I have something functional now.
Regards,

Thierry

On 09/09/2011 7:40 PM, Jayesh Krishna wrote:
> Hi,
>   Can you explain what you are trying to do (in your code) ?
>
> Regards,
> Jayesh
> ----- Original Message -----
> From: "Thierry Roudier"<thierry.roudier at kiastek.com>
> To: "Jayesh Krishna"<jayesh at mcs.anl.gov>
> Cc: mpich-discuss at mcs.anl.gov
> Sent: Friday, September 9, 2011 12:10:31 PM
> Subject: Re: [mpich-discuss] Persistent communications between 2 computers
>
> Hi Jayesh
>
> My answers below:
>> # Which version of MPICH2 are you using (Use 1.4.1p1 - http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)?
> No I am using the 1.4 version
>> Can you run cpi.exe (C:\Program Files\MPICH2\examples\cpi.exe) using mpiexec ?
> Yes
>> # How are you running your code (Copy-paste the mpiexec command) ?
> mpiexec -hosts 2 192.168.1.11 192.168.1.12 teststart
>> # Can you run your code on a single machine ?
> Yes... by using the command mpiexec -n 2 teststart. There is no error.
>> # ...to sock in the Windows Registry, the behavior of persistent communications is more stable ... Do you still get errors ?
> I did more tests today, and the error does not depend of the channel
> parameter, and only occurs on the first time I call the MPI_Waitall
> method (for rank==0). To correct the problem, if the first MPI_Waitall
> failed, I just call a second time the MPI_Waitall method.... to be sure
> I am synchronized. And it works.
>
> I did a second test: If I decompose the MPI_Startall and MPI_Waitall
> calls in successive MPI_Start/MPI_Wait calls for each request, I don't
> get the error... but a such sequence increases the time to exchange data
> over the network.
>
> I also noticed the error only occurs after a certain time... as if the
> communication is blocked... and the MPI_Wait function reached a timeout.
> If the communication is blocked, the sequence of the send/receive is
> probably the cause. But I didn't try to exchange the order (Send,Recv)
> in the requests table... just some thoughts.
>
> Regards,
>
> Thierry
>
> On 09/09/2011 6:10 PM, Jayesh Krishna wrote:
>> Hi,
>>
>> # Which version of MPICH2 are you using (Use 1.4.1p1 - http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads)?
>>
>> # Can you run cpi.exe (C:\Program Files\MPICH2\examples\cpi.exe) using mpiexec ?
>>
>> # How are you running your code (Copy-paste the mpiexec command) ?
>>
>> # Can you run your code on a single machine ?
>>
>> # ...to sock in the Windows Registry, the behavior of persistent communications is more stable ... Do you still get errors ?
>>
>> Regards,
>> Jayesh
>>
>> ----- Original Message -----
>> From: "Thierry Roudier"<thierry.roudier at kiastek.com>
>> To: mpich-discuss at mcs.anl.gov
>> Sent: Thursday, September 8, 2011 4:24:58 PM
>> Subject: Re: [mpich-discuss] Persistent communications between 2 computers
>>
>> Thanks Darius for your reply.
>> Unfortunately, what you suggested doesn't work. I still have the same
>> error message. But as the error only occurs on some transactions, I
>> avoid the crash by using MPI_Errhandler_set(..., MPI_ERRORS_RETURN). And
>> I handle the error in my program.
>> I also saw by setting the MPI parameter "channel" to sock in the Windows
>> Registry, the behavior of persistent communications is more stable.
>>
>> -Thierry
>>
>> On 08/09/2011 5:25 PM, Darius Buntinas wrote:
>>> The error message indicates that the second process was unable to look up the address of the other process.  Rather than using "localhost" use the actual address or name of the machine.  See if this helps.
>>>
>>> -d
>>>
>>>
>>> On Sep 8, 2011, at 1:46 AM, Thierry Roudier wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want to use persistent communications to exchange data (below is the code I use). It works well on a single computer (mpiexec -n 2 teststart). But if I want to distribute the code on 2 computers (Win7) by using the command line: mpiexec -hosts 2 localhost 192.168.1.12 teststart. Unfortunately it doesn't work, and I got the following message:
>>>>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>>>>> PMPI_Waitall(274)....................: MPI_Waitall(count=2, req_array=004208B0, status_array=00420880) failed
>>>>> MPIR_Waitall_impl(121)...............:
>>>>> MPIDI_CH3I_Progress(402).............:
>>>>> MPID_nem_mpich2_blocking_recv(905)...:
>>>>> MPID_nem_newtcp_module_poll(37)......:
>>>>> MPID_nem_newtcp_module_connpoll(2669):
>>>>> gen_cnting_fail_handler(1738)........: connect failed - The semaphore timeout period has expired.
>>>>> (errno 121)
>>>>>
>>>>> ****** Persistent Communications *****
>>>>> Trials=             1
>>>>> Reps/trial=      1000
>>>>> Message Size   Bandwidth (bytes/sec)
>>>>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>>>>> PMPI_Waitall(274)....................: MPI_Waitall(count=2, req_array=004208B0, status_array=00420880) failed
>>>>> MPIR_Waitall_impl(121)...............:
>>>>> MPIDI_CH3I_Progress(402).............:
>>>>> MPID_nem_mpich2_blocking_recv(905)...:
>>>>> MPID_nem_newtcp_module_poll(37)......:
>>>>> MPID_nem_newtcp_module_connpoll(2655):
>>>>> gen_read_fail_handler(1145)..........: read from socket failed - Le nom rÚseau spÚcifiÚ nÆest plus disponible.
>>>> And below is the code I use.
>>>>
>>>> Thanks a lot for your help.
>>>>
>>>> -Thierry
>>>>
>>>>> #include<mpi.h>
>>>>> #include<stdio.h>
>>>>>
>>>>> /* Modify these to change timing scenario */
>>>>> #define TRIALS          1
>>>>> #define STEPS           1
>>>>> #define MAX_MSGSIZE     1048576    /* 2^STEPS */
>>>>> #define REPS            1000
>>>>> #define MAXPOINTS       10000
>>>>>
>>>>> int    numtasks, rank, tag=999, n, i, j, k, this, msgsizes[MAXPOINTS];
>>>>> double  mbytes, tbytes, results[MAXPOINTS], ttime, t1, t2;
>>>>> char   sbuff[MAX_MSGSIZE], rbuff[MAX_MSGSIZE];
>>>>> MPI_Status stats[2];
>>>>> MPI_Request reqs[2];
>>>>>
>>>>> int main(argc,argv)
>>>>> int argc;
>>>>> char *argv[];  {
>>>>>
>>>>> MPI_Init(&argc,&argv);
>>>>> MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>
>>>>> /**************************** task 0 ***********************************/
>>>>> if (rank == 0) {
>>>>>
>>>>>     /* Initializations */
>>>>>     n=1;
>>>>>     for (i=0; i<=STEPS; i++) {
>>>>>       msgsizes[i] = n;
>>>>>       results[i] = 0.0;
>>>>>       n=n*2;
>>>>>       }
>>>>>     for (i=0; i<MAX_MSGSIZE; i++)
>>>>>       sbuff[i] = 'x';
>>>>>
>>>>>     /* Greetings */
>>>>>     printf("\n****** Persistent Communications *****\n");
>>>>>     printf("Trials=      %8d\n",TRIALS);
>>>>>     printf("Reps/trial=  %8d\n",REPS);
>>>>>     printf("Message Size   Bandwidth (bytes/sec)\n");
>>>>>
>>>>>     /* Begin timings */
>>>>>     for (k=0; k<TRIALS; k++) {
>>>>>
>>>>>       n=1;
>>>>>       for (j=0; j<=STEPS; j++) {
>>>>>
>>>>>         /* Setup persistent requests for both the send and receive */
>>>>>         MPI_Recv_init (&rbuff, n, MPI_CHAR, 1, tag, MPI_COMM_WORLD,&reqs[0]);
>>>>>         MPI_Send_init (&sbuff, n, MPI_CHAR, 1, tag, MPI_COMM_WORLD,&reqs[1]);
>>>>>
>>>>>         t1 = MPI_Wtime();
>>>>>         for (i=1; i<=REPS; i++){
>>>>>           MPI_Startall (2, reqs);
>>>>>           MPI_Waitall (2, reqs, stats);
>>>>>           }
>>>>>         t2 = MPI_Wtime();
>>>>>
>>>>>         /* Compute bandwidth and save best result over all TRIALS */
>>>>>         ttime = t2 - t1;
>>>>>         tbytes = sizeof(char) * n * 2.0 * (float)REPS;
>>>>>         mbytes = tbytes/ttime;
>>>>>         if (results[j]<    mbytes)
>>>>>            results[j] = mbytes;
>>>>>
>>>>>         /* Free persistent requests */
>>>>>         MPI_Request_free (&reqs[0]);
>>>>>         MPI_Request_free (&reqs[1]);
>>>>>         n=n*2;
>>>>>         }   /* end j loop */
>>>>>       }     /* end k loop */
>>>>>
>>>>>     /* Print results */
>>>>>     for (j=0; j<=STEPS; j++) {
>>>>>       printf("%9d %16d\n", msgsizes[j], (int)results[j]);
>>>>>       }
>>>>>
>>>>>     }       /* end of task 0 */
>>>>>
>>>>>
>>>>>
>>>>> /****************************  task 1 ************************************/
>>>>> if (rank == 1) {
>>>>>
>>>>>     /* Begin timing tests */
>>>>>     for (k=0; k<TRIALS; k++) {
>>>>>
>>>>>       n=1;
>>>>>       for (j=0; j<=STEPS; j++) {
>>>>>
>>>>>         /* Setup persistent requests for both the send and receive */
>>>>>         MPI_Recv_init (&rbuff, n, MPI_CHAR, 0, tag, MPI_COMM_WORLD,&reqs[0]);
>>>>>         MPI_Send_init (&sbuff, n, MPI_CHAR, 0, tag, MPI_COMM_WORLD,&reqs[1]);
>>>>>
>>>>>         for (i=1; i<=REPS; i++){
>>>>>           MPI_Startall (2, reqs);
>>>>>           MPI_Waitall (2, reqs, stats);
>>>>>           }
>>>>>
>>>>>         /* Free persistent requests */
>>>>>         MPI_Request_free (&reqs[0]);
>>>>>         MPI_Request_free (&reqs[1]);
>>>>>         n=n*2;
>>>>>
>>>>>         }   /* end j loop */
>>>>>       }     /* end k loop */
>>>>>     }       /* end task 1 */
>>>>>
>>>>>
>>>>> MPI_Finalize();
>>>>>
>>>>> }  /* end of main */
>>>> _______________________________________________
>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list