[mpich-discuss] Bcast question?

Fri Aug 17 10:35:38 CDT 2012

What platform are you running on? If you run on a real parallel machine or cluster with a real parallel file system, you may see different performance results.

Rajeev

On Aug 16, 2012, at 3:20 AM, John Chludzinski wrote:

> I tried 2 approaches: using MPI_File_read_all(...) to have all processes read the input file versus MPI_File_seek(...) + MPI_File_read(...) to break up reading the file.  The later requires a gather to bring together all the pieces.  Both work but have very different performances.  
> 
> The first approach:
> 
>     double read_time = 0.0;
>     read_time -= MPI_Wtime();
>     MPI_File_read_all(fh, (void *)read_buffer, total_number_of_bytes, MPI_BYTE, &status);
>     read_time += MPI_Wtime();
> 
> read_times grow as the number of processes grows.
> 
> The second approach:
> 
>     MPI_File_seek(fh, my_offset, MPI_SEEK_SET);
>     double read_time = 0.0;
>     read_time -= MPI_Wtime();
>     MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);
>     read_time += MPI_Wtime();
> 
> read_time decreases as the number of processes grows.
> 
> 1) With the first approach the costs (time) of all the parallel reads get worse as the number of processes grows.  
> 
> 2) With the second approach, even after I factored in gather times, the effective transfer rate scale positively with the number of processes.
> 
> ---John
> 
> 
> 
> On Wed, Aug 15, 2012 at 11:50 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> Yes. A few other options: You could avoid the seek and use read_at instead. You could try the collective function read_at_all, which on some systems may perform better for this case (and won't on other systems).
> 
> Rajeev
> 
> 
> On Aug 15, 2012, at 9:14 PM, John Chludzinski wrote:
> 
> > I got rid of the distinction between the last process and the others.  All processes now use:
> >
> > int number_of_bytes = ceil((double)total_number_of_bytes /pool_size);
> >
> > read_buffer = (char*) calloc(number_of_bytes, 1);
> > ...
> > my_offset = (MPI_Offset) my_rank * number_of_bytes;
> > ...
> > MPI_File_seek(fh, my_offset, MPI_SEEK_SET);
> > ...
> > MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);
> > ...
> >
> > This seems to work.  Look reasonable?
> >
> > ---John
> >
> >
> > On Sat, Aug 11, 2012 at 5:06 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > You can simply have each process read the entire file using a single MPI_File_read_all. No need for Gather or Gatherv.
> >
> > Rajeev
> >
> > On Aug 11, 2012, at 5:47 AM, John Chludzinski wrote:
> >
> > > I followed up on your suggestion to look into MPI-IO - GREAT suggestion.
> > >
> > > I found an example at http://beige.ucs.indiana.edu/I590/node92.html and added code to gather the pieces of the file read in by each process:
> > >
> > > MPI_Gather( read_buffer, number_of_bytes, MPI_BYTE, rbuf, number_of_bytes, MPI_BYTE, MASTER_RANK, MPI_COMM_WORLD);
> > >
> > > All process execute this line.  The problem is that number_of_bytes maybe different for the last process if  total_number_of_bytes is not a multiple of pool_size (i.e., total_number_of_bytes % pool_size != 0).  And if the value isn't the same for all processes, you get:
> > >
> > > Fatal error in PMPI_Gather: Message truncated
> > >
> > > If I set pool_size (the number of processes) so that total_number_of_bytes is a multiple of it (i.e., total_number_of_bytes % pool_size == 0), the code executes without error.
> > >
> > > I thought I read in Peter Pacheco's book that this need not necessarily be required?
> > >
> > > ---John
> > >
> > >
> > > On Fri, Aug 10, 2012 at 9:58 AM, William Gropp <wgropp at illinois.edu> wrote:
> > > The most likely newbe mistake is that you are timing the time time that the MPI_Bcast is waiting - for example, if your code looks like this:
> > >
> > > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }
> > > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;
> > >
> > > then on all but rank 0, you are timing the time that MPI_Bcast is waiting for the read data step to finish.  Instead, consider adding an MPI_Barrier before the MPI_Bcast:
> > >
> > > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }
> > > MPI_Barrier();
> > > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;
> > >
> > > *Only* do this when you are trying to answer such timing questions.
> > >
> > > You may also want to consider using MPI-IO to parallelize the read step.
> > >
> > > Bill
> > >
> > > William Gropp
> > > Director, Parallel Computing Institute
> > > Deputy Director for Research
> > > Institute for Advanced Computing Applications and Technologies
> > > Paul and Cynthia Saylor Professor of Computer Science
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > > On Aug 10, 2012, at 3:03 AM, John Chludzinski wrote:
> > >
> > > > I have a problem which requires all process to have a copy of an array of data that is read in from a file (process 0).  I Bcast the array to all processes (using MPI_COMM_WORLD).
> > > >
> > > > I instrumented the code with some calls to Wtime to find the time consumed for different actions.  In particular, I was interested in comparing the time required for Bcast's vs. fread's.  The size of the array is 1,200,000 of type MPI_DOUBLE.
> > > >
> > > > For a 3 process run:
> > > >
> > > > RANK = 0      fread_time = 2.323575
> > > >
> > > > vs.
> > > >
> > > > RANK = 2      bcast_time = 2.361233
> > > > RANK = 0      bcast_time = 0.081910
> > > > RANK = 1      bcast_time = 2.399790
> > > >
> > > > These numbers seem to indicate that Bcast-ing the data is as slow as reading the data from a file (on my Western Digital Passport USB drive).  Am I making a newbie mistake?
> > > >
> > > > ---John
> > > >
> > > > PS> I'm using Fedora 16 (32 bit) notebook with a dual core AMD Phenom processor.
> > > > _______________________________________________
> > > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> 
>