[mpich-discuss] Bcast question?

John Chludzinski john.chludzinski at gmail.com
Thu Aug 16 03:20:47 CDT 2012


 I tried 2 approaches: using MPI_File_read_all(...) to have all processes
read the input file *versus* MPI_File_seek(...) + MPI_File_read(...) to
break up reading the file.  The later requires a gather to bring together
all the pieces.  Both work but have very different performances.

The first approach:

    double read_time = 0.0;
    read_time -= MPI_Wtime();
    *MPI_File_read_all(fh, (void *)read_buffer, total_number_of_bytes,
MPI_BYTE, &status);*
    read_time += MPI_Wtime();

read_times grow as the number of processes grows.

The second approach:

    MPI_File_seek(fh, my_offset, MPI_SEEK_SET);
     double read_time = 0.0;
     read_time -= MPI_Wtime();
    *MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);*
    read_time += MPI_Wtime();

read_time decreases as the number of processes grows.

1) With the first approach the costs (time) of all the parallel reads get
worse as the number of processes grows.

2) With the second approach, even after I factored in gather times, the
effective transfer rate scale positively with the number of processes.

---John



On Wed, Aug 15, 2012 at 11:50 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:

> Yes. A few other options: You could avoid the seek and use read_at
> instead. You could try the collective function read_at_all, which on some
> systems may perform better for this case (and won't on other systems).
>
> Rajeev
>
>
> On Aug 15, 2012, at 9:14 PM, John Chludzinski wrote:
>
> > I got rid of the distinction between the last process and the others.
>  All processes now use:
> >
> > int number_of_bytes = ceil((double)total_number_of_bytes /pool_size);
> >
> > read_buffer = (char*) calloc(number_of_bytes, 1);
> > ...
> > my_offset = (MPI_Offset) my_rank * number_of_bytes;
> > ...
> > MPI_File_seek(fh, my_offset, MPI_SEEK_SET);
> > ...
> > MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);
> > ...
> >
> > This seems to work.  Look reasonable?
> >
> > ---John
> >
> >
> > On Sat, Aug 11, 2012 at 5:06 PM, Rajeev Thakur <thakur at mcs.anl.gov>
> wrote:
> > You can simply have each process read the entire file using a single
> MPI_File_read_all. No need for Gather or Gatherv.
> >
> > Rajeev
> >
> > On Aug 11, 2012, at 5:47 AM, John Chludzinski wrote:
> >
> > > I followed up on your suggestion to look into MPI-IO - GREAT
> suggestion.
> > >
> > > I found an example at http://beige.ucs.indiana.edu/I590/node92.htmland added code to gather the pieces of the file read in by each process:
> > >
> > > MPI_Gather( read_buffer, number_of_bytes, MPI_BYTE, rbuf,
> number_of_bytes, MPI_BYTE, MASTER_RANK, MPI_COMM_WORLD);
> > >
> > > All process execute this line.  The problem is that number_of_bytes
> maybe different for the last process if  total_number_of_bytes is not a
> multiple of pool_size (i.e., total_number_of_bytes % pool_size != 0).  And
> if the value isn't the same for all processes, you get:
> > >
> > > Fatal error in PMPI_Gather: Message truncated
> > >
> > > If I set pool_size (the number of processes) so that
> total_number_of_bytes is a multiple of it (i.e., total_number_of_bytes %
> pool_size == 0), the code executes without error.
> > >
> > > I thought I read in Peter Pacheco's book that this need not
> necessarily be required?
> > >
> > > ---John
> > >
> > >
> > > On Fri, Aug 10, 2012 at 9:58 AM, William Gropp <wgropp at illinois.edu>
> wrote:
> > > The most likely newbe mistake is that you are timing the time time
> that the MPI_Bcast is waiting - for example, if your code looks like this:
> > >
> > > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }
> > > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;
> > >
> > > then on all but rank 0, you are timing the time that MPI_Bcast is
> waiting for the read data step to finish.  Instead, consider adding an
> MPI_Barrier before the MPI_Bcast:
> > >
> > > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }
> > > MPI_Barrier();
> > > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;
> > >
> > > *Only* do this when you are trying to answer such timing questions.
> > >
> > > You may also want to consider using MPI-IO to parallelize the read
> step.
> > >
> > > Bill
> > >
> > > William Gropp
> > > Director, Parallel Computing Institute
> > > Deputy Director for Research
> > > Institute for Advanced Computing Applications and Technologies
> > > Paul and Cynthia Saylor Professor of Computer Science
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > > On Aug 10, 2012, at 3:03 AM, John Chludzinski wrote:
> > >
> > > > I have a problem which requires all process to have a copy of an
> array of data that is read in from a file (process 0).  I Bcast the array
> to all processes (using MPI_COMM_WORLD).
> > > >
> > > > I instrumented the code with some calls to Wtime to find the time
> consumed for different actions.  In particular, I was interested in
> comparing the time required for Bcast's vs. fread's.  The size of the array
> is 1,200,000 of type MPI_DOUBLE.
> > > >
> > > > For a 3 process run:
> > > >
> > > > RANK = 0      fread_time = 2.323575
> > > >
> > > > vs.
> > > >
> > > > RANK = 2      bcast_time = 2.361233
> > > > RANK = 0      bcast_time = 0.081910
> > > > RANK = 1      bcast_time = 2.399790
> > > >
> > > > These numbers seem to indicate that Bcast-ing the data is as slow as
> reading the data from a file (on my Western Digital Passport USB drive).
>  Am I making a newbie mistake?
> > > >
> > > > ---John
> > > >
> > > > PS> I'm using Fedora 16 (32 bit) notebook with a dual core AMD
> Phenom processor.
> > > > _______________________________________________
> > > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120816/11e44f29/attachment.html>


More information about the mpich-discuss mailing list