I tried 2 approaches: using MPI_File_read_all(...) to have all processes

 read the input file <i><b>versus</b></i> MPI_File_seek(...) + MPI_File_read(...) to 

break up reading the file.  The later requires a gather to bring 

together all the pieces.  Both work but have very different performances.  <br><br>The first approach:<br>

<br>

<span style="font-family:courier new,monospace">    double read_time = 0.0;</span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">    read_time -= MPI_Wtime();</span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">    <b>MPI_File_read_all(fh, (void *)read_buffer, total_number_of_bytes, MPI_BYTE, &status);</b></span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">    read_time += MPI_Wtime();</span><br>

<br>

read_times grow as the number of processes grows.<br>

<br>

The second approach:<br>

<br>

<span style="font-family:courier new,monospace">   

MPI_File_seek(fh, my_offset, MPI_SEEK_SET);</span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">

    double read_time = 0.0;</span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">

    read_time -= MPI_Wtime();</span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">    <b>MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);</b></span><br style="font-family:courier new,monospace">

<span style="font-family:courier new,monospace">    read_time += MPI_Wtime();</span><br>

<br>

<span style="font-family:courier new,monospace">read_time decreases</span> as the number of processes grows.<br>


<br>1) With the first approach the costs (time) of all the parallel reads get worse as the number of processes grows.  <br><br>2) With the second approach, even after I factored in gather times, the effective transfer rate scale positively with the number of processes.<br>


<br>

---John<br><br>

<br><br><div class="gmail_quote">On Wed, Aug 15, 2012 at 11:50 PM, Rajeev Thakur <span dir="ltr"><<a href="mailto:thakur@mcs.anl.gov" target="_blank">thakur@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Yes. A few other options: You could avoid the seek and use read_at instead. You could try the collective function read_at_all, which on some systems may perform better for this case (and won't on other systems).<br>

<span class="HOEnZb"><font color="#888888"><br>

Rajeev<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

On Aug 15, 2012, at 9:14 PM, John Chludzinski wrote:<br>

<br>

> I got rid of the distinction between the last process and the others.  All processes now use:<br>

><br>

> int number_of_bytes = ceil((double)total_number_of_bytes /pool_size);<br>

><br>

> read_buffer = (char*) calloc(number_of_bytes, 1);<br>

> ...<br>

> my_offset = (MPI_Offset) my_rank * number_of_bytes;<br>

> ...<br>

> MPI_File_seek(fh, my_offset, MPI_SEEK_SET);<br>

> ...<br>

> MPI_File_read(fh, read_buffer, number_of_bytes_2, MPI_BYTE, &status);<br>

> ...<br>

><br>

> This seems to work.  Look reasonable?<br>

><br>

> ---John<br>

><br>

><br>

> On Sat, Aug 11, 2012 at 5:06 PM, Rajeev Thakur <<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>> wrote:<br>

> You can simply have each process read the entire file using a single MPI_File_read_all. No need for Gather or Gatherv.<br>

><br>

> Rajeev<br>

><br>

> On Aug 11, 2012, at 5:47 AM, John Chludzinski wrote:<br>

><br>

> > I followed up on your suggestion to look into MPI-IO - GREAT suggestion.<br>

> ><br>

> > I found an example at <a href="http://beige.ucs.indiana.edu/I590/node92.html" target="_blank">http://beige.ucs.indiana.edu/I590/node92.html</a> and added code to gather the pieces of the file read in by each process:<br>


> ><br>

> > MPI_Gather( read_buffer, number_of_bytes, MPI_BYTE, rbuf, number_of_bytes, MPI_BYTE, MASTER_RANK, MPI_COMM_WORLD);<br>

> ><br>

> > All process execute this line.  The problem is that number_of_bytes maybe different for the last process if  total_number_of_bytes is not a multiple of pool_size (i.e., total_number_of_bytes % pool_size != 0).  And if the value isn't the same for all processes, you get:<br>


> ><br>

> > Fatal error in PMPI_Gather: Message truncated<br>

> ><br>

> > If I set pool_size (the number of processes) so that total_number_of_bytes is a multiple of it (i.e., total_number_of_bytes % pool_size == 0), the code executes without error.<br>

> ><br>

> > I thought I read in Peter Pacheco's book that this need not necessarily be required?<br>

> ><br>

> > ---John<br>

> ><br>

> ><br>

> > On Fri, Aug 10, 2012 at 9:58 AM, William Gropp <<a href="mailto:wgropp@illinois.edu">wgropp@illinois.edu</a>> wrote:<br>

> > The most likely newbe mistake is that you are timing the time time that the MPI_Bcast is waiting - for example, if your code looks like this:<br>

> ><br>

> > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }<br>

> > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;<br>

> ><br>

> > then on all but rank 0, you are timing the time that MPI_Bcast is waiting for the read data step to finish.  Instead, consider adding an MPI_Barrier before the MPI_Bcast:<br>

> ><br>

> > if (rank == 0) { tr = MPI_Wtime(); read data tr = MPI_Wtime()-tr; }<br>

> > MPI_Barrier();<br>

> > tb = MPI_Wtime(): MPI_Bcast(…); tb = MPI_Wtime() - tb;<br>

> ><br>

> > *Only* do this when you are trying to answer such timing questions.<br>

> ><br>

> > You may also want to consider using MPI-IO to parallelize the read step.<br>

> ><br>

> > Bill<br>

> ><br>

> > William Gropp<br>

> > Director, Parallel Computing Institute<br>

> > Deputy Director for Research<br>

> > Institute for Advanced Computing Applications and Technologies<br>

> > Paul and Cynthia Saylor Professor of Computer Science<br>

> > University of Illinois Urbana-Champaign<br>

> ><br>

> ><br>

> ><br>

> > On Aug 10, 2012, at 3:03 AM, John Chludzinski wrote:<br>

> ><br>

> > > I have a problem which requires all process to have a copy of an array of data that is read in from a file (process 0).  I Bcast the array to all processes (using MPI_COMM_WORLD).<br>

> > ><br>

> > > I instrumented the code with some calls to Wtime to find the time consumed for different actions.  In particular, I was interested in comparing the time required for Bcast's vs. fread's.  The size of the array is 1,200,000 of type MPI_DOUBLE.<br>


> > ><br>

> > > For a 3 process run:<br>

> > ><br>

> > > RANK = 0      fread_time = 2.323575<br>

> > ><br>

> > > vs.<br>

> > ><br>

> > > RANK = 2      bcast_time = 2.361233<br>

> > > RANK = 0      bcast_time = 0.081910<br>

> > > RANK = 1      bcast_time = 2.399790<br>

> > ><br>

> > > These numbers seem to indicate that Bcast-ing the data is as slow as reading the data from a file (on my Western Digital Passport USB drive).  Am I making a newbie mistake?<br>

> > ><br>

> > > ---John<br>

> > ><br>

> > > PS> I'm using Fedora 16 (32 bit) notebook with a dual core AMD Phenom processor.<br>

> > > _______________________________________________<br>

> > > mpich-discuss mailing list     <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>

> > > To manage subscription options or unsubscribe:<br>

> > > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > mpich-discuss mailing list     <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>

> > To manage subscription options or unsubscribe:<br>

> > <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>

><br>

> _______________________________________________<br>

> mpich-discuss mailing list     <a href="mailto:mpich-discuss@mcs.anl.gov">mpich-discuss@mcs.anl.gov</a><br>

> To manage subscription options or unsubscribe:<br>

> <a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>

><br>

<br>

</div></div></blockquote></div><br>