[mpich-discuss] read the same file on all processes

Wed Oct 22 22:37:53 CDT 2008

Gus, Wei and all,
Thanks very much for your answer. It clarified lots of things and now I
think I can really choose the right way to move.

My case is like the second Gus described: I have a big file with some
initial data and just part of it really need to go to each respective
process. Also, this could be done while the computation is taking place. It
is better to mix computation and communication instead of concentrate all
communication in just one step (when talking about large data), right?

Wei also told me that I could use MPI_File_read_all(). As I said, I can read
parts of the file during my steps. I will look for some collective function
to do it and also try what you said.

One question: nowadays my Master process also participates in the
computation and this is particularly fine because all processes run the same
code. However, I have thought about create a special process to work like an
"data server", responsible to "broadcast" this information along the other
processes. Is there some reason that might jeopardize this implementation?

Thanks in advance,
Luiz

On Wed, Oct 22, 2008 at 1:50 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:

> Hello Kamaraju, Luiz and list
>
> Reading on processs 0 and broadcasting to all others is typically
> what is done in most programs we use here (ocean, atmosphere, climate
> models).
> For control files, with namelists for instance, which are small but contain
> global data
> that is needed by all processes, this works very well.
>
> For larger files, e.g., binaries with initial conditions data on the global
> grid,
> this is often done also, but with a twist, not through broadcast.
> These are domain decomposition applications, where each process works on a
> subdomain, and hence needs to know only the part of the data in one
> subdomain
> (and exchange domain boundaries with neighbor domains/processes as the
> solution is marched in time).
> Hence. for this type of array/grid data, there is no need to broadcast the
> global array,
> but to scatter the data from the global domain to the subdomains/processes.
> Often times even the subdomain data is quite large (say 3D arrays), and
> need to be split
> into smaller chunks (say 2D slices), to keep message size manageable.
> In this case you scatter the smaller chunks in a loop.
> MPI types are often used to organize the data structures being exchanged,
> decomposing the data on one process and reassembling them on another
> process, etc.
>
> This technique is very traditional, the "master-slave" model,
> and precedes the MPI I/O functions, parallel file systems, etc.
> The mirror image of it is gathering the data from all subdomains by process
> 0,
> which very often is the one responsible to write the data to output files.
> Indeed, this serializes the computation to some extent,
> but it is safe for I/O in systems where, say, NFS may become a bottleneck.
> (If you use local disks on the nodes of a cluster this is not a problem,
> but then you need to
> take care of staging in data and staging out the results to/from the local
> disks,
> and post-processing the files perhaps.)
> If  I have, say, 64 processes banging together on a single NFS server
> things typically break down.
> If I use MPI to gather/scatter big data arrays, and funnel the I/O trough
> process 0, NFS doesn't suffer.
> Old fashioned but functional.  :)
>
> Another list subscriber pointed out a different solution, using the MPI I/O
> functions.
> I believe it works well, say, for raw binary files, but we use more
> structured stuff here (e.g. NetCDF format).
>
> I hope this helps,
> Gus Correa
>
> --
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> Luiz Carlos da Costa Junior wrote:
>
>  Hi all,
>>
>> Let me join this conversation. I also "suffer" from these doubts. In my
>> case, I have an application in two versions, Windows (NTFS) and Linux
>> (FAT32) and I have first implemented the first approach (make one separated
>> copy for each machine).
>>
>> But recently, I started to deal with bigger files (200Mb ~ 1Gb) and this
>> became very inefficiently. Actually, the reason I suspect is that even we
>> have multiple processes, the hard disk device that is responsible for manage
>> all these readings is just one. In other words, this operation is
>> intrinsically sequential and became a bottleneck (am I right?).
>>
>> I didn't changed my implementation yet, but I was thinking to move to the
>> second approach (rank 0 reads and BCast the info) expecting to have better
>> results.
>>
>> Does anyone have any experience?
>>
>> Actually I am not sure if this will be better. I understand that MPI uses
>> sockets to pass all messages and an natural question is if this operation is
>> faster than reading from files?
>>
>> Best regards,
>> Luiz
>>
>> On Wed, Oct 22, 2008 at 12:10 AM, Rajeev Thakur <thakur at mcs.anl.gov<mailto:
>> thakur at mcs.anl.gov>> wrote:
>>
>>    How big is the file? What kind of file system is it on?
>>
>>    Rajeev
>>
>>    > -----Original Message-----
>>    > From: owner-mpich-discuss at mcs.anl.gov
>>    <mailto:owner-mpich-discuss at mcs.anl.gov>
>>    > [mailto:owner-mpich-discuss at mcs.anl.gov
>>    <mailto:owner-mpich-discuss at mcs.anl.gov>] On Behalf Of
>>    > Kamaraju Kusumanchi
>>    > Sent: Tuesday, October 21, 2008 8:27 PM
>>    > To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>>    > Subject: [mpich-discuss] read the same file on all processes
>>    >
>>    > Hi all,
>>    >
>>    >     I have a file which needs to be read on all the processes
>>    > of an MPI job. If I read the same file simultaneously on all
>>    > the processes, will it cause any problems?
>>    >
>>    >     I can think of two other options such as
>>    >
>>    > - make multiple copies of the same file and read a separate
>>    > file on different processes
>>    > - read the file on rank 0 process, then use MPI_Bcast and
>>    > transfer the contents across the remaining processes.
>>    >
>>    >    Which approach should be preferred? I am thinking this
>>    > must be something encountered by others. So, if there is a
>>    > book/web page which explains these kind of things, a pointer
>>    > to them would be most appreciated.
>>    >
>>    > regards
>>    > raju
>>    >
>>    >
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20081023/64bfb51d/attachment.htm>