[mpich-discuss] read the same file on all processes

Wed Oct 22 10:50:49 CDT 2008

Hello Kamaraju, Luiz and list

Reading on processs 0 and broadcasting to all others is typically
what is done in most programs we use here (ocean, atmosphere, climate 
models).
For control files, with namelists for instance, which are small but 
contain global data
that is needed by all processes, this works very well.

For larger files, e.g., binaries with initial conditions data on the 
global grid,
this is often done also, but with a twist, not through broadcast.
These are domain decomposition applications, where each process works on a
subdomain, and hence needs to know only the part of the data in one 
subdomain
(and exchange domain boundaries with neighbor domains/processes as the 
solution is marched in time).
Hence. for this type of array/grid data, there is no need to broadcast 
the global array,
but to scatter the data from the global domain to the subdomains/processes.
Often times even the subdomain data is quite large (say 3D arrays), and 
need to be split
into smaller chunks (say 2D slices), to keep message size manageable.
In this case you scatter the smaller chunks in a loop.
MPI types are often used to organize the data structures being exchanged,
decomposing the data on one process and reassembling them on another 
process, etc.

This technique is very traditional, the "master-slave" model,
and precedes the MPI I/O functions, parallel file systems, etc.
The mirror image of it is gathering the data from all subdomains by 
process 0,
which very often is the one responsible to write the data to output files.
Indeed, this serializes the computation to some extent,
but it is safe for I/O in systems where, say, NFS may become a bottleneck.
(If you use local disks on the nodes of a cluster this is not a problem, 
but then you need to
take care of staging in data and staging out the results to/from the 
local disks,
and post-processing the files perhaps.)
If  I have, say, 64 processes banging together on a single NFS server 
things typically break down.
If I use MPI to gather/scatter big data arrays, and funnel the I/O 
trough process 0, NFS doesn't suffer.
Old fashioned but functional.  :)

Another list subscriber pointed out a different solution, using the MPI 
I/O functions.
I believe it works well, say, for raw binary files, but we use more 
structured stuff here (e.g. NetCDF format).

I hope this helps,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Luiz Carlos da Costa Junior wrote:

> Hi all,
>
> Let me join this conversation. I also "suffer" from these doubts. 
> In my case, I have an application in two versions, Windows (NTFS) and 
> Linux (FAT32) and I have first implemented the first approach (make 
> one separated copy for each machine).
>
> But recently, I started to deal with bigger files (200Mb ~ 1Gb) and 
> this became very inefficiently. Actually, the reason I suspect is that 
> even we have multiple processes, the hard disk device that is 
> responsible for manage all these readings is just one. In other words, 
> this operation is intrinsically sequential and became a bottleneck (am 
> I right?).
>
> I didn't changed my implementation yet, but I was thinking to move to 
> the second approach (rank 0 reads and BCast the info) expecting to 
> have better results.
>
> Does anyone have any experience?
>
> Actually I am not sure if this will be better. I understand that MPI 
> uses sockets to pass all messages and an natural question is if this 
> operation is faster than reading from files?
>
> Best regards,
> Luiz
>
> On Wed, Oct 22, 2008 at 12:10 AM, Rajeev Thakur <thakur at mcs.anl.gov 
> <mailto:thakur at mcs.anl.gov>> wrote:
>
>     How big is the file? What kind of file system is it on?
>
>     Rajeev
>
>     > -----Original Message-----
>     > From: owner-mpich-discuss at mcs.anl.gov
>     <mailto:owner-mpich-discuss at mcs.anl.gov>
>     > [mailto:owner-mpich-discuss at mcs.anl.gov
>     <mailto:owner-mpich-discuss at mcs.anl.gov>] On Behalf Of
>     > Kamaraju Kusumanchi
>     > Sent: Tuesday, October 21, 2008 8:27 PM
>     > To: mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>     > Subject: [mpich-discuss] read the same file on all processes
>     >
>     > Hi all,
>     >
>     >     I have a file which needs to be read on all the processes
>     > of an MPI job. If I read the same file simultaneously on all
>     > the processes, will it cause any problems?
>     >
>     >     I can think of two other options such as
>     >
>     > - make multiple copies of the same file and read a separate
>     > file on different processes
>     > - read the file on rank 0 process, then use MPI_Bcast and
>     > transfer the contents across the remaining processes.
>     >
>     >    Which approach should be preferred? I am thinking this
>     > must be something encountered by others. So, if there is a
>     > book/web page which explains these kind of things, a pointer
>     > to them would be most appreciated.
>     >
>     > regards
>     > raju
>     >
>     >
>
>