[petsc-users] Binary I/O

Sat Apr 7 13:56:17 CDT 2012

Just to clarify this a little more as it came up recently in some of Amal's
thesis work using PyClaw.  On a BlueGene/P the I/O nodes are available at
some (usually fixed) density throughout the network.  On the BlueGene/P
system at KAUST, the majority of the system is available at the I/O node
density of 128 compute nodes to 1 I/O node.  If you use a non-MPIIO binary
viewer, you are going to be limited by the network connection of a single
I/O node to the parallel file system, which is in this case 10 Gb/s.  The
I/O nodes on BG/P are relatively low-powered ppc450 nodes, so it can be
rather difficult for them to drive a full 10 Gb/s (1.25 GB/s), and they are
limited as well by the collective network between the compute nodes and
each I/O node (I believe this is 0.75 GB/s on our system).

The throughput performance of our parallel file system is limited by the
file system controllers (each controller is backed by 8 servers), which
have theoretical limits of 3GB/s each.  We have a relatively low-powered
system with 4 controllers, but that brings our theoretical I/O to 12 GB/s.
 When we do not enable MPIIO, we're limited to at most 1.25 GB/s, because
we are only pushing on one of the I/O nodes, and we can theoretically
achieve a maximum of 10% of the throughput capability of the file system.
 Switching on MPIIO eventually makes the I/O nodes no longer the
theoretical bottleneck at 16 I/O nodes = 2048 compute nodes, or 2 racks of
BG/P.

A

On Thu, Oct 13, 2011 at 5:05 AM, Barry Smith <bsmith at mcs.anl.gov> wrote:

>
> On Oct 12, 2011, at 7:40 PM, Mohamad M. Nasr-Azadani wrote:
>
> > Hi again,
> >
> > On the similar topic, how hard would it be to write a function similar
> to the PETSc's VecView() associated with Binary writer to do exactly the
> same thing, i.e. write a parallel vector into one single file but when
> writing all the processors are performing the task simultaneously?
>
>    Impossible. In the end in a standard filesystem the physical hard disk
> is connected to a single CPU and everything that gets written to that hard
> disk has to go through that one CPU; there is no way physically for a bunch
> of CPUs to write together onto a single physical disk.
>
>   Now in high end parallel file systems each file may be spread over
> several hard disks (this is sometimes called stripping) (say 8). In that
> case there is some parallelism in writing the data since eight different
> parts of the vector can be sent through 8 different CPUs to 8 disks. But
> note that in general the number of these disks that file is spread over is
> small, like 8, it is never 10,000. When you have a parallel file system and
> use the option -viewer_binary_mpiio then the PETSc VecView() uses MPI IO to
> do the writing and you do get this level of parallelism so you may get
> slightly better performance than not using MPI IO.
>
>   If you are seeing long wait times in VecView() with the binary file it
> is likely it is because the file server is connected via some pretty slow
> network to the actual compute nodes on parallel machine and nothing to do
> with the details of VecView(). You need to make sure you are writing
> directly to a disk that is on a compute node of the parallel machine, not
> over some network using NFS (Network File System), this can make a huge
> difference in time.
>
>   Barry
>
> >
> > Best,
> > Mohamad
> >
> >
> > On Wed, Oct 12, 2011 at 4:17 PM, Mohamad M. Nasr-Azadani <
> mmnasr at gmail.com> wrote:
> > Thanks Barry. That makes perfect sense.
> >
> > Best,
> > Mohamad
> >
> >
> > On Wed, Oct 12, 2011 at 3:50 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
> >
> > On Oct 12, 2011, at 5:42 PM, Mohamad M. Nasr-Azadani wrote:
> >
> > > Hi everyone,
> > >
> > > I think I know the answer to my question, but I was double checking.
> > > When using
> > > PetscViewerBinaryOpen();
> > >
> > > It is mentioned that
> > > "For writing files it only opens the file on processor 0 in the
> communicator."
> > >
> > > Does that mean when writing a parallel vector to file using VecView(),
> all the data from other processors is first sent to processor zero and then
> dumped into the file?
> >
> >   No all the data is not sent to process zero before writing. That is
> process 0 does not need enough memory to store all the data before writing.
> >
> >    Instead the processes take turns sending data to process 0 who
> immediately writes it out out to disk.
> >
> > > If so, that would be a very slow processor for big datasets and large
> number of processor?
> >
> >   For less than a few thousand processes this is completely fine and
> nothing else would be much faster
> >
> > > Any suggestions to speed that process up?
> >
> >   We have the various MPI IO options that uses MPI IO to have several
> processes writing to disks at the same time that is useful for very large
> numbers of processes.
> >
> >   Barry
> >
> > >
> > > Best,
> > > Mohamad
> > >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120407/4283ef56/attachment.htm>