[mpich-discuss] File I/O causing collective abort of all ranks

Martin Siegert siegert at sfu.ca
Mon Sep 29 12:53:43 CDT 2008


Hi Brian,

On Mon, Sep 29, 2008 at 10:32:35AM -0600, Brian Harker wrote:
> Hi Gus-
> 
> Thanks for the ideas...i do have write permissions to the desired
> directories.  I have heard a lot about how fortran 90 is argumentative
> and doesn't play nice with MPI sometimes, so in that vein I am
> currently re-writing all my code (!) in C.  My hopes are that since
> the MPI libraries are implemented in C, that I won't have these issues
> (or at least some less-difficult issues ;)  Thanks for all the input!

For what it's worth: the following program compiles and executes fine
with ifort-10.1.015 and mpich2-1.0.7:

program prtest
use mpi
implicit none
double precision, allocatable :: array(:,:)
integer :: i, j, nx, ny
integer :: numprocs, proc_id, mpierr

   call MPI_Init(mpierr)
   call MPI_Comm_rank(MPI_COMM_WORLD, proc_id, mpierr)
   call MPI_Comm_size(MPI_COMM_WORLD, numprocs, mpierr)
   if (proc_id == 0) read*, nx, ny
   call MPI_Bcast(nx, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpierr)
   call MPI_Bcast(ny, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpierr)
   allocate(array(nx,ny))
   array = dble(proc_id)
   if ( proc_id == 0 ) then
      open( unit = 1, file = "fubar.dat", status="new" )
      do j = 1, ny
         write(1,*) ( array(i,j), i = 1, nx )
      end do
      close(1)
   end if
   call MPI_Finalize(mpierr)
end

# mpif90 -O3 -o prtest prtest.f90
# rm -f fubar.dat; mpiexec -n 2 ./prtest < pr.in

Thus, this has nothing to do with with Fortran or C.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

> On Mon, Sep 29, 2008 at 8:27 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> > Hello Brian and list
> >
> > There is a lot of things that can go wrong with I/O in a parallel
> > environment.
> > Without the error messages, code, etc, it is just guessing.
> >
> > Here are just a few guesses, besides the others I sent before (file name
> > conflict across processes,
> > conflicting file/directory manipulation by other processes, conflict with
> > I/O redirection in process 0, etc).
> > In case you want to check:
> >
> > 1) Do you have permission to write to this(these) directory(ies)?
> >
> > 2) Do you know the directory(ies) where the processes are actually working?
> >
> > 3) The mpi launcher normally puts you in your home directory.
> > You may also have the ability to cd to the working directory (where the
> > program is)
> > or somewhere else.
> > Where home is depends on the actual computer node where you are running.
> > You may have a local home directory, or perhaps it may not have been set.
> > You may have an NFS mounted home (or working directory), and the NFS
> > export/mount scheme
> > may not be correctly set.
> >
> > You can use print/flush or call system commands to see where the processes
> > are running,
> > where the program breaks,  and why.
> >
> > Gus Correa
> >
> > Brian Harker wrote:
> >
> >> No, it doesn't exist, but I have tried it with status="replace" and
> >> status="unknown" just in case, and it still craps out.  Haven't tried
> >> it without specifying the status keyword, will give it a go and see
> >> what happens.  thanks!
> >>
> >> On Fri, Sep 26, 2008 at 4:06 PM, Martin Siegert <siegert at sfu.ca> wrote:
> >>
> >>>
> >>> Maybe a stupid question, but did you check whether the file fubar.dat
> >>> exists already?
> >>>
> >>> Or try to compile with
> >>> open( unit = 1, file = "fubar.dat")
> >>> Does that work?
> >>>
> >>> Cheers,
> >>> Martin
> >>>
> >>> --
> >>> Martin Siegert
> >>> Head, Research Computing
> >>> WestGrid Site Lead
> >>> IT Services                                phone: 778 782-4691
> >>> Simon Fraser University                    fax:   778 782-4242
> >>> Burnaby, British Columbia                  email: siegert at sfu.ca
> >>> Canada  V5A 1S6
> >>>
> >>> On Fri, Sep 26, 2008 at 03:31:39PM -0600, Brian Harker wrote:
> >>>
> >>>>
> >>>> Well, no luck with the fresh install.  :(  Still can't write to file.
> >>>>
> >>>> I also had an inkling that perhaps I had installed the intel fortran
> >>>> compiler before the intel c compiler, so that my ifort was built with
> >>>> gcc instead of icc, so I re-installed my compilers, icc and icpc
> >>>> first, then ifort, then rebuilt mpich2.  No go.  Any other ideas?
> >>>>
> >>>> On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Hi Gus-
> >>>>>
> >>>>> Ha! I sure do remember the clean install!  :)
> >>>>>
> >>>>> As far as unit numbers go, I've tried many different ones between 1
> >>>>> and 99, still no luck.  Tonight I'll try "make clean" followed by a
> >>>>> fresh install to see what happens.  Cheers!
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> Hi Brian and list
> >>>>>>
> >>>>>> Wild guesses:
> >>>>>>
> >>>>>> 1) Any chance that different processes use the same file name
> >>>>>> ("fubar.dat"
> >>>>>> or other)?
> >>>>>> 2) Or perhaps that processes somehow manipulate whole files or
> >>>>>> directories,
> >>>>>> with OS/shell calls to cp, mv, rm, etc?
> >>>>>>
> >>>>>> I was betting on "unit=1" being the source of the problem,
> >>>>>> perhaps combined with I/O redirection ( "<" and ">") of your program
> >>>>>> in the
> >>>>>> mpirun/mpiexec command.
> >>>>>> However, you say you already tried to use other unit numbers with no
> >>>>>> success.
> >>>>>> Did you try it on this part of the code, for process 0, with something
> >>>>>> different from 0,1,2,5,6?
> >>>>>>
> >>>>>> Yet another thing to try is a fresh compilation, preceded by a "make
> >>>>>> cleanall" of sorts,
> >>>>>> just to avoid leftover object files from ancient builds and outdated
> >>>>>> source
> >>>>>> code.
> >>>>>> Remember that?  :)
> >>>>>>
> >>>>>> Gus Correa
> >>>>>>
> >>>>>> --
> >>>>>> ---------------------------------------------------------------------
> >>>>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> >>>>>> Lamont-Doherty Earth Observatory - Columbia University
> >>>>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> >>>>>> ---------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>> Brian Harker wrote:
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Hi Gus-
> >>>>>>>
> >>>>>>> I have tried different unit numbers as well, and this master process
> >>>>>>> file-write is the only process with a hardwired unit number.  The
> >>>>>>> slave-writes I have treated very similarly to your suggestion.
> >>>>>>>
> >>>>>>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hello Brian and list
> >>>>>>>>
> >>>>>>>> Some guesses.
> >>>>>>>>
> >>>>>>>> 1) Have you tried to use a different unit number for the file being
> >>>>>>>> opened,
> >>>>>>>> instead of 1, say 12,  for instance?
> >>>>>>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
> >>>>>>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
> >>>>>>>> stderr.
> >>>>>>>> For regular files I prefer to stay away from these magic numbers,
> >>>>>>>> just in case the OS and the compiler try to enforce their own
> >>>>>>>> preferences,
> >>>>>>>> fight each other, and perhaps don't change the file handle number
> >>>>>>>> from
> >>>>>>>> the
> >>>>>>>> program source
> >>>>>>>> in a sensible way.
> >>>>>>>>
> >>>>>>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of
> >>>>>>>> process 0
> >>>>>>>> in
> >>>>>>>> a special way,
> >>>>>>>> which may be the reason why your process 0 fails, but not the
> >>>>>>>> others,
> >>>>>>>> particularly if you are redirecting stdin and stdout with "<" and
> >>>>>>>> ">".
> >>>>>>>> (This used to be the case in the past, I am not sure if it still is.
> >>>>>>>> The MPICH experts may have something better to say about it.)
> >>>>>>>>
> >>>>>>>> 3) If there are nodes with more than one process running (SMP) I
> >>>>>>>> don't
> >>>>>>>> know
> >>>>>>>> if hardwiring the
> >>>>>>>> same file unit number on all processes is a good idea (in case you
> >>>>>>>> used
> >>>>>>>> the
> >>>>>>>> same number for all of them).
> >>>>>>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
> >>>>>>>> number_of_processes_per_node)  may avoid potential file handle
> >>>>>>>> number
> >>>>>>>> conflict across different processes under the same (SMP) OS on a
> >>>>>>>> node.
> >>>>>>>>
> >>>>>>>> My two guessed cents,
> >>>>>>>> Gus Correa
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------------
> >>>>>>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> >>>>>>>> Lamont-Doherty Earth Observatory - Columbia University
> >>>>>>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> >>>>>>>>
> >>>>>>>> ---------------------------------------------------------------------
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Brian Harker wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hello list-
> >>>>>>>>>
> >>>>>>>>> I have a problem with process 0 being able to open a file for
> >>>>>>>>> writing
> >>>>>>>>> and subsequently write to it.  The pertinent section of code looks
> >>>>>>>>> as
> >>>>>>>>> follows:
> >>>>>>>>>
> >>>>>>>>> ========================================
> >>>>>>>>> if ( proc_id == 0 ) then
> >>>>>>>>>
> >>>>>>>>> open( unit = 1, file = "fubar.dat", status="new" )
> >>>>>>>>> do i = 1, ny
> >>>>>>>>> write(1,*) ( array(i,j), i = 1, nx )
> >>>>>>>>> end do
> >>>>>>>>> close(1)
> >>>>>>>>>
> >>>>>>>>> end if
> >>>>>>>>> ========================================
> >>>>>>>>>
> >>>>>>>>> When this part of the code is reached, the program seems to hang
> >>>>>>>>> for a
> >>>>>>>>> long time while trying to open the file, then spits out the
> >>>>>>>>> following
> >>>>>>>>> error message:
> >>>>>>>>>
> >>>>>>>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all
> >>>>>>>>> ranks
> >>>>>>>>> exit status of rank 0: killed by signal 9
> >>>>>>>>>
> >>>>>>>>> I am confused about this error, because it is seemingly isolated to
> >>>>>>>>> this particular write-to-file by process 0.  During execution, my
> >>>>>>>>> slave processes write out other files using this exact same syntax.
> >>>>>>>>> Has anyone run across this?  I can't seem to find any useful
> >>>>>>>>> information on the interweb.  I have run into this problem with
> >>>>>>>>> both
> >>>>>>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
> >>>>>>>>> compiler, ifort 10.1.012.
> >>>>>>>>>
> >>>>>>>>> Thanks in advance for any input!
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Cheers,
> >>>>> Brian
> >>>>> brian.harker at gmail.com
> >>>>>
> >>>>>
> >>>>> "In science, there is only physics; all the rest is stamp-collecting."
> >>>>> -Ernest Rutherford
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Cheers,
> >>>> Brian
> >>>> brian.harker at gmail.com
> >>>>
> >>>>
> >>>> "In science, there is only physics; all the rest is stamp-collecting."
> >>>>
> >>>> -Ernest Rutherford
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >
> >
> 
> 
> 
> -- 
> Cheers,
> Brian
> brian.harker at gmail.com
> 
> 
> "In science, there is only physics; all the rest is stamp-collecting."
> 
> -Ernest Rutherford




More information about the mpich-discuss mailing list