[mpich-discuss] File I/O causing collective abort of all ranks

Martin Siegert siegert at sfu.ca
Fri Sep 26 17:06:38 CDT 2008


Maybe a stupid question, but did you check whether the file fubar.dat
exists already?

Or try to compile with
open( unit = 1, file = "fubar.dat")
Does that work?

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid Site Lead
IT Services                                phone: 778 782-4691
Simon Fraser University                    fax:   778 782-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6

On Fri, Sep 26, 2008 at 03:31:39PM -0600, Brian Harker wrote:
> Well, no luck with the fresh install.  :(  Still can't write to file.
> 
> I also had an inkling that perhaps I had installed the intel fortran
> compiler before the intel c compiler, so that my ifort was built with
> gcc instead of icc, so I re-installed my compilers, icc and icpc
> first, then ifort, then rebuilt mpich2.  No go.  Any other ideas?
> 
> On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com> wrote:
> > Hi Gus-
> >
> > Ha! I sure do remember the clean install!  :)
> >
> > As far as unit numbers go, I've tried many different ones between 1
> > and 99, still no luck.  Tonight I'll try "make clean" followed by a
> > fresh install to see what happens.  Cheers!
> >
> >
> >
> > On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> >> Hi Brian and list
> >>
> >> Wild guesses:
> >>
> >> 1) Any chance that different processes use the same file name ("fubar.dat"
> >> or other)?
> >> 2) Or perhaps that processes somehow manipulate whole files or directories,
> >> with OS/shell calls to cp, mv, rm, etc?
> >>
> >> I was betting on "unit=1" being the source of the problem,
> >> perhaps combined with I/O redirection ( "<" and ">") of your program in the
> >> mpirun/mpiexec command.
> >> However, you say you already tried to use other unit numbers with no
> >> success.
> >> Did you try it on this part of the code, for process 0, with something
> >> different from 0,1,2,5,6?
> >>
> >> Yet another thing to try is a fresh compilation, preceded by a "make
> >> cleanall" of sorts,
> >> just to avoid leftover object files from ancient builds and outdated source
> >> code.
> >> Remember that?  :)
> >>
> >> Gus Correa
> >>
> >> --
> >> ---------------------------------------------------------------------
> >> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> >> Lamont-Doherty Earth Observatory - Columbia University
> >> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> >> ---------------------------------------------------------------------
> >>
> >>
> >> Brian Harker wrote:
> >>
> >>> Hi Gus-
> >>>
> >>> I have tried different unit numbers as well, and this master process
> >>> file-write is the only process with a hardwired unit number.  The
> >>> slave-writes I have treated very similarly to your suggestion.
> >>>
> >>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> >>>
> >>>>
> >>>> Hello Brian and list
> >>>>
> >>>> Some guesses.
> >>>>
> >>>> 1) Have you tried to use a different unit number for the file being
> >>>> opened,
> >>>> instead of 1, say 12,  for instance?
> >>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
> >>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
> >>>> stderr.
> >>>> For regular files I prefer to stay away from these magic numbers,
> >>>> just in case the OS and the compiler try to enforce their own
> >>>> preferences,
> >>>> fight each other, and perhaps don't change the file handle number from
> >>>> the
> >>>> program source
> >>>> in a sensible way.
> >>>>
> >>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0
> >>>> in
> >>>> a special way,
> >>>> which may be the reason why your process 0 fails, but not the others,
> >>>> particularly if you are redirecting stdin and stdout with "<" and ">".
> >>>> (This used to be the case in the past, I am not sure if it still is.
> >>>> The MPICH experts may have something better to say about it.)
> >>>>
> >>>> 3) If there are nodes with more than one process running (SMP) I don't
> >>>> know
> >>>> if hardwiring the
> >>>> same file unit number on all processes is a good idea (in case you used
> >>>> the
> >>>> same number for all of them).
> >>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
> >>>> number_of_processes_per_node)  may avoid potential file handle number
> >>>> conflict across different processes under the same (SMP) OS on a node.
> >>>>
> >>>> My two guessed cents,
> >>>> Gus Correa
> >>>>
> >>>> --
> >>>> ---------------------------------------------------------------------
> >>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> >>>> Lamont-Doherty Earth Observatory - Columbia University
> >>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> >>>> ---------------------------------------------------------------------
> >>>>
> >>>>
> >>>> Brian Harker wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Hello list-
> >>>>>
> >>>>> I have a problem with process 0 being able to open a file for writing
> >>>>> and subsequently write to it.  The pertinent section of code looks as
> >>>>> follows:
> >>>>>
> >>>>> ========================================
> >>>>> if ( proc_id == 0 ) then
> >>>>>
> >>>>> open( unit = 1, file = "fubar.dat", status="new" )
> >>>>> do i = 1, ny
> >>>>>  write(1,*) ( array(i,j), i = 1, nx )
> >>>>> end do
> >>>>> close(1)
> >>>>>
> >>>>> end if
> >>>>> ========================================
> >>>>>
> >>>>> When this part of the code is reached, the program seems to hang for a
> >>>>> long time while trying to open the file, then spits out the following
> >>>>> error message:
> >>>>>
> >>>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
> >>>>> exit status of rank 0: killed by signal 9
> >>>>>
> >>>>> I am confused about this error, because it is seemingly isolated to
> >>>>> this particular write-to-file by process 0.  During execution, my
> >>>>> slave processes write out other files using this exact same syntax.
> >>>>> Has anyone run across this?  I can't seem to find any useful
> >>>>> information on the interweb.  I have run into this problem with both
> >>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
> >>>>> compiler, ifort 10.1.012.
> >>>>>
> >>>>> Thanks in advance for any input!
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> > --
> > Cheers,
> > Brian
> > brian.harker at gmail.com
> >
> >
> > "In science, there is only physics; all the rest is stamp-collecting."
> >  -Ernest Rutherford
> >
> 
> 
> 
> -- 
> Cheers,
> Brian
> brian.harker at gmail.com
> 
> 
> "In science, there is only physics; all the rest is stamp-collecting."
> 
> -Ernest Rutherford




More information about the mpich-discuss mailing list