[mpich-discuss] File I/O causing collective abort of all ranks

Gus Correa gus at ldeo.columbia.edu
Mon Sep 29 09:27:37 CDT 2008

Hello Brian and list

There is a lot of things that can go wrong with I/O in a parallel 
Without the error messages, code, etc, it is just guessing.

Here are just a few guesses, besides the others I sent before (file name 
conflict across processes,
conflicting file/directory manipulation by other processes, conflict 
with I/O redirection in process 0, etc).
In case you want to check:

1) Do you have permission to write to this(these) directory(ies)?

2) Do you know the directory(ies) where the processes are actually working?

3) The mpi launcher normally puts you in your home directory.
You may also have the ability to cd to the working directory (where the 
program is)
or somewhere else.
Where home is depends on the actual computer node where you are running.
You may have a local home directory, or perhaps it may not have been set.
You may have an NFS mounted home (or working directory), and the NFS 
export/mount scheme
may not be correctly set.

You can use print/flush or call system commands to see where the 
processes are running,
where the program breaks,  and why.

Gus Correa

Brian Harker wrote:

>No, it doesn't exist, but I have tried it with status="replace" and
>status="unknown" just in case, and it still craps out.  Haven't tried
>it without specifying the status keyword, will give it a go and see
>what happens.  thanks!
>On Fri, Sep 26, 2008 at 4:06 PM, Martin Siegert <siegert at sfu.ca> wrote:
>>Maybe a stupid question, but did you check whether the file fubar.dat
>>exists already?
>>Or try to compile with
>>open( unit = 1, file = "fubar.dat")
>>Does that work?
>>Martin Siegert
>>Head, Research Computing
>>WestGrid Site Lead
>>IT Services                                phone: 778 782-4691
>>Simon Fraser University                    fax:   778 782-4242
>>Burnaby, British Columbia                  email: siegert at sfu.ca
>>Canada  V5A 1S6
>>On Fri, Sep 26, 2008 at 03:31:39PM -0600, Brian Harker wrote:
>>>Well, no luck with the fresh install.  :(  Still can't write to file.
>>>I also had an inkling that perhaps I had installed the intel fortran
>>>compiler before the intel c compiler, so that my ifort was built with
>>>gcc instead of icc, so I re-installed my compilers, icc and icpc
>>>first, then ifort, then rebuilt mpich2.  No go.  Any other ideas?
>>>On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com> wrote:
>>>>Hi Gus-
>>>>Ha! I sure do remember the clean install!  :)
>>>>As far as unit numbers go, I've tried many different ones between 1
>>>>and 99, still no luck.  Tonight I'll try "make clean" followed by a
>>>>fresh install to see what happens.  Cheers!
>>>>On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>>>>Hi Brian and list
>>>>>Wild guesses:
>>>>>1) Any chance that different processes use the same file name ("fubar.dat"
>>>>>or other)?
>>>>>2) Or perhaps that processes somehow manipulate whole files or directories,
>>>>>with OS/shell calls to cp, mv, rm, etc?
>>>>>I was betting on "unit=1" being the source of the problem,
>>>>>perhaps combined with I/O redirection ( "<" and ">") of your program in the
>>>>>mpirun/mpiexec command.
>>>>>However, you say you already tried to use other unit numbers with no
>>>>>Did you try it on this part of the code, for process 0, with something
>>>>>different from 0,1,2,5,6?
>>>>>Yet another thing to try is a fresh compilation, preceded by a "make
>>>>>cleanall" of sorts,
>>>>>just to avoid leftover object files from ancient builds and outdated source
>>>>>Remember that?  :)
>>>>>Gus Correa
>>>>>Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>>>>Lamont-Doherty Earth Observatory - Columbia University
>>>>>P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>>>>Brian Harker wrote:
>>>>>>Hi Gus-
>>>>>>I have tried different unit numbers as well, and this master process
>>>>>>file-write is the only process with a hardwired unit number.  The
>>>>>>slave-writes I have treated very similarly to your suggestion.
>>>>>>On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>>>>>>Hello Brian and list
>>>>>>>Some guesses.
>>>>>>>1) Have you tried to use a different unit number for the file being
>>>>>>>instead of 1, say 12,  for instance?
>>>>>>>Old Fortran liked to use 5 and 6 for stdin and stdout,
>>>>>>>whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
>>>>>>>For regular files I prefer to stay away from these magic numbers,
>>>>>>>just in case the OS and the compiler try to enforce their own
>>>>>>>fight each other, and perhaps don't change the file handle number from
>>>>>>>program source
>>>>>>>in a sensible way.
>>>>>>>2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0
>>>>>>>a special way,
>>>>>>>which may be the reason why your process 0 fails, but not the others,
>>>>>>>particularly if you are redirecting stdin and stdout with "<" and ">".
>>>>>>>(This used to be the case in the past, I am not sure if it still is.
>>>>>>>The MPICH experts may have something better to say about it.)
>>>>>>>3) If there are nodes with more than one process running (SMP) I don't
>>>>>>>if hardwiring the
>>>>>>>same file unit number on all processes is a good idea (in case you used
>>>>>>>same number for all of them).
>>>>>>>Something like 12+proc_id, or perhaps 12+mod(proc_id,
>>>>>>>number_of_processes_per_node)  may avoid potential file handle number
>>>>>>>conflict across different processes under the same (SMP) OS on a node.
>>>>>>>My two guessed cents,
>>>>>>>Gus Correa
>>>>>>>Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>>>>>>Lamont-Doherty Earth Observatory - Columbia University
>>>>>>>P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>>>>>>Brian Harker wrote:
>>>>>>>>Hello list-
>>>>>>>>I have a problem with process 0 being able to open a file for writing
>>>>>>>>and subsequently write to it.  The pertinent section of code looks as
>>>>>>>>if ( proc_id == 0 ) then
>>>>>>>>open( unit = 1, file = "fubar.dat", status="new" )
>>>>>>>>do i = 1, ny
>>>>>>>> write(1,*) ( array(i,j), i = 1, nx )
>>>>>>>>end do
>>>>>>>>end if
>>>>>>>>When this part of the code is reached, the program seems to hang for a
>>>>>>>>long time while trying to open the file, then spits out the following
>>>>>>>>error message:
>>>>>>>>rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>>>>>>>exit status of rank 0: killed by signal 9
>>>>>>>>I am confused about this error, because it is seemingly isolated to
>>>>>>>>this particular write-to-file by process 0.  During execution, my
>>>>>>>>slave processes write out other files using this exact same syntax.
>>>>>>>>Has anyone run across this?  I can't seem to find any useful
>>>>>>>>information on the interweb.  I have run into this problem with both
>>>>>>>>MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>>>>>>compiler, ifort 10.1.012.
>>>>>>>>Thanks in advance for any input!
>>>>brian.harker at gmail.com
>>>>"In science, there is only physics; all the rest is stamp-collecting."
>>>> -Ernest Rutherford
>>>brian.harker at gmail.com
>>>"In science, there is only physics; all the rest is stamp-collecting."
>>>-Ernest Rutherford

More information about the mpich-discuss mailing list