[mpich-discuss] File I/O causing collective abort of all ranks

Brian Harker brian.harker at gmail.com
Mon Sep 29 11:32:35 CDT 2008


Hi Gus-

Thanks for the ideas...i do have write permissions to the desired
directories.  I have heard a lot about how fortran 90 is argumentative
and doesn't play nice with MPI sometimes, so in that vein I am
currently re-writing all my code (!) in C.  My hopes are that since
the MPI libraries are implemented in C, that I won't have these issues
(or at least some less-difficult issues ;)  Thanks for all the input!

On Mon, Sep 29, 2008 at 8:27 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hello Brian and list
>
> There is a lot of things that can go wrong with I/O in a parallel
> environment.
> Without the error messages, code, etc, it is just guessing.
>
> Here are just a few guesses, besides the others I sent before (file name
> conflict across processes,
> conflicting file/directory manipulation by other processes, conflict with
> I/O redirection in process 0, etc).
> In case you want to check:
>
> 1) Do you have permission to write to this(these) directory(ies)?
>
> 2) Do you know the directory(ies) where the processes are actually working?
>
> 3) The mpi launcher normally puts you in your home directory.
> You may also have the ability to cd to the working directory (where the
> program is)
> or somewhere else.
> Where home is depends on the actual computer node where you are running.
> You may have a local home directory, or perhaps it may not have been set.
> You may have an NFS mounted home (or working directory), and the NFS
> export/mount scheme
> may not be correctly set.
>
> You can use print/flush or call system commands to see where the processes
> are running,
> where the program breaks,  and why.
>
> Gus Correa
>
> Brian Harker wrote:
>
>> No, it doesn't exist, but I have tried it with status="replace" and
>> status="unknown" just in case, and it still craps out.  Haven't tried
>> it without specifying the status keyword, will give it a go and see
>> what happens.  thanks!
>>
>> On Fri, Sep 26, 2008 at 4:06 PM, Martin Siegert <siegert at sfu.ca> wrote:
>>
>>>
>>> Maybe a stupid question, but did you check whether the file fubar.dat
>>> exists already?
>>>
>>> Or try to compile with
>>> open( unit = 1, file = "fubar.dat")
>>> Does that work?
>>>
>>> Cheers,
>>> Martin
>>>
>>> --
>>> Martin Siegert
>>> Head, Research Computing
>>> WestGrid Site Lead
>>> IT Services                                phone: 778 782-4691
>>> Simon Fraser University                    fax:   778 782-4242
>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>> Canada  V5A 1S6
>>>
>>> On Fri, Sep 26, 2008 at 03:31:39PM -0600, Brian Harker wrote:
>>>
>>>>
>>>> Well, no luck with the fresh install.  :(  Still can't write to file.
>>>>
>>>> I also had an inkling that perhaps I had installed the intel fortran
>>>> compiler before the intel c compiler, so that my ifort was built with
>>>> gcc instead of icc, so I re-installed my compilers, icc and icpc
>>>> first, then ifort, then rebuilt mpich2.  No go.  Any other ideas?
>>>>
>>>> On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi Gus-
>>>>>
>>>>> Ha! I sure do remember the clean install!  :)
>>>>>
>>>>> As far as unit numbers go, I've tried many different ones between 1
>>>>> and 99, still no luck.  Tonight I'll try "make clean" followed by a
>>>>> fresh install to see what happens.  Cheers!
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Brian and list
>>>>>>
>>>>>> Wild guesses:
>>>>>>
>>>>>> 1) Any chance that different processes use the same file name
>>>>>> ("fubar.dat"
>>>>>> or other)?
>>>>>> 2) Or perhaps that processes somehow manipulate whole files or
>>>>>> directories,
>>>>>> with OS/shell calls to cp, mv, rm, etc?
>>>>>>
>>>>>> I was betting on "unit=1" being the source of the problem,
>>>>>> perhaps combined with I/O redirection ( "<" and ">") of your program
>>>>>> in the
>>>>>> mpirun/mpiexec command.
>>>>>> However, you say you already tried to use other unit numbers with no
>>>>>> success.
>>>>>> Did you try it on this part of the code, for process 0, with something
>>>>>> different from 0,1,2,5,6?
>>>>>>
>>>>>> Yet another thing to try is a fresh compilation, preceded by a "make
>>>>>> cleanall" of sorts,
>>>>>> just to avoid leftover object files from ancient builds and outdated
>>>>>> source
>>>>>> code.
>>>>>> Remember that?  :)
>>>>>>
>>>>>> Gus Correa
>>>>>>
>>>>>> --
>>>>>> ---------------------------------------------------------------------
>>>>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> Brian Harker wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi Gus-
>>>>>>>
>>>>>>> I have tried different unit numbers as well, and this master process
>>>>>>> file-write is the only process with a hardwired unit number.  The
>>>>>>> slave-writes I have treated very similarly to your suggestion.
>>>>>>>
>>>>>>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hello Brian and list
>>>>>>>>
>>>>>>>> Some guesses.
>>>>>>>>
>>>>>>>> 1) Have you tried to use a different unit number for the file being
>>>>>>>> opened,
>>>>>>>> instead of 1, say 12,  for instance?
>>>>>>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
>>>>>>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
>>>>>>>> stderr.
>>>>>>>> For regular files I prefer to stay away from these magic numbers,
>>>>>>>> just in case the OS and the compiler try to enforce their own
>>>>>>>> preferences,
>>>>>>>> fight each other, and perhaps don't change the file handle number
>>>>>>>> from
>>>>>>>> the
>>>>>>>> program source
>>>>>>>> in a sensible way.
>>>>>>>>
>>>>>>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of
>>>>>>>> process 0
>>>>>>>> in
>>>>>>>> a special way,
>>>>>>>> which may be the reason why your process 0 fails, but not the
>>>>>>>> others,
>>>>>>>> particularly if you are redirecting stdin and stdout with "<" and
>>>>>>>> ">".
>>>>>>>> (This used to be the case in the past, I am not sure if it still is.
>>>>>>>> The MPICH experts may have something better to say about it.)
>>>>>>>>
>>>>>>>> 3) If there are nodes with more than one process running (SMP) I
>>>>>>>> don't
>>>>>>>> know
>>>>>>>> if hardwiring the
>>>>>>>> same file unit number on all processes is a good idea (in case you
>>>>>>>> used
>>>>>>>> the
>>>>>>>> same number for all of them).
>>>>>>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
>>>>>>>> number_of_processes_per_node)  may avoid potential file handle
>>>>>>>> number
>>>>>>>> conflict across different processes under the same (SMP) OS on a
>>>>>>>> node.
>>>>>>>>
>>>>>>>> My two guessed cents,
>>>>>>>> Gus Correa
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>>>>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>>>>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> Brian Harker wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello list-
>>>>>>>>>
>>>>>>>>> I have a problem with process 0 being able to open a file for
>>>>>>>>> writing
>>>>>>>>> and subsequently write to it.  The pertinent section of code looks
>>>>>>>>> as
>>>>>>>>> follows:
>>>>>>>>>
>>>>>>>>> ========================================
>>>>>>>>> if ( proc_id == 0 ) then
>>>>>>>>>
>>>>>>>>> open( unit = 1, file = "fubar.dat", status="new" )
>>>>>>>>> do i = 1, ny
>>>>>>>>> write(1,*) ( array(i,j), i = 1, nx )
>>>>>>>>> end do
>>>>>>>>> close(1)
>>>>>>>>>
>>>>>>>>> end if
>>>>>>>>> ========================================
>>>>>>>>>
>>>>>>>>> When this part of the code is reached, the program seems to hang
>>>>>>>>> for a
>>>>>>>>> long time while trying to open the file, then spits out the
>>>>>>>>> following
>>>>>>>>> error message:
>>>>>>>>>
>>>>>>>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all
>>>>>>>>> ranks
>>>>>>>>> exit status of rank 0: killed by signal 9
>>>>>>>>>
>>>>>>>>> I am confused about this error, because it is seemingly isolated to
>>>>>>>>> this particular write-to-file by process 0.  During execution, my
>>>>>>>>> slave processes write out other files using this exact same syntax.
>>>>>>>>> Has anyone run across this?  I can't seem to find any useful
>>>>>>>>> information on the interweb.  I have run into this problem with
>>>>>>>>> both
>>>>>>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>>>>>>> compiler, ifort 10.1.012.
>>>>>>>>>
>>>>>>>>> Thanks in advance for any input!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> Brian
>>>>> brian.harker at gmail.com
>>>>>
>>>>>
>>>>> "In science, there is only physics; all the rest is stamp-collecting."
>>>>> -Ernest Rutherford
>>>>>
>>>>>
>>>>
>>>> --
>>>> Cheers,
>>>> Brian
>>>> brian.harker at gmail.com
>>>>
>>>>
>>>> "In science, there is only physics; all the rest is stamp-collecting."
>>>>
>>>> -Ernest Rutherford
>>>>
>>>
>>>
>>
>>
>>
>>
>
>



-- 
Cheers,
Brian
brian.harker at gmail.com


"In science, there is only physics; all the rest is stamp-collecting."

-Ernest Rutherford




More information about the mpich-discuss mailing list