[mpich-discuss] File I/O causing collective abort of all ranks

Brian Harker brian.harker at gmail.com
Tue Sep 23 17:40:01 CDT 2008


Hi Gus-

Ha! I sure do remember the clean install!  :)

As far as unit numbers go, I've tried many different ones between 1
and 99, still no luck.  Tonight I'll try "make clean" followed by a
fresh install to see what happens.  Cheers!



On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hi Brian and list
>
> Wild guesses:
>
> 1) Any chance that different processes use the same file name ("fubar.dat"
> or other)?
> 2) Or perhaps that processes somehow manipulate whole files or directories,
> with OS/shell calls to cp, mv, rm, etc?
>
> I was betting on "unit=1" being the source of the problem,
> perhaps combined with I/O redirection ( "<" and ">") of your program in the
> mpirun/mpiexec command.
> However, you say you already tried to use other unit numbers with no
> success.
> Did you try it on this part of the code, for process 0, with something
> different from 0,1,2,5,6?
>
> Yet another thing to try is a fresh compilation, preceded by a "make
> cleanall" of sorts,
> just to avoid leftover object files from ancient builds and outdated source
> code.
> Remember that?  :)
>
> Gus Correa
>
> --
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> Brian Harker wrote:
>
>> Hi Gus-
>>
>> I have tried different unit numbers as well, and this master process
>> file-write is the only process with a hardwired unit number.  The
>> slave-writes I have treated very similarly to your suggestion.
>>
>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>
>>>
>>> Hello Brian and list
>>>
>>> Some guesses.
>>>
>>> 1) Have you tried to use a different unit number for the file being
>>> opened,
>>> instead of 1, say 12,  for instance?
>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
>>> stderr.
>>> For regular files I prefer to stay away from these magic numbers,
>>> just in case the OS and the compiler try to enforce their own
>>> preferences,
>>> fight each other, and perhaps don't change the file handle number from
>>> the
>>> program source
>>> in a sensible way.
>>>
>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0
>>> in
>>> a special way,
>>> which may be the reason why your process 0 fails, but not the others,
>>> particularly if you are redirecting stdin and stdout with "<" and ">".
>>> (This used to be the case in the past, I am not sure if it still is.
>>> The MPICH experts may have something better to say about it.)
>>>
>>> 3) If there are nodes with more than one process running (SMP) I don't
>>> know
>>> if hardwiring the
>>> same file unit number on all processes is a good idea (in case you used
>>> the
>>> same number for all of them).
>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
>>> number_of_processes_per_node)  may avoid potential file handle number
>>> conflict across different processes under the same (SMP) OS on a node.
>>>
>>> My two guessed cents,
>>> Gus Correa
>>>
>>> --
>>> ---------------------------------------------------------------------
>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>> Brian Harker wrote:
>>>
>>>
>>>>
>>>> Hello list-
>>>>
>>>> I have a problem with process 0 being able to open a file for writing
>>>> and subsequently write to it.  The pertinent section of code looks as
>>>> follows:
>>>>
>>>> ========================================
>>>> if ( proc_id == 0 ) then
>>>>
>>>> open( unit = 1, file = "fubar.dat", status="new" )
>>>> do i = 1, ny
>>>>  write(1,*) ( array(i,j), i = 1, nx )
>>>> end do
>>>> close(1)
>>>>
>>>> end if
>>>> ========================================
>>>>
>>>> When this part of the code is reached, the program seems to hang for a
>>>> long time while trying to open the file, then spits out the following
>>>> error message:
>>>>
>>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>>> exit status of rank 0: killed by signal 9
>>>>
>>>> I am confused about this error, because it is seemingly isolated to
>>>> this particular write-to-file by process 0.  During execution, my
>>>> slave processes write out other files using this exact same syntax.
>>>> Has anyone run across this?  I can't seem to find any useful
>>>> information on the interweb.  I have run into this problem with both
>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>> compiler, ifort 10.1.012.
>>>>
>>>> Thanks in advance for any input!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>
>



-- 
Cheers,
Brian
brian.harker at gmail.com


"In science, there is only physics; all the rest is stamp-collecting."
 -Ernest Rutherford




More information about the mpich-discuss mailing list