[mpich-discuss] File I/O causing collective abort of all ranks

Gus Correa gus at ldeo.columbia.edu
Tue Sep 23 17:29:12 CDT 2008


Hi Brian and list

Wild guesses:

1) Any chance that different processes use the same file name 
("fubar.dat" or other)?
2) Or perhaps that processes somehow manipulate whole files or directories,
with OS/shell calls to cp, mv, rm, etc?

I was betting on "unit=1" being the source of the problem,
perhaps combined with I/O redirection ( "<" and ">") of your program in 
the mpirun/mpiexec command.
However, you say you already tried to use other unit numbers with no 
success.
Did you try it on this part of the code, for process 0, with something 
different from 0,1,2,5,6?

Yet another thing to try is a fresh compilation, preceded by a "make 
cleanall" of sorts,
just to avoid leftover object files from ancient builds and outdated 
source code.
Remember that?  :)

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Brian Harker wrote:

>Hi Gus-
>
>I have tried different unit numbers as well, and this master process
>file-write is the only process with a hardwired unit number.  The
>slave-writes I have treated very similarly to your suggestion.
>
>On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>  
>
>>Hello Brian and list
>>
>>Some guesses.
>>
>>1) Have you tried to use a different unit number for the file being opened,
>>instead of 1, say 12,  for instance?
>>Old Fortran liked to use 5 and 6 for stdin and stdout,
>>whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and stderr.
>>For regular files I prefer to stay away from these magic numbers,
>>just in case the OS and the compiler try to enforce their own preferences,
>>fight each other, and perhaps don't change the file handle number from the
>>program source
>>in a sensible way.
>>
>>2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0 in
>>a special way,
>>which may be the reason why your process 0 fails, but not the others,
>>particularly if you are redirecting stdin and stdout with "<" and ">".
>>(This used to be the case in the past, I am not sure if it still is.
>>The MPICH experts may have something better to say about it.)
>>
>>3) If there are nodes with more than one process running (SMP) I don't know
>>if hardwiring the
>>same file unit number on all processes is a good idea (in case you used the
>>same number for all of them).
>>Something like 12+proc_id, or perhaps 12+mod(proc_id,
>> number_of_processes_per_node)  may avoid potential file handle number
>>conflict across different processes under the same (SMP) OS on a node.
>>
>>My two guessed cents,
>>Gus Correa
>>
>>--
>>---------------------------------------------------------------------
>>Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>Lamont-Doherty Earth Observatory - Columbia University
>>P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>---------------------------------------------------------------------
>>
>>
>>Brian Harker wrote:
>>
>>    
>>
>>>Hello list-
>>>
>>>I have a problem with process 0 being able to open a file for writing
>>>and subsequently write to it.  The pertinent section of code looks as
>>>follows:
>>>
>>>========================================
>>>if ( proc_id == 0 ) then
>>>
>>> open( unit = 1, file = "fubar.dat", status="new" )
>>> do i = 1, ny
>>>  write(1,*) ( array(i,j), i = 1, nx )
>>> end do
>>> close(1)
>>>
>>>end if
>>>========================================
>>>
>>>When this part of the code is reached, the program seems to hang for a
>>>long time while trying to open the file, then spits out the following
>>>error message:
>>>
>>>rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>> exit status of rank 0: killed by signal 9
>>>
>>>I am confused about this error, because it is seemingly isolated to
>>>this particular write-to-file by process 0.  During execution, my
>>>slave processes write out other files using this exact same syntax.
>>>Has anyone run across this?  I can't seem to find any useful
>>>information on the interweb.  I have run into this problem with both
>>>MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>compiler, ifort 10.1.012.
>>>
>>>Thanks in advance for any input!
>>>
>>>
>>>
>>>
>>>      
>>>
>>    
>>
>
>
>
>  
>




More information about the mpich-discuss mailing list