[mpich-discuss] File I/O causing collective abort of all ranks

Gus Correa gus at ldeo.columbia.edu
Tue Sep 23 14:07:09 CDT 2008


Hello Brian and list

Some guesses.

1) Have you tried to use a different unit number for the file being opened,
instead of 1, say 12,  for instance?
Old Fortran liked to use 5 and 6 for stdin and stdout,
whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and stderr.
For regular files I prefer to stay away from these magic numbers,
just in case the OS and the compiler try to enforce their own preferences,
fight each other, and perhaps don't change the file handle number from 
the program source
in a sensible way.

2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 
0 in a special way,
which may be the reason why your process 0 fails, but not the others,
particularly if you are redirecting stdin and stdout with "<" and ">".
(This used to be the case in the past, I am not sure if it still is.
The MPICH experts may have something better to say about it.)

3) If there are nodes with more than one process running (SMP) I don't 
know if hardwiring the
same file unit number on all processes is a good idea (in case you used 
the same number for all of them).
Something like 12+proc_id, or perhaps 12+mod(proc_id,  
number_of_processes_per_node)  may avoid potential file handle number 
conflict across different processes under the same (SMP) OS on a node.

My two guessed cents,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Brian Harker wrote:

>Hello list-
>
>I have a problem with process 0 being able to open a file for writing
>and subsequently write to it.  The pertinent section of code looks as
>follows:
>
>========================================
>if ( proc_id == 0 ) then
>
>  open( unit = 1, file = "fubar.dat", status="new" )
>  do i = 1, ny
>    write(1,*) ( array(i,j), i = 1, nx )
>  end do
>  close(1)
>
>end if
>========================================
>
>When this part of the code is reached, the program seems to hang for a
>long time while trying to open the file, then spits out the following
>error message:
>
>rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>   exit status of rank 0: killed by signal 9
>
>I am confused about this error, because it is seemingly isolated to
>this particular write-to-file by process 0.  During execution, my
>slave processes write out other files using this exact same syntax.
>Has anyone run across this?  I can't seem to find any useful
>information on the interweb.  I have run into this problem with both
>MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>compiler, ifort 10.1.012.
>
>Thanks in advance for any input!
>
>
>
>  
>




More information about the mpich-discuss mailing list