[mpich-discuss] File I/O causing collective abort of all ranks

Brian Harker brian.harker at gmail.com
Tue Sep 23 14:32:13 CDT 2008


Hi Gus-

I have tried different unit numbers as well, and this master process
file-write is the only process with a hardwired unit number.  The
slave-writes I have treated very similarly to your suggestion.

On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> Hello Brian and list
>
> Some guesses.
>
> 1) Have you tried to use a different unit number for the file being opened,
> instead of 1, say 12,  for instance?
> Old Fortran liked to use 5 and 6 for stdin and stdout,
> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and stderr.
> For regular files I prefer to stay away from these magic numbers,
> just in case the OS and the compiler try to enforce their own preferences,
> fight each other, and perhaps don't change the file handle number from the
> program source
> in a sensible way.
>
> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0 in
> a special way,
> which may be the reason why your process 0 fails, but not the others,
> particularly if you are redirecting stdin and stdout with "<" and ">".
> (This used to be the case in the past, I am not sure if it still is.
> The MPICH experts may have something better to say about it.)
>
> 3) If there are nodes with more than one process running (SMP) I don't know
> if hardwiring the
> same file unit number on all processes is a good idea (in case you used the
> same number for all of them).
> Something like 12+proc_id, or perhaps 12+mod(proc_id,
>  number_of_processes_per_node)  may avoid potential file handle number
> conflict across different processes under the same (SMP) OS on a node.
>
> My two guessed cents,
> Gus Correa
>
> --
> ---------------------------------------------------------------------
> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
> Lamont-Doherty Earth Observatory - Columbia University
> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> Brian Harker wrote:
>
>> Hello list-
>>
>> I have a problem with process 0 being able to open a file for writing
>> and subsequently write to it.  The pertinent section of code looks as
>> follows:
>>
>> ========================================
>> if ( proc_id == 0 ) then
>>
>>  open( unit = 1, file = "fubar.dat", status="new" )
>>  do i = 1, ny
>>   write(1,*) ( array(i,j), i = 1, nx )
>>  end do
>>  close(1)
>>
>> end if
>> ========================================
>>
>> When this part of the code is reached, the program seems to hang for a
>> long time while trying to open the file, then spits out the following
>> error message:
>>
>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>  exit status of rank 0: killed by signal 9
>>
>> I am confused about this error, because it is seemingly isolated to
>> this particular write-to-file by process 0.  During execution, my
>> slave processes write out other files using this exact same syntax.
>> Has anyone run across this?  I can't seem to find any useful
>> information on the interweb.  I have run into this problem with both
>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>> compiler, ifort 10.1.012.
>>
>> Thanks in advance for any input!
>>
>>
>>
>>
>
>



-- 
Cheers,
Brian
brian.harker at gmail.com


"In science, there is only physics; all the rest is stamp-collecting."
 -Ernest Rutherford




More information about the mpich-discuss mailing list