[mpich-discuss] File I/O causing collective abort of all ranks

Brian Harker brian.harker at gmail.com
Fri Sep 26 16:31:39 CDT 2008


Well, no luck with the fresh install.  :(  Still can't write to file.

I also had an inkling that perhaps I had installed the intel fortran
compiler before the intel c compiler, so that my ifort was built with
gcc instead of icc, so I re-installed my compilers, icc and icpc
first, then ifort, then rebuilt mpich2.  No go.  Any other ideas?

On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com> wrote:
> Hi Gus-
>
> Ha! I sure do remember the clean install!  :)
>
> As far as unit numbers go, I've tried many different ones between 1
> and 99, still no luck.  Tonight I'll try "make clean" followed by a
> fresh install to see what happens.  Cheers!
>
>
>
> On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> Hi Brian and list
>>
>> Wild guesses:
>>
>> 1) Any chance that different processes use the same file name ("fubar.dat"
>> or other)?
>> 2) Or perhaps that processes somehow manipulate whole files or directories,
>> with OS/shell calls to cp, mv, rm, etc?
>>
>> I was betting on "unit=1" being the source of the problem,
>> perhaps combined with I/O redirection ( "<" and ">") of your program in the
>> mpirun/mpiexec command.
>> However, you say you already tried to use other unit numbers with no
>> success.
>> Did you try it on this part of the code, for process 0, with something
>> different from 0,1,2,5,6?
>>
>> Yet another thing to try is a fresh compilation, preceded by a "make
>> cleanall" of sorts,
>> just to avoid leftover object files from ancient builds and outdated source
>> code.
>> Remember that?  :)
>>
>> Gus Correa
>>
>> --
>> ---------------------------------------------------------------------
>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>> Lamont-Doherty Earth Observatory - Columbia University
>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>>
>> Brian Harker wrote:
>>
>>> Hi Gus-
>>>
>>> I have tried different unit numbers as well, and this master process
>>> file-write is the only process with a hardwired unit number.  The
>>> slave-writes I have treated very similarly to your suggestion.
>>>
>>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>>
>>>>
>>>> Hello Brian and list
>>>>
>>>> Some guesses.
>>>>
>>>> 1) Have you tried to use a different unit number for the file being
>>>> opened,
>>>> instead of 1, say 12,  for instance?
>>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
>>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
>>>> stderr.
>>>> For regular files I prefer to stay away from these magic numbers,
>>>> just in case the OS and the compiler try to enforce their own
>>>> preferences,
>>>> fight each other, and perhaps don't change the file handle number from
>>>> the
>>>> program source
>>>> in a sensible way.
>>>>
>>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0
>>>> in
>>>> a special way,
>>>> which may be the reason why your process 0 fails, but not the others,
>>>> particularly if you are redirecting stdin and stdout with "<" and ">".
>>>> (This used to be the case in the past, I am not sure if it still is.
>>>> The MPICH experts may have something better to say about it.)
>>>>
>>>> 3) If there are nodes with more than one process running (SMP) I don't
>>>> know
>>>> if hardwiring the
>>>> same file unit number on all processes is a good idea (in case you used
>>>> the
>>>> same number for all of them).
>>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
>>>> number_of_processes_per_node)  may avoid potential file handle number
>>>> conflict across different processes under the same (SMP) OS on a node.
>>>>
>>>> My two guessed cents,
>>>> Gus Correa
>>>>
>>>> --
>>>> ---------------------------------------------------------------------
>>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>>>> ---------------------------------------------------------------------
>>>>
>>>>
>>>> Brian Harker wrote:
>>>>
>>>>
>>>>>
>>>>> Hello list-
>>>>>
>>>>> I have a problem with process 0 being able to open a file for writing
>>>>> and subsequently write to it.  The pertinent section of code looks as
>>>>> follows:
>>>>>
>>>>> ========================================
>>>>> if ( proc_id == 0 ) then
>>>>>
>>>>> open( unit = 1, file = "fubar.dat", status="new" )
>>>>> do i = 1, ny
>>>>>  write(1,*) ( array(i,j), i = 1, nx )
>>>>> end do
>>>>> close(1)
>>>>>
>>>>> end if
>>>>> ========================================
>>>>>
>>>>> When this part of the code is reached, the program seems to hang for a
>>>>> long time while trying to open the file, then spits out the following
>>>>> error message:
>>>>>
>>>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>>>> exit status of rank 0: killed by signal 9
>>>>>
>>>>> I am confused about this error, because it is seemingly isolated to
>>>>> this particular write-to-file by process 0.  During execution, my
>>>>> slave processes write out other files using this exact same syntax.
>>>>> Has anyone run across this?  I can't seem to find any useful
>>>>> information on the interweb.  I have run into this problem with both
>>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>>> compiler, ifort 10.1.012.
>>>>>
>>>>> Thanks in advance for any input!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Cheers,
> Brian
> brian.harker at gmail.com
>
>
> "In science, there is only physics; all the rest is stamp-collecting."
>  -Ernest Rutherford
>



-- 
Cheers,
Brian
brian.harker at gmail.com


"In science, there is only physics; all the rest is stamp-collecting."

-Ernest Rutherford




More information about the mpich-discuss mailing list