[mpich-discuss] File I/O causing collective abort of all ranks
Brian Harker
brian.harker at gmail.com
Fri Sep 26 17:45:50 CDT 2008
No, it doesn't exist, but I have tried it with status="replace" and
status="unknown" just in case, and it still craps out. Haven't tried
it without specifying the status keyword, will give it a go and see
what happens. thanks!
On Fri, Sep 26, 2008 at 4:06 PM, Martin Siegert <siegert at sfu.ca> wrote:
> Maybe a stupid question, but did you check whether the file fubar.dat
> exists already?
>
> Or try to compile with
> open( unit = 1, file = "fubar.dat")
> Does that work?
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid Site Lead
> IT Services phone: 778 782-4691
> Simon Fraser University fax: 778 782-4242
> Burnaby, British Columbia email: siegert at sfu.ca
> Canada V5A 1S6
>
> On Fri, Sep 26, 2008 at 03:31:39PM -0600, Brian Harker wrote:
>> Well, no luck with the fresh install. :( Still can't write to file.
>>
>> I also had an inkling that perhaps I had installed the intel fortran
>> compiler before the intel c compiler, so that my ifort was built with
>> gcc instead of icc, so I re-installed my compilers, icc and icpc
>> first, then ifort, then rebuilt mpich2. No go. Any other ideas?
>>
>> On Tue, Sep 23, 2008 at 4:40 PM, Brian Harker <brian.harker at gmail.com> wrote:
>> > Hi Gus-
>> >
>> > Ha! I sure do remember the clean install! :)
>> >
>> > As far as unit numbers go, I've tried many different ones between 1
>> > and 99, still no luck. Tonight I'll try "make clean" followed by a
>> > fresh install to see what happens. Cheers!
>> >
>> >
>> >
>> > On Tue, Sep 23, 2008 at 4:29 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> >> Hi Brian and list
>> >>
>> >> Wild guesses:
>> >>
>> >> 1) Any chance that different processes use the same file name ("fubar.dat"
>> >> or other)?
>> >> 2) Or perhaps that processes somehow manipulate whole files or directories,
>> >> with OS/shell calls to cp, mv, rm, etc?
>> >>
>> >> I was betting on "unit=1" being the source of the problem,
>> >> perhaps combined with I/O redirection ( "<" and ">") of your program in the
>> >> mpirun/mpiexec command.
>> >> However, you say you already tried to use other unit numbers with no
>> >> success.
>> >> Did you try it on this part of the code, for process 0, with something
>> >> different from 0,1,2,5,6?
>> >>
>> >> Yet another thing to try is a fresh compilation, preceded by a "make
>> >> cleanall" of sorts,
>> >> just to avoid leftover object files from ancient builds and outdated source
>> >> code.
>> >> Remember that? :)
>> >>
>> >> Gus Correa
>> >>
>> >> --
>> >> ---------------------------------------------------------------------
>> >> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>> >> Lamont-Doherty Earth Observatory - Columbia University
>> >> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>> >> ---------------------------------------------------------------------
>> >>
>> >>
>> >> Brian Harker wrote:
>> >>
>> >>> Hi Gus-
>> >>>
>> >>> I have tried different unit numbers as well, and this master process
>> >>> file-write is the only process with a hardwired unit number. The
>> >>> slave-writes I have treated very similarly to your suggestion.
>> >>>
>> >>> On Tue, Sep 23, 2008 at 1:07 PM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>> >>>
>> >>>>
>> >>>> Hello Brian and list
>> >>>>
>> >>>> Some guesses.
>> >>>>
>> >>>> 1) Have you tried to use a different unit number for the file being
>> >>>> opened,
>> >>>> instead of 1, say 12, for instance?
>> >>>> Old Fortran liked to use 5 and 6 for stdin and stdout,
>> >>>> whereas Unix and Linux prefer to use 0, 1, 2 for stdin, stdout, and
>> >>>> stderr.
>> >>>> For regular files I prefer to stay away from these magic numbers,
>> >>>> just in case the OS and the compiler try to enforce their own
>> >>>> preferences,
>> >>>> fight each other, and perhaps don't change the file handle number from
>> >>>> the
>> >>>> program source
>> >>>> in a sensible way.
>> >>>>
>> >>>> 2) Moreover, mpiexec and/or mpirun may treat stdin and stoud of process 0
>> >>>> in
>> >>>> a special way,
>> >>>> which may be the reason why your process 0 fails, but not the others,
>> >>>> particularly if you are redirecting stdin and stdout with "<" and ">".
>> >>>> (This used to be the case in the past, I am not sure if it still is.
>> >>>> The MPICH experts may have something better to say about it.)
>> >>>>
>> >>>> 3) If there are nodes with more than one process running (SMP) I don't
>> >>>> know
>> >>>> if hardwiring the
>> >>>> same file unit number on all processes is a good idea (in case you used
>> >>>> the
>> >>>> same number for all of them).
>> >>>> Something like 12+proc_id, or perhaps 12+mod(proc_id,
>> >>>> number_of_processes_per_node) may avoid potential file handle number
>> >>>> conflict across different processes under the same (SMP) OS on a node.
>> >>>>
>> >>>> My two guessed cents,
>> >>>> Gus Correa
>> >>>>
>> >>>> --
>> >>>> ---------------------------------------------------------------------
>> >>>> Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
>> >>>> Lamont-Doherty Earth Observatory - Columbia University
>> >>>> P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>> >>>> ---------------------------------------------------------------------
>> >>>>
>> >>>>
>> >>>> Brian Harker wrote:
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Hello list-
>> >>>>>
>> >>>>> I have a problem with process 0 being able to open a file for writing
>> >>>>> and subsequently write to it. The pertinent section of code looks as
>> >>>>> follows:
>> >>>>>
>> >>>>> ========================================
>> >>>>> if ( proc_id == 0 ) then
>> >>>>>
>> >>>>> open( unit = 1, file = "fubar.dat", status="new" )
>> >>>>> do i = 1, ny
>> >>>>> write(1,*) ( array(i,j), i = 1, nx )
>> >>>>> end do
>> >>>>> close(1)
>> >>>>>
>> >>>>> end if
>> >>>>> ========================================
>> >>>>>
>> >>>>> When this part of the code is reached, the program seems to hang for a
>> >>>>> long time while trying to open the file, then spits out the following
>> >>>>> error message:
>> >>>>>
>> >>>>> rank 0 in job 11 $HOSTNAME_##### caused collective abort of all ranks
>> >>>>> exit status of rank 0: killed by signal 9
>> >>>>>
>> >>>>> I am confused about this error, because it is seemingly isolated to
>> >>>>> this particular write-to-file by process 0. During execution, my
>> >>>>> slave processes write out other files using this exact same syntax.
>> >>>>> Has anyone run across this? I can't seem to find any useful
>> >>>>> information on the interweb. I have run into this problem with both
>> >>>>> MPICH2-1.0.6p1 and MPICH2-1.0.7. I am using the Intel fortran
>> >>>>> compiler, ifort 10.1.012.
>> >>>>>
>> >>>>> Thanks in advance for any input!
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Cheers,
>> > Brian
>> > brian.harker at gmail.com
>> >
>> >
>> > "In science, there is only physics; all the rest is stamp-collecting."
>> > -Ernest Rutherford
>> >
>>
>>
>>
>> --
>> Cheers,
>> Brian
>> brian.harker at gmail.com
>>
>>
>> "In science, there is only physics; all the rest is stamp-collecting."
>>
>> -Ernest Rutherford
>
>
--
Cheers,
Brian
brian.harker at gmail.com
"In science, there is only physics; all the rest is stamp-collecting."
-Ernest Rutherford
More information about the mpich-discuss
mailing list