[mpich-discuss] File I/O causing collective abort of all ranks
Gus Correa
gus at ldeo.columbia.edu
Tue Sep 23 15:34:27 CDT 2008
Hi Brian and list
Signal 9 is "kill", as in "kill -9 process_number".
See:
http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Brian Harker wrote:
>Hi-
>Thanks for the replies. When i first encountered this error, I did
>try it with only a single process, and it still aborts. :( It's
>definitely not an I/O problem on my machine, as I am running other
>serial code(s) right now with absolutely no problem. Strange. Any
>idea what "signal 9" actually is? I tried some googling, but nothing
>helpful has come up. I am currently running it under gdb to see if I
>can further isolate where the problem is occuring...
>
>To "The Source": process shouldn't exit at the point where the file is
>opened, and I was careful to make sure MPI_Finalize is in the correct
>place...I have watched it seemingly *try* to open the file, with some
>simple print-to-stdout debugging, and it seems to hang at the file
>open statement. After what seems like about 1-2 minutes of hangtime,
>it all crashes and I get the error message from my first post.
>
>On Tue, Sep 23, 2008 at 12:22 PM, The Source <thesourcehim at gmail.com> wrote:
>
>
>>This error pops up when the process exits without calling MPI_Finalize().
>>Check if process crashes for example.
>>
>>Brian Harker :
>>
>>
>>>Hello list-
>>>
>>>I have a problem with process 0 being able to open a file for writing
>>>and subsequently write to it. The pertinent section of code looks as
>>>follows:
>>>
>>>========================================
>>>if ( proc_id == 0 ) then
>>>
>>> open( unit = 1, file = "fubar.dat", status="new" )
>>> do i = 1, ny
>>> write(1,*) ( array(i,j), i = 1, nx )
>>> end do
>>> close(1)
>>>
>>>end if
>>>========================================
>>>
>>>When this part of the code is reached, the program seems to hang for a
>>>long time while trying to open the file, then spits out the following
>>>error message:
>>>
>>>rank 0 in job 11 $HOSTNAME_##### caused collective abort of all ranks
>>> exit status of rank 0: killed by signal 9
>>>
>>>I am confused about this error, because it is seemingly isolated to
>>>this particular write-to-file by process 0. During execution, my
>>>slave processes write out other files using this exact same syntax.
>>>Has anyone run across this? I can't seem to find any useful
>>>information on the interweb. I have run into this problem with both
>>>MPICH2-1.0.6p1 and MPICH2-1.0.7. I am using the Intel fortran
>>>compiler, ifort 10.1.012.
>>>
>>>Thanks in advance for any input!
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
>
More information about the mpich-discuss
mailing list