[mpich-discuss] File I/O causing collective abort of all ranks

Gus Correa gus at ldeo.columbia.edu
Tue Sep 23 15:34:27 CDT 2008


Hi Brian and list

Signal 9 is "kill", as in "kill -9 process_number".
See:

http://unixhelp.ed.ac.uk/CGI/man-cgi?signal+7

Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Brian Harker wrote:

>Hi-
>Thanks for the replies.  When i first encountered this error, I did
>try it with only a single process, and it still aborts.  :(  It's
>definitely not an I/O problem on my machine, as I am running other
>serial code(s) right now with absolutely no problem.  Strange.  Any
>idea what "signal 9" actually is?  I tried some googling, but nothing
>helpful has come up.  I am currently running it under gdb to see if I
>can further isolate where the problem is occuring...
>
>To "The Source": process shouldn't exit at the point where the file is
>opened, and I was careful to make sure MPI_Finalize is in the correct
>place...I have watched it seemingly *try* to open the file, with some
>simple print-to-stdout debugging, and it seems to hang at the file
>open statement.  After what seems like about 1-2 minutes of hangtime,
>it all crashes and I get the error message from my first post.
>
>On Tue, Sep 23, 2008 at 12:22 PM, The Source <thesourcehim at gmail.com> wrote:
>  
>
>>This error pops up when the process exits without calling MPI_Finalize().
>>Check if process crashes for example.
>>
>>Brian Harker :
>>    
>>
>>>Hello list-
>>>
>>>I have a problem with process 0 being able to open a file for writing
>>>and subsequently write to it.  The pertinent section of code looks as
>>>follows:
>>>
>>>========================================
>>>if ( proc_id == 0 ) then
>>>
>>> open( unit = 1, file = "fubar.dat", status="new" )
>>> do i = 1, ny
>>>   write(1,*) ( array(i,j), i = 1, nx )
>>> end do
>>> close(1)
>>>
>>>end if
>>>========================================
>>>
>>>When this part of the code is reached, the program seems to hang for a
>>>long time while trying to open the file, then spits out the following
>>>error message:
>>>
>>>rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>>  exit status of rank 0: killed by signal 9
>>>
>>>I am confused about this error, because it is seemingly isolated to
>>>this particular write-to-file by process 0.  During execution, my
>>>slave processes write out other files using this exact same syntax.
>>>Has anyone run across this?  I can't seem to find any useful
>>>information on the interweb.  I have run into this problem with both
>>>MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>>compiler, ifort 10.1.012.
>>>
>>>Thanks in advance for any input!
>>>
>>>
>>>
>>>
>>>      
>>>
>>    
>>
>
>
>
>  
>




More information about the mpich-discuss mailing list