[mpich-discuss] File I/O causing collective abort of all ranks

Tue Sep 23 13:49:46 CDT 2008

Ok, have some more info:

gdb spits out the following:

0:  Program received signal SIGSEGV, Segmentation fault.
0:  0x004d3b91 in fileno_unlocked ( ) from /lib/libc.so.6
0:  (gdb) where
0:  #0  0x004d3b91 in fileno_unlocked ( ) from /lib/libc.so.6
0:  #1  0x080be895 in for__open_proc. ( )
0   #2  0x0809de74 in for_open ( )
0:  #3  0x0804b377 in mpi_main ( ) at driver.f90: 159
0:  #4  0x0804a941 in main ( )

So seems to be a glibc problem?

2008/9/23 Brian Harker <brian.harker at gmail.com>:
> Hi-
> Thanks for the replies.  When i first encountered this error, I did
> try it with only a single process, and it still aborts.  :(  It's
> definitely not an I/O problem on my machine, as I am running other
> serial code(s) right now with absolutely no problem.  Strange.  Any
> idea what "signal 9" actually is?  I tried some googling, but nothing
> helpful has come up.  I am currently running it under gdb to see if I
> can further isolate where the problem is occuring...
>
> To "The Source": process shouldn't exit at the point where the file is
> opened, and I was careful to make sure MPI_Finalize is in the correct
> place...I have watched it seemingly *try* to open the file, with some
> simple print-to-stdout debugging, and it seems to hang at the file
> open statement.  After what seems like about 1-2 minutes of hangtime,
> it all crashes and I get the error message from my first post.
>
> On Tue, Sep 23, 2008 at 12:22 PM, The Source <thesourcehim at gmail.com> wrote:
>> This error pops up when the process exits without calling MPI_Finalize().
>> Check if process crashes for example.
>>
>> Brian Harker пишет:
>>>
>>> Hello list-
>>>
>>> I have a problem with process 0 being able to open a file for writing
>>> and subsequently write to it.  The pertinent section of code looks as
>>> follows:
>>>
>>> ========================================
>>> if ( proc_id == 0 ) then
>>>
>>>  open( unit = 1, file = "fubar.dat", status="new" )
>>>  do i = 1, ny
>>>    write(1,*) ( array(i,j), i = 1, nx )
>>>  end do
>>>  close(1)
>>>
>>> end if
>>> ========================================
>>>
>>> When this part of the code is reached, the program seems to hang for a
>>> long time while trying to open the file, then spits out the following
>>> error message:
>>>
>>> rank 0 in job 11  $HOSTNAME_#####  caused collective abort of all ranks
>>>   exit status of rank 0: killed by signal 9
>>>
>>> I am confused about this error, because it is seemingly isolated to
>>> this particular write-to-file by process 0.  During execution, my
>>> slave processes write out other files using this exact same syntax.
>>> Has anyone run across this?  I can't seem to find any useful
>>> information on the interweb.  I have run into this problem with both
>>> MPICH2-1.0.6p1 and MPICH2-1.0.7.  I am using the Intel fortran
>>> compiler, ifort 10.1.012.
>>>
>>> Thanks in advance for any input!
>>>
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Cheers,
> Brian
> brian.harker at gmail.com
>
>
> "In science, there is only physics; all the rest is stamp-collecting."
>  -Ernest Rutherford
>

-- 
Cheers,
Brian
brian.harker at gmail.com

"In science, there is only physics; all the rest is stamp-collecting."
 -Ernest Rutherford