[mpich-discuss] File I/O causing collective abort of all ranks
Brian Harker
brian.harker at gmail.com
Tue Sep 23 13:49:46 CDT 2008
Ok, have some more info:
gdb spits out the following:
0: Program received signal SIGSEGV, Segmentation fault.
0: 0x004d3b91 in fileno_unlocked ( ) from /lib/libc.so.6
0: (gdb) where
0: #0 0x004d3b91 in fileno_unlocked ( ) from /lib/libc.so.6
0: #1 0x080be895 in for__open_proc. ( )
0 #2 0x0809de74 in for_open ( )
0: #3 0x0804b377 in mpi_main ( ) at driver.f90: 159
0: #4 0x0804a941 in main ( )
So seems to be a glibc problem?
2008/9/23 Brian Harker <brian.harker at gmail.com>:
> Hi-
> Thanks for the replies. When i first encountered this error, I did
> try it with only a single process, and it still aborts. :( It's
> definitely not an I/O problem on my machine, as I am running other
> serial code(s) right now with absolutely no problem. Strange. Any
> idea what "signal 9" actually is? I tried some googling, but nothing
> helpful has come up. I am currently running it under gdb to see if I
> can further isolate where the problem is occuring...
>
> To "The Source": process shouldn't exit at the point where the file is
> opened, and I was careful to make sure MPI_Finalize is in the correct
> place...I have watched it seemingly *try* to open the file, with some
> simple print-to-stdout debugging, and it seems to hang at the file
> open statement. After what seems like about 1-2 minutes of hangtime,
> it all crashes and I get the error message from my first post.
>
> On Tue, Sep 23, 2008 at 12:22 PM, The Source <thesourcehim at gmail.com> wrote:
>> This error pops up when the process exits without calling MPI_Finalize().
>> Check if process crashes for example.
>>
>> Brian Harker пишет:
>>>
>>> Hello list-
>>>
>>> I have a problem with process 0 being able to open a file for writing
>>> and subsequently write to it. The pertinent section of code looks as
>>> follows:
>>>
>>> ========================================
>>> if ( proc_id == 0 ) then
>>>
>>> open( unit = 1, file = "fubar.dat", status="new" )
>>> do i = 1, ny
>>> write(1,*) ( array(i,j), i = 1, nx )
>>> end do
>>> close(1)
>>>
>>> end if
>>> ========================================
>>>
>>> When this part of the code is reached, the program seems to hang for a
>>> long time while trying to open the file, then spits out the following
>>> error message:
>>>
>>> rank 0 in job 11 $HOSTNAME_##### caused collective abort of all ranks
>>> exit status of rank 0: killed by signal 9
>>>
>>> I am confused about this error, because it is seemingly isolated to
>>> this particular write-to-file by process 0. During execution, my
>>> slave processes write out other files using this exact same syntax.
>>> Has anyone run across this? I can't seem to find any useful
>>> information on the interweb. I have run into this problem with both
>>> MPICH2-1.0.6p1 and MPICH2-1.0.7. I am using the Intel fortran
>>> compiler, ifort 10.1.012.
>>>
>>> Thanks in advance for any input!
>>>
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Cheers,
> Brian
> brian.harker at gmail.com
>
>
> "In science, there is only physics; all the rest is stamp-collecting."
> -Ernest Rutherford
>
--
Cheers,
Brian
brian.harker at gmail.com
"In science, there is only physics; all the rest is stamp-collecting."
-Ernest Rutherford
More information about the mpich-discuss
mailing list