[mpich-discuss] Problems running mpi

Dmitri Chubarov dmitri.chubarov at gmail.com
Sat Oct 10 02:25:35 CDT 2009


Hello, Fernando,

the default MPI error handler would abort the process on an error. Since the
call returns successfully a nonzero mpi_error might indicate memory
corruption. It is hard to tell really without seeing the code. Could you
isolate the problem into a small piece of code that you could post on the
list?

Best,
  Dmitri


On Thu, Oct 8, 2009 at 9:43 AM, Fernando Saez <saezfernando at gmail.com>wrote:

> Hello Dmitri,
>
> My intent is not to the rank 0 will send a message from itself.
>
> I run these test over 1 machine and it simule several nodes (mpirun -np 4).
> The machine run over debian etch, gcc 4.1 and mpich 1.0.6.
>
> I've noticed when run test is MPI_Recv return (No mpi error), but
> status.MPI_ERROR is different from zero. Can this happen?
>
> thanks,
>
> Fernando
>
>
>
> On Wed, Oct 7, 2009 at 3:18 PM, Dmitri Chubarov <dmitri.chubarov at gmail.com
> > wrote:
>
>> Hello, Fernando,
>>
>> this does not look familiar to me, though something caught my attention.
>> It seems that you call MPI_Recv on process with rank 0 to receive a
>> message from itself. Is this what's intended?
>>
>> Error 22 would normally stay for "Invalid argument". This error would be
>> posted by a bind call on a socket that is already bound.
>>
>> Could you post more information and describe your configuration in more
>> detail.
>>
>> Hope this helps,
>>   Dima
>>
>>
>> On Tue, Oct 6, 2009 at 11:01 PM, Fernando Saez <saezfernando at gmail.com>wrote:
>>
>>> Dear MPICH discussion group
>>>
>>> I am trying to run a MPI program, but I fail with the following error:
>>>
>>> 1: Fatal error in MPI_Recv: Other MPI error, error stack:
>>> 1: MPI_Recv(186)................: MPI_Recv(buf=0xbfdf70e8, count=52,
>>> MPI_DOUBLE, src=0, tag=0, MPI_COMM_WORLD, status=0xbfdf6f34) failed
>>> 1: MPIDI_CH3i_Progress_wait(207): sock_wait failed
>>> 1: MPIDU_Sock_wait(202).........: unexpected operating system error
>>> (errno=22:(strerror() not found))
>>> rank 0 in job 71  lidic01.unsl.edu.ar_39689   caused collective abort of
>>> all ranks
>>>   exit status of rank 0: killed by signal 11
>>>
>>> The program ejecute very well with smaller input size, but when I row the
>>> size it crashing.
>>>
>>> Let me know if this error sounds familiar to you and if you have any
>>> suggestions for what to do here.
>>>
>>> Thanks,
>>>
>>> Fernando
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091010/3aac8b60/attachment.htm>


More information about the mpich-discuss mailing list