[mpich-discuss] mpich problem.... net_send: could not write tofd=4, errno = 32

Luís Miranda luistm at gmail.com
Fri Feb 6 05:38:46 CST 2009


Hi!

@Gus Correa
My machine has 2.6.18-92.1.13.el5  kernel.

I have resolved my problem by switching to the lateste version of mpich2.



Luís


2009/2/5 Gus Correa <gus at ldeo.columbia.edu>

> PS - Luiz, list:
>
> For what it is worth,
> I can run your code with MPICH-1 on a Rocks 4.3 cluster,
> with 2.6.9-55.0.2.ELsmp kernel. (See results below.)
> Just like Siegmar did.
>
> It would be interesting to know which kernel is installed on
> your and on Siegmar's machine.
> It was argued on the Rocks list
> that MPICH-1 wouldn't work with more recent kernels
> (such as the ones in Rocks 5.1 / CentOS 5.2).
>
> See the same threads that I mentioned before:
> http://marc.info/?l=npaci-rocks-discussion&m=123124666119400&w=2
> http://marc.info/?l=npaci-rocks-discussion&m=123110011830125&w=2
>
> From what was reported there also,
> MPICH-2 with the sockets communication channel
> produced errors on Rocks 5.1 / CentOS 5.2 too.
>
> However, MPICH-2 with Nemesis seems to have worked fine.
> So, better upgrade to it.
>
> My second two cents.
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
>
> ******** output of Luis' program *****
>
> Thu Feb  5 16:59:02 EST 2009
> executing ...
> i'm process 2 de 4...
> SLAVE 2: trying to receive message...
> SLAVE 2 MAQUINA compute-0-0.local: receive message 1
>  i'm process 3 de 4...
> SLAVE 3: trying to receive message...
> SLAVE 3 MAQUINA compute-0-0.local: receive message 1
>  i'm process 1 de 4...
> SLAVE 1: trying to receive message...
> SLAVE 1 MAQUINA compute-0-1.local: receive message 1
>  i'm process 0 de 4...
> ROOT: trying to send message...
> ROOT: trying to send message...
> ROOT: trying to send message...
> tlm ended at:
> Thu Feb  5 16:59:04 EST 2009
>
> ********
>
>
> Gus Correa wrote:
>
>> Hi Luis, Siegmar, Rajeev, and list
>>
>> Just some wild guesses.
>> Is your cluster a ROCKS 5.1, or does it use CentOS 5.2 or RHEL 5.2?
>> Somebody using MPICH-1 recently posted similar, hard to explain,
>> p4 errors on the Rocks mailing list.
>> The person was just trying to run the cpi.c example.
>>
>> A number of people there, including myself,
>> recommended switching from MPICH-1 to MPICH-2 (with nemesis).
>> When this was done, the problem was solved.
>>
>> See these threads:
>> http://marc.info/?l=npaci-rocks-discussion&m=123124666119400&w=2
>> http://marc.info/?l=npaci-rocks-discussion&m=123110011830125&w=2
>>
>> My two cents,
>>
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
>>
>> Siegmar Gross wrote:
>>
>>> Hi. I'm trying to run this:
>>>>  /opt/mpich/gnu/bin/mpirun -v -np 2    -machinefile program
>>>>
>>>> but i get this error:
>>>>
>>>> i'm process 0 de 2...
>>>> ROOT:  trying to send message...
>>>> p0_26706:  p4_error: interrupt SIGSEGV: 11
>>>> Killed by signal 2.
>>>> p0_26706: (0.113281) net_send: could not write to fd=4, errno = 32
>>>>
>>>
>>> I have no problems with your code.
>>>
>>> linpc1 fd1026 69 which mpicc
>>> /usr/local/mpich-1.2.5.2/bin/mpicc
>>> linpc1 fd1026 70 mpicc x.c
>>>
>>> linpc1 fd1026 71 mpirun -np 3 a.out
>>> i'm process 0 de 3...
>>> ROOT: trying to send message...
>>> ROOT: trying to send message...
>>> i'm process 1 de 3...
>>> SLAVE 1: trying to receive message...
>>> SLAVE 1 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
>>>  i'm process 2 de 3...
>>> SLAVE 2: trying to receive message...
>>> SLAVE 2 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
>>>
>>>  linpc1 fd1026 72 mpirun -machinefile x.machines -np 3 a.out
>>> i'm process 0 de 3...
>>> ROOT: trying to send message...
>>> ROOT: trying to send message...
>>> i'm process 2 de 3...
>>> SLAVE 2: trying to receive message...
>>> SLAVE 2 MAQUINA linpc3.informatik.hs-fulda.de: receive message 1
>>>  i'm process 1 de 3...
>>> SLAVE 1: trying to receive message...
>>> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
>>>
>>>  linpc1 fd1026 73 mpirun -v -machinefile x.machines -np 2 a.out
>>> running /home/fd1026/a.out on 2 LINUX ch_p4 processors
>>> Created /home/fd1026/PI28729
>>> i'm process 0 de 2...
>>> ROOT: trying to send message...
>>> i'm process 1 de 2...
>>> SLAVE 1: trying to receive message...
>>> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
>>>  linpc1 fd1026 74
>>>
>>> Siegmar
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20090206/fbc86d2a/attachment.htm>


More information about the mpich-discuss mailing list