[mpich-discuss] mpich problem.... net_send: could not write tofd=4, errno = 32
Gus Correa
gus at ldeo.columbia.edu
Thu Feb 5 16:55:52 CST 2009
PS - Luiz, list:
For what it is worth,
I can run your code with MPICH-1 on a Rocks 4.3 cluster,
with 2.6.9-55.0.2.ELsmp kernel. (See results below.)
Just like Siegmar did.
It would be interesting to know which kernel is installed on
your and on Siegmar's machine.
It was argued on the Rocks list
that MPICH-1 wouldn't work with more recent kernels
(such as the ones in Rocks 5.1 / CentOS 5.2).
See the same threads that I mentioned before:
http://marc.info/?l=npaci-rocks-discussion&m=123124666119400&w=2
http://marc.info/?l=npaci-rocks-discussion&m=123110011830125&w=2
From what was reported there also,
MPICH-2 with the sockets communication channel
produced errors on Rocks 5.1 / CentOS 5.2 too.
However, MPICH-2 with Nemesis seems to have worked fine.
So, better upgrade to it.
My second two cents.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
******** output of Luis' program *****
Thu Feb 5 16:59:02 EST 2009
executing ...
i'm process 2 de 4...
SLAVE 2: trying to receive message...
SLAVE 2 MAQUINA compute-0-0.local: receive message 1
i'm process 3 de 4...
SLAVE 3: trying to receive message...
SLAVE 3 MAQUINA compute-0-0.local: receive message 1
i'm process 1 de 4...
SLAVE 1: trying to receive message...
SLAVE 1 MAQUINA compute-0-1.local: receive message 1
i'm process 0 de 4...
ROOT: trying to send message...
ROOT: trying to send message...
ROOT: trying to send message...
tlm ended at:
Thu Feb 5 16:59:04 EST 2009
********
Gus Correa wrote:
> Hi Luis, Siegmar, Rajeev, and list
>
> Just some wild guesses.
> Is your cluster a ROCKS 5.1, or does it use CentOS 5.2 or RHEL 5.2?
> Somebody using MPICH-1 recently posted similar, hard to explain,
> p4 errors on the Rocks mailing list.
> The person was just trying to run the cpi.c example.
>
> A number of people there, including myself,
> recommended switching from MPICH-1 to MPICH-2 (with nemesis).
> When this was done, the problem was solved.
>
> See these threads:
> http://marc.info/?l=npaci-rocks-discussion&m=123124666119400&w=2
> http://marc.info/?l=npaci-rocks-discussion&m=123110011830125&w=2
>
> My two cents,
>
> Gus Correa
> ---------------------------------------------------------------------
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> ---------------------------------------------------------------------
>
> Siegmar Gross wrote:
>>> Hi. I'm trying to run this:
>>> /opt/mpich/gnu/bin/mpirun -v -np 2 -machinefile program
>>>
>>> but i get this error:
>>>
>>> i'm process 0 de 2...
>>> ROOT: trying to send message...
>>> p0_26706: p4_error: interrupt SIGSEGV: 11
>>> Killed by signal 2.
>>> p0_26706: (0.113281) net_send: could not write to fd=4, errno = 32
>>
>> I have no problems with your code.
>>
>> linpc1 fd1026 69 which mpicc
>> /usr/local/mpich-1.2.5.2/bin/mpicc
>> linpc1 fd1026 70 mpicc x.c
>>
>> linpc1 fd1026 71 mpirun -np 3 a.out
>> i'm process 0 de 3...
>> ROOT: trying to send message...
>> ROOT: trying to send message...
>> i'm process 1 de 3...
>> SLAVE 1: trying to receive message...
>> SLAVE 1 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
>> i'm process 2 de 3...
>> SLAVE 2: trying to receive message...
>> SLAVE 2 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
>>
>> linpc1 fd1026 72 mpirun -machinefile x.machines -np 3 a.out
>> i'm process 0 de 3...
>> ROOT: trying to send message...
>> ROOT: trying to send message...
>> i'm process 2 de 3...
>> SLAVE 2: trying to receive message...
>> SLAVE 2 MAQUINA linpc3.informatik.hs-fulda.de: receive message 1
>> i'm process 1 de 3...
>> SLAVE 1: trying to receive message...
>> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
>>
>> linpc1 fd1026 73 mpirun -v -machinefile x.machines -np 2 a.out
>> running /home/fd1026/a.out on 2 LINUX ch_p4 processors
>> Created /home/fd1026/PI28729
>> i'm process 0 de 2...
>> ROOT: trying to send message...
>> i'm process 1 de 2...
>> SLAVE 1: trying to receive message...
>> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
>> linpc1 fd1026 74
>>
>> Siegmar
More information about the mpich-discuss
mailing list