[mpich-discuss] mpich problem.... net_send: could not write tofd=4, errno = 32

Gus Correa gus at ldeo.columbia.edu
Thu Feb 5 14:47:02 CST 2009


Hi Luis, Siegmar, Rajeev, and list

Just some wild guesses.
Is your cluster a ROCKS 5.1, or does it use CentOS 5.2 or RHEL 5.2?
Somebody using MPICH-1 recently posted similar, hard to explain,
p4 errors on the Rocks mailing list.
The person was just trying to run the cpi.c example.

A number of people there, including myself,
recommended switching from MPICH-1 to MPICH-2 (with nemesis).
When this was done, the problem was solved.

See these threads:
http://marc.info/?l=npaci-rocks-discussion&m=123124666119400&w=2
http://marc.info/?l=npaci-rocks-discussion&m=123110011830125&w=2

My two cents,

Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Siegmar Gross wrote:
>> Hi. I'm trying to run this:
>>  /opt/mpich/gnu/bin/mpirun -v -np 2    -machinefile program
>>
>> but i get this error:
>>
>> i'm process 0 de 2...
>> ROOT:  trying to send message...
>> p0_26706:  p4_error: interrupt SIGSEGV: 11
>> Killed by signal 2.
>> p0_26706: (0.113281) net_send: could not write to fd=4, errno = 32
> 
> I have no problems with your code.
> 
> linpc1 fd1026 69 which mpicc
> /usr/local/mpich-1.2.5.2/bin/mpicc
> linpc1 fd1026 70 mpicc x.c
> 
> linpc1 fd1026 71 mpirun -np 3 a.out
> i'm process 0 de 3...
> ROOT: trying to send message...
> ROOT: trying to send message...
> i'm process 1 de 3...
> SLAVE 1: trying to receive message...
> SLAVE 1 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
>  i'm process 2 de 3...
> SLAVE 2: trying to receive message...
> SLAVE 2 MAQUINA linpc0.informatik.hs-fulda.de: receive message 1
> 
>  linpc1 fd1026 72 mpirun -machinefile x.machines -np 3 a.out
> i'm process 0 de 3...
> ROOT: trying to send message...
> ROOT: trying to send message...
> i'm process 2 de 3...
> SLAVE 2: trying to receive message...
> SLAVE 2 MAQUINA linpc3.informatik.hs-fulda.de: receive message 1
>  i'm process 1 de 3...
> SLAVE 1: trying to receive message...
> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
> 
>  linpc1 fd1026 73 mpirun -v -machinefile x.machines -np 2 a.out
> running /home/fd1026/a.out on 2 LINUX ch_p4 processors
> Created /home/fd1026/PI28729
> i'm process 0 de 2...
> ROOT: trying to send message...
> i'm process 1 de 2...
> SLAVE 1: trying to receive message...
> SLAVE 1 MAQUINA linpc2.informatik.hs-fulda.de: receive message 1
>  linpc1 fd1026 74 
> 
> 
> Siegmar



More information about the mpich-discuss mailing list