[MPICH] Problems running mpd with n > mpd processes

Tony Keating akeating at eng.umd.edu
Mon Sep 19 15:58:56 CDT 2005


Hi,

I'm trying to get mpd up and running on a small (2 dual processor) cluster.

I have it working fine with one processes per mpd processes (per box), 
but I'm having difficulties when running two processes per mpd 
processes. Here is some more info:

On the head node:

~# mpd --ifhn=192.168.1.1
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2: conn error in connect_lhs: Connection refused
barolo.umd.edu_mpdman_2 (connect_lhs 542): failed to connect to lhs at 
127.0.0.1 33093
barolo.umd.edu_mpdman_2 (run 172): lhs connect failed

I tried running 2 processes which works fine, then with four things just 
hang and I get the above errors and need to press ctrl-C to break out:

~# mpdrun -n 2 hostname
barolo.umd.edu
c01
~# mpdrun -n 4 hostname
mpdrun_barolo.umd.edu (mpdrun 276): mpdrun: failed to obtain sock from 
manager

On the other node (c01)

~# mpd -h barolo.umd.edu -p 33450
c01_mpdman_3 (connect_lhs 554): invalid challenge from 192.168.1.1 33471: {}
c01_mpdman_3 (run 155): lhs connect failed

Anybody have any ideas? I have a feeling it has to do with the 
networking setup here, but I'm not 100% sure how to fix it.

Tony.




More information about the mpich-discuss mailing list