[MPICH] Problems running mpd with n > mpd processes

Philip Sydney Lavers psl02 at uow.edu.au
Tue Sep 20 08:07:31 CDT 2005


Hello Tony and folks,

>
>I'm trying to get mpd up and running on a small (2 dual
processor) cluster.
>
etc.

I had a similar behaviour on my cluster which is one dual
processor (opterons) and four single processor nodes (athlon64).

The OS is freebsd 5.4 throughout, and kernel compiled for SMP
on the dual opteron. four of the machines have additional
network cards and WERE setup to work with the campus network
and had the hostnames designated for my static IP addresses by
the campus IT administration (for example,
eng156-4.eng.uow.edu.au) . 

The cluster is on a 192.168.1.xx network in which each IP
address is given a simle short name (e.g. Paul1, claude6 etc) 
I could always startup mpd on each machine and get sensible
results from mpdtrace -l, but got the same results as you are
getting when I tried to run a parallel job.

My solution was to setup all my machines on the campus
network, download any updates etc. Then ensure the cluster
network was working THEN change the hostnames on each machine
  to the short name associated with my local network.

This gives an annoying delay with sendmail if the machines are
rebooted BUT  MPICH2 works perfectly. Interestingly,
FreeBSD/MPICH only calls on the second processor in the
opteron machine after all other nodes have been utilised. Thus
with the dual processor opteron and four nodes connected, if 
I request "mpiexec -n 5 ./executable_programme", top shows
that each node is working on executable_programme, but only
processor 0 on the dual machine. If I request "mpiexec -n 6
./executable_programme" both processors on the dual machine
churn away at the task.

Hoping my experience sheds some light,

regards,

Phil Lavers




More information about the mpich-discuss mailing list