You need to do the mpdcheck tests with every pair of compute nodes. Or
to isolate the problem, try running on a smaller set of nodes first and
increase it one at a time until it fails.


I have just installed MPICH2 in my Xen-based virtual machines.

My hardware configuration is as follows:

Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
Motherboard: Intel Desktop Board DQ45CB BIOS 0093
Memory: 4X 2GB Kingston DDR2-800 CL5

My software configuration is as follows:

Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
Jeremy Fitzhardinge's pv-ops dom0 kernel:
Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized (PV)
domU guests (SELinux disabled)

I have successfully configured, built and installed MPICH2 in a F11 PV
guest OS master compute node 1 with NFS server (MPICH2 bin subdirectory
exported). The rest of the 5 compute nodes have access to the MPICH2
binaries by mounting NFS share from node 1. Please see attached c.txt,
m.txt and mi.txt. With Xen virtualization, I have created 6 F11 linux PV
guests to simulate 6 HPC compute nodes. The network adapter (NIC) in
each guest OS is virtual. The Xen networking type is bridged. Running
"lspci -v" and lsusb in each guest OS does not show up anything.

According to Appendix A troubleshooting section of the MPICH2 install
guide, I have verified that the 2-node test scenario with "mpdcheck -s"
and "mpdcheck -c" is working. The 2 nodes each acting as server and
client respectively can communicate with each other without problems.
Both nodes can communicate with each other in server and client modes
respectively. I have also tested mpdboot with the 2-node scenario and
the ring of mpd is working.

After the troubleshooting process, I have successfully created a ring of
mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all the
6 nodes. However, when I want to run a job with mpiexec, it gives me the
following error:

[enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd
when expecting ack of request

I have also tried starting the mpd ring with the root user but I still
encounter the same error above.

Thank you.

PS. config.log is also attached.

