Dear All,<br><br>I have solved the problem. <br><br>With reference to <a href="http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01327.html" target="_blank">http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01327.html</a> and <a href="http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices" target="_blank">http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices</a>
, I have executed the following command as root on all my 6 compute
nodes (each compute node is a F11 linux 64-bit PV virtual machine).<br>
<br># <code>ethtool -K eth0 tx off gso on</code><br><br>Now I can successfully run mpiexec to execute MPI and non-MPI jobs on my Virtual HPC Compute Cluster.<br><br><span class="sliceCur"><strong>Topic: [Xen-users] Using Xen Virtualization
Environment for Development and Testing of Supercomputing and High
Performance Computing (HPC) Cluster MPICH2 MPI-2 Applications<br><br><span style="font-weight: normal;">URL: <a href="http://lists.xensource.com/archives/html/xen-devel/2009-10/msg01440.html">http://lists.xensource.com/archives/html/xen-devel/2009-10/msg01440.html</a></span><br>
</strong></span><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>
My Youtube videos: <a href="http://www.youtube.com/user/enmingteo">http://www.youtube.com/user/enmingteo</a><br>Email: <a href="mailto:space.time.universe@gmail.com">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com">teoenming@hotmail.com</a><br>
Mobile Phone (SingTel): +65-9648-9798<br>Mobile Phone (Starhub Prepaid): +65-8369-2618<br>Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br>
<br><br><br><div class="gmail_quote">On Fri, Oct 30, 2009 at 4:13 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Dear All,<br><br>I have googled something which may help to solve my problem.<br><br><h2 style="font-weight: normal;">
<b><font size="2">[Xen-devel] Network drop on domU (netfront: rx->offset: 0,        size: 4294967295)</font></b></h2>
<a href="http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.html" target="_blank">http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.html</a><br><br><h1><font size="2">Virtualization Tip: Always disable checksumming on virtual ethernet devices</font></h1>
<br><a href="http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices" target="_blank">http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices</a><br>
<br><br>Let me try to work on it first.<div class="im"><br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>
My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br><br><br></div><div><div></div><div class="h5"><div class="gmail_quote">
On Fri, Oct 30, 2009 at 3:56 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>Hi,<br><br>I have reverted to the 2-node troubleshooting scenario. I have started node 1 and node 2.<br>
<br>On
node 1, I will try to bring up the ring of mpd for the 2 nodes using
mpdboot and try to execute mpiexec. On node 2, I will capture the
tcpdump messages on virtual network interface eth0.<br>
<br>Please see attached PNG screenshots. They are numbered in sequence.<br><br>Please check if there are any problems.<br><br></div><div>Thank you.<br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>
Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>
My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br><br><br></div><div><div></div><div><div class="gmail_quote">
On Fri, Oct 30, 2009 at 2:55 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>Here are more virtual network interface eth0 kernel messages. Notice
the "net eth0: rx->offset: 0" messages. Are they of significance?<br><br><u><b>Node 1</b></u><br><br>Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from <a href="http://192.168.1.253:1009/" target="_blank">192.168.1.253:1009</a> for /home/enming/mpich2-install/<div>
bin (/home/enming/mpich2-install/bin)<br>
Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from <a href="http://192.168.1.252:877/" target="_blank">192.168.1.252:877</a> for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)<br>
Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from <a href="http://192.168.1.251:1000/" target="_blank">192.168.1.251:1000</a> for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)<br>
Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from <a href="http://192.168.1.250:882/" target="_blank">192.168.1.250:882</a> for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)<br>
Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from <a href="http://192.168.1.249:953/" target="_blank">192.168.1.249:953</a> for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)<br>
Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no mpdid yet<br>Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has mpdid=enming-f11-pv-hpc-node0001_48545 (port=48545)<br>Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12 callbacks suppressed<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295<br><br><u><b>Node 6</b></u><br>
<br>Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no mpdid yet<br>Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has mpdid=enming-f11-pv-hpc-node0006_52805 (port=52805)<br>
Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295<br>
<br><u><b>Node 1 NFS Server Configuration</b></u><br><br>[root@enming-f11-pv-hpc-node0001 ~]# cat /etc/exports <br>/home/enming/mpich2-install/bin <a href="http://192.168.1.0/24%28ro%29" target="_blank">192.168.1.0/24(ro)</a><br>
<br><u><b>Node 2 /etc/fstab Configuration Entry for NFS Client</b></u><br>
<br>192.168.1.254:/home/enming/mpich2-install/bin /home/enming/mpich2-install/bin nfs rsize=8192,wsize=8192,timeo=14,intr</div><div><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>
Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>
My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div><div></div><div><div class="gmail_quote">On Fri, Oct 30, 2009 at 2:14 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>I have noticed that there are Receive Errors (RX-ERR) in all of my 6 compute nodes. It appears that there may be problems with the virtual network interface eth0 in Xen networking.<br>
<br>=================================================<br>
<br>Node 1:<div><br><br>[root@enming-f11-pv-hpc-node0001 ~]# netstat -i<br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br></div>eth0 1500 0 5824 27 0 0 5056 0 0 0 BMRU<div>
<br>
lo 16436 0 127 0 0 0 127 0 0 0 LRU<br></div>[root@enming-f11-pv-hpc-node0001 ~]# ps -ef | grep mpd<div><br>enming 1505 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d<br>
</div>
root 1650 1576 0 22:07 pts/0 00:00:00 grep mpd<br><br>Node 2:<br><br>[root@enming-f11-pv-hpc-node0002 ~]# netstat -i<div><br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br>
</div>
eth0 1500 0 1504 7 0 0 1417 0 0 0 BMRU<br>lo 16436 0 44 0 0 0 44 0 0 0 LRU<br><br>Node 3:<br><br>[root@enming-f11-pv-hpc-node0003 ~]# netstat -i<div>
<br>
Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br></div>eth0 1500 0 1520 12 0 0 1467 0 0 0 BMRU<br>lo 16436 0 42 0 0 0 42 0 0 0 LRU<br>
<br>Node 4:<br><br>[root@enming-f11-pv-hpc-node0004 ~]# netstat -i<div><br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br></div>eth0 1500 0 1528 10 0 0 1514 0 0 0 BMRU<br>
lo 16436 0 44 0 0 0 44 0 0 0 LRU<br><br>Node 5:<br><br>[root@enming-f11-pv-hpc-node0005 ~]# netstat -i<div><br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br>
</div>
eth0 1500 0 1416 11 0 0 1412 0 0 0 BMRU<br>lo 16436 0 44 0 0 0 44 0 0 0 LRU<br><br>Node 6:<br><br>[root@enming-f11-pv-hpc-node0006 ~]# netstat -i<div>
<br>
Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br></div>eth0 1500 0 1474 9 0 0 1504 0 0 0 BMRU<br>lo 16436 0 44 0 0 0 44 0 0 0 LRU<br>
<br>================================================<div><br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>
My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div><div></div><div><div class="gmail_quote">On Fri, Oct 30, 2009 at 2:07 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">All the six compute nodes are identical PV virtual machines.<div><br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>
Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>
My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div><div></div><div><div class="gmail_quote">On Fri, Oct 30, 2009 at 2:04 PM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>I have changed the communication method from nemesis (high performance network method) to ssm (socket for nodes and shared memory within a node) by recompiling MPICH2. I have also pre-set the MAC address of the virtual network adapter eth0 in each compute node (each compute node is a Xen paravirtualized virtual machine) by configuring the vif directive in each PV domU configuration file.<br>
<br>Additionally, I have also turned off iptables to facilitate troubleshooting and communication between all mpd daemons in each node. SSH without password is possible between all the compute nodes.<br><br>After having done all of the above, I am still encountering the MPIEXEC 392 error.<div>
<br>
<br>mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd when expecting ack of request<br><br></div>=================================================<br><br>Master Node / Compute Node 1:<br><br>[enming@enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd<br>
enming 1499 1455 0 21:44 pts/0 00:00:00 grep mpd<br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6<br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>
[enming@enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd<br>enming 1505 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d<br><br>Compute Node 2:<br><br>[enming@enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l<br>
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>[enming@enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd<br>enming 1431 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>
enming 1481 1436 0 21:46 pts/0 00:00:00 grep mpd<br><br>Compute Node 3:<br><br>[enming@enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>[enming@enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd<br>
enming 1422 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>enming 1473 1427 0 21:47 pts/0 00:00:00 grep mpd<br><br>Compute Node 4: <br>
<br>[enming@enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>[enming@enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd<br>enming 1432 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>
enming 1482 1437 0 21:47 pts/0 00:00:00 grep mpd<br><br>Compute Node 5:<br><br>[enming@enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>[enming@enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd<br>
enming 1423 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>enming 1475 1429 0 21:48 pts/0 00:00:00 grep mpd<br><br>Compute Node 6: <br>
<br>[enming@enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>[enming@enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd<br>enming 1427 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p 42012 --ncpus=1 -e -d<br>
enming 1477 1432 0 21:49 pts/0 00:00:00 grep mpd<br><br>=================================================<br><br>Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py file or should I change the communication method to sock?<br>
<br>MPIEXEC 392 error says no msg recvd from mpd when expecting ack of request. So I am thinking that it could be taking very very long to receive acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low. Hence that causes the mpiexec 392 error in my case. I am using a virtual network adapter and not physical Gigabit network adapter.<br>
<br>=================================================<br><br>[root@enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo<br>processor : 0<br>vendor_id : GenuineIntel<br>cpu family : 6<br>model : 23<br>model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz<br>
stepping : 10<br>cpu MHz : 2800.098<br>cache size : 2048 KB<br>fpu : yes<br>fpu_exception : yes<br>cpuid level : 13<br>wp : yes<br>flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm<br>
bogomips : 5600.19<br>clflush size : 64<br>cache_alignment : 64<br>address sizes : 36 bits physical, 48 bits virtual<br>power management:<br><br>processor : 1<br>vendor_id : GenuineIntel<br>cpu family : 6<br>
model : 23<br>model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz<br>stepping : 10<br>cpu MHz : 2800.098<br>cache size : 2048 KB<br>fpu : yes<br>fpu_exception : yes<br>cpuid level : 13<br>
wp : yes<br>flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm<br>bogomips : 5600.19<br>clflush size : 64<br>
cache_alignment : 64<br>
address sizes : 36 bits physical, 48 bits virtual<br>power management:<br><br>[root@enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo<br>MemTotal: 532796 kB<br>MemFree: 386156 kB<br>Buffers: 12904 kB<br>
Cached: 48864 kB<br>SwapCached: 0 kB<br>Active: 34884 kB<br>Inactive: 43252 kB<br>Active(anon): 16504 kB<br>Inactive(anon): 0 kB<br>Active(file): 18380 kB<br>Inactive(file): 43252 kB<br>
Unevictable: 0 kB<br>Mlocked: 0 kB<br>SwapTotal: 2195448 kB<br>SwapFree: 2195448 kB<br>Dirty: 12 kB<br>Writeback: 0 kB<br>AnonPages: 16444 kB<br>Mapped: 8864 kB<br>
Slab: 10528 kB<br>SReclaimable: 4668 kB<br>SUnreclaim: 5860 kB<br>PageTables: 2996 kB<br>NFS_Unstable: 0 kB<br>Bounce: 0 kB<br>WritebackTmp: 0 kB<br>CommitLimit: 2461844 kB<br>
Committed_AS: 73024 kB<br>VmallocTotal: 34359738367 kB<br>VmallocUsed: 6332 kB<br>VmallocChunk: 34359724899 kB<br>HugePages_Total: 0<br>HugePages_Free: 0<br>HugePages_Rsvd: 0<br>HugePages_Surp: 0<br>
Hugepagesize: 2048 kB<br>DirectMap4k: 524288 kB<br>DirectMap2M: 0 kB<br>[root@enming-f11-pv-hpc-node0001 ~]# lspci -v<br>[root@enming-f11-pv-hpc-node0001 ~]# lsusb<br>[root@enming-f11-pv-hpc-node0001 ~]# ifconfig eth0<br>
eth0 Link encap:Ethernet HWaddr 00:16:3E:69:E9:11 <br> inet addr:192.168.1.254 Bcast:192.168.1.255 Mask:255.255.255.0<br> inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link<br> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br>
RX packets:5518 errors:26 dropped:0 overruns:0 frame:0<br> TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0<br> collisions:0 txqueuelen:1000 <br> RX bytes:872864 (852.4 KiB) TX bytes:3972981 (3.7 MiB)<br>
Interrupt:17 <br><br>[root@enming-f11-pv-hpc-node0001 ~]# ethtool eth0<br>Settings for eth0:<br> Link detected: yes<br>[root@enming-f11-pv-hpc-node0001 ~]# netstat -i<br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br>
eth0 1500 0 5589 26 0 0 4875 0 0 0 BMRU<br>lo 16436 0 127 0 0 0 127 0 0 0 LRU<br>[root@enming-f11-pv-hpc-node0001 ~]# uname -a<br>
Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux<br>You have new mail in /var/spool/mail/root<br>[root@enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release <br>
Fedora release 11 (Leonidas)<br><br>=================================================<br><br>Please advise.<div><br><br>Thank you.<br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>
Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>
My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div><div></div><div><div class="gmail_quote">On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>I am getting the same mpiexec 392 error message as Kenneth Yoshimoto from the San Diego Supercomputer Center. His mpich-discuss mailing list topic URL is <a href="http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html" target="_blank">http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html</a><br>
<br>I have actually already performed the 2-node mpdcheck utility test as described in Appendix A.1 of the MPICH2 installation guide. I could start the ring of mpd on the 2-node test scenario using mpdboot successfully as well.<br>
<br>薛正华 (ID: <span></span>zhxue123) from China reported solving the mpiexec 392 error. According to 薛正华, the cause of the mpiexec 392 error is the absence of high performance network in his environment. He had changed the default communication method from nemesis to ssm and also increased the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his report is at <a href="http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx" target="_blank">http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx</a><br>
<br>Could this be my problem also?<br><br>Thank you.<div><br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>
My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div class="gmail_quote"><div><div></div><div>On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <span dir="ltr"><<a href="mailto:thakur@mcs.anl.gov" target="_blank">thakur@mcs.anl.gov</a>></span> wrote:<br>
</div></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div>
<div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff">You need to do the mpdcheck tests with every pair of compute
nodes. Or to isolate the problem, try running on a smaller set of nodes first
and increase it one at a time until it fails.</font></span></div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff"></font></span> </div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff">Rajeev</font></span></div>
<div dir="ltr" align="left"><span></span> </div><br>
<blockquote style="border-left: 2px solid rgb(0, 0, 255); padding-left: 5px; margin-left: 5px; margin-right: 0px;">
<div dir="ltr" lang="en-us" align="left">
<hr>
<font face="Tahoma" size="2"><b>From:</b> <a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a>
[mailto:<a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a>] <b>On Behalf Of </b>Mr. Teo En Ming
(Zhang Enming)<br><b>Sent:</b> Thursday, October 29, 2009 2:35
PM<br><b>To:</b> <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br><b>Subject:</b> [mpich-discuss]
(mpiexec 392): no msg recvd from mpd when expectingack of
request<br></font><br></div><div><div></div><div>
<div></div>Hi,<br><br>I have just installed MPICH2 in my Xen-based virtual
machines.<br><br>My hardware configuration is as follows:<br><br>Processor:
Intel Pentium Dual Core E6300 @ 2.8 GHz<br>Motherboard: Intel Desktop Board
DQ45CB BIOS 0093<br>Memory: 4X 2GB Kingston DDR2-800 CL5<br><br>My software
configuration is as follows:<br><br>Xen Hypervisor / Virtual Machine Monitor
Version: 3.5-unstable<br>Jeremy Fitzhardinge's pv-ops dom0 kernel:
2.6.31.4<br>Host Operating System: Fedora Linux 11 x86-64 (SELinux
disabled)<br>Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized
(PV) domU guests (SELinux disabled)<br><br>I have successfully configured,
built and installed MPICH2 in a F11 PV guest OS master compute node 1 with NFS
server (MPICH2 bin subdirectory exported). The rest of the 5 compute nodes
have access to the MPICH2 binaries by mounting NFS share from node 1. Please
see attached c.txt, m.txt and mi.txt. With Xen virtualization, I have created
6 F11 linux PV guests to simulate 6 HPC compute nodes. The network adapter
(NIC) in each guest OS is virtual. The Xen networking type is bridged. Running
"lspci -v" and lsusb in each guest OS does not show up
anything.<br><br>According to Appendix A troubleshooting section of the MPICH2
install guide, I have verified that the 2-node test scenario with "mpdcheck
-s" and "mpdcheck -c" is working. The 2 nodes each acting as server and client
respectively can communicate with each other without problems. Both nodes can
communicate with each other in server and client modes respectively. I have
also tested mpdboot with the 2-node scenario and the ring of mpd is
working.<br><br>After the troubleshooting process, I have successfully created
a ring of mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all
the 6 nodes. However, when I want to run a job with mpiexec, it gives me the
following error:<br><br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2
examples/cpi<br>mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd
from mpd when expecting ack of request<br><br>I have also tried starting the
mpd ring with the root user but I still encounter the same error
above.<br><br>Thank you.<br><br>PS. config.log is also attached.<br clear="all"><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics)
BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore
Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My
Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>Email:
<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN:
<a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile
Phone (SingTel): +65-9648-9798<br>Mobile Phone (Starhub Prepaid):
+65-8369-2618<br>Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race:
Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country:
Singapore<br></div></div></blockquote></div>
<br></div></div>_______________________________________________<br>
mpich-discuss mailing list<div><br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
</div><a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br></blockquote></div><br><br clear="all"><br><br>
</blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>
</div></div></blockquote></div><br><br clear="all"><br><br>