[mpich-discuss] (mpiexec 392): no msg recvd from mpd whenexpecting ack of request
Ralph Butler
rbutler at mtsu.edu
Fri Oct 30 07:01:30 CDT 2009
no.
here is part of the help message from running mpdboot --help
--ncpus indicates how many cpus you want to show for the local machine;
others are listed in the hosts file
On FriOct 30, at Fri Oct 30 6:31AM, Mr. Teo En Ming (Zhang Enming)
wrote:
> If I execute "mpdboot -n 6 -f mpd.hosts --ncpus=2" on the master
> node, will I get --ncpus=2 for mpd.py on all the remaining compute
> nodes?
>
> My virtual machines have 2 virtual processors each.
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 6:54 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Check out my screenshots (15 png images) at my blog here:
>
> http://teo-en-ming-aka-zhang-enming.blogspot.com/2009/10/using-xen-virtualization-environment.html
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 6:23 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Dear All,
>
> I have solved the problem.
>
> With reference to http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01327.html
> and http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices
> , I have executed the following command as root on all my 6 compute
> nodes (each compute node is a F11 linux 64-bit PV virtual machine).
>
> # ethtool -K eth0 tx off gso on
>
> Now I can successfully run mpiexec to execute MPI and non-MPI jobs
> on my Virtual HPC Compute Cluster.
>
> Topic: [Xen-users] Using Xen Virtualization Environment for
> Development and Testing of Supercomputing and High Performance
> Computing (HPC) Cluster MPICH2 MPI-2 Applications
>
> URL: http://lists.xensource.com/archives/html/xen-devel/2009-10/msg01440.html
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
>
>
> On Fri, Oct 30, 2009 at 4:13 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Dear All,
>
> I have googled something which may help to solve my problem.
>
> [Xen-devel] Network drop on domU (netfront: rx->offset: 0, size:
> 4294967295) http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.html
>
> Virtualization Tip: Always disable checksumming on virtual ethernet
> devices
> http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices
>
>
> Let me try to work on it first.
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
>
>
> On Fri, Oct 30, 2009 at 3:56 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Hi,
>
> I have reverted to the 2-node troubleshooting scenario. I have
> started node 1 and node 2.
>
> On node 1, I will try to bring up the ring of mpd for the 2 nodes
> using mpdboot and try to execute mpiexec. On node 2, I will capture
> the tcpdump messages on virtual network interface eth0.
>
> Please see attached PNG screenshots. They are numbered in sequence.
>
> Please check if there are any problems.
>
> Thank you.
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
>
>
> On Fri, Oct 30, 2009 at 2:55 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Hi,
>
> Here are more virtual network interface eth0 kernel messages. Notice
> the "net eth0: rx->offset: 0" messages. Are they of significance?
>
> Node 1
>
> Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]:
> authenticated mount request from 192.168.1.253:1009 for /home/enming/
> mpich2-install/
> bin (/home/enming/mpich2-install/bin)
> Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]:
> authenticated mount request from 192.168.1.252:877 for /home/enming/
> mpich2-install/bin (/home/enming/mpich2-install/bin)
> Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]:
> authenticated mount request from 192.168.1.251:1000 for /home/enming/
> mpich2-install/bin (/home/enming/mpich2-install/bin)
> Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]:
> authenticated mount request from 192.168.1.250:882 for /home/enming/
> mpich2-install/bin (/home/enming/mpich2-install/bin)
> Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]:
> authenticated mount request from 192.168.1.249:953 for /home/enming/
> mpich2-install/bin (/home/enming/mpich2-install/bin)
> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no
> mpdid yet
> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has mpdid=enming-
> f11-pv-hpc-node0001_48545 (port=48545)
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12
> callbacks suppressed
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
>
> Node 6
>
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no
> mpdid yet
> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has mpdid=enming-
> f11-pv-hpc-node0006_52805 (port=52805)
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx-
> >offset: 0, size: 4294967295
>
> Node 1 NFS Server Configuration
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/exports
> /home/enming/mpich2-install/bin 192.168.1.0/24(ro)
>
> Node 2 /etc/fstab Configuration Entry for NFS Client
>
> 192.168.1.254:/home/enming/mpich2-install/bin /home/enming/mpich2-
> install/bin nfs rsize=8192,wsize=8192,timeo=14,intr
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 2:14 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Hi,
>
> I have noticed that there are Receive Errors (RX-ERR) in all of my 6
> compute nodes. It appears that there may be problems with the
> virtual network interface eth0 in Xen networking.
>
> =================================================
>
> Node 1:
>
>
> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 5824 27 0 0 5056
> 0 0 0 BMRU
>
> lo 16436 0 127 0 0 0 127
> 0 0 0 LRU
> [root at enming-f11-pv-hpc-node0001 ~]# ps -ef | grep mpd
>
> enming 1505 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
> root 1650 1576 0 22:07 pts/0 00:00:00 grep mpd
>
> Node 2:
>
> [root at enming-f11-pv-hpc-node0002 ~]# netstat -i
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 1504 7 0 0 1417
> 0 0 0 BMRU
> lo 16436 0 44 0 0 0 44
> 0 0 0 LRU
>
> Node 3:
>
> [root at enming-f11-pv-hpc-node0003 ~]# netstat -i
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 1520 12 0 0 1467
> 0 0 0 BMRU
> lo 16436 0 42 0 0 0 42
> 0 0 0 LRU
>
> Node 4:
>
> [root at enming-f11-pv-hpc-node0004 ~]# netstat -i
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 1528 10 0 0 1514
> 0 0 0 BMRU
> lo 16436 0 44 0 0 0 44
> 0 0 0 LRU
>
> Node 5:
>
> [root at enming-f11-pv-hpc-node0005 ~]# netstat -i
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 1416 11 0 0 1412
> 0 0 0 BMRU
> lo 16436 0 44 0 0 0 44
> 0 0 0 LRU
>
> Node 6:
>
> [root at enming-f11-pv-hpc-node0006 ~]# netstat -i
>
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 1474 9 0 0 1504
> 0 0 0 BMRU
> lo 16436 0 44 0 0 0 44
> 0 0 0 LRU
>
> ================================================
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 2:07 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> All the six compute nodes are identical PV virtual machines.
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 2:04 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Hi,
>
> I have changed the communication method from nemesis (high
> performance network method) to ssm (socket for nodes and shared
> memory within a node) by recompiling MPICH2. I have also pre-set the
> MAC address of the virtual network adapter eth0 in each compute node
> (each compute node is a Xen paravirtualized virtual machine) by
> configuring the vif directive in each PV domU configuration file.
>
> Additionally, I have also turned off iptables to facilitate
> troubleshooting and communication between all mpd daemons in each
> node. SSH without password is possible between all the compute nodes.
>
> After having done all of the above, I am still encountering the
> MPIEXEC 392 error.
>
>
> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from
> mpd when expecting ack of request
>
> =================================================
>
> Master Node / Compute Node 1:
>
> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
> enming 1499 1455 0 21:44 pts/0 00:00:00 grep mpd
> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6
> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
> enming 1505 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>
> Compute Node 2:
>
> [enming at enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> [enming at enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd
> enming 1431 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming 1481 1436 0 21:46 pts/0 00:00:00 grep mpd
>
> Compute Node 3:
>
> [enming at enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> [enming at enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd
> enming 1422 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming 1473 1427 0 21:47 pts/0 00:00:00 grep mpd
>
> Compute Node 4:
>
> [enming at enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> [enming at enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd
> enming 1432 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming 1482 1437 0 21:47 pts/0 00:00:00 grep mpd
>
> Compute Node 5:
>
> [enming at enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> [enming at enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd
> enming 1423 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming 1475 1429 0 21:48 pts/0 00:00:00 grep mpd
>
> Compute Node 6:
>
> [enming at enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> [enming at enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd
> enming 1427 1 0 21:44 ? 00:00:00 python2.6 /home/
> enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p
> 42012 --ncpus=1 -e -d
> enming 1477 1432 0 21:49 pts/0 00:00:00 grep mpd
>
> =================================================
>
> Should I increase the value of MPIEXEC_RECV_TIMEOUT in the
> mpiexec.py file or should I change the communication method to sock?
>
> MPIEXEC 392 error says no msg recvd from mpd when expecting ack of
> request. So I am thinking that it could be taking very very long to
> receive acknowledgement of request while the MPIEXEC_RECV_TIMEOUT
> value is too low. Hence that causes the mpiexec 392 error in my
> case. I am using a virtual network adapter and not physical Gigabit
> network adapter.
>
> =================================================
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz
> stepping : 10
> cpu MHz : 2800.098
> cache size : 2048 KB
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse
> sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16
> hypervisor lahf_lm
> bogomips : 5600.19
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> processor : 1
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz
> stepping : 10
> cpu MHz : 2800.098
> cache size : 2048 KB
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse
> sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16
> hypervisor lahf_lm
> bogomips : 5600.19
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo
> MemTotal: 532796 kB
> MemFree: 386156 kB
> Buffers: 12904 kB
> Cached: 48864 kB
> SwapCached: 0 kB
> Active: 34884 kB
> Inactive: 43252 kB
> Active(anon): 16504 kB
> Inactive(anon): 0 kB
> Active(file): 18380 kB
> Inactive(file): 43252 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 2195448 kB
> SwapFree: 2195448 kB
> Dirty: 12 kB
> Writeback: 0 kB
> AnonPages: 16444 kB
> Mapped: 8864 kB
> Slab: 10528 kB
> SReclaimable: 4668 kB
> SUnreclaim: 5860 kB
> PageTables: 2996 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 2461844 kB
> Committed_AS: 73024 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 6332 kB
> VmallocChunk: 34359724899 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 524288 kB
> DirectMap2M: 0 kB
> [root at enming-f11-pv-hpc-node0001 ~]# lspci -v
> [root at enming-f11-pv-hpc-node0001 ~]# lsusb
> [root at enming-f11-pv-hpc-node0001 ~]# ifconfig eth0
> eth0 Link encap:Ethernet HWaddr 00:16:3E:69:E9:11
> inet addr:192.168.1.254 Bcast:192.168.1.255 Mask:
> 255.255.255.0
> inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:5518 errors:26 dropped:0 overruns:0 frame:0
> TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:872864 (852.4 KiB) TX bytes:3972981 (3.7 MiB)
> Interrupt:17
>
> [root at enming-f11-pv-hpc-node0001 ~]# ethtool eth0
> Settings for eth0:
> Link detected: yes
> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
> Kernel Interface table
> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
> DRP TX-OVR Flg
> eth0 1500 0 5589 26 0 0 4875
> 0 0 0 BMRU
> lo 16436 0 127 0 0 0 127
> 0 0 0 LRU
> [root at enming-f11-pv-hpc-node0001 ~]# uname -a
> Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed
> May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> You have new mail in /var/spool/mail/root
> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release
> Fedora release 11 (Leonidas)
>
> =================================================
>
> Please advise.
>
>
> Thank you.
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <space.time.universe at gmail.com
> > wrote:
> Hi,
>
> I am getting the same mpiexec 392 error message as Kenneth Yoshimoto
> from the San Diego Supercomputer Center. His mpich-discuss mailing
> list topic URL is http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html
>
> I have actually already performed the 2-node mpdcheck utility test
> as described in Appendix A.1 of the MPICH2 installation guide. I
> could start the ring of mpd on the 2-node test scenario using
> mpdboot successfully as well.
>
> 薛正华 (ID: zhxue123) from China reported solving the mpiexec 392
> error. According to 薛正华, the cause of the mpiexec 392 error is
> the absence of high performance network in his environment. He had
> changed the default communication method from nemesis to ssm and
> also increased the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py
> python source code. The URL of his report is at http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx
>
> Could this be my problem also?
>
> Thank you.
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <thakur at mcs.anl.gov>
> wrote:
> You need to do the mpdcheck tests with every pair of compute nodes.
> Or to isolate the problem, try running on a smaller set of nodes
> first and increase it one at a time until it fails.
>
> Rajeev
>
>
> From: mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov
> ] On Behalf Of Mr. Teo En Ming (Zhang Enming)
> Sent: Thursday, October 29, 2009 2:35 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: [mpich-discuss] (mpiexec 392): no msg recvd from mpd when
> expectingack of request
>
> Hi,
>
> I have just installed MPICH2 in my Xen-based virtual machines.
>
> My hardware configuration is as follows:
>
> Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
> Motherboard: Intel Desktop Board DQ45CB BIOS 0093
> Memory: 4X 2GB Kingston DDR2-800 CL5
>
> My software configuration is as follows:
>
> Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
> Jeremy Fitzhardinge's pv-ops dom0 kernel: 2.6.31.4
> Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
> Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized (PV)
> domU guests (SELinux disabled)
>
> I have successfully configured, built and installed MPICH2 in a F11
> PV guest OS master compute node 1 with NFS server (MPICH2 bin
> subdirectory exported). The rest of the 5 compute nodes have access
> to the MPICH2 binaries by mounting NFS share from node 1. Please see
> attached c.txt, m.txt and mi.txt. With Xen virtualization, I have
> created 6 F11 linux PV guests to simulate 6 HPC compute nodes. The
> network adapter (NIC) in each guest OS is virtual. The Xen
> networking type is bridged. Running "lspci -v" and lsusb in each
> guest OS does not show up anything.
>
> According to Appendix A troubleshooting section of the MPICH2
> install guide, I have verified that the 2-node test scenario with
> "mpdcheck -s" and "mpdcheck -c" is working. The 2 nodes each acting
> as server and client respectively can communicate with each other
> without problems. Both nodes can communicate with each other in
> server and client modes respectively. I have also tested mpdboot
> with the 2-node scenario and the ring of mpd is working.
>
> After the troubleshooting process, I have successfully created a
> ring of mpd involving 6 compute nodes. "mpdtrace -l" successfully
> lists all the 6 nodes. However, when I want to run a job with
> mpiexec, it gives me the following error:
>
> [enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from
> mpd when expecting ack of request
>
> I have also tried starting the mpd ring with the root user but I
> still encounter the same error above.
>
> Thank you.
>
> PS. config.log is also attached.
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)
> (Mechanical Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> _______________________________________________
> mpich-discuss mailing list
>
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list