[mpich-discuss] (mpiexec 392): no msg recvd from mpd when expecting ack of request

Mr. Teo En Ming (Zhang Enming) space.time.universe at gmail.com
Fri Oct 30 05:54:40 CDT 2009


Check out my screenshots (15 png images) at my blog here:

http://teo-en-ming-aka-zhang-enming.blogspot.com/2009/10/using-xen-virtualization-environment.html

-- 
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe at gmail.com
MSN: teoenming at hotmail.com
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 6:23 PM, Mr. Teo En Ming (Zhang Enming) <
space.time.universe at gmail.com> wrote:

> Dear All,
>
> I have solved the problem.
>
> With reference to
> http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01327.htmland
> http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices, I have executed the following command as root on all my 6 compute nodes
> (each compute node is a F11 linux 64-bit PV virtual machine).
>
> # ethtool -K eth0 tx off gso on
>
> Now I can successfully run mpiexec to execute MPI and non-MPI jobs on my
> Virtual HPC Compute Cluster.
>
> *Topic: [Xen-users] Using Xen Virtualization Environment for Development
> and Testing of Supercomputing and High Performance Computing (HPC) Cluster
> MPICH2 MPI-2 Applications
>
> URL:
> http://lists.xensource.com/archives/html/xen-devel/2009-10/msg01440.html
> *
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
> Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
>
>
> On Fri, Oct 30, 2009 at 4:13 PM, Mr. Teo En Ming (Zhang Enming) <
> space.time.universe at gmail.com> wrote:
>
>> Dear All,
>>
>> I have googled something which may help to solve my problem.
>>
>> *[Xen-devel] Network drop on domU (netfront: rx->offset: 0, size:
>> 4294967295)*
>> http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.html
>>
>> Virtualization Tip: Always disable checksumming on virtual ethernet
>> devices
>>
>> http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices
>>
>>
>> Let me try to work on it first.
>>
>>
>> --
>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>> Engineering)
>> Alma Maters:
>> (1) Singapore Polytechnic
>> (2) National University of Singapore
>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>> My Youtube videos: http://www.youtube.com/user/enmingteo
>> Email: space.time.universe at gmail.com
>> MSN: teoenming at hotmail.com
>> Mobile Phone (SingTel): +65-9648-9798
>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>> Age: 31 (as at 30 Oct 2009)
>> Height: 1.78 meters
>> Race: Chinese
>> Dialect: Hokkien
>> Street: Bedok Reservoir Road
>> Country: Singapore
>>
>>
>>
>> On Fri, Oct 30, 2009 at 3:56 PM, Mr. Teo En Ming (Zhang Enming) <
>> space.time.universe at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have reverted to the 2-node troubleshooting scenario. I have started
>>> node 1 and node 2.
>>>
>>> On node 1, I will try to bring up the ring of mpd for the 2 nodes using
>>> mpdboot and try to execute mpiexec. On node 2, I will capture the tcpdump
>>> messages on virtual network interface eth0.
>>>
>>> Please see attached PNG screenshots. They are numbered in sequence.
>>>
>>> Please check if there are any problems.
>>>
>>> Thank you.
>>>
>>> --
>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>> Engineering)
>>> Alma Maters:
>>> (1) Singapore Polytechnic
>>> (2) National University of Singapore
>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>> Email: space.time.universe at gmail.com
>>> MSN: teoenming at hotmail.com
>>> Mobile Phone (SingTel): +65-9648-9798
>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>> Age: 31 (as at 30 Oct 2009)
>>> Height: 1.78 meters
>>> Race: Chinese
>>> Dialect: Hokkien
>>> Street: Bedok Reservoir Road
>>> Country: Singapore
>>>
>>>
>>>
>>> On Fri, Oct 30, 2009 at 2:55 PM, Mr. Teo En Ming (Zhang Enming) <
>>> space.time.universe at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Here are more virtual network interface eth0 kernel messages. Notice the
>>>> "net eth0: rx->offset: 0" messages. Are they of significance?
>>>>
>>>> *Node 1*
>>>>
>>>> Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
>>>> mount request from 192.168.1.253:1009 for /home/enming/mpich2-install/
>>>> bin (/home/enming/mpich2-install/bin)
>>>> Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
>>>> mount request from 192.168.1.252:877 for
>>>> /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
>>>> Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
>>>> mount request from 192.168.1.251:1000 for
>>>> /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
>>>> Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
>>>> mount request from 192.168.1.250:882 for
>>>> /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
>>>> Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
>>>> mount request from 192.168.1.249:953 for
>>>> /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
>>>> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no mpdid
>>>> yet
>>>> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has
>>>> mpdid=enming-f11-pv-hpc-node0001_48545 (port=48545)
>>>> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12
>>>> callbacks suppressed
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>>
>>>> *Node 6*
>>>>
>>>> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no mpdid
>>>> yet
>>>> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has
>>>> mpdid=enming-f11-pv-hpc-node0006_52805 (port=52805)
>>>> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset:
>>>> 0, size: 4294967295
>>>>
>>>> *Node 1 NFS Server Configuration*
>>>>
>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/exports
>>>> /home/enming/mpich2-install/bin        192.168.1.0/24(ro)<http://192.168.1.0/24%28ro%29>
>>>>
>>>> *Node 2 /etc/fstab Configuration Entry for NFS Client*
>>>>
>>>> 192.168.1.254:/home/enming/mpich2-install/bin
>>>> /home/enming/mpich2-install/bin    nfs
>>>> rsize=8192,wsize=8192,timeo=14,intr
>>>>
>>>> --
>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>> Engineering)
>>>> Alma Maters:
>>>> (1) Singapore Polytechnic
>>>> (2) National University of Singapore
>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>> Email: space.time.universe at gmail.com
>>>> MSN: teoenming at hotmail.com
>>>> Mobile Phone (SingTel): +65-9648-9798
>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>> Age: 31 (as at 30 Oct 2009)
>>>> Height: 1.78 meters
>>>> Race: Chinese
>>>> Dialect: Hokkien
>>>> Street: Bedok Reservoir Road
>>>> Country: Singapore
>>>>
>>>> On Fri, Oct 30, 2009 at 2:14 PM, Mr. Teo En Ming (Zhang Enming) <
>>>> space.time.universe at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have noticed that there are Receive Errors (RX-ERR) in all of my 6
>>>>> compute nodes. It appears that there may be problems with the virtual
>>>>> network interface eth0 in Xen networking.
>>>>>
>>>>> =================================================
>>>>>
>>>>> Node 1:
>>>>>
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>> eth0       1500   0     5824     27      0      0     5056      0
>>>>> 0      0 BMRU
>>>>>
>>>>> lo        16436   0      127      0      0      0      127      0
>>>>> 0      0 LRU
>>>>> [root at enming-f11-pv-hpc-node0001 ~]# ps -ef | grep mpd
>>>>>
>>>>> enming    1505     1  0 21:44 ?        00:00:00 python2.6
>>>>> /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>>>>>  root      1650  1576  0 22:07 pts/0    00:00:00 grep mpd
>>>>>
>>>>> Node 2:
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0002 ~]# netstat -i
>>>>>
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>>  eth0       1500   0     1504      7      0      0     1417      0
>>>>> 0      0 BMRU
>>>>> lo        16436   0       44      0      0      0       44      0
>>>>> 0      0 LRU
>>>>>
>>>>> Node 3:
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0003 ~]# netstat -i
>>>>>
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>> eth0       1500   0     1520     12      0      0     1467      0
>>>>> 0      0 BMRU
>>>>> lo        16436   0       42      0      0      0       42      0
>>>>> 0      0 LRU
>>>>>
>>>>> Node 4:
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0004 ~]# netstat -i
>>>>>
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>> eth0       1500   0     1528     10      0      0     1514      0
>>>>> 0      0 BMRU
>>>>> lo        16436   0       44      0      0      0       44      0
>>>>> 0      0 LRU
>>>>>
>>>>> Node 5:
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0005 ~]# netstat -i
>>>>>
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>>  eth0       1500   0     1416     11      0      0     1412      0
>>>>> 0      0 BMRU
>>>>> lo        16436   0       44      0      0      0       44      0
>>>>> 0      0 LRU
>>>>>
>>>>> Node 6:
>>>>>
>>>>> [root at enming-f11-pv-hpc-node0006 ~]# netstat -i
>>>>>
>>>>> Kernel Interface table
>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>> TX-DRP TX-OVR Flg
>>>>> eth0       1500   0     1474      9      0      0     1504      0
>>>>> 0      0 BMRU
>>>>> lo        16436   0       44      0      0      0       44      0
>>>>> 0      0 LRU
>>>>>
>>>>> ================================================
>>>>>
>>>>>
>>>>> --
>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>>> Engineering)
>>>>> Alma Maters:
>>>>> (1) Singapore Polytechnic
>>>>> (2) National University of Singapore
>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>> Email: space.time.universe at gmail.com
>>>>> MSN: teoenming at hotmail.com
>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>> Age: 31 (as at 30 Oct 2009)
>>>>> Height: 1.78 meters
>>>>> Race: Chinese
>>>>> Dialect: Hokkien
>>>>> Street: Bedok Reservoir Road
>>>>> Country: Singapore
>>>>>
>>>>> On Fri, Oct 30, 2009 at 2:07 PM, Mr. Teo En Ming (Zhang Enming) <
>>>>> space.time.universe at gmail.com> wrote:
>>>>>
>>>>>> All the six compute nodes are identical PV virtual machines.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>>>> Engineering)
>>>>>> Alma Maters:
>>>>>> (1) Singapore Polytechnic
>>>>>> (2) National University of Singapore
>>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>>> Email: space.time.universe at gmail.com
>>>>>> MSN: teoenming at hotmail.com
>>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>>> Age: 31 (as at 30 Oct 2009)
>>>>>> Height: 1.78 meters
>>>>>> Race: Chinese
>>>>>> Dialect: Hokkien
>>>>>> Street: Bedok Reservoir Road
>>>>>> Country: Singapore
>>>>>>
>>>>>> On Fri, Oct 30, 2009 at 2:04 PM, Mr. Teo En Ming (Zhang Enming) <
>>>>>> space.time.universe at gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have changed the communication method from nemesis (high
>>>>>>> performance network method) to ssm (socket for nodes and shared memory
>>>>>>> within a node) by recompiling MPICH2. I have also pre-set the MAC address of
>>>>>>> the virtual network adapter eth0 in each compute node (each compute node is
>>>>>>> a Xen paravirtualized virtual machine) by configuring the vif directive in
>>>>>>> each PV domU configuration file.
>>>>>>>
>>>>>>> Additionally, I have also turned off iptables to facilitate
>>>>>>> troubleshooting and communication between all mpd daemons in each node. SSH
>>>>>>> without password is possible between all the compute nodes.
>>>>>>>
>>>>>>> After having done all of the above, I am still encountering the
>>>>>>> MPIEXEC 392 error.
>>>>>>>
>>>>>>>
>>>>>>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from
>>>>>>> mpd when expecting ack of request
>>>>>>>
>>>>>>> =================================================
>>>>>>>
>>>>>>> Master Node / Compute Node 1:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
>>>>>>> enming    1499  1455  0 21:44 pts/0    00:00:00 grep mpd
>>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6
>>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
>>>>>>> enming    1505     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>>>>>>>
>>>>>>> Compute Node 2:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> [enming at enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd
>>>>>>> enming    1431     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>>>>> 34188 --ncpus=1 -e -d
>>>>>>> enming    1481  1436  0 21:46 pts/0    00:00:00 grep mpd
>>>>>>>
>>>>>>> Compute Node 3:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> [enming at enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd
>>>>>>> enming    1422     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>>>>> 34188 --ncpus=1 -e -d
>>>>>>> enming    1473  1427  0 21:47 pts/0    00:00:00 grep mpd
>>>>>>>
>>>>>>> Compute Node 4:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> [enming at enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd
>>>>>>> enming    1432     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>>>>> 34188 --ncpus=1 -e -d
>>>>>>> enming    1482  1437  0 21:47 pts/0    00:00:00 grep mpd
>>>>>>>
>>>>>>> Compute Node 5:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> [enming at enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd
>>>>>>> enming    1423     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>>>>> 34188 --ncpus=1 -e -d
>>>>>>> enming    1475  1429  0 21:48 pts/0    00:00:00 grep mpd
>>>>>>>
>>>>>>> Compute Node 6:
>>>>>>>
>>>>>>> [enming at enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l
>>>>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>>>>> [enming at enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd
>>>>>>> enming    1427     1  0 21:44 ?        00:00:00 python2.6
>>>>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p
>>>>>>> 42012 --ncpus=1 -e -d
>>>>>>> enming    1477  1432  0 21:49 pts/0    00:00:00 grep mpd
>>>>>>>
>>>>>>> =================================================
>>>>>>>
>>>>>>> Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py
>>>>>>> file or should I change the communication method to sock?
>>>>>>>
>>>>>>> MPIEXEC 392 error says no msg recvd from mpd when expecting ack of
>>>>>>> request. So I am thinking that it could be taking very very long to receive
>>>>>>> acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low.
>>>>>>> Hence that causes the mpiexec 392 error in my case. I am using a virtual
>>>>>>> network adapter and not physical Gigabit network adapter.
>>>>>>>
>>>>>>> =================================================
>>>>>>>
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo
>>>>>>> processor    : 0
>>>>>>> vendor_id    : GenuineIntel
>>>>>>> cpu family    : 6
>>>>>>> model        : 23
>>>>>>> model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
>>>>>>> stepping    : 10
>>>>>>> cpu MHz        : 2800.098
>>>>>>> cache size    : 2048 KB
>>>>>>> fpu        : yes
>>>>>>> fpu_exception    : yes
>>>>>>> cpuid level    : 13
>>>>>>> wp        : yes
>>>>>>> flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse
>>>>>>> sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor
>>>>>>> lahf_lm
>>>>>>> bogomips    : 5600.19
>>>>>>> clflush size    : 64
>>>>>>> cache_alignment    : 64
>>>>>>> address sizes    : 36 bits physical, 48 bits virtual
>>>>>>> power management:
>>>>>>>
>>>>>>> processor    : 1
>>>>>>> vendor_id    : GenuineIntel
>>>>>>> cpu family    : 6
>>>>>>> model        : 23
>>>>>>> model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
>>>>>>> stepping    : 10
>>>>>>> cpu MHz        : 2800.098
>>>>>>> cache size    : 2048 KB
>>>>>>> fpu        : yes
>>>>>>> fpu_exception    : yes
>>>>>>> cpuid level    : 13
>>>>>>> wp        : yes
>>>>>>> flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse
>>>>>>> sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor
>>>>>>> lahf_lm
>>>>>>> bogomips    : 5600.19
>>>>>>> clflush size    : 64
>>>>>>> cache_alignment    : 64
>>>>>>> address sizes    : 36 bits physical, 48 bits virtual
>>>>>>> power management:
>>>>>>>
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo
>>>>>>> MemTotal:         532796 kB
>>>>>>> MemFree:          386156 kB
>>>>>>> Buffers:           12904 kB
>>>>>>> Cached:            48864 kB
>>>>>>> SwapCached:            0 kB
>>>>>>> Active:            34884 kB
>>>>>>> Inactive:          43252 kB
>>>>>>> Active(anon):      16504 kB
>>>>>>> Inactive(anon):        0 kB
>>>>>>> Active(file):      18380 kB
>>>>>>> Inactive(file):    43252 kB
>>>>>>> Unevictable:           0 kB
>>>>>>> Mlocked:               0 kB
>>>>>>> SwapTotal:       2195448 kB
>>>>>>> SwapFree:        2195448 kB
>>>>>>> Dirty:                12 kB
>>>>>>> Writeback:             0 kB
>>>>>>> AnonPages:         16444 kB
>>>>>>> Mapped:             8864 kB
>>>>>>> Slab:              10528 kB
>>>>>>> SReclaimable:       4668 kB
>>>>>>> SUnreclaim:         5860 kB
>>>>>>> PageTables:         2996 kB
>>>>>>> NFS_Unstable:          0 kB
>>>>>>> Bounce:                0 kB
>>>>>>> WritebackTmp:          0 kB
>>>>>>> CommitLimit:     2461844 kB
>>>>>>> Committed_AS:      73024 kB
>>>>>>> VmallocTotal:   34359738367 kB
>>>>>>> VmallocUsed:        6332 kB
>>>>>>> VmallocChunk:   34359724899 kB
>>>>>>> HugePages_Total:       0
>>>>>>> HugePages_Free:        0
>>>>>>> HugePages_Rsvd:        0
>>>>>>> HugePages_Surp:        0
>>>>>>> Hugepagesize:       2048 kB
>>>>>>> DirectMap4k:      524288 kB
>>>>>>> DirectMap2M:           0 kB
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# lspci -v
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# lsusb
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# ifconfig eth0
>>>>>>> eth0      Link encap:Ethernet  HWaddr 00:16:3E:69:E9:11
>>>>>>>           inet addr:192.168.1.254  Bcast:192.168.1.255
>>>>>>> Mask:255.255.255.0
>>>>>>>           inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link
>>>>>>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>>>>           RX packets:5518 errors:26 dropped:0 overruns:0 frame:0
>>>>>>>           TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0
>>>>>>>           collisions:0 txqueuelen:1000
>>>>>>>           RX bytes:872864 (852.4 KiB)  TX bytes:3972981 (3.7 MiB)
>>>>>>>           Interrupt:17
>>>>>>>
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# ethtool eth0
>>>>>>> Settings for eth0:
>>>>>>>     Link detected: yes
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
>>>>>>> Kernel Interface table
>>>>>>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR
>>>>>>> TX-DRP TX-OVR Flg
>>>>>>> eth0       1500   0     5589     26      0      0     4875
>>>>>>> 0      0      0 BMRU
>>>>>>> lo        16436   0      127      0      0      0      127
>>>>>>> 0      0      0 LRU
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# uname -a
>>>>>>> Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed
>>>>>>> May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>> You have new mail in /var/spool/mail/root
>>>>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release
>>>>>>> Fedora release 11 (Leonidas)
>>>>>>>
>>>>>>> =================================================
>>>>>>>
>>>>>>> Please advise.
>>>>>>>
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> --
>>>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics)
>>>>>>> BEng(Hons)(Mechanical Engineering)
>>>>>>> Alma Maters:
>>>>>>> (1) Singapore Polytechnic
>>>>>>> (2) National University of Singapore
>>>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>>>> Email: space.time.universe at gmail.com
>>>>>>> MSN: teoenming at hotmail.com
>>>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>>>> Age: 31 (as at 30 Oct 2009)
>>>>>>> Height: 1.78 meters
>>>>>>> Race: Chinese
>>>>>>> Dialect: Hokkien
>>>>>>> Street: Bedok Reservoir Road
>>>>>>> Country: Singapore
>>>>>>>
>>>>>>> On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <
>>>>>>> space.time.universe at gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am getting the same mpiexec 392 error message as Kenneth Yoshimoto
>>>>>>>> from the San Diego Supercomputer Center. His mpich-discuss mailing list
>>>>>>>> topic URL is
>>>>>>>> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html
>>>>>>>>
>>>>>>>> I have actually already performed the 2-node mpdcheck utility test
>>>>>>>> as described in Appendix A.1 of the MPICH2 installation guide. I could start
>>>>>>>> the ring of mpd on the 2-node test scenario using mpdboot successfully as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> 薛正华 (ID: zhxue123) from China reported solving the mpiexec 392
>>>>>>>> error. According to 薛正华, the cause of the mpiexec 392 error is the absence
>>>>>>>> of high performance network in his environment. He had changed the default
>>>>>>>> communication method from nemesis to ssm and also increased the value of
>>>>>>>> MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his
>>>>>>>> report is at
>>>>>>>> http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx
>>>>>>>>
>>>>>>>> Could this be my problem also?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics)
>>>>>>>> BEng(Hons)(Mechanical Engineering)
>>>>>>>> Alma Maters:
>>>>>>>> (1) Singapore Polytechnic
>>>>>>>> (2) National University of Singapore
>>>>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>>>>> Email: space.time.universe at gmail.com
>>>>>>>> MSN: teoenming at hotmail.com
>>>>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>>>>> Age: 31 (as at 30 Oct 2009)
>>>>>>>> Height: 1.78 meters
>>>>>>>> Race: Chinese
>>>>>>>> Dialect: Hokkien
>>>>>>>> Street: Bedok Reservoir Road
>>>>>>>> Country: Singapore
>>>>>>>>
>>>>>>>> On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <thakur at mcs.anl.gov
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>>  You need to do the mpdcheck tests with every pair of compute
>>>>>>>>> nodes. Or to isolate the problem, try running on a smaller set of nodes
>>>>>>>>> first and increase it one at a time until it fails.
>>>>>>>>>
>>>>>>>>> Rajeev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  ------------------------------
>>>>>>>>> *From:* mpich-discuss-bounces at mcs.anl.gov [mailto:
>>>>>>>>> mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *Mr. Teo En Ming
>>>>>>>>> (Zhang Enming)
>>>>>>>>> *Sent:* Thursday, October 29, 2009 2:35 PM
>>>>>>>>> *To:* mpich-discuss at mcs.anl.gov
>>>>>>>>> *Subject:* [mpich-discuss] (mpiexec 392): no msg recvd from mpd
>>>>>>>>> when expectingack of request
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have just installed MPICH2 in my Xen-based virtual machines.
>>>>>>>>>
>>>>>>>>> My hardware configuration is as follows:
>>>>>>>>>
>>>>>>>>> Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
>>>>>>>>> Motherboard: Intel Desktop Board DQ45CB BIOS 0093
>>>>>>>>> Memory: 4X 2GB Kingston DDR2-800 CL5
>>>>>>>>>
>>>>>>>>> My software configuration is as follows:
>>>>>>>>>
>>>>>>>>> Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
>>>>>>>>> Jeremy Fitzhardinge's pv-ops dom0 kernel: 2.6.31.4
>>>>>>>>> Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
>>>>>>>>> Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized
>>>>>>>>> (PV) domU guests (SELinux disabled)
>>>>>>>>>
>>>>>>>>> I have successfully configured, built and installed MPICH2 in a F11
>>>>>>>>> PV guest OS master compute node 1 with NFS server (MPICH2 bin subdirectory
>>>>>>>>> exported). The rest of the 5 compute nodes have access to the MPICH2
>>>>>>>>> binaries by mounting NFS share from node 1. Please see attached c.txt, m.txt
>>>>>>>>> and mi.txt. With Xen virtualization, I have created 6 F11 linux PV guests to
>>>>>>>>> simulate 6 HPC compute nodes. The network adapter (NIC) in each guest OS is
>>>>>>>>> virtual. The Xen networking type is bridged. Running "lspci -v" and lsusb in
>>>>>>>>> each guest OS does not show up anything.
>>>>>>>>>
>>>>>>>>> According to Appendix A troubleshooting section of the MPICH2
>>>>>>>>> install guide, I have verified that the 2-node test scenario with "mpdcheck
>>>>>>>>> -s" and "mpdcheck -c" is working. The 2 nodes each acting as server and
>>>>>>>>> client respectively can communicate with each other without problems. Both
>>>>>>>>> nodes can communicate with each other in server and client modes
>>>>>>>>> respectively. I have also tested mpdboot with the 2-node scenario and the
>>>>>>>>> ring of mpd is working.
>>>>>>>>>
>>>>>>>>> After the troubleshooting process, I have successfully created a
>>>>>>>>> ring of mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all
>>>>>>>>> the 6 nodes. However, when I want to run a job with mpiexec, it gives me the
>>>>>>>>> following error:
>>>>>>>>>
>>>>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
>>>>>>>>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from
>>>>>>>>> mpd when expecting ack of request
>>>>>>>>>
>>>>>>>>> I have also tried starting the mpd ring with the root user but I
>>>>>>>>> still encounter the same error above.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> PS. config.log is also attached.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics)
>>>>>>>>> BEng(Hons)(Mechanical Engineering)
>>>>>>>>> Alma Maters:
>>>>>>>>> (1) Singapore Polytechnic
>>>>>>>>> (2) National University of Singapore
>>>>>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>>>>>> Email: space.time.universe at gmail.com
>>>>>>>>> MSN: teoenming at hotmail.com
>>>>>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>>>>>> Age: 31 (as at 30 Oct 2009)
>>>>>>>>> Height: 1.78 meters
>>>>>>>>> Race: Chinese
>>>>>>>>> Dialect: Hokkien
>>>>>>>>> Street: Bedok Reservoir Road
>>>>>>>>> Country: Singapore
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> mpich-discuss mailing list
>>>>>>>>>
>>>>>>>>> mpich-discuss at mcs.anl.gov
>>>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/206eab88/attachment-0001.htm>


More information about the mpich-discuss mailing list