[mpich-discuss] (mpiexec 392): no msg recvd from mpd when expecting ack of request
Mr. Teo En Ming (Zhang Enming)
space.time.universe at gmail.com
Fri Oct 30 02:56:28 CDT 2009
Hi,
I have reverted to the 2-node troubleshooting scenario. I have started node
1 and node 2.
On node 1, I will try to bring up the ring of mpd for the 2 nodes using
mpdboot and try to execute mpiexec. On node 2, I will capture the tcpdump
messages on virtual network interface eth0.
Please see attached PNG screenshots. They are numbered in sequence.
Please check if there are any problems.
Thank you.
--
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe at gmail.com
MSN: teoenming at hotmail.com
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore
On Fri, Oct 30, 2009 at 2:55 PM, Mr. Teo En Ming (Zhang Enming) <
space.time.universe at gmail.com> wrote:
> Hi,
>
> Here are more virtual network interface eth0 kernel messages. Notice the
> "net eth0: rx->offset: 0" messages. Are they of significance?
>
> *Node 1*
>
> Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
> mount request from 192.168.1.253:1009 for /home/enming/mpich2-install/
> bin (/home/enming/mpich2-install/bin)
> Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
> mount request from 192.168.1.252:877 for /home/enming/mpich2-install/bin
> (/home/enming/mpich2-install/bin)
> Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
> mount request from 192.168.1.251:1000 for /home/enming/mpich2-install/bin
> (/home/enming/mpich2-install/bin)
> Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
> mount request from 192.168.1.250:882 for /home/enming/mpich2-install/bin
> (/home/enming/mpich2-install/bin)
> Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated
> mount request from 192.168.1.249:953 for /home/enming/mpich2-install/bin
> (/home/enming/mpich2-install/bin)
> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no mpdid yet
> Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has
> mpdid=enming-f11-pv-hpc-node0001_48545 (port=48545)
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12
> callbacks suppressed
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0,
> size: 4294967295
>
> *Node 6*
>
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no mpdid yet
> Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has
> mpdid=enming-f11-pv-hpc-node0006_52805 (port=52805)
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
> Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0,
> size: 4294967295
>
> *Node 1 NFS Server Configuration*
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/exports
> /home/enming/mpich2-install/bin 192.168.1.0/24(ro)<http://192.168.1.0/24%28ro%29>
>
> *Node 2 /etc/fstab Configuration Entry for NFS Client*
>
> 192.168.1.254:/home/enming/mpich2-install/bin
> /home/enming/mpich2-install/bin nfs
> rsize=8192,wsize=8192,timeo=14,intr
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
> Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 2:14 PM, Mr. Teo En Ming (Zhang Enming) <
> space.time.universe at gmail.com> wrote:
>
>> Hi,
>>
>> I have noticed that there are Receive Errors (RX-ERR) in all of my 6
>> compute nodes. It appears that there may be problems with the virtual
>> network interface eth0 in Xen networking.
>>
>> =================================================
>>
>> Node 1:
>>
>>
>> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 5824 27 0 0 5056 0
>> 0 0 BMRU
>>
>> lo 16436 0 127 0 0 0 127 0
>> 0 0 LRU
>> [root at enming-f11-pv-hpc-node0001 ~]# ps -ef | grep mpd
>>
>> enming 1505 1 0 21:44 ? 00:00:00 python2.6
>> /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>> root 1650 1576 0 22:07 pts/0 00:00:00 grep mpd
>>
>> Node 2:
>>
>> [root at enming-f11-pv-hpc-node0002 ~]# netstat -i
>>
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 1504 7 0 0 1417 0
>> 0 0 BMRU
>> lo 16436 0 44 0 0 0 44 0
>> 0 0 LRU
>>
>> Node 3:
>>
>> [root at enming-f11-pv-hpc-node0003 ~]# netstat -i
>>
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 1520 12 0 0 1467 0
>> 0 0 BMRU
>> lo 16436 0 42 0 0 0 42 0
>> 0 0 LRU
>>
>> Node 4:
>>
>> [root at enming-f11-pv-hpc-node0004 ~]# netstat -i
>>
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 1528 10 0 0 1514 0
>> 0 0 BMRU
>> lo 16436 0 44 0 0 0 44 0
>> 0 0 LRU
>>
>> Node 5:
>>
>> [root at enming-f11-pv-hpc-node0005 ~]# netstat -i
>>
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 1416 11 0 0 1412 0
>> 0 0 BMRU
>> lo 16436 0 44 0 0 0 44 0
>> 0 0 LRU
>>
>> Node 6:
>>
>> [root at enming-f11-pv-hpc-node0006 ~]# netstat -i
>>
>> Kernel Interface table
>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0 1500 0 1474 9 0 0 1504 0
>> 0 0 BMRU
>> lo 16436 0 44 0 0 0 44 0
>> 0 0 LRU
>>
>> ================================================
>>
>>
>> --
>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>> Engineering)
>> Alma Maters:
>> (1) Singapore Polytechnic
>> (2) National University of Singapore
>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>> My Youtube videos: http://www.youtube.com/user/enmingteo
>> Email: space.time.universe at gmail.com
>> MSN: teoenming at hotmail.com
>> Mobile Phone (SingTel): +65-9648-9798
>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>> Age: 31 (as at 30 Oct 2009)
>> Height: 1.78 meters
>> Race: Chinese
>> Dialect: Hokkien
>> Street: Bedok Reservoir Road
>> Country: Singapore
>>
>> On Fri, Oct 30, 2009 at 2:07 PM, Mr. Teo En Ming (Zhang Enming) <
>> space.time.universe at gmail.com> wrote:
>>
>>> All the six compute nodes are identical PV virtual machines.
>>>
>>>
>>> --
>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>> Engineering)
>>> Alma Maters:
>>> (1) Singapore Polytechnic
>>> (2) National University of Singapore
>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>> Email: space.time.universe at gmail.com
>>> MSN: teoenming at hotmail.com
>>> Mobile Phone (SingTel): +65-9648-9798
>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>> Age: 31 (as at 30 Oct 2009)
>>> Height: 1.78 meters
>>> Race: Chinese
>>> Dialect: Hokkien
>>> Street: Bedok Reservoir Road
>>> Country: Singapore
>>>
>>> On Fri, Oct 30, 2009 at 2:04 PM, Mr. Teo En Ming (Zhang Enming) <
>>> space.time.universe at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have changed the communication method from nemesis (high performance
>>>> network method) to ssm (socket for nodes and shared memory within a node) by
>>>> recompiling MPICH2. I have also pre-set the MAC address of the virtual
>>>> network adapter eth0 in each compute node (each compute node is a Xen
>>>> paravirtualized virtual machine) by configuring the vif directive in each PV
>>>> domU configuration file.
>>>>
>>>> Additionally, I have also turned off iptables to facilitate
>>>> troubleshooting and communication between all mpd daemons in each node. SSH
>>>> without password is possible between all the compute nodes.
>>>>
>>>> After having done all of the above, I am still encountering the MPIEXEC
>>>> 392 error.
>>>>
>>>>
>>>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd
>>>> when expecting ack of request
>>>>
>>>> =================================================
>>>>
>>>> Master Node / Compute Node 1:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
>>>> enming 1499 1455 0 21:44 pts/0 00:00:00 grep mpd
>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6
>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
>>>> enming 1505 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>>>>
>>>> Compute Node 2:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> [enming at enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd
>>>> enming 1431 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>> 34188 --ncpus=1 -e -d
>>>> enming 1481 1436 0 21:46 pts/0 00:00:00 grep mpd
>>>>
>>>> Compute Node 3:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> [enming at enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd
>>>> enming 1422 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>> 34188 --ncpus=1 -e -d
>>>> enming 1473 1427 0 21:47 pts/0 00:00:00 grep mpd
>>>>
>>>> Compute Node 4:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> [enming at enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd
>>>> enming 1432 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>> 34188 --ncpus=1 -e -d
>>>> enming 1482 1437 0 21:47 pts/0 00:00:00 grep mpd
>>>>
>>>> Compute Node 5:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> [enming at enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd
>>>> enming 1423 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
>>>> 34188 --ncpus=1 -e -d
>>>> enming 1475 1429 0 21:48 pts/0 00:00:00 grep mpd
>>>>
>>>> Compute Node 6:
>>>>
>>>> [enming at enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l
>>>> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
>>>> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
>>>> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
>>>> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
>>>> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
>>>> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
>>>> [enming at enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd
>>>> enming 1427 1 0 21:44 ? 00:00:00 python2.6
>>>> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p
>>>> 42012 --ncpus=1 -e -d
>>>> enming 1477 1432 0 21:49 pts/0 00:00:00 grep mpd
>>>>
>>>> =================================================
>>>>
>>>> Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py
>>>> file or should I change the communication method to sock?
>>>>
>>>> MPIEXEC 392 error says no msg recvd from mpd when expecting ack of
>>>> request. So I am thinking that it could be taking very very long to receive
>>>> acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low.
>>>> Hence that causes the mpiexec 392 error in my case. I am using a virtual
>>>> network adapter and not physical Gigabit network adapter.
>>>>
>>>> =================================================
>>>>
>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo
>>>> processor : 0
>>>> vendor_id : GenuineIntel
>>>> cpu family : 6
>>>> model : 23
>>>> model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz
>>>> stepping : 10
>>>> cpu MHz : 2800.098
>>>> cache size : 2048 KB
>>>> fpu : yes
>>>> fpu_exception : yes
>>>> cpuid level : 13
>>>> wp : yes
>>>> flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2
>>>> ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
>>>> bogomips : 5600.19
>>>> clflush size : 64
>>>> cache_alignment : 64
>>>> address sizes : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> processor : 1
>>>> vendor_id : GenuineIntel
>>>> cpu family : 6
>>>> model : 23
>>>> model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz
>>>> stepping : 10
>>>> cpu MHz : 2800.098
>>>> cache size : 2048 KB
>>>> fpu : yes
>>>> fpu_exception : yes
>>>> cpuid level : 13
>>>> wp : yes
>>>> flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2
>>>> ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
>>>> bogomips : 5600.19
>>>> clflush size : 64
>>>> cache_alignment : 64
>>>> address sizes : 36 bits physical, 48 bits virtual
>>>> power management:
>>>>
>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo
>>>> MemTotal: 532796 kB
>>>> MemFree: 386156 kB
>>>> Buffers: 12904 kB
>>>> Cached: 48864 kB
>>>> SwapCached: 0 kB
>>>> Active: 34884 kB
>>>> Inactive: 43252 kB
>>>> Active(anon): 16504 kB
>>>> Inactive(anon): 0 kB
>>>> Active(file): 18380 kB
>>>> Inactive(file): 43252 kB
>>>> Unevictable: 0 kB
>>>> Mlocked: 0 kB
>>>> SwapTotal: 2195448 kB
>>>> SwapFree: 2195448 kB
>>>> Dirty: 12 kB
>>>> Writeback: 0 kB
>>>> AnonPages: 16444 kB
>>>> Mapped: 8864 kB
>>>> Slab: 10528 kB
>>>> SReclaimable: 4668 kB
>>>> SUnreclaim: 5860 kB
>>>> PageTables: 2996 kB
>>>> NFS_Unstable: 0 kB
>>>> Bounce: 0 kB
>>>> WritebackTmp: 0 kB
>>>> CommitLimit: 2461844 kB
>>>> Committed_AS: 73024 kB
>>>> VmallocTotal: 34359738367 kB
>>>> VmallocUsed: 6332 kB
>>>> VmallocChunk: 34359724899 kB
>>>> HugePages_Total: 0
>>>> HugePages_Free: 0
>>>> HugePages_Rsvd: 0
>>>> HugePages_Surp: 0
>>>> Hugepagesize: 2048 kB
>>>> DirectMap4k: 524288 kB
>>>> DirectMap2M: 0 kB
>>>> [root at enming-f11-pv-hpc-node0001 ~]# lspci -v
>>>> [root at enming-f11-pv-hpc-node0001 ~]# lsusb
>>>> [root at enming-f11-pv-hpc-node0001 ~]# ifconfig eth0
>>>> eth0 Link encap:Ethernet HWaddr 00:16:3E:69:E9:11
>>>> inet addr:192.168.1.254 Bcast:192.168.1.255
>>>> Mask:255.255.255.0
>>>> inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link
>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>> RX packets:5518 errors:26 dropped:0 overruns:0 frame:0
>>>> TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0
>>>> collisions:0 txqueuelen:1000
>>>> RX bytes:872864 (852.4 KiB) TX bytes:3972981 (3.7 MiB)
>>>> Interrupt:17
>>>>
>>>> [root at enming-f11-pv-hpc-node0001 ~]# ethtool eth0
>>>> Settings for eth0:
>>>> Link detected: yes
>>>> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
>>>> Kernel Interface table
>>>> Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP
>>>> TX-OVR Flg
>>>> eth0 1500 0 5589 26 0 0 4875 0
>>>> 0 0 BMRU
>>>> lo 16436 0 127 0 0 0 127 0
>>>> 0 0 LRU
>>>> [root at enming-f11-pv-hpc-node0001 ~]# uname -a
>>>> Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May
>>>> 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>>>> You have new mail in /var/spool/mail/root
>>>> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release
>>>> Fedora release 11 (Leonidas)
>>>>
>>>> =================================================
>>>>
>>>> Please advise.
>>>>
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>> Engineering)
>>>> Alma Maters:
>>>> (1) Singapore Polytechnic
>>>> (2) National University of Singapore
>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>> Email: space.time.universe at gmail.com
>>>> MSN: teoenming at hotmail.com
>>>> Mobile Phone (SingTel): +65-9648-9798
>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>> Age: 31 (as at 30 Oct 2009)
>>>> Height: 1.78 meters
>>>> Race: Chinese
>>>> Dialect: Hokkien
>>>> Street: Bedok Reservoir Road
>>>> Country: Singapore
>>>>
>>>> On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <
>>>> space.time.universe at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am getting the same mpiexec 392 error message as Kenneth Yoshimoto
>>>>> from the San Diego Supercomputer Center. His mpich-discuss mailing list
>>>>> topic URL is
>>>>> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html
>>>>>
>>>>> I have actually already performed the 2-node mpdcheck utility test as
>>>>> described in Appendix A.1 of the MPICH2 installation guide. I could start
>>>>> the ring of mpd on the 2-node test scenario using mpdboot successfully as
>>>>> well.
>>>>>
>>>>> 薛正华 (ID: zhxue123) from China reported solving the mpiexec 392 error.
>>>>> According to 薛正华, the cause of the mpiexec 392 error is the absence of high
>>>>> performance network in his environment. He had changed the default
>>>>> communication method from nemesis to ssm and also increased the value of
>>>>> MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his
>>>>> report is at
>>>>> http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx
>>>>>
>>>>> Could this be my problem also?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> --
>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>>> Engineering)
>>>>> Alma Maters:
>>>>> (1) Singapore Polytechnic
>>>>> (2) National University of Singapore
>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>> Email: space.time.universe at gmail.com
>>>>> MSN: teoenming at hotmail.com
>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>> Age: 31 (as at 30 Oct 2009)
>>>>> Height: 1.78 meters
>>>>> Race: Chinese
>>>>> Dialect: Hokkien
>>>>> Street: Bedok Reservoir Road
>>>>> Country: Singapore
>>>>>
>>>>> On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <thakur at mcs.anl.gov>wrote:
>>>>>
>>>>>> You need to do the mpdcheck tests with every pair of compute nodes.
>>>>>> Or to isolate the problem, try running on a smaller set of nodes first and
>>>>>> increase it one at a time until it fails.
>>>>>>
>>>>>> Rajeev
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* mpich-discuss-bounces at mcs.anl.gov [mailto:
>>>>>> mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *Mr. Teo En Ming
>>>>>> (Zhang Enming)
>>>>>> *Sent:* Thursday, October 29, 2009 2:35 PM
>>>>>> *To:* mpich-discuss at mcs.anl.gov
>>>>>> *Subject:* [mpich-discuss] (mpiexec 392): no msg recvd from mpd when
>>>>>> expectingack of request
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have just installed MPICH2 in my Xen-based virtual machines.
>>>>>>
>>>>>> My hardware configuration is as follows:
>>>>>>
>>>>>> Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
>>>>>> Motherboard: Intel Desktop Board DQ45CB BIOS 0093
>>>>>> Memory: 4X 2GB Kingston DDR2-800 CL5
>>>>>>
>>>>>> My software configuration is as follows:
>>>>>>
>>>>>> Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
>>>>>> Jeremy Fitzhardinge's pv-ops dom0 kernel: 2.6.31.4
>>>>>> Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
>>>>>> Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized (PV)
>>>>>> domU guests (SELinux disabled)
>>>>>>
>>>>>> I have successfully configured, built and installed MPICH2 in a F11 PV
>>>>>> guest OS master compute node 1 with NFS server (MPICH2 bin subdirectory
>>>>>> exported). The rest of the 5 compute nodes have access to the MPICH2
>>>>>> binaries by mounting NFS share from node 1. Please see attached c.txt, m.txt
>>>>>> and mi.txt. With Xen virtualization, I have created 6 F11 linux PV guests to
>>>>>> simulate 6 HPC compute nodes. The network adapter (NIC) in each guest OS is
>>>>>> virtual. The Xen networking type is bridged. Running "lspci -v" and lsusb in
>>>>>> each guest OS does not show up anything.
>>>>>>
>>>>>> According to Appendix A troubleshooting section of the MPICH2 install
>>>>>> guide, I have verified that the 2-node test scenario with "mpdcheck -s" and
>>>>>> "mpdcheck -c" is working. The 2 nodes each acting as server and client
>>>>>> respectively can communicate with each other without problems. Both nodes
>>>>>> can communicate with each other in server and client modes respectively. I
>>>>>> have also tested mpdboot with the 2-node scenario and the ring of mpd is
>>>>>> working.
>>>>>>
>>>>>> After the troubleshooting process, I have successfully created a ring
>>>>>> of mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all the 6
>>>>>> nodes. However, when I want to run a job with mpiexec, it gives me the
>>>>>> following error:
>>>>>>
>>>>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
>>>>>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from
>>>>>> mpd when expecting ack of request
>>>>>>
>>>>>> I have also tried starting the mpd ring with the root user but I still
>>>>>> encounter the same error above.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> PS. config.log is also attached.
>>>>>>
>>>>>> --
>>>>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>>>>> Engineering)
>>>>>> Alma Maters:
>>>>>> (1) Singapore Polytechnic
>>>>>> (2) National University of Singapore
>>>>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>>>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>>>>> Email: space.time.universe at gmail.com
>>>>>> MSN: teoenming at hotmail.com
>>>>>> Mobile Phone (SingTel): +65-9648-9798
>>>>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>>>>> Age: 31 (as at 30 Oct 2009)
>>>>>> Height: 1.78 meters
>>>>>> Race: Chinese
>>>>>> Dialect: Hokkien
>>>>>> Street: Bedok Reservoir Road
>>>>>> Country: Singapore
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>>
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 01.png
Type: image/png
Size: 31551 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 02.png
Type: image/png
Size: 54749 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 03.png
Type: image/png
Size: 29169 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 04.png
Type: image/png
Size: 56421 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 05.png
Type: image/png
Size: 55104 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 06.png
Type: image/png
Size: 30293 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0013.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 07.png
Type: image/png
Size: 56412 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0014.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 08.png
Type: image/png
Size: 55332 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/2701f3c9/attachment-0015.png>
More information about the mpich-discuss
mailing list