[mpich-discuss] (mpiexec 392): no msg recvd from mpd when expecting ack of request

Mr. Teo En Ming (Zhang Enming) space.time.universe at gmail.com
Fri Oct 30 01:07:17 CDT 2009


All the six compute nodes are identical PV virtual machines.

-- 
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe at gmail.com
MSN: teoenming at hotmail.com
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 2:04 PM, Mr. Teo En Ming (Zhang Enming) <
space.time.universe at gmail.com> wrote:

> Hi,
>
> I have changed the communication method from nemesis (high performance
> network method) to ssm (socket for nodes and shared memory within a node) by
> recompiling MPICH2. I have also pre-set the MAC address of the virtual
> network adapter eth0 in each compute node (each compute node is a Xen
> paravirtualized virtual machine) by configuring the vif directive in each PV
> domU configuration file.
>
> Additionally, I have also turned off iptables to facilitate troubleshooting
> and communication between all mpd daemons in each node. SSH without password
> is possible between all the compute nodes.
>
> After having done all of the above, I am still encountering the MPIEXEC 392
> error.
>
>
> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd
> when expecting ack of request
>
> =================================================
>
> Master Node / Compute Node 1:
>
> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
> enming    1499  1455  0 21:44 pts/0    00:00:00 grep mpd
> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6
> [enming at enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> [enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
> enming    1505     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d
>
> Compute Node 2:
>
> [enming at enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> [enming at enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd
> enming    1431     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming    1481  1436  0 21:46 pts/0    00:00:00 grep mpd
>
> Compute Node 3:
>
> [enming at enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> [enming at enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd
> enming    1422     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming    1473  1427  0 21:47 pts/0    00:00:00 grep mpd
>
> Compute Node 4:
>
> [enming at enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> [enming at enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd
> enming    1432     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming    1482  1437  0 21:47 pts/0    00:00:00 grep mpd
>
> Compute Node 5:
>
> [enming at enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> [enming at enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd
> enming    1423     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
> 34188 --ncpus=1 -e -d
> enming    1475  1429  0 21:48 pts/0    00:00:00 grep mpd
>
> Compute Node 6:
>
> [enming at enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l
> enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
> enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
> enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
> enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
> enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
> enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
> [enming at enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd
> enming    1427     1  0 21:44 ?        00:00:00 python2.6
> /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p
> 42012 --ncpus=1 -e -d
> enming    1477  1432  0 21:49 pts/0    00:00:00 grep mpd
>
> =================================================
>
> Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py file
> or should I change the communication method to sock?
>
> MPIEXEC 392 error says no msg recvd from mpd when expecting ack of request.
> So I am thinking that it could be taking very very long to receive
> acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low.
> Hence that causes the mpiexec 392 error in my case. I am using a virtual
> network adapter and not physical Gigabit network adapter.
>
> =================================================
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo
> processor    : 0
> vendor_id    : GenuineIntel
> cpu family    : 6
> model        : 23
> model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
> stepping    : 10
> cpu MHz        : 2800.098
> cache size    : 2048 KB
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 13
> wp        : yes
> flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss
> ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
> bogomips    : 5600.19
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 36 bits physical, 48 bits virtual
> power management:
>
> processor    : 1
> vendor_id    : GenuineIntel
> cpu family    : 6
> model        : 23
> model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
> stepping    : 10
> cpu MHz        : 2800.098
> cache size    : 2048 KB
> fpu        : yes
> fpu_exception    : yes
> cpuid level    : 13
> wp        : yes
> flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss
> ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
> bogomips    : 5600.19
> clflush size    : 64
> cache_alignment    : 64
> address sizes    : 36 bits physical, 48 bits virtual
> power management:
>
> [root at enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo
> MemTotal:         532796 kB
> MemFree:          386156 kB
> Buffers:           12904 kB
> Cached:            48864 kB
> SwapCached:            0 kB
> Active:            34884 kB
> Inactive:          43252 kB
> Active(anon):      16504 kB
> Inactive(anon):        0 kB
> Active(file):      18380 kB
> Inactive(file):    43252 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:       2195448 kB
> SwapFree:        2195448 kB
> Dirty:                12 kB
> Writeback:             0 kB
> AnonPages:         16444 kB
> Mapped:             8864 kB
> Slab:              10528 kB
> SReclaimable:       4668 kB
> SUnreclaim:         5860 kB
> PageTables:         2996 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     2461844 kB
> Committed_AS:      73024 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:        6332 kB
> VmallocChunk:   34359724899 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:      524288 kB
> DirectMap2M:           0 kB
> [root at enming-f11-pv-hpc-node0001 ~]# lspci -v
> [root at enming-f11-pv-hpc-node0001 ~]# lsusb
> [root at enming-f11-pv-hpc-node0001 ~]# ifconfig eth0
> eth0      Link encap:Ethernet  HWaddr 00:16:3E:69:E9:11
>           inet addr:192.168.1.254  Bcast:192.168.1.255  Mask:255.255.255.0
>           inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:5518 errors:26 dropped:0 overruns:0 frame:0
>           TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:872864 (852.4 KiB)  TX bytes:3972981 (3.7 MiB)
>           Interrupt:17
>
> [root at enming-f11-pv-hpc-node0001 ~]# ethtool eth0
> Settings for eth0:
>     Link detected: yes
> [root at enming-f11-pv-hpc-node0001 ~]# netstat -i
> Kernel Interface table
> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
> TX-OVR Flg
> eth0       1500   0     5589     26      0      0     4875      0
> 0      0 BMRU
> lo        16436   0      127      0      0      0      127      0
> 0      0 LRU
> [root at enming-f11-pv-hpc-node0001 ~]# uname -a
> Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27
> 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
> You have new mail in /var/spool/mail/root
> [root at enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release
> Fedora release 11 (Leonidas)
>
> =================================================
>
> Please advise.
>
>
> Thank you.
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
> Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <
> space.time.universe at gmail.com> wrote:
>
>> Hi,
>>
>> I am getting the same mpiexec 392 error message as Kenneth Yoshimoto from
>> the San Diego Supercomputer Center. His mpich-discuss mailing list topic URL
>> is
>> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html
>>
>> I have actually already performed the 2-node mpdcheck utility test as
>> described in Appendix A.1 of the MPICH2 installation guide. I could start
>> the ring of mpd on the 2-node test scenario using mpdboot successfully as
>> well.
>>
>> 薛正华 (ID: zhxue123) from China reported solving the mpiexec 392 error.
>> According to 薛正华, the cause of the mpiexec 392 error is the absence of high
>> performance network in his environment. He had changed the default
>> communication method from nemesis to ssm and also increased the value of
>> MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his
>> report is at
>> http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx
>>
>> Could this be my problem also?
>>
>> Thank you.
>>
>>
>> --
>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>> Engineering)
>> Alma Maters:
>> (1) Singapore Polytechnic
>> (2) National University of Singapore
>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>> My Youtube videos: http://www.youtube.com/user/enmingteo
>> Email: space.time.universe at gmail.com
>> MSN: teoenming at hotmail.com
>> Mobile Phone (SingTel): +65-9648-9798
>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>> Age: 31 (as at 30 Oct 2009)
>> Height: 1.78 meters
>> Race: Chinese
>> Dialect: Hokkien
>> Street: Bedok Reservoir Road
>> Country: Singapore
>>
>> On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <thakur at mcs.anl.gov>wrote:
>>
>>>  You need to do the mpdcheck tests with every pair of compute nodes. Or
>>> to isolate the problem, try running on a smaller set of nodes first and
>>> increase it one at a time until it fails.
>>>
>>> Rajeev
>>>
>>>
>>>  ------------------------------
>>> *From:* mpich-discuss-bounces at mcs.anl.gov [mailto:
>>> mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *Mr. Teo En Ming (Zhang
>>> Enming)
>>> *Sent:* Thursday, October 29, 2009 2:35 PM
>>> *To:* mpich-discuss at mcs.anl.gov
>>> *Subject:* [mpich-discuss] (mpiexec 392): no msg recvd from mpd when
>>> expectingack of request
>>>
>>> Hi,
>>>
>>> I have just installed MPICH2 in my Xen-based virtual machines.
>>>
>>> My hardware configuration is as follows:
>>>
>>> Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
>>> Motherboard: Intel Desktop Board DQ45CB BIOS 0093
>>> Memory: 4X 2GB Kingston DDR2-800 CL5
>>>
>>> My software configuration is as follows:
>>>
>>> Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
>>> Jeremy Fitzhardinge's pv-ops dom0 kernel: 2.6.31.4
>>> Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
>>> Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized (PV) domU
>>> guests (SELinux disabled)
>>>
>>> I have successfully configured, built and installed MPICH2 in a F11 PV
>>> guest OS master compute node 1 with NFS server (MPICH2 bin subdirectory
>>> exported). The rest of the 5 compute nodes have access to the MPICH2
>>> binaries by mounting NFS share from node 1. Please see attached c.txt, m.txt
>>> and mi.txt. With Xen virtualization, I have created 6 F11 linux PV guests to
>>> simulate 6 HPC compute nodes. The network adapter (NIC) in each guest OS is
>>> virtual. The Xen networking type is bridged. Running "lspci -v" and lsusb in
>>> each guest OS does not show up anything.
>>>
>>> According to Appendix A troubleshooting section of the MPICH2 install
>>> guide, I have verified that the 2-node test scenario with "mpdcheck -s" and
>>> "mpdcheck -c" is working. The 2 nodes each acting as server and client
>>> respectively can communicate with each other without problems. Both nodes
>>> can communicate with each other in server and client modes respectively. I
>>> have also tested mpdboot with the 2-node scenario and the ring of mpd is
>>> working.
>>>
>>> After the troubleshooting process, I have successfully created a ring of
>>> mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all the 6
>>> nodes. However, when I want to run a job with mpiexec, it gives me the
>>> following error:
>>>
>>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
>>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd
>>> when expecting ack of request
>>>
>>> I have also tried starting the mpd ring with the root user but I still
>>> encounter the same error above.
>>>
>>> Thank you.
>>>
>>> PS. config.log is also attached.
>>>
>>> --
>>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>>> Engineering)
>>> Alma Maters:
>>> (1) Singapore Polytechnic
>>> (2) National University of Singapore
>>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>>> My Youtube videos: http://www.youtube.com/user/enmingteo
>>> Email: space.time.universe at gmail.com
>>> MSN: teoenming at hotmail.com
>>> Mobile Phone (SingTel): +65-9648-9798
>>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>>> Age: 31 (as at 30 Oct 2009)
>>> Height: 1.78 meters
>>> Race: Chinese
>>> Dialect: Hokkien
>>> Street: Bedok Reservoir Road
>>> Country: Singapore
>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list
>>>
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>>
>>
>>
>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/53635901/attachment-0001.htm>


More information about the mpich-discuss mailing list