Hi,<br><br>I have changed the communication method from nemesis (high performance network method) to ssm (socket for nodes and shared memory within a node) by recompiling MPICH2. I have also pre-set the MAC address of the virtual network adapter eth0 in each compute node (each compute node is a Xen paravirtualized virtual machine) by configuring the vif directive in each PV domU configuration file.<br>
<br>Additionally, I have also turned off iptables to facilitate troubleshooting and communication between all mpd daemons in each node. SSH without password is possible between all the compute nodes.<br><br>After having done all of the above, I am still encountering the MPIEXEC 392 error.<br>
<br>mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd when expecting ack of request<br><br>=================================================<br><br>Master Node / Compute Node 1:<br><br>[enming@enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd<br>
enming 1499 1455 0 21:44 pts/0 00:00:00 grep mpd<br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6<br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>
[enming@enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd<br>enming 1505 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d<br><br>Compute Node 2:<br><br>[enming@enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l<br>
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>[enming@enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd<br>enming 1431 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>
enming 1481 1436 0 21:46 pts/0 00:00:00 grep mpd<br><br>Compute Node 3:<br><br>[enming@enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>[enming@enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd<br>
enming 1422 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>enming 1473 1427 0 21:47 pts/0 00:00:00 grep mpd<br><br>Compute Node 4: <br>
<br>[enming@enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>[enming@enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd<br>enming 1432 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>
enming 1482 1437 0 21:47 pts/0 00:00:00 grep mpd<br><br>Compute Node 5:<br><br>[enming@enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>[enming@enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd<br>
enming 1423 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p 34188 --ncpus=1 -e -d<br>enming 1475 1429 0 21:48 pts/0 00:00:00 grep mpd<br><br>Compute Node 6: <br>
<br>[enming@enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l<br>enming-f11-pv-hpc-node0006_55525 (192.168.1.249)<br>enming-f11-pv-hpc-node0001_34188 (192.168.1.254)<br>enming-f11-pv-hpc-node0005_39315 (192.168.1.250)<br>enming-f11-pv-hpc-node0004_46914 (192.168.1.251)<br>
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)<br>enming-f11-pv-hpc-node0002_42012 (192.168.1.253)<br>[enming@enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd<br>enming 1427 1 0 21:44 ? 00:00:00 python2.6 /home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p 42012 --ncpus=1 -e -d<br>
enming 1477 1432 0 21:49 pts/0 00:00:00 grep mpd<br><br>=================================================<br><br>Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py file or should I change the communication method to sock?<br>
<br>MPIEXEC 392 error says no msg recvd from mpd when expecting ack of request. So I am thinking that it could be taking very very long to receive acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low. Hence that causes the mpiexec 392 error in my case. I am using a virtual network adapter and not physical Gigabit network adapter.<br>
<br>=================================================<br><br>[root@enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo<br>processor : 0<br>vendor_id : GenuineIntel<br>cpu family : 6<br>model : 23<br>model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz<br>
stepping : 10<br>cpu MHz : 2800.098<br>cache size : 2048 KB<br>fpu : yes<br>fpu_exception : yes<br>cpuid level : 13<br>wp : yes<br>flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm<br>
bogomips : 5600.19<br>clflush size : 64<br>cache_alignment : 64<br>address sizes : 36 bits physical, 48 bits virtual<br>power management:<br><br>processor : 1<br>vendor_id : GenuineIntel<br>cpu family : 6<br>
model : 23<br>model name : Pentium(R) Dual-Core CPU E6300 @ 2.80GHz<br>stepping : 10<br>cpu MHz : 2800.098<br>cache size : 2048 KB<br>fpu : yes<br>fpu_exception : yes<br>cpuid level : 13<br>
wp : yes<br>flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm<br>bogomips : 5600.19<br>clflush size : 64<br>cache_alignment : 64<br>
address sizes : 36 bits physical, 48 bits virtual<br>power management:<br><br>[root@enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo<br>MemTotal: 532796 kB<br>MemFree: 386156 kB<br>Buffers: 12904 kB<br>
Cached: 48864 kB<br>SwapCached: 0 kB<br>Active: 34884 kB<br>Inactive: 43252 kB<br>Active(anon): 16504 kB<br>Inactive(anon): 0 kB<br>Active(file): 18380 kB<br>Inactive(file): 43252 kB<br>
Unevictable: 0 kB<br>Mlocked: 0 kB<br>SwapTotal: 2195448 kB<br>SwapFree: 2195448 kB<br>Dirty: 12 kB<br>Writeback: 0 kB<br>AnonPages: 16444 kB<br>Mapped: 8864 kB<br>
Slab: 10528 kB<br>SReclaimable: 4668 kB<br>SUnreclaim: 5860 kB<br>PageTables: 2996 kB<br>NFS_Unstable: 0 kB<br>Bounce: 0 kB<br>WritebackTmp: 0 kB<br>CommitLimit: 2461844 kB<br>
Committed_AS: 73024 kB<br>VmallocTotal: 34359738367 kB<br>VmallocUsed: 6332 kB<br>VmallocChunk: 34359724899 kB<br>HugePages_Total: 0<br>HugePages_Free: 0<br>HugePages_Rsvd: 0<br>HugePages_Surp: 0<br>
Hugepagesize: 2048 kB<br>DirectMap4k: 524288 kB<br>DirectMap2M: 0 kB<br>[root@enming-f11-pv-hpc-node0001 ~]# lspci -v<br>[root@enming-f11-pv-hpc-node0001 ~]# lsusb<br>[root@enming-f11-pv-hpc-node0001 ~]# ifconfig eth0<br>
eth0 Link encap:Ethernet HWaddr 00:16:3E:69:E9:11 <br> inet addr:192.168.1.254 Bcast:192.168.1.255 Mask:255.255.255.0<br> inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link<br> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1<br>
RX packets:5518 errors:26 dropped:0 overruns:0 frame:0<br> TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0<br> collisions:0 txqueuelen:1000 <br> RX bytes:872864 (852.4 KiB) TX bytes:3972981 (3.7 MiB)<br>
Interrupt:17 <br><br>[root@enming-f11-pv-hpc-node0001 ~]# ethtool eth0<br>Settings for eth0:<br> Link detected: yes<br>[root@enming-f11-pv-hpc-node0001 ~]# netstat -i<br>Kernel Interface table<br>Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg<br>
eth0 1500 0 5589 26 0 0 4875 0 0 0 BMRU<br>lo 16436 0 127 0 0 0 127 0 0 0 LRU<br>[root@enming-f11-pv-hpc-node0001 ~]# uname -a<br>
Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27 17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux<br>You have new mail in /var/spool/mail/root<br>[root@enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release <br>
Fedora release 11 (Leonidas)<br><br>=================================================<br><br>Please advise.<br><br>Thank you.<br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>
Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br><div class="gmail_quote">On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <span dir="ltr"><<a href="mailto:space.time.universe@gmail.com">space.time.universe@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi,<br><br>I am getting the same mpiexec 392 error message as Kenneth Yoshimoto from the San Diego Supercomputer Center. His mpich-discuss mailing list topic URL is <a href="http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html" target="_blank">http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html</a><br>
<br>I have actually already performed the 2-node mpdcheck utility test as described in Appendix A.1 of the MPICH2 installation guide. I could start the ring of mpd on the 2-node test scenario using mpdboot successfully as well.<br>
<br>薛正华 (ID: <span></span>zhxue123) from China reported solving the mpiexec 392 error. According to 薛正华, the cause of the mpiexec 392 error is the absence of high performance network in his environment. He had changed the default communication method from nemesis to ssm and also increased the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his report is at <a href="http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx" target="_blank">http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx</a><br>
<br>Could this be my problem also?<br><br>Thank you.<div class="im"><br><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore Polytechnic<br>(2) National University of Singapore<br>
My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>
Email: <a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN: <a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile Phone (SingTel): +65-9648-9798<br>
Mobile Phone (Starhub Prepaid): +65-8369-2618<br>
Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race: Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country: Singapore<br><br></div><div class="gmail_quote"><div><div></div><div class="h5">On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <span dir="ltr"><<a href="mailto:thakur@mcs.anl.gov" target="_blank">thakur@mcs.anl.gov</a>></span> wrote:<br>
</div></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">
<div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff">You need to do the mpdcheck tests with every pair of compute
nodes. Or to isolate the problem, try running on a smaller set of nodes first
and increase it one at a time until it fails.</font></span></div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff"></font></span> </div>
<div dir="ltr" align="left"><span><font face="Arial" size="2" color="#0000ff">Rajeev</font></span></div>
<div dir="ltr" align="left"><span></span> </div><br>
<blockquote style="border-left: 2px solid rgb(0, 0, 255); padding-left: 5px; margin-left: 5px; margin-right: 0px;">
<div dir="ltr" lang="en-us" align="left">
<hr>
<font face="Tahoma" size="2"><b>From:</b> <a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a>
[mailto:<a href="mailto:mpich-discuss-bounces@mcs.anl.gov" target="_blank">mpich-discuss-bounces@mcs.anl.gov</a>] <b>On Behalf Of </b>Mr. Teo En Ming
(Zhang Enming)<br><b>Sent:</b> Thursday, October 29, 2009 2:35
PM<br><b>To:</b> <a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br><b>Subject:</b> [mpich-discuss]
(mpiexec 392): no msg recvd from mpd when expectingack of
request<br></font><br></div><div><div></div><div>
<div></div>Hi,<br><br>I have just installed MPICH2 in my Xen-based virtual
machines.<br><br>My hardware configuration is as follows:<br><br>Processor:
Intel Pentium Dual Core E6300 @ 2.8 GHz<br>Motherboard: Intel Desktop Board
DQ45CB BIOS 0093<br>Memory: 4X 2GB Kingston DDR2-800 CL5<br><br>My software
configuration is as follows:<br><br>Xen Hypervisor / Virtual Machine Monitor
Version: 3.5-unstable<br>Jeremy Fitzhardinge's pv-ops dom0 kernel:
2.6.31.4<br>Host Operating System: Fedora Linux 11 x86-64 (SELinux
disabled)<br>Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized
(PV) domU guests (SELinux disabled)<br><br>I have successfully configured,
built and installed MPICH2 in a F11 PV guest OS master compute node 1 with NFS
server (MPICH2 bin subdirectory exported). The rest of the 5 compute nodes
have access to the MPICH2 binaries by mounting NFS share from node 1. Please
see attached c.txt, m.txt and mi.txt. With Xen virtualization, I have created
6 F11 linux PV guests to simulate 6 HPC compute nodes. The network adapter
(NIC) in each guest OS is virtual. The Xen networking type is bridged. Running
"lspci -v" and lsusb in each guest OS does not show up
anything.<br><br>According to Appendix A troubleshooting section of the MPICH2
install guide, I have verified that the 2-node test scenario with "mpdcheck
-s" and "mpdcheck -c" is working. The 2 nodes each acting as server and client
respectively can communicate with each other without problems. Both nodes can
communicate with each other in server and client modes respectively. I have
also tested mpdboot with the 2-node scenario and the ring of mpd is
working.<br><br>After the troubleshooting process, I have successfully created
a ring of mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all
the 6 nodes. However, when I want to run a job with mpiexec, it gives me the
following error:<br><br>[enming@enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2
examples/cpi<br>mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd
from mpd when expecting ack of request<br><br>I have also tried starting the
mpd ring with the root user but I still encounter the same error
above.<br><br>Thank you.<br><br>PS. config.log is also attached.<br clear="all"><br>-- <br>Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics)
BEng(Hons)(Mechanical Engineering)<br>Alma Maters:<br>(1) Singapore
Polytechnic<br>(2) National University of Singapore<br>My blog URL: <a href="http://teo-en-ming-aka-zhang-enming.blogspot.com" target="_blank">http://teo-en-ming-aka-zhang-enming.blogspot.com</a><br>My
Youtube videos: <a href="http://www.youtube.com/user/enmingteo" target="_blank">http://www.youtube.com/user/enmingteo</a><br>Email:
<a href="mailto:space.time.universe@gmail.com" target="_blank">space.time.universe@gmail.com</a><br>MSN:
<a href="mailto:teoenming@hotmail.com" target="_blank">teoenming@hotmail.com</a><br>Mobile
Phone (SingTel): +65-9648-9798<br>Mobile Phone (Starhub Prepaid):
+65-8369-2618<br>Age: 31 (as at 30 Oct 2009)<br>Height: 1.78 meters<br>Race:
Chinese<br>Dialect: Hokkien<br>Street: Bedok Reservoir Road<br>Country:
Singapore<br></div></div></blockquote></div>
<br></div></div>_______________________________________________<br>
mpich-discuss mailing list<div class="im"><br>
<a href="mailto:mpich-discuss@mcs.anl.gov" target="_blank">mpich-discuss@mcs.anl.gov</a><br>
</div><a href="https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss" target="_blank">https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss</a><br>
<br></blockquote></div><br><br clear="all"><br><br>
</blockquote></div><br><br clear="all"><br><br>