[mpich-discuss] (mpiexec 392): no msg recvd from mpd when expecting ack of request

Mr. Teo En Ming (Zhang Enming) space.time.universe at gmail.com
Fri Oct 30 01:04:56 CDT 2009


Hi,

I have changed the communication method from nemesis (high performance
network method) to ssm (socket for nodes and shared memory within a node) by
recompiling MPICH2. I have also pre-set the MAC address of the virtual
network adapter eth0 in each compute node (each compute node is a Xen
paravirtualized virtual machine) by configuring the vif directive in each PV
domU configuration file.

Additionally, I have also turned off iptables to facilitate troubleshooting
and communication between all mpd daemons in each node. SSH without password
is possible between all the compute nodes.

After having done all of the above, I am still encountering the MPIEXEC 392
error.

mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd when
expecting ack of request

=================================================

Master Node / Compute Node 1:

[enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
enming    1499  1455  0 21:44 pts/0    00:00:00 grep mpd
[enming at enming-f11-pv-hpc-node0001 ~]$ mpdboot -n 6
[enming at enming-f11-pv-hpc-node0001 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
[enming at enming-f11-pv-hpc-node0001 ~]$ ps -ef | grep mpd
enming    1505     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py --ncpus=1 -e -d

Compute Node 2:

[enming at enming-f11-pv-hpc-node0002 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
[enming at enming-f11-pv-hpc-node0002 ~]$ ps -ef | grep mpd
enming    1431     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
34188 --ncpus=1 -e -d
enming    1481  1436  0 21:46 pts/0    00:00:00 grep mpd

Compute Node 3:

[enming at enming-f11-pv-hpc-node0003 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
[enming at enming-f11-pv-hpc-node0003 ~]$ ps -ef | grep mpd
enming    1422     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
34188 --ncpus=1 -e -d
enming    1473  1427  0 21:47 pts/0    00:00:00 grep mpd

Compute Node 4:

[enming at enming-f11-pv-hpc-node0004 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
[enming at enming-f11-pv-hpc-node0004 ~]$ ps -ef | grep mpd
enming    1432     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
34188 --ncpus=1 -e -d
enming    1482  1437  0 21:47 pts/0    00:00:00 grep mpd

Compute Node 5:

[enming at enming-f11-pv-hpc-node0005 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
[enming at enming-f11-pv-hpc-node0005 ~]$ ps -ef | grep mpd
enming    1423     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0001 -p
34188 --ncpus=1 -e -d
enming    1475  1429  0 21:48 pts/0    00:00:00 grep mpd

Compute Node 6:

[enming at enming-f11-pv-hpc-node0006 ~]$ mpdtrace -l
enming-f11-pv-hpc-node0006_55525 (192.168.1.249)
enming-f11-pv-hpc-node0001_34188 (192.168.1.254)
enming-f11-pv-hpc-node0005_39315 (192.168.1.250)
enming-f11-pv-hpc-node0004_46914 (192.168.1.251)
enming-f11-pv-hpc-node0003_36478 (192.168.1.252)
enming-f11-pv-hpc-node0002_42012 (192.168.1.253)
[enming at enming-f11-pv-hpc-node0006 ~]$ ps -ef | grep mpd
enming    1427     1  0 21:44 ?        00:00:00 python2.6
/home/enming/mpich2-install/bin/mpd.py -h enming-f11-pv-hpc-node0002 -p
42012 --ncpus=1 -e -d
enming    1477  1432  0 21:49 pts/0    00:00:00 grep mpd

=================================================

Should I increase the value of MPIEXEC_RECV_TIMEOUT in the mpiexec.py file
or should I change the communication method to sock?

MPIEXEC 392 error says no msg recvd from mpd when expecting ack of request.
So I am thinking that it could be taking very very long to receive
acknowledgement of request while the MPIEXEC_RECV_TIMEOUT value is too low.
Hence that causes the mpiexec 392 error in my case. I am using a virtual
network adapter and not physical Gigabit network adapter.

=================================================

[root at enming-f11-pv-hpc-node0001 ~]# cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
stepping    : 10
cpu MHz        : 2800.098
cache size    : 2048 KB
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss
ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
bogomips    : 5600.19
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Pentium(R) Dual-Core  CPU      E6300  @ 2.80GHz
stepping    : 10
cpu MHz        : 2800.098
cache size    : 2048 KB
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ss
ht syscall nx lm constant_tsc rep_good pni ssse3 cx16 hypervisor lahf_lm
bogomips    : 5600.19
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

[root at enming-f11-pv-hpc-node0001 ~]# cat /proc/meminfo
MemTotal:         532796 kB
MemFree:          386156 kB
Buffers:           12904 kB
Cached:            48864 kB
SwapCached:            0 kB
Active:            34884 kB
Inactive:          43252 kB
Active(anon):      16504 kB
Inactive(anon):        0 kB
Active(file):      18380 kB
Inactive(file):    43252 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2195448 kB
SwapFree:        2195448 kB
Dirty:                12 kB
Writeback:             0 kB
AnonPages:         16444 kB
Mapped:             8864 kB
Slab:              10528 kB
SReclaimable:       4668 kB
SUnreclaim:         5860 kB
PageTables:         2996 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2461844 kB
Committed_AS:      73024 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        6332 kB
VmallocChunk:   34359724899 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      524288 kB
DirectMap2M:           0 kB
[root at enming-f11-pv-hpc-node0001 ~]# lspci -v
[root at enming-f11-pv-hpc-node0001 ~]# lsusb
[root at enming-f11-pv-hpc-node0001 ~]# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:16:3E:69:E9:11
          inet addr:192.168.1.254  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::216:3eff:fe69:e911/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5518 errors:26 dropped:0 overruns:0 frame:0
          TX packets:4832 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:872864 (852.4 KiB)  TX bytes:3972981 (3.7 MiB)
          Interrupt:17

[root at enming-f11-pv-hpc-node0001 ~]# ethtool eth0
Settings for eth0:
    Link detected: yes
[root at enming-f11-pv-hpc-node0001 ~]# netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
TX-OVR Flg
eth0       1500   0     5589     26      0      0     4875      0
0      0 BMRU
lo        16436   0      127      0      0      0      127      0
0      0 LRU
[root at enming-f11-pv-hpc-node0001 ~]# uname -a
Linux enming-f11-pv-hpc-node0001 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27
17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
You have new mail in /var/spool/mail/root
[root at enming-f11-pv-hpc-node0001 ~]# cat /etc/redhat-release
Fedora release 11 (Leonidas)

=================================================

Please advise.

Thank you.

-- 
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe at gmail.com
MSN: teoenming at hotmail.com
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 11:55 AM, Mr. Teo En Ming (Zhang Enming) <
space.time.universe at gmail.com> wrote:

> Hi,
>
> I am getting the same mpiexec 392 error message as Kenneth Yoshimoto from
> the San Diego Supercomputer Center. His mpich-discuss mailing list topic URL
> is
> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005882.html
>
> I have actually already performed the 2-node mpdcheck utility test as
> described in Appendix A.1 of the MPICH2 installation guide. I could start
> the ring of mpd on the 2-node test scenario using mpdboot successfully as
> well.
>
> 薛正华 (ID: zhxue123) from China reported solving the mpiexec 392 error.
> According to 薛正华, the cause of the mpiexec 392 error is the absence of high
> performance network in his environment. He had changed the default
> communication method from nemesis to ssm and also increased the value of
> MPIEXEC_RECV_TIMEOUT in the mpiexec.py python source code. The URL of his
> report is at http://blog.csdn.net/zhxue123/archive/2009/08/22/4473089.aspx
>
> Could this be my problem also?
>
> Thank you.
>
>
> --
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
> Engineering)
> Alma Maters:
> (1) Singapore Polytechnic
> (2) National University of Singapore
> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
> My Youtube videos: http://www.youtube.com/user/enmingteo
> Email: space.time.universe at gmail.com
> MSN: teoenming at hotmail.com
> Mobile Phone (SingTel): +65-9648-9798
> Mobile Phone (Starhub Prepaid): +65-8369-2618
> Age: 31 (as at 30 Oct 2009)
> Height: 1.78 meters
> Race: Chinese
> Dialect: Hokkien
> Street: Bedok Reservoir Road
> Country: Singapore
>
> On Fri, Oct 30, 2009 at 11:09 AM, Rajeev Thakur <thakur at mcs.anl.gov>wrote:
>
>>  You need to do the mpdcheck tests with every pair of compute nodes. Or
>> to isolate the problem, try running on a smaller set of nodes first and
>> increase it one at a time until it fails.
>>
>> Rajeev
>>
>>
>>  ------------------------------
>> *From:* mpich-discuss-bounces at mcs.anl.gov [mailto:
>> mpich-discuss-bounces at mcs.anl.gov] *On Behalf Of *Mr. Teo En Ming (Zhang
>> Enming)
>> *Sent:* Thursday, October 29, 2009 2:35 PM
>> *To:* mpich-discuss at mcs.anl.gov
>> *Subject:* [mpich-discuss] (mpiexec 392): no msg recvd from mpd when
>> expectingack of request
>>
>> Hi,
>>
>> I have just installed MPICH2 in my Xen-based virtual machines.
>>
>> My hardware configuration is as follows:
>>
>> Processor: Intel Pentium Dual Core E6300 @ 2.8 GHz
>> Motherboard: Intel Desktop Board DQ45CB BIOS 0093
>> Memory: 4X 2GB Kingston DDR2-800 CL5
>>
>> My software configuration is as follows:
>>
>> Xen Hypervisor / Virtual Machine Monitor Version: 3.5-unstable
>> Jeremy Fitzhardinge's pv-ops dom0 kernel: 2.6.31.4
>> Host Operating System: Fedora Linux 11 x86-64 (SELinux disabled)
>> Guest Operating Systems: Fedora Linux 11 x86-64 paravirtualized (PV) domU
>> guests (SELinux disabled)
>>
>> I have successfully configured, built and installed MPICH2 in a F11 PV
>> guest OS master compute node 1 with NFS server (MPICH2 bin subdirectory
>> exported). The rest of the 5 compute nodes have access to the MPICH2
>> binaries by mounting NFS share from node 1. Please see attached c.txt, m.txt
>> and mi.txt. With Xen virtualization, I have created 6 F11 linux PV guests to
>> simulate 6 HPC compute nodes. The network adapter (NIC) in each guest OS is
>> virtual. The Xen networking type is bridged. Running "lspci -v" and lsusb in
>> each guest OS does not show up anything.
>>
>> According to Appendix A troubleshooting section of the MPICH2 install
>> guide, I have verified that the 2-node test scenario with "mpdcheck -s" and
>> "mpdcheck -c" is working. The 2 nodes each acting as server and client
>> respectively can communicate with each other without problems. Both nodes
>> can communicate with each other in server and client modes respectively. I
>> have also tested mpdboot with the 2-node scenario and the ring of mpd is
>> working.
>>
>> After the troubleshooting process, I have successfully created a ring of
>> mpd involving 6 compute nodes. "mpdtrace -l" successfully lists all the 6
>> nodes. However, when I want to run a job with mpiexec, it gives me the
>> following error:
>>
>> [enming at enming-f11-pv-hpc-node0001 ~]$ mpiexec -n 2 examples/cpi
>> mpiexec_enming-f11-pv-hpc-node0001 (mpiexec 392): no msg recvd from mpd
>> when expecting ack of request
>>
>> I have also tried starting the mpd ring with the root user but I still
>> encounter the same error above.
>>
>> Thank you.
>>
>> PS. config.log is also attached.
>>
>> --
>> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical
>> Engineering)
>> Alma Maters:
>> (1) Singapore Polytechnic
>> (2) National University of Singapore
>> My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
>> My Youtube videos: http://www.youtube.com/user/enmingteo
>> Email: space.time.universe at gmail.com
>> MSN: teoenming at hotmail.com
>> Mobile Phone (SingTel): +65-9648-9798
>> Mobile Phone (Starhub Prepaid): +65-8369-2618
>> Age: 31 (as at 30 Oct 2009)
>> Height: 1.78 meters
>> Race: Chinese
>> Dialect: Hokkien
>> Street: Bedok Reservoir Road
>> Country: Singapore
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>>
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091030/566a3a85/attachment-0001.htm>


More information about the mpich-discuss mailing list