[mpich-discuss] mpich2 1.2 with channel ch3:nemesis:mx is broken

Audet, Martin Martin.Audet at imi.cnrc-nrc.gc.ca
Wed Oct 21 18:55:23 CDT 2009


Hi,

 I send you the following bug report via this mailling list because your bug reporting system is so painfull to use (it join lines and rejected this report as spam).

Martin 



Hi MPICH2_Developers,

It seems that there are again problems with the ch3:nemesis:mx channel.

After submiting issue #892 (error while compiling mpich2-1.2 with ch3:nemesis:mx), I generated a patch with svn expressing the diffs between the revision coresponding to the official version 1.2 and the revision r5434 of the 1.2 branch. This patch allowed me to compile ch3:nemesis:mx with no problems.

However if I try to submit any parallel job using the ch3:nemesis:mx channel of mpich2-1.2r5434 with more than one process (e.g np > 1) it either lock or ends with an error message.

Note: There were no such problems with mpich2-1.1.1p1 with any device or with mpich2-1.2r5434 with ch3:nemesis or ch3:sock device.

For your informations our "mc1,mc2, ..., mc12" cluster is a small cluster using dual quad core Core 2 CPUs (x86_64) nodes, running Fedora 7 Linux. Each node is equiped with a Myrinet 2000 F card using MX 1.2.9.

Now look at the following:

[audet at mc1 mpi]$ which mpicc
/usr/local/mpich2-ch3_nemesis_mx/bin/mpicc
[audet at mc1 mpi]$ mpich2version
MPICH2 Version:         1.2
MPICH2 Release date:    Unknown, built on Wed Oct 21 17:28:30 EDT 2009
MPICH2 Device:          ch3:nemesis
MPICH2 configure:       --with-device=ch3:nemesis:mx --prefix=/usr/local/mpich2-ch3_nemesis_mx-1.2r5434 --with-mx=/usr/local/mx --enable-fast --enable-romio --with-file-system=ufs+nfs --disable-cxx --disable-f90 --enable-sharedlibs=gcc
MPICH2 CC:      gcc  -DNDEBUG -O2
MPICH2 CXX:       -DNDEBUG
MPICH2 F77:     gfortran  -DNDEBUG -O2
MPICH2 F90:     gfortran  -DNDEBUG
[audet at mc1 mpi]$ ls -l /usr/local/mpich2-ch3_nemesis_mx
lrwxrwxrwx 1 publique mod 30 Oct 21 17:32 /usr/local/mpich2-ch3_nemesis_mx -> mpich2-ch3_nemesis_mx-1.2r5434
[audet at mc1 mpi]$ which mpicc
/usr/local/mpich2-ch3_nemesis_mx/bin/mpicc
[audet at mc1 mpi]$ mpicc empty.c
[audet at mc1 mpi]$ ldd ./a.out
        libmpich.so.1.2 => /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2 (0x00002aaaaaaad000)
        libmyriexpress.so => /usr/local/mx/lib64/libmyriexpress.so (0x00002aaaaae12000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000338d400000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003395200000)
        libc.so.6 => /lib64/libc.so.6 (0x000000338c400000)
        /lib64/ld-linux-x86-64.so.2 (0x000000338b400000)
[audet at mc1 mpi]$ cat empty.c
#include <mpi.h>

int main(int argc, char **argv)
{
   MPI_Init(&argc, &argv);
   MPI_Finalize();

   return 0;
}
[audet at mc1 mpi]$ mpdtrace -l
mc1_41399 (172.17.10.101)
[audet at mc1 mpi]$ ./a.out
[audet at mc1 mpi]$ mpiexec -n 2 ./a.out
[cli_0]: write_line error; fd=3 buf=:cmd=get kvsname=kvs_mc1_41399_1_0 key=P1-businesscard
:
system msg for write_line failure : Bad file descriptor
[cli_0]: write_line error; fd=3 buf=:cmd=get kvsname=kvs_mc1_41399_1_0 key=P1-businesscard
:
system msg for write_line failure : Bad file descriptor
[cli_1]: write_line error; fd=3 buf=:cmd=get kvsname=kvs_mc1_41399_1_0 key=P0-businesscard
:
system msg for write_line failure : Bad file descriptor
[cli_1]: write_line error; fd=3 buf=:cmd=get kvsname=kvs_mc1_41399_1_0 key=P0-businesscard
:
system msg for write_line failure : Bad file descriptor
[audet at mc1 mpi]$
[audet at mc1 mpi]$

This above is the error message I get when I start two processes on the same machine.

If I start two process on two nodes (one by node), it freeze (e.g. the job doesn't finish, I have to stop it with Ctrl-C). A gdb backtrace of the stuck process then gives (same for the two process):

(gdb)
#0  0x00002aaaaab258e2 in MPIDI_CH3I_Progress () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#1  0x00002aaaaab50a4a in MPIC_Wait () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#2  0x00002aaaaab515ba in MPIC_Sendrecv () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#3  0x00002aaaaab1a449 in MPIR_Barrier () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#4  0x00002aaaaab1a843 in PMPI_Barrier () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#5  0x00002aaaaab58aec in MPID_Finalize () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#6  0x00002aaaaab479d6 in PMPI_Finalize () from /usr/local/mpich2-ch3_nemesis_mx-1.2r5434/lib/libmpich.so.1.2
#7  0x0000000000400679 in main ()
(gdb)                    


More information about the mpich-discuss mailing list