[mpich-discuss] Regarding MPICH2-1.1.1p1 testing basing on open-mx

Brice Goglin Brice.Goglin at inria.fr
Mon Mar 29 06:34:47 CDT 2010


We cannot reproduce any problem here with 1G NICs (bnx2) in MTU 1500 or
9000 using Open-MX 1.2.1 and MPICH2 1.1.1p1 or a recent SVN snapshot.
Such unreproducible problems could be related to your hardware and/or
fabric. For instance, if packets are somehow reordered, it could reveal
a bug somewhere, we never know.

Could try configuring Open-MX 1.2.1 with --enable-debug ? If it doesn't
catch any problem, try also setting OMX_DEBUG_CHECKSUM=1 in the
environment. It will enable some extensive corruption detection (and
slow down communication a lot).

Also, if you could look at the counters on all nodes before and after
one run that fails, it may help. Just do omx_counters -c (as root) one
all nodes to clear the current counter values, then run Unidir_Get or
Bidir_Get, and save the output of omx_counters on all nodes after the run.

Finally, I assume that you don't need 4 nodes to reproduce the problem ?
Unidir_Get and Bidir_Get seem to fail when only 2 nodes are involved
(and the other 2 processes are waiting in Barrier), right ?

thanks,
Brice




李俊丽 wrote:
> Simple tests like cpi work well. Mpitests-IMB also happily work except
> Unidir_Get and Bidir_Get.
>
>
> Lily
>
> 2010/3/20 Dave Goodell <goodell at mcs.anl.gov <mailto:goodell at mcs.anl.gov>>
>
>     I don't think that we have tested OpenMX with the mx netmod, so
>     I'm not sure if there are any bugs there. I've CCed the primary
>     developers of both OpenMX and our mx netmod in case they have any
>     information on this.
>
>     Do simpler tests work? The "examples/cpi" program in your MPICH2
>     build directory is a good simple sanity test.
>
>     -Dave
>
>
>     On Mar 19, 2010, at 3:31 AM, 李俊丽 wrote:
>
>         Hello,
>
>         Just do:
>         ./configure --with-device=ch3:nemesis:mx
>         --with-mx-lib=/opt/open-mx/lib/
>         --with-mx-include=/opt/open-mx/include/
>
>         make
>
>         make install
>
>         Then, I start open-omx service, and test mpich2 based on open-mx.
>
>
>
>         [root at cu02 ~]# mpiexec -n 4
>         /usr/lib64/mpich2/bin/mpitests-IMB-EXT Unidir_Get
>
>
>
>         It has this error message:
>
>         rank 0 in job 8 cu02.hpc.com_54277 caused collective abort of
>         all ranks exit status of rank 0: killed by signal 9
>
>         And the same wrong comes with "mpiexec -n 4
>         /usr/lib64/mpich2/bin/mpitests-IMB-EXT Bidir_Get "
>
>         Is there any way to solve this problem?
>
>         Thanks!
>
>         Regards
>
>         Lily
>         _______________________________________________
>         mpich-discuss mailing list
>         mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>         https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>     _______________________________________________
>     mpich-discuss mailing list
>     mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>     https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>   



More information about the mpich-discuss mailing list