[mpich-discuss] Regarding MPICH2-1.1.1p1 testing basing on open-mx
Brice Goglin
Brice.Goglin at inria.fr
Mon Mar 29 06:34:47 CDT 2010
We cannot reproduce any problem here with 1G NICs (bnx2) in MTU 1500 or
9000 using Open-MX 1.2.1 and MPICH2 1.1.1p1 or a recent SVN snapshot.
Such unreproducible problems could be related to your hardware and/or
fabric. For instance, if packets are somehow reordered, it could reveal
a bug somewhere, we never know.
Could try configuring Open-MX 1.2.1 with --enable-debug ? If it doesn't
catch any problem, try also setting OMX_DEBUG_CHECKSUM=1 in the
environment. It will enable some extensive corruption detection (and
slow down communication a lot).
Also, if you could look at the counters on all nodes before and after
one run that fails, it may help. Just do omx_counters -c (as root) one
all nodes to clear the current counter values, then run Unidir_Get or
Bidir_Get, and save the output of omx_counters on all nodes after the run.
Finally, I assume that you don't need 4 nodes to reproduce the problem ?
Unidir_Get and Bidir_Get seem to fail when only 2 nodes are involved
(and the other 2 processes are waiting in Barrier), right ?
thanks,
Brice
李俊丽 wrote:
> Simple tests like cpi work well. Mpitests-IMB also happily work except
> Unidir_Get and Bidir_Get.
>
>
> Lily
>
> 2010/3/20 Dave Goodell <goodell at mcs.anl.gov <mailto:goodell at mcs.anl.gov>>
>
> I don't think that we have tested OpenMX with the mx netmod, so
> I'm not sure if there are any bugs there. I've CCed the primary
> developers of both OpenMX and our mx netmod in case they have any
> information on this.
>
> Do simpler tests work? The "examples/cpi" program in your MPICH2
> build directory is a good simple sanity test.
>
> -Dave
>
>
> On Mar 19, 2010, at 3:31 AM, 李俊丽 wrote:
>
> Hello,
>
> Just do:
> ./configure --with-device=ch3:nemesis:mx
> --with-mx-lib=/opt/open-mx/lib/
> --with-mx-include=/opt/open-mx/include/
>
> make
>
> make install
>
> Then, I start open-omx service, and test mpich2 based on open-mx.
>
>
>
> [root at cu02 ~]# mpiexec -n 4
> /usr/lib64/mpich2/bin/mpitests-IMB-EXT Unidir_Get
>
>
>
> It has this error message:
>
> rank 0 in job 8 cu02.hpc.com_54277 caused collective abort of
> all ranks exit status of rank 0: killed by signal 9
>
> And the same wrong comes with "mpiexec -n 4
> /usr/lib64/mpich2/bin/mpitests-IMB-EXT Bidir_Get "
>
> Is there any way to solve this problem?
>
> Thanks!
>
> Regards
>
> Lily
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list