<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2> Hi,<BR>
I tried the allreduce test in IMB on a 64-bit machine with 2 dual core procs (total 4 cores) and did not get any errors.<BR>
Can you try the following ?,<BR>
<BR>
# Disable hyperthreading<BR>
# Download the latest version of MPICH2 (available at <A HREF="http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads">http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads</A>). Uninstall any existing version of MPICH2 in your system and install the downloaded version.<BR>
# Remove any modifications that you made to IMB. It would be best to use a fresh download of IMB.<BR>
# Recompile IMB (Note that you should link your applications with mpi.lib NOT mpich2.lib.)<BR>
# Rerun the allreduce benchmark (on the local machine - "mpiexec -n 2 imb-mpi1.exe allreduce")<BR>
<BR>
Let us know the results.<BR>
<BR>
Regards,<BR>
Jayesh<BR>
<BR>
-----Original Message-----<BR>
From: Calin Iaru [<A HREF="mailto:calin@dolphinics.com">mailto:calin@dolphinics.com</A>]<BR>
Sent: Friday, May 30, 2008 11:58 AM<BR>
To: Jayesh Krishna<BR>
Cc: mpich-discuss@mcs.anl.gov<BR>
Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce<BR>
<BR>
IMB is compiled with Studio 2003 command prompt by launching "nmake -f make_ict_win", links to Program Files\mpich2\lib\mpich2.lib; the machine where it runs has 2 cpus that run with hyperthreading enabled and is also the build machine. I ran it on both CPUs and on the same CPU by adding a SetProcessAffinityMask before MPI_Init.<BR>
<BR>
I added some information like the hexadecimal representation of the expected value and the hexadecimal representation of the difference between the expected and the arrived value.<BR>
<BR>
<BR>
<BR>
Jayesh Krishna wrote:<BR>
><BR>
> Hi,<BR>
> Please provide us as much details as possible so that we can help<BR>
> with your problem (I am not able to reproduce the error in our lab. I<BR>
> tried allreduce - 16 procs, reduce - 2 procs, reduce_scatter - 2<BR>
> procs, on an x86 WinXP machine with 1 proc).<BR>
><BR>
> # Make sure that you compile the IMB 3.1 suite in your local machine<BR>
> (don't execute an executable created on another machine - to narrow<BR>
> down on the pblm) # Run your job as "mpiexec -n 2 imb-mpi1.exe<BR>
> allreduce"<BR>
> # Are you running your tests on a multi-core machine ?<BR>
><BR>
> Once again pls provide as much details as possible in your reply.<BR>
><BR>
> Regards,<BR>
> Jayesh<BR>
><BR>
> -----Original Message-----<BR>
> From: Calin Iaru [<A HREF="mailto:calin@dolphinics.com">mailto:calin@dolphinics.com</A>]<BR>
> Sent: Friday, May 30, 2008 9:48 AM<BR>
> To: Jayesh Krishna<BR>
> Cc: mpich-discuss@mcs.anl.gov<BR>
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce<BR>
><BR>
> 1) the job crashes on one machine with -n 2 at the same transfers:<BR>
> Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.<BR>
><BR>
> Jayesh Krishna wrote:<BR>
> ><BR>
> > Hi,<BR>
> > Any inputs on the other points that I mentioned in my prev email ?<BR>
> ><BR>
> > Regards,<BR>
> > Jayesh<BR>
> ><BR>
> > -----Original Message-----<BR>
> > From: Calin Iaru [<A HREF="mailto:calin@dolphinics.com">mailto:calin@dolphinics.com</A>]<BR>
> > Sent: Friday, May 30, 2008 8:17 AM<BR>
> > To: Jayesh Krishna<BR>
> > Cc: mpich-discuss@mcs.anl.gov<BR>
> > Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce<BR>
> ><BR>
> > Hi Jayesh,<BR>
> ><BR>
> > besides Allreduce, there is Reduce and Reduce_Scatter that fails.<BR>
> ><BR>
> > Best regards,<BR>
> > Calin<BR>
> ><BR>
> > Jayesh Krishna wrote:<BR>
> > ><BR>
> > > Hi,<BR>
> > > I tried running the IMB 3.1 suite for allreduce on a single<BR>
> > > machine with upto 8 procs and did not get any errors.<BR>
> > ><BR>
> > > 1) Make sure that both node-1 & node-2 have the same data model<BR>
> > > (data type representation). Please note that MPICH2 currently does<BR>
> > > not support heterogeneous systems (wrt the data models used by the<BR>
> > > machines, for eg: you cannot run MPI procs across x86 and x64<BR>
> > > machines). If you need to run your program across a heterogeneous<BR>
> > > system please use MPICH1 instead.<BR>
> > ><BR>
> > > 2) Try running the benchmark on a single node/host (mpiexec -n 2<BR>
> > > imb-mpi1.exe allreduce) and let us know the results.<BR>
> > > 3) Are you able to run other tests in the IMB 3.1 suite ?<BR>
> > ><BR>
> > > Regards,<BR>
> > > Jayesh<BR>
> > ><BR>
> > > -----Original Message-----<BR>
> > > From: owner-mpich-discuss@mcs.anl.gov<BR>
> > > [<A HREF="mailto:owner-mpich-discuss@mcs.anl.gov">mailto:owner-mpich-discuss@mcs.anl.gov</A>] On Behalf Of Calin Iaru<BR>
> > > Sent: Monday, May 26, 2008 5:50 AM<BR>
> > > To: mpich-discuss@mcs.anl.gov<BR>
> > > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce<BR>
> > ><BR>
> > > The problem is that the latest mpich2 in combination with IMB 3.1<BR>
> > > generates a data corruption error when running on 2 nodes. IMB was<BR>
> > > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h.<BR>
> > > I am not sure if this is a transport error or a verification<BR>
> > > error; it could be that the problem lies in the application code.<BR>
> > ><BR>
> > > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2<BR>
> > > \\node-1\e$\imb-mpi1.exe allreduce<BR>
> > > #---------------------------------------------------<BR>
> > > # Intel (R) MPI Benchmark Suite V3.1, MPI-1 part<BR>
> > > #---------------------------------------------------<BR>
> > > # Date : Fri May 23 14:44:12 2008<BR>
> > > # Machine : x86 Family 15 Model 4 Stepping 1,<BR>
> GenuineIntel<BR>
> > > # System : Windows 2003<BR>
> > > # Release : 5.2.3790<BR>
> > > # Version : Service Pack 1<BR>
> > > # MPI Version : 2.0<BR>
> > > # MPI Thread Environment: MPI_THREAD_SINGLE<BR>
> > ><BR>
> > ><BR>
> > ><BR>
> > > # Calling sequence was:<BR>
> > ><BR>
> > > # \\node-1\e$\imb-mpi1.exe allreduce<BR>
> > ><BR>
> > > # Minimum message length in bytes: 0<BR>
> > > # Maximum message length in bytes: 4194304<BR>
> > > #<BR>
> > > # MPI_Datatype : MPI_BYTE<BR>
> > > # MPI_Datatype for reductions : MPI_FLOAT<BR>
> > > # MPI_Op : MPI_SUM<BR>
> > > #<BR>
> > > #<BR>
> > ><BR>
> > > # List of Benchmarks to run:<BR>
> > ><BR>
> > > # Allreduce<BR>
> > ><BR>
> > > #-----------------------------------------------------------------<BR>
> > > --<BR>
> > > --<BR>
> > > --------<BR>
> > > # Benchmarking Allreduce<BR>
> > > # #processes = 2<BR>
> > ><BR>
> > #-------------------------------------------------------------------<BR>
> > --<BR>
> > --------<BR>
> > > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] <BR>
> > > defects<BR>
> > > 0 1000 0.51 0.52 <BR>
> > > 0.51 0.00<BR>
> > > 4 1000 80.30 80.35 <BR>
> > > 80.33 0.00<BR>
> > > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid buffer:<BR>
> > > Buffer entry: 2.300000<BR>
> > > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid buffer:<BR>
> > > Buffer entry: 2.300000<BR>
> > ><BR>
> > ><BR>
> ><BR>
><BR>
><BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>