[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
Jayesh Krishna
jayesh at mcs.anl.gov
Fri May 30 12:39:52 CDT 2008
Hi,
I tried the allreduce test in IMB on a 64-bit machine with 2 dual core
procs (total 4 cores) and did not get any errors.
Can you try the following ?,
# Disable hyperthreading
# Download the latest version of MPICH2 (available at
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=down
loads). Uninstall any existing version of MPICH2 in your system and
install the downloaded version.
# Remove any modifications that you made to IMB. It would be best to use a
fresh download of IMB.
# Recompile IMB (Note that you should link your applications with mpi.lib
NOT mpich2.lib.)
# Rerun the allreduce benchmark (on the local machine - "mpiexec -n 2
imb-mpi1.exe allreduce")
Let us know the results.
Regards,
Jayesh
-----Original Message-----
From: Calin Iaru [mailto:calin at dolphinics.com]
Sent: Friday, May 30, 2008 11:58 AM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
IMB is compiled with Studio 2003 command prompt by launching "nmake -f
make_ict_win", links to Program Files\mpich2\lib\mpich2.lib; the machine
where it runs has 2 cpus that run with hyperthreading enabled and is also
the build machine. I ran it on both CPUs and on the same CPU by adding a
SetProcessAffinityMask before MPI_Init.
I added some information like the hexadecimal representation of the
expected value and the hexadecimal representation of the difference
between the expected and the arrived value.
Jayesh Krishna wrote:
>
> Hi,
> Please provide us as much details as possible so that we can help
> with your problem (I am not able to reproduce the error in our lab. I
> tried allreduce - 16 procs, reduce - 2 procs, reduce_scatter - 2
> procs, on an x86 WinXP machine with 1 proc).
>
> # Make sure that you compile the IMB 3.1 suite in your local machine
> (don't execute an executable created on another machine - to narrow
> down on the pblm) # Run your job as "mpiexec -n 2 imb-mpi1.exe
> allreduce"
> # Are you running your tests on a multi-core machine ?
>
> Once again pls provide as much details as possible in your reply.
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.com]
> Sent: Friday, May 30, 2008 9:48 AM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
>
> 1) the job crashes on one machine with -n 2 at the same transfers:
> Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.
>
> Jayesh Krishna wrote:
> >
> > Hi,
> > Any inputs on the other points that I mentioned in my prev email ?
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: Calin Iaru [mailto:calin at dolphinics.com]
> > Sent: Friday, May 30, 2008 8:17 AM
> > To: Jayesh Krishna
> > Cc: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> >
> > Hi Jayesh,
> >
> > besides Allreduce, there is Reduce and Reduce_Scatter that fails.
> >
> > Best regards,
> > Calin
> >
> > Jayesh Krishna wrote:
> > >
> > > Hi,
> > > I tried running the IMB 3.1 suite for allreduce on a single
> > > machine with upto 8 procs and did not get any errors.
> > >
> > > 1) Make sure that both node-1 & node-2 have the same data model
> > > (data type representation). Please note that MPICH2 currently does
> > > not support heterogeneous systems (wrt the data models used by the
> > > machines, for eg: you cannot run MPI procs across x86 and x64
> > > machines). If you need to run your program across a heterogeneous
> > > system please use MPICH1 instead.
> > >
> > > 2) Try running the benchmark on a single node/host (mpiexec -n 2
> > > imb-mpi1.exe allreduce) and let us know the results.
> > > 3) Are you able to run other tests in the IMB 3.1 suite ?
> > >
> > > Regards,
> > > Jayesh
> > >
> > > -----Original Message-----
> > > From: owner-mpich-discuss at mcs.anl.gov
> > > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> > > Sent: Monday, May 26, 2008 5:50 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> > >
> > > The problem is that the latest mpich2 in combination with IMB 3.1
> > > generates a data corruption error when running on 2 nodes. IMB was
> > > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h.
> > > I am not sure if this is a transport error or a verification
> > > error; it could be that the problem lies in the application code.
> > >
> > > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2
> > > \\node-1\e$\imb-mpi1.exe allreduce
> > > #---------------------------------------------------
> > > # Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
> > > #---------------------------------------------------
> > > # Date : Fri May 23 14:44:12 2008
> > > # Machine : x86 Family 15 Model 4 Stepping 1,
> GenuineIntel
> > > # System : Windows 2003
> > > # Release : 5.2.3790
> > > # Version : Service Pack 1
> > > # MPI Version : 2.0
> > > # MPI Thread Environment: MPI_THREAD_SINGLE
> > >
> > >
> > >
> > > # Calling sequence was:
> > >
> > > # \\node-1\e$\imb-mpi1.exe allreduce
> > >
> > > # Minimum message length in bytes: 0
> > > # Maximum message length in bytes: 4194304
> > > #
> > > # MPI_Datatype : MPI_BYTE
> > > # MPI_Datatype for reductions : MPI_FLOAT
> > > # MPI_Op : MPI_SUM
> > > #
> > > #
> > >
> > > # List of Benchmarks to run:
> > >
> > > # Allreduce
> > >
> > > #-----------------------------------------------------------------
> > > --
> > > --
> > > --------
> > > # Benchmarking Allreduce
> > > # #processes = 2
> > >
> > #-------------------------------------------------------------------
> > --
> > --------
> > > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> > > defects
> > > 0 1000 0.51 0.52
> > > 0.51 0.00
> > > 4 1000 80.30 80.35
> > > 80.33 0.00
> > > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid
buffer:
> > > Buffer entry: 2.300000
> > > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid
buffer:
> > > Buffer entry: 2.300000
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080530/fe3df785/attachment.htm>
More information about the mpich-discuss
mailing list