[mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

Jayesh Krishna jayesh at mcs.anl.gov
Fri May 30 10:45:17 CDT 2008


 Hi,
  Please provide us as much details as possible so that we can help with
your problem (I am not able to reproduce the error in our lab. I tried
allreduce - 16 procs, reduce - 2 procs, reduce_scatter - 2 procs, on an
x86 WinXP machine with 1 proc).

# Make sure that you compile the IMB 3.1 suite in your local machine
(don't execute an executable created on another machine - to narrow down
on the pblm)
# Run your job as "mpiexec -n 2 imb-mpi1.exe allreduce"
# Are you running your tests on a multi-core machine ?

  Once again pls provide as much details as possible in your reply.

Regards,
Jayesh

-----Original Message-----
From: Calin Iaru [mailto:calin at dolphinics.com] 
Sent: Friday, May 30, 2008 9:48 AM
To: Jayesh Krishna
Cc: mpich-discuss at mcs.anl.gov
Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce

1) the job crashes on one machine with -n 2 at the same transfers: 
Allreduce, Reduce and Reduce_scatter. Jobs are running on Win32 only.

Jayesh Krishna wrote:
>
>  Hi,
>   Any inputs on the other points that I mentioned in my prev email ?
>
> Regards,
> Jayesh
>
> -----Original Message-----
> From: Calin Iaru [mailto:calin at dolphinics.com]
> Sent: Friday, May 30, 2008 8:17 AM
> To: Jayesh Krishna
> Cc: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
>
> Hi Jayesh,
>
>     besides Allreduce, there is Reduce and Reduce_Scatter that fails.
>
> Best regards,
>     Calin
>
> Jayesh Krishna wrote:
> >
> >  Hi,
> >   I tried running the IMB 3.1 suite for allreduce on a single 
> > machine with upto 8 procs and did not get any errors.
> > 
> > 1) Make sure that both node-1 & node-2 have the same data model 
> > (data type representation). Please note that MPICH2 currently does 
> > not support heterogeneous systems (wrt the data models used by the 
> > machines, for eg: you cannot run MPI procs across x86 and x64 
> > machines). If you need to run your program across a heterogeneous 
> > system please use MPICH1 instead.
> >
> > 2) Try running the benchmark on a single node/host (mpiexec -n 2 
> > imb-mpi1.exe allreduce) and let us know the results.
> > 3) Are you able to run other tests in the IMB 3.1 suite ?
> >
> > Regards,
> > Jayesh
> >
> > -----Original Message-----
> > From: owner-mpich-discuss at mcs.anl.gov 
> > [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Calin Iaru
> > Sent: Monday, May 26, 2008 5:50 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: [mpich-discuss] IMB 3.1 with TOL 0 crashes on Allreduce
> >
> > The problem is that the latest mpich2 in combination with IMB 3.1 
> > generates a data corruption error when running on 2 nodes. IMB was 
> > compiled with the CHECK flag and TOL set to 0 inside IMB_declare.h. 
> > I am not sure if this is a transport error or a verification error; 
> > it could be that the problem lies in the application code.
> >
> > E:\Program Files\MPICH2\bin>mpiexec.exe -hosts 2 node-1 node-2 
> > \\node-1\e$\imb-mpi1.exe allreduce
> > #---------------------------------------------------
> > #    Intel (R) MPI Benchmark Suite V3.1, MPI-1 part
> > #---------------------------------------------------
> > # Date                  : Fri May 23 14:44:12 2008
> > # Machine               : x86 Family 15 Model 4 Stepping 1,
GenuineIntel
> > # System                : Windows 2003
> > # Release               : 5.2.3790
> > # Version               : Service Pack 1
> > # MPI Version           : 2.0
> > # MPI Thread Environment: MPI_THREAD_SINGLE
> >
> >
> >
> > # Calling sequence was:
> >
> > # \\node-1\e$\imb-mpi1.exe allreduce
> >
> > # Minimum message length in bytes:   0
> > # Maximum message length in bytes:   4194304
> > #
> > # MPI_Datatype                   :   MPI_BYTE
> > # MPI_Datatype for reductions    :   MPI_FLOAT
> > # MPI_Op                         :   MPI_SUM
> > #
> > #
> >
> > # List of Benchmarks to run:
> >
> > # Allreduce
> >
> > #-------------------------------------------------------------------
> > --
> > --------
> > # Benchmarking Allreduce
> > # #processes = 2
> > 
> #---------------------------------------------------------------------
> --------
> >        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]    
> > defects
> >             0         1000         0.51         0.52       
> > 0.51         0.00
> >             4         1000        80.30        80.35      
> > 80.33         0.00
> > 1: Error Allreduce, size = 8, sample #0 Process 1: Got invalid buffer:
> > Buffer entry: 2.300000
> > 0: Error Allreduce, size = 8, sample #0 Process 0: Got invalid buffer:
> > Buffer entry: 2.300000
> >
> >
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080530/23800a1a/attachment.htm>


More information about the mpich-discuss mailing list