[mpich-discuss] Probelm regarding running the code

Mon Feb 28 09:42:32 CST 2011

It looks like you are using an older version of MPICH2, which uses MPD 
internally. Can you upgrade your version of MPICH2? Starting mpich2-1.3, 
we use Hydra which integrates better with PBS environments.

  -- Pavan

On 02/28/2011 06:28 AM, Ashwinkumar Dobariya wrote:
> Hi Rajiv,
>
> Thanks for reply.
> I first tried to load the using the wrapper command that is
>
>   mpif90 -o prog.f90  prog
>
> then I submitted the script as below :
>
> #/bin/bash
> #PBS -q compute
> #PBS -N test_job
> # Request 1 Node with 12 Processors
> #PBS -l nodes=1:ppn=12
> #PBS -l walltime=100:00:00
> #PBS -S /bin/bash
> #PBS -M your_email at lboro.ac.uk <mailto:your_email at lboro.ac.uk>
> #PBS -m bae
> #PBS -A your_account12345
> #
> # Go to the directory from which you submitted the job
> cd $PBS_O_WORKDIR
>
> module load intel_compilers
> module load bullxmpi
>
> mpirun ./Multi_beta
>
> but still the same error I am getting which is as below:
>
> running mpdallexit on hydra127
> LAUNCHED mpd on hydra127  via
> RUNNING: mpd on hydra127
>   Total Nb of PE:            1
>
>   PE#           0 /           1  OK
> PE# 0    0   0   0
> PE# 0    0  33   0 165   0  65
> PE# 0  -1  1 -1 -1 -1 -1
>   PE_Table, PE#           0  complete
> PE# 0   -0.03   0.98  -1.00   1.00  -0.03   1.97
>   PE#           0  doesn t intersect any bloc
>   PE#           0  will communicate with            0
>               single value
>   PE#           0  has           1  com. boundaries
>   Data_Read, PE#           0  complete
>
>   PE#           0  checking boundary type for
>   0  1   1   1   0 165   0  65  nor sur sur sur gra  1  0  0
>   0  2  33  33   0 165   0  65            EXC ->  1
>   0  3   0  33   1   1   0  65  sur nor sur sur gra  0  1  0
>   0  4   0  33 164 164   0  65  sur nor sur sur gra  0 -1  0
>   0  5   0  33   0 165   1   1  cyc cyc cyc sur cyc  0  0  1
>   0  6   0  33   0 165  64  64  cyc cyc cyc sur cyc  0  0 -1
>   PE#           0  Set new
>   PE#           0  FFT Table
>   PE#           0  Coeff
> Fatal error in MPI_Send: Invalid rank, error stack:
> MPI_Send(176): MPI_Send(buf=0x7fff9425c388, count=1,
> MPI_DOUBLE_PRECISION, dest=1, tag=1, MPI_COMM_WORLD) failed
> MPI_Send(98).: Invalid rank has value 1 but must be nonnegative and less
> than 1
> rank 0 in job 1  hydra127_37620   caused collective abort of all ranks
>    exit status of rank 0: return code 1
> ~
> I am struggling to find the error but I am not sure where I mess up. if
> I ma runnign the other examples it is ok.
>
> Thanks and Regards
>
> On Fri, Feb 25, 2011 at 4:26 PM, Rajeev Thakur <thakur at mcs.anl.gov
> <mailto:thakur at mcs.anl.gov>> wrote:
>
>     For some reason, each process thinks the total number of processes
>     in the parallel job is 1. Check the wrapper script and try to run by
>     hand using mpiexec. Also try running the cpi example from the
>     examples directory and see if it runs correctly.
>
>     Rajeev
>
>     On Feb 25, 2011, at 9:43 AM, Ashwinkumar Dobariya wrote:
>
>      > Hello everyone,
>      >
>      > I am newbie here. I am running the code for Large eddy simulation
>     of turbulent flow. I am compiling the code using wrapper command and
>     running the code on Hydra cluster. when I am submitting the script
>     file it is showing the following error.
>      >
>      > running mpdallexit on hydra127
>      > LAUNCHED mpd on hydra127  via
>      > RUNNING: mpd on hydra127
>      > LAUNCHED mpd on hydra118  via  hydra127
>      > RUNNING: mpd on hydra118
>      > Fatal error in MPI_Send: Invalid rank, error stack:
>      > MPI_Send(176): MPI_Send(buf=0x7fffa7a1e4a8, count=1,
>     MPI_DOUBLE_PRECISION, dest=1, tag=1, MPI_COMM_WORLD) failed
>      > MPI_Send(98).: Invalid rank has value 1 but must be nonnegative
>     and less than 1
>      >  Total Nb of PE:            1
>      >
>      >  PE#           0 /           1  OK
>      > PE# 0    0   0   0
>      > PE# 0    0  33   0 165   0  33
>      > PE# 0  -1  1 -1 -1 -1  8
>      >  PE_Table, PE#           0  complete
>      > PE# 0   -0.03   0.98  -1.00   1.00  -0.03   0.98
>      >  PE#           0  doesn t intersect any bloc
>      >  PE#           0  will communicate with            0
>      >              single value
>      >  PE#           0  has           2  com. boundaries
>      >  Data_Read, PE#           0  complete
>      >
>      >  PE#           0  checking boundary type for
>      >  0  1   1   1   0 165   0  33  nor sur sur sur gra  1  0  0
>      >  0  2  33  33   0 165   0  33            EXC ->  1
>      >  0  3   0  33   1   1   0  33  sur nor sur sur gra  0  1  0
>      >  0  4   0  33 164 164   0  33  sur nor sur sur gra  0 -1  0
>      >  0  5   0  33   0 165   1   1  cyc cyc cyc sur cyc  0  0  1
>      >  0  6   0  33   0 165  33  33            EXC ->  8
>      >  PE#           0  Set new
>      >  PE#           0  FFT Table
>      >  PE#           0  Coeff
>      > rank 0 in job 1  hydra127_34565   caused collective abort of all
>     ranks
>      >   exit status of rank 0: return code 1
>      >
>      > I am struggling to find the error in my code. can anybody suggest
>     me where I messed up.
>      >
>      > Thanks and Regards,
>      > Ash _______________________________________________
>      > mpich-discuss mailing list
>      > mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>      > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>     _______________________________________________
>     mpich-discuss mailing list
>     mpich-discuss at mcs.anl.gov <mailto:mpich-discuss at mcs.anl.gov>
>     https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji