[mpich-discuss] Re: [mvapich-discuss] caused collective abort of all ranks + signal 9

Matthew Koop koop at cse.ohio-state.edu
Tue May 6 10:45:11 CDT 2008


Sangamesh,

Can you run any of the included benchmarks with the OFED package? Try
running the ibv_rc_pingpoing test between nodes in your system to first
make sure the fabric is healthy.

Also, can you give us some additional information on your setup? What type
of cards are these? Also, how did you set the 'ulimit -l unlimited'. We
suggest placing it in /etc/init.d/sshd on all nodes and restarting sshd
(and mpd).  This will ensure that the processes started will inherit the
modified ulimit settings.

Thanks,

Matt

On Tue, 6 May 2008, Sangamesh B wrote:

> Hi all,
>
>
> *I got some problem, can someone help me on this issue.*
>
> *The scenario is : We have a Rocks(4.2) cluster with 12 nodes. We installed
> Infiniband cards newly in 5 nodes(Masternode doesn't have IB card).
> Installation of OFED is successful and IP got assigned.*
>
> *I installed Mvapich2 in that and created password free environment from
> computenode-0-8 to 12(the nodes which have IB card).  So far everything is
> fine, And the MPD is booting up also. *
>
> *I've compiled a sample MPI program and tried to execute it and I got the
> following kind of results:*
>
> Scenario 1: Using root to execute Hellow.o (compiled with mvapich2-mpicc)
>
> [root at compute-0-8 test]# /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
> Hello world from process 0 of 2
> Hello world from process 1 of 2
> rank 1 in job 8  compute-0-8.local_34399   caused collective abort of all
> ranks
>   exit status of rank 1: killed by signal 9
> rank 0 in job 8  compute-0-8.local_34399   caused collective abort of all
> ranks
>   exit status of rank 0: killed by signal 9
>
> Scenario 2: Using user id (srinu) to execute the same file.
>
> [srinu at compute-0-8 test]$ /opt/mvapich2_ps/bin/mpiexec -np 2 /test/Hellow.o
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>     This will severely limit memory registrations.
> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
>     This will severely limit memory registrations.
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(259)....: Initialization failed
> MPID_Init(102)...........: channel initialization failed
> MPIDI_CH3_Init(178)......:
> MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
> rdma_iba_hca_init(645)...: cannot create cq
> Fatal error in MPI_Init:
> Other MPI error, error stack:
> MPIR_Init_thread(259)....: Initialization failed
> MPID_Init(102)...........: channel initialization failed
> MPIDI_CH3_Init(178)......:
> MPIDI_CH3I_RMDA_init(208): Failed to Initialize HCA type
> rdma_iba_hca_init(645)...: cannot create cq
> rank 1 in job 9  compute-0-8.local_34399   caused collective abort of all
> ranks
>   exit status of rank 1: return code 1
>
> For 2nd scenario,  I found solution from net such as ulimit –l unlimited.
> But further, this also produced same error as of 1st scenario.
> Can someone solve this error?
>
> Thanks in advance,
>
> Sangamesh
>




More information about the mpich-discuss mailing list