[mpich-discuss] Abort: fail to register rdma memory

Jean-Christophe Ducom jcducom at gmail.com
Wed Jan 6 15:09:18 CST 2010


All-
The system is a cluster of  Nehalem 8cores (E5520  @ 2.27GHz) with 24GB 
of memory and InfiniPath_QLE7240 cards.
The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.

When I run a medium size (16nodes/128cores) CFD simulation, the run 
stops with the following error message (it runs fine with 64cores)
[...]
[49] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[51] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[47] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
[50] Abort: fail to register rdma memory, size 32768
  at line 105 in file ibv_priv.c
send desc error
[58] Abort: send desc error
[60] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[62] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[118] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[65] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[116] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[64] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[76] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
  at line 581 in file ibv_channel_manager.c
[] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[63] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
  at line 581 in file ibv_channel_manager.c
send desc error
[84] Abort: [] Got completion with error 12, vendor code=0, dest rank=65
  at line 581 in file ibv_channel_manager.c
send desc error
[85] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
  at line 581 in file ibv_channel_manager.c
[...]

Looking at the ibv_priv.c:
mem_handle[i] =  register_memory(vbuf_rdma_buf,
                                  rdma_vbuf_total_size * 
num_rdma_buffer, i);
I believe I need to change the runtime parameters
MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
MV2 NUM RDMA BUFFER
MV2 RDMA VBUF POOL SIZE

Could anyone confirm it and suggest values for them?
Thank you
JC


More information about the mpich-discuss mailing list