[mpich-discuss] Abort: fail to register rdma memory
Jean-Christophe Ducom
jcducom at gmail.com
Wed Jan 6 15:09:18 CST 2010
All-
The system is a cluster of Nehalem 8cores (E5520 @ 2.27GHz) with 24GB
of memory and InfiniPath_QLE7240 cards.
The nodes are running RHEL5.4 with mvapich2/1.4 compiled with Intel9.0.
When I run a medium size (16nodes/128cores) CFD simulation, the run
stops with the following error message (it runs fine with 64cores)
[...]
[49] Abort: fail to register rdma memory, size 32768
at line 105 in file ibv_priv.c
[51] Abort: fail to register rdma memory, size 32768
at line 105 in file ibv_priv.c
[47] Abort: fail to register rdma memory, size 32768
at line 105 in file ibv_priv.c
[50] Abort: fail to register rdma memory, size 32768
at line 105 in file ibv_priv.c
send desc error
[58] Abort: send desc error
[60] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[62] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[118] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[65] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[116] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[64] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[76] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
at line 581 in file ibv_channel_manager.c
[] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[63] Abort: [] Got completion with error 12, vendor code=0, dest rank=52
at line 581 in file ibv_channel_manager.c
send desc error
[84] Abort: [] Got completion with error 12, vendor code=0, dest rank=65
at line 581 in file ibv_channel_manager.c
send desc error
[85] Abort: [] Got completion with error 12, vendor code=0, dest rank=20
at line 581 in file ibv_channel_manager.c
[...]
Looking at the ibv_priv.c:
mem_handle[i] = register_memory(vbuf_rdma_buf,
rdma_vbuf_total_size *
num_rdma_buffer, i);
I believe I need to change the runtime parameters
MV2 VBUF TOTAL SIZE (and then MV2_IBA_EAGER_THRESHOLD)
MV2 NUM RDMA BUFFER
MV2 RDMA VBUF POOL SIZE
Could anyone confirm it and suggest values for them?
Thank you
JC
More information about the mpich-discuss
mailing list