[mpich2-dev] MVAPICH2 does not work with specified PKEYs.
Mike Heinz
michael.heinz at qlogic.com
Wed Aug 12 10:40:27 CDT 2009
My testers are reporting further problems with mvapich2. On a fabric where the use of pkeys is required, mvapich2 is failing.
1) The MV2_DEFAULT_PKEY parameter does not appear to be supported when using mpirun_rsh.
2) When using mpd and mpiexec, the MV2_DEFAULT_PKEY parameter gets passed, but then fails. For example:
[root at homer mpi_apps]# export MV2_DEFAULT_PKEY=0xffff
[root at homer mpi_apps]# /usr/mpi/gcc/mvapich2-1.2p1/bin/mpiexec -machinefile /opt/iba/src/mpi_apps/mpi_hosts -n 2 osu2/osu_bw
[0] Abort: Can't find PKEY INDEX according to given PKEY
at line 1190 in file rdma_iba_priv.c
rank 0 in job 6 homer.dev.silverstorm.com_33133 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
(Note that 0xffff is actually the default PKEY).
A quick saquery reveals that the pkey is, in fact in the table:
[root at homer mpi_apps]# iba_saquery -o pkey -l 1
LID: 0x0001 PortNum: 1 BlockNum: 0
0- 7: 0x9001 0xffff 0x9002 0x0000 0x0000 0x0000 0x0000 0x0000
8- 15: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
16- 23: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
24- 31: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000
When I examine ibv_param.c to see what was going on, here is what I found:
if ((value = getenv("MV2_DEFAULT_PKEY")) != NULL) {
rdma_default_pkey = (uint16_t)strtol(value, (char **) NULL,0) & PKEY_MASK;
}
And...
#define PKEY_MASK 0x7fff /* the last bit is reserved */
This makes it clear that mpiexec is doing bad things to the pkey - if nothing else, the high bit must be set in order for the connection to have full membership in an Infiniband partition. Without setting this bit, a node will only have "limited membership", and limited nodes are not permitted to talk to each other.
I'm going to try and see if I can quickly put together a patch for you that fixes the problems with mpiexec - but I'm not sure what the correct fix is for mpirun_rsh.
--
Michael Heinz
Principal Engineer, Qlogic Corporation
King of Prussia, Pennsylvania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich2-dev/attachments/20090812/2732a78b/attachment.htm>
More information about the mpich2-dev
mailing list