<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
span.EmailStyle17
        {mso-style-type:personal-compose;
        font-family:"Calibri","sans-serif";
        color:windowtext;}
.MsoChpDefault
        {mso-style-type:export-only;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
        {page:Section1;}
/* List Definitions */
@list l0
        {mso-list-id:2122602076;
        mso-list-type:hybrid;
        mso-list-template-ids:-1302673462 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-text:"%1\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        margin-left:.25in;
        text-indent:-.25in;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal>My testers are reporting further problems with mvapich2. On
a fabric where the use of pkeys is required, mvapich2 is failing.<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoListParagraph style='margin-left:.25in;text-indent:-.25in;
mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>1)<span
style='font:7.0pt "Times New Roman"'> </span></span><![endif]>The
MV2_DEFAULT_PKEY parameter does not appear to be supported when using
mpirun_rsh.<o:p></o:p></p>
<p class=MsoListParagraph style='margin-left:.25in;text-indent:-.25in;
mso-list:l0 level1 lfo1'><![if !supportLists]><span style='mso-list:Ignore'>2)<span
style='font:7.0pt "Times New Roman"'> </span></span><![endif]>When
using mpd and mpiexec, the MV2_DEFAULT_PKEY parameter gets passed, but then
fails. For example:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>[root@homer
mpi_apps]# export MV2_DEFAULT_PKEY=0xffff<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>[root@homer
mpi_apps]# /usr/mpi/gcc/mvapich2-1.2p1/bin/mpiexec -machinefile
/opt/iba/src/mpi_apps/mpi_hosts -n 2 osu2/osu_bw<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'> [0] Abort:
Can't find PKEY INDEX according to given PKEY<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'> at line 1190
in file rdma_iba_priv.c<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>rank 0 in job
6 homer.dev.silverstorm.com_33133 caused collective abort of
all ranks<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'> exit status
of rank 0: killed by signal 9<o:p></o:p></span></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>(Note that 0xffff is actually the default PKEY).<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>A quick saquery reveals that the pkey is, in fact in the
table:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>[root@homer
mpi_apps]# iba_saquery -o pkey -l 1<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>LID: 0x0001
PortNum: 1 BlockNum: 0<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
0- 7: 0x9001 0xffff 0x9002 0x0000
0x0000 0x0000 0x0000 0x0000<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
8- 15: 0x0000 0x0000 0x0000 0x0000
0x0000 0x0000 0x0000 0x0000<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
16- 23: 0x0000 0x0000 0x0000 0x0000
0x0000 0x0000 0x0000 0x0000<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
24- 31: 0x0000 0x0000 0x0000 0x0000
0x0000 0x0000 0x0000 0x0000<o:p></o:p></span></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>When I examine ibv_param.c to see what was going on, here is
what I found:<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
if ((value = getenv("MV2_DEFAULT_PKEY")) != NULL) {<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>
rdma_default_pkey = (uint16_t)strtol(value, (char **) NULL,0) & PKEY_MASK;<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'> }<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'>And…<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'> #define
PKEY_MASK 0x7fff /* the last bit is reserved */<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-family:"Courier New"'><o:p> </o:p></span></p>
<p class=MsoNormal>This makes it clear that mpiexec is doing bad things to the
pkey – if nothing else, the high bit <b>must</b> be set in order for the
connection to have full membership in an Infiniband partition. Without
setting this bit, a node will only have “limited membership”, and
limited nodes are not permitted to talk to each other.<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal>I’m going to try and see if I can quickly put together
a patch for you that fixes the problems with mpiexec – but I’m not
sure what the correct fix is for mpirun_rsh.<o:p></o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>--</span><span
style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Michael
Heinz</span><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Principal
Engineer, Qlogic Corporation</span><span style='font-size:12.0pt;font-family:
"Times New Roman","serif"'><o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>King
of Prussia, Pennsylvania</span><o:p></o:p></p>
</div>
</body>
</html>