<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Hyogi,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks for the heads up!<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
What exactly fails when you run the code snippet you shared? Is it the ssg_group_create_pmix() call? That would make the most sense but just making sure it's not something else like the Margo or SSG init calls. I don't see anything obviously wrong in your code.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I'm currently investigating some other errors that result in sporadic hangs or crashes on Summit when using SSG to launch many processes on a node (e.g,. starting 64 SSG processes on one Summit node) -- maybe this is just another variation of that particular
bug (though I'm using verbs and you're using tcp)? It might be helpful if you can confirm exactly how many processes and nodes it takes to trigger the problem so I can try to reproduce. Your original mail mentions 512-nodes, but not clear whether you're starting
just a single process on each node or more than that, etc. It would be nice if we could simplify the reproducer so we don't need so many nodes to debug, if possible.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
--Shane<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> mochi-devel <mochi-devel-bounces@lists.mcs.anl.gov> on behalf of Sim, Hyogi <simh@ornl.gov><br>
<b>Sent:</b> Tuesday, September 15, 2020 10:59 AM<br>
<b>To:</b> mochi-devel@lists.mcs.anl.gov <mochi-devel@lists.mcs.anl.gov><br>
<b>Subject:</b> [Mochi-devel] [SSG] pmix initialization failure</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Hi,<br>
<br>
I am initializing a SSG group using pmix on Summit@. The initialization works as expected, but only up to a certain number of compute nodes (~ 256 nodes). The group initialization seems always unsuccessful with 512+ nodes. Assuming that ssg itself has been
tested in a larger scale, I am wondering if you see any obvious problems in my code below.
<br>
<br>
For ssg_config and ssg_group_update_cb, I just copied directly from the mochi documentation (<a href="https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html">https://mochi.readthedocs.io/en/latest/ssg/05_create_pmix.html</a>). I am using v0.4.1.<br>
<br>
===<br>
<br>
static int comm_init(void)<br>
{<br>
int ret = 0;<br>
int i = 0;<br>
int rank = 0;<br>
int nranks = 0;<br>
pmix_proc_t proc;<br>
margo_instance_id mid;<br>
ssg_group_id_t gid;<br>
<br>
__debug("initializa the communication");<br>
<br>
ret = PMIx_Init(&proc, NULL, 0);<br>
if (ret != PMIX_SUCCESS) {<br>
__error("PMIx_Init failed: %s", PMIx_Error_string(ret));<br>
return ret;<br>
}<br>
<br>
__debug("pmix initialized");<br>
<br>
mid = margo_init("ofi+tcp://", MARGO_SERVER_MODE, 1, 4);<br>
if (mid == MARGO_INSTANCE_NULL) {<br>
__error("failed to initialize margo");<br>
return EIO;<br>
}<br>
<br>
__debug("margo initialized");<br>
<br>
ret = ssg_init();<br>
if (ret != SSG_SUCCESS) {<br>
__error("ssg_init() failed");<br>
return ret;<br>
}<br>
<br>
gid = ssg_group_create_pmix(mid, "servergroup", proc,<br>
&ssg_config, ssg_group_update_cb, NULL);<br>
if (gid == SSG_GROUP_ID_INVALID) {<br>
__error("ssg_group_create_pmix() failed");<br>
return ret;<br>
}<br>
<br>
rank = ssg_get_group_self_rank(gid);<br>
nranks = ssg_get_group_size(gid);<br>
<br>
__debug("ssg group (gid=%llu, rank=%d, nranks=%d)",<br>
(unsigned long long) gid, (int) rank, nranks);<br>
<br>
ssg_group_dump(gid);<br>
<br>
return 0;<br>
}<br>
<br>
===<br>
<br>
Thanks,<br>
Hyogi<br>
_______________________________________________<br>
mochi-devel mailing list<br>
mochi-devel@lists.mcs.anl.gov<br>
<a href="https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel">https://lists.mcs.anl.gov/mailman/listinfo/mochi-devel</a><br>
<a href="https://www.mcs.anl.gov/research/projects/mochi">https://www.mcs.anl.gov/research/projects/mochi</a><br>
</div>
</span></font></div>
</body>
</html>