Thanks, it sounds like things are under control.<div><br></div><div>I got to running the tests, and hit three problems there so far</div><div><br></div><div><span style="font-size:large"><span style="font-weight:bold">1)</span></span></div>
<div>This is the smallest but most clearly wrong code. I was failing the test/mpi/init/initstat.c test because MPI_Init_thread() and MPI_Query_thread() were returning different provided levels. </div><div><br></div><div>
When the device claims to handle MPI_THREAD_MULTIPLE, it gets set to "runtime":</div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><span style="font-family:'courier new';font-size:12px"># Threads must be supported by the device. First, set the default to</span><br>
<span style="font-family:'courier new';font-size:12px"># be the highest supported by the device</span><br><span style="font-family:'courier new';font-size:12px">if test "$enable_threads" = default ; then</span><br>
<span style="font-family:'courier new';font-size:12px"> if test -n "$MPID_MAX_THREAD_LEVEL" ; then</span><br><span style="font-family:'courier new';font-size:12px"> case $MPID_MAX_THREAD_LEVEL in</span><br>
<span style="font-family:'courier new';font-size:12px"> MPI_THREAD_SINGLE) enable_threads=single ;;</span><br><span style="font-family:'courier new';font-size:12px"> MPI_THREAD_FUNNELED) enable_threads=funneled ;;</span><br>
<span style="font-family:'courier new';font-size:12px"> MPI_THREAD_SERIALIZED) enable_threads=serialized ;;</span><br><span style="font-family:'courier new';font-size:12px"> MPI_THREAD_MULTIPLE) enable_threads=runtime ;;</span><br>
<span style="font-family:'courier new';font-size:12px"> *) AC_MSG_ERROR([Unrecognized thread level from device $MPID_MAX_THREAD_LEVEL])</span><br><span style="font-family:'courier new'"> </span><span style="white-space:pre"><span style="font-family:'courier new'">        </span></span><span style="font-family:'courier new'"> ;;</span><br>
<span style="font-family:'courier new';font-size:12px"> esac</span><br><span style="font-family:'courier new';font-size:12px"> else</span><br>
<span style="font-family:'courier new';font-size:12px"> enable_threads=single</span><br><span style="font-family:'courier new';font-size:12px"> fi</span><br>
<span style="font-family:'courier new';font-size:12px">fi</span><br><br><span class="Apple-style-span" style="font-family: 'courier new'; font-size: 12px;">.........</span><br><span style="font-family:'courier new';font-size:12px"># Runtime is an alias for multiple with an additional value</span><br>
<span style="font-family:'courier new';font-size:12px">if test "$enable_threads" = "runtime" ; then</span><br><span style="font-family:'courier new';font-size:12px"> AC_DEFINE(HAVE_RUNTIME_THREADCHECK,1,[Define if MPI supports MPI_THREAD_MULTIPLE with a runtime check for thread level])</span><br>
<span style="font-family:'courier new';font-size:12px"> enable_threads=multiple</span><br><span style="font-family:'courier new';font-size:12px"> # FIXME: This doesn't support runtime:thread-impl (as in multiple:thread-impl)</span><br>
<span style="font-family:'courier new';font-size:12px">fi</span><br></blockquote><div><div>This will cause <span style="font-family:'courier new';font-size:12px">HAVE_RUNTIME_THREADCHECK<span style="font-family:arial;font-size:13px"> to be defined. In MPI_Init_thread, this causes the provided data to be partially ignored. I see there is a "fixme" comment; did you have other plans for this code?<br>
</span></span></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><span style="font-family:'courier new', monospace"> 288 mpi_errno = MPID_Init(argc, argv, required, <span style="color:rgb(255, 0, 0)">&thread_provided</span>, <br>
289 &has_args, &has_env);<br> 290 /* --BEGIN ERROR HANDLING-- */<br> 303 /* --END ERROR HANDLING-- */<br> 304 <br> 305 /* Capture the level of thread support provided */<br>
<span style="color:rgb(255, 0, 0)"> 306 MPIR_ThreadInfo.thread_provided = thread_provided;</span><br> 307 if (provided) *provided = thread_provided;<br> 308 /* FIXME: Rationalize this with the above */<br>
309 #ifdef HAVE_RUNTIME_THREADCHECK<br> 310 MPIR_ThreadInfo.isThreaded = required == MPI_THREAD_MULTIPLE;<br><span style="color:rgb(255, 0, 0)"> 311 if (provided) *provided = required;</span><br>
312 #endif</span><br></blockquote><div><div><span style="font-family:'courier new';font-size:12px"><span style="font-family:arial;font-size:13px"><div>
Line 288 will get the "provided" information from the device, as before.</div><div>Line 306 will store the device-provided info into the MPI_Threadinfo struct, as before.</div><div>Line 311 will over-write the device-provided info and tell the user that the provided is the same as the requested.</div>
<div>Since this is MPI_Thread_query() code:</div><div><div> *provided = MPIR_ThreadInfo.thread_provided;</div><div>The device would have to always return MPI_THREAD_MULTIPLE or the two values will be different and inconsistent.</div>
<div>Either</div><div>A) The device must be completely ignored.</div><div>B) The provided thread level cannot be set higher than the device is willing.</div></div></span></span></div><div>Note: Choice (A) may break the threaded tests in mpich2/test/mpi/threads/, since they don't generally check the return value from phtread_create(), only that MPI_THREAD_MULTIPLE was provided. If threads cannot be started, these tests won't work.<br>
</div><br><br></div><br><span style="font-size:large"><span style="font-weight:bold">2)</span></span><br>I had to completely gut MPIU_Find_local_and_external() (same file as problem 2 before) because this generic code didn't know as much about the BG/P topology as it thought. It is running now that intra-comms work. I return a generic non-fatal error and the comm utils seem fine with it.<blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<span style="font-family:'courier new', monospace">int MPIU_Find_local_and_external(MPID_Comm *comm, int *local_size_p, int *local_rank_p, int **local_ranks_p,<br> int *external_size_p, int *external_rank_p, int **external_ranks_p,<br>
int **intranode_table_p, int **internode_table_p)<br>{<br> return MPI_ERR_UNKNOWN;<br>}</span><br></blockquote><div><div><br></div><div><br></div><br><span style="font-size:large"><span style="font-weight:bold">3)</span></span><br>
I noticed that we got a hang because the build didn't pick up or custom CS_ENTER/EXIT macros. It looks like the "threaded" branch code for the macros is in this alpha release? I added some code to use MPID_DEFINES_MPID_CS to once-again allow a device to use custom macros. Unlike my "work" in the threaded branch, it is much more exacting. I will send these changes as a (git) patch in case you are interested. It can usually be applied using "patch -p1".<div>
<br></div><div><br></div><div><br></div><div>Thanks,</div><div>Joe Ratterman</div><div><a href="mailto:jratt@us.ibm.com" target="_blank">jratt@us.ibm.com</a></div><div><br></div><div>
<br></div><div><br><div><div class="gmail_quote">On Fri, Nov 21, 2008 at 23:38, Dave Goodell <span dir="ltr"><<a href="mailto:goodell@mcs.anl.gov" target="_blank">goodell@mcs.anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div><br>
On Nov 20, 2008, at 2:24 PM, Joe Ratterman wrote:<br>
<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
Hi, I am working on merging the latest changes from this alpha in the the BG/P code. It fully compiles now--I haven't run the tests--but I had a few issues that I wanted to mention. Maybe someone already is working on solutions or can otherwise be of some help.<br>
<br>
1)<br></div>
PAC_CC_FUNCTION_NAME_SYMBOL (<a href="http://configure.in" target="_blank">configure.in</a>) doesn't work at all in cross-compilation environments, though I don't know of a good solution to that. [...]<br>
</blockquote>
<br>
It looks like David Gingold came to the same conclusion here: https://<a href="http://trac.mcs.anl.gov/projects/mpich2/ticket/300" target="_blank">trac.mcs.anl.gov/projects/mpich2/ticket/300</a><br>
<br>
Sorry for the breakage, we don't cross compile as often as you do and we didn't catch this one before the release. I haven't had the chance to dig in and fully grok this change yet, but I'm sure we can come up with a fix soon-ish.<div>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2)<br>
"src/util/procmap/local_proc.c" seems a bit troubling for us. We don't use a PMI device, and specify "MPID_NO_PMI=yes" in src/mpid/dcmfd/mpich2prereq. However, this file calls PMI_KVS_Get_key_length_max() from MPIU_Get_local_procs(). That did compile because C doesn't care too much, but it wouldn't link, even though we never call MPIU_Get_local_procs(). This is because the file also defines MPIU_Get_intranode_rank(), which is uses by both src/mpi/coll/reduce.c & src/mpi/coll/bcast.c. I ended up simply deleting the entire MPIU_Get_local_procs() function to solve the problem. I am sure that isn't the answer, but I don't know what is the correct version.<br>
</blockquote>
<br></div>
This code is in need of a good dose of cleanup and improvement. It's not where I'd like it to be but we elected to put it out there to see how people felt about the feature. Don't worry, this won't be the final version of this code. In your situation I suspect removing the code is what makes the most sense for now.<div>
<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
3)<br>
This one might be my problem, but I couldn't compile all the tests because neither the F77 nor F90 versions of f*/init/checksizes.c had actually been generated by the test/mpi/configure script. I don't know why not, but I had to copy them out of the <a href="http://configure.in" target="_blank">configure.in</a> script. There is no reference to them in the logs, and they are not part of the config.status script to be re-generated. I'll be looking into that one more, after I get the system running properly.<br>
</blockquote>
<br></div>
Bill has been making a bunch of changes in this area to clean up the test script and we might have accidentally excluded one of his changes from the release. I've filed a ticket to track this here: <a href="https://trac.mcs.anl.gov/projects/mpich2/ticket/301" target="_blank">https://trac.mcs.anl.gov/projects/mpich2/ticket/301</a><br>
<font color="#888888">
<br>
-Dave<br>
<br>
</font></blockquote></div><br></div></div>
</div>