[mpich-discuss] [mpich2-announce] Announcing the availability of MPICH2-1.1a2

Joe Ratterman jratt0 at gmail.com
Tue Nov 25 14:18:10 CST 2008


Thanks,  it sounds like things are under control.
I got to running the tests, and hit three problems there so far

1)
This is the smallest but most clearly wrong code.  I was failing
the test/mpi/init/initstat.c test because MPI_Init_thread()
and MPI_Query_thread() were returning different provided levels.

When the device claims to handle MPI_THREAD_MULTIPLE, it gets set to
"runtime":

# Threads must be supported by the device.  First, set the default to
# be the highest supported by the device
if test "$enable_threads" = default ; then
    if test -n "$MPID_MAX_THREAD_LEVEL" ; then
        case $MPID_MAX_THREAD_LEVEL in
            MPI_THREAD_SINGLE)     enable_threads=single ;;
            MPI_THREAD_FUNNELED)   enable_threads=funneled ;;
            MPI_THREAD_SERIALIZED) enable_threads=serialized ;;
            MPI_THREAD_MULTIPLE)   enable_threads=runtime ;;
            *) AC_MSG_ERROR([Unrecognized thread level from device
$MPID_MAX_THREAD_LEVEL])
        ;;
        esac
    else
        enable_threads=single
    fi
fi

.........
# Runtime is an alias for multiple with an additional value
if test "$enable_threads" = "runtime" ; then
    AC_DEFINE(HAVE_RUNTIME_THREADCHECK,1,[Define if MPI supports
MPI_THREAD_MULTIPLE with a runtime check for thread level])
    enable_threads=multiple
    # FIXME: This doesn't support runtime:thread-impl (as in
multiple:thread-impl)
fi

This will cause HAVE_RUNTIME_THREADCHECK to be defined.  In MPI_Init_thread,
this causes the provided data to be partially ignored.  I see there is a
"fixme" comment; did you have other plans for this code?

   288      mpi_errno = MPID_Init(argc, argv, required, &thread_provided,
   289                            &has_args, &has_env);
   290      /* --BEGIN ERROR HANDLING-- */
   303      /* --END ERROR HANDLING-- */
   304
   305      /* Capture the level of thread support provided */
   306      MPIR_ThreadInfo.thread_provided = thread_provided;
   307      if (provided) *provided = thread_provided;
   308      /* FIXME: Rationalize this with the above */
   309  #ifdef HAVE_RUNTIME_THREADCHECK
   310      MPIR_ThreadInfo.isThreaded = required == MPI_THREAD_MULTIPLE;
   311      if (provided) *provided = required;
   312  #endif

Line 288 will get the "provided" information from the device, as before.
Line 306 will store the device-provided info into the MPI_Threadinfo struct,
as before.
Line 311 will over-write the device-provided info and tell the user that the
provided is the same as the requested.
Since this is MPI_Thread_query() code:
    *provided = MPIR_ThreadInfo.thread_provided;
The device would have to always return MPI_THREAD_MULTIPLE or the two values
will be different and inconsistent.
Either
A) The device must be completely ignored.
B) The provided thread level cannot be set higher than the device is
willing.
Note: Choice (A) may break the threaded tests in mpich2/test/mpi/threads/,
since they don't generally check the return value from phtread_create(),
only that MPI_THREAD_MULTIPLE was provided.  If threads cannot be started,
these tests won't work.



2)
I had to completely gut MPIU_Find_local_and_external() (same file as problem
2
before) because this generic code didn't know as much about the BG/P
topology as it thought.  It is running now that intra-comms work.
 I return a generic non-fatal error and the comm utils seem fine with it.

int MPIU_Find_local_and_external(MPID_Comm *comm, int *local_size_p, int
*local_rank_p, int **local_ranks_p,
                                 int *external_size_p, int *external_rank_p,
int **external_ranks_p,
                                 int **intranode_table_p, int
**internode_table_p)
{
    return MPI_ERR_UNKNOWN;
}




3)
I noticed that we got a hang
because the build didn't pick up or custom CS_ENTER/EXIT macros.  It
looks like the "threaded" branch code for the macros is in this alpha
release?  I added some code to use MPID_DEFINES_MPID_CS to once-again
allow a device to use custom macros.
 Unlike my "work" in the threaded branch, it is much more exacting.  I will
send these changes as a (git) patch in case you are interested.  It can
usually be applied using "patch -p1".


Thanks,
Joe Ratterman
jratt at us.ibm.com



On Fri, Nov 21, 2008 at 23:38, Dave Goodell <goodell at mcs.anl.gov> wrote:

>
> On Nov 20, 2008, at 2:24 PM, Joe Ratterman wrote:
>
>  Hi,  I am working on merging the latest changes from this alpha in the the
>> BG/P code.  It fully compiles now--I haven't run the tests--but I had a few
>> issues that I wanted to mention.  Maybe someone already is working on
>> solutions or can otherwise be of some help.
>>
>> 1)
>> PAC_CC_FUNCTION_NAME_SYMBOL (configure.in) doesn't work at all in
>> cross-compilation environments, though I don't know of a good solution to
>> that. [...]
>>
>
> It looks like David Gingold came to the same conclusion here: https://
> trac.mcs.anl.gov/projects/mpich2/ticket/300
>
> Sorry for the breakage, we don't cross compile as often as you do and we
> didn't catch this one before the release.  I haven't had the chance to dig
> in and fully grok this change yet, but I'm sure we can come up with a fix
> soon-ish.
>
>  2)
>> "src/util/procmap/local_proc.c" seems a bit troubling for us.  We don't
>> use a PMI device, and specify "MPID_NO_PMI=yes" in
>> src/mpid/dcmfd/mpich2prereq.  However, this file calls
>> PMI_KVS_Get_key_length_max() from MPIU_Get_local_procs(). That did compile
>> because C doesn't care too much, but it wouldn't link, even though we never
>> call MPIU_Get_local_procs().  This is because the file also defines
>> MPIU_Get_intranode_rank(), which is uses by both src/mpi/coll/reduce.c &
>> src/mpi/coll/bcast.c.  I ended up simply deleting the entire
>> MPIU_Get_local_procs() function to solve the problem.  I am sure that isn't
>> the answer, but I don't know what is the correct version.
>>
>
> This code is in need of a good dose of cleanup and improvement.  It's not
> where I'd like it to be but we elected to put it out there to see how people
> felt about the feature.  Don't worry, this won't be the final version of
> this code.  In your situation I suspect removing the code is what makes the
> most sense for now.
>
>  3)
>> This one might be my problem, but I couldn't compile all the tests because
>> neither the F77 nor F90 versions of f*/init/checksizes.c had actually been
>> generated by the test/mpi/configure script.  I don't know why not, but I had
>> to copy them out of the configure.in script.  There is no reference to
>> them in the logs, and they are not part of the config.status script to be
>> re-generated.  I'll be looking into that one more, after I get the system
>> running properly.
>>
>
> Bill has been making a bunch of changes in this area to clean up the test
> script and we might have accidentally excluded one of his changes from the
> release.  I've filed a ticket to track this here:
> https://trac.mcs.anl.gov/projects/mpich2/ticket/301
>
> -Dave
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20081125/493e7570/attachment.htm>


More information about the mpich-discuss mailing list