[mpich2-dev] Scaling problem with MPI_Comm_create()
Brian Smith
smithbr at us.ibm.com
Fri Mar 20 13:10:32 CDT 2009
Hey guys, we just had a bug report from LLNL that MPI_Comm_create() is
taking a substantial amount of time on large communicators (~5min on 144k
cores). Investigating the code, I found this snippet:
for (i=0; i<n; i++) {
/* Mapping[i] is the rank in the communicator of the process that
is the ith element of the group */
/* We use the appropriate vcr, depending on whether this is
an intercomm (use the local vcr) or an intracomm (remote vcr)
Note that this is really the local mapping for intercomm
and remote mapping for the intracomm */
/* FIXME : BUBBLE SORT */
/* FIXME : NEEDS COMM_WORLD SPECIALIZATION */
mapping[i] = -1;
for (j=0; j<vcr_size; j++) {
int comm_lpid;
MPID_VCR_Get_lpid( vcr[j], &comm_lpid );
if (comm_lpid == group_ptr->lrank_to_lpid[i].lpid) {
mapping[i] = j;
break;
}
}
MPIU_ERR_CHKANDJUMP1(mapping[i] == -1,mpi_errno,
MPI_ERR_GROUP,
"**groupnotincomm", "**groupnotincomm %d", i );
}
For large subcomms off of large (sub)comms, this is basically an O(np^2)
operation. This step totally dominates any partition size larger than ~1k
in my testing.
I've "inlined" MPID_VCR_Get_lpid() by macro-izing it. That speeds it up by
about 2x-3x, but we are still looking at an O(np^2) loop.
I see there are a few FIXME comments in there. Have you guys thought about
better ways of doing this?
I'll admit I'm not fully certain what the code is trying to do so I
haven't come up with a better strategy but I think we could sort the ranks
of the parent comm and the ranks of the subcomm and then walk through the
arrays to do the rank/lrank/lpid mapping. Is that basically what this is
trying to do?
That would at least get us to O(NplgNp).
For subcomms off of comm_world I think we can get rid of part of this
loop. I'm not sure if there is a way to get rid of it alltogether in the
general subcomm case. I don't think there is.
If you have some ideas we can try some patches for now. I'm not sure if
changes would make it by MPICH2 1.1
Thanks.
Brian Smith
BlueGene MPI Development
IBM Rochester
Phone: 507 253 4717
smithbr at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090320/b30b10df/attachment.htm>
More information about the mpich2-dev
mailing list