[mpich2-dev] Scaling problem with MPI_Comm_create()

Fri Mar 20 13:10:32 CDT 2009

Hey guys, we just had a bug report from LLNL that MPI_Comm_create() is 
taking a substantial amount of time on large communicators (~5min on 144k 
cores). Investigating the code, I found this snippet:

for (i=0; i<n; i++) {
       /* Mapping[i] is the rank in the communicator of the process that
          is the ith element of the group */
       /* We use the appropriate vcr, depending on whether this is
          an intercomm (use the local vcr) or an intracomm (remote vcr) 
          Note that this is really the local mapping for intercomm
          and remote mapping for the intracomm */
       /* FIXME : BUBBLE SORT */
       /* FIXME : NEEDS COMM_WORLD SPECIALIZATION */
       mapping[i] = -1;
       for (j=0; j<vcr_size; j++) {
      int comm_lpid;
      MPID_VCR_Get_lpid( vcr[j], &comm_lpid );
      if (comm_lpid == group_ptr->lrank_to_lpid[i].lpid) {
          mapping[i] = j;
          break;
      }
       }
       MPIU_ERR_CHKANDJUMP1(mapping[i] == -1,mpi_errno,
             MPI_ERR_GROUP,
             "**groupnotincomm", "**groupnotincomm %d", i );
   }

For large subcomms off of large (sub)comms, this is basically an O(np^2) 
operation. This step totally dominates any partition size larger than ~1k 
in my testing.

I've "inlined" MPID_VCR_Get_lpid() by macro-izing it. That speeds it up by 
about 2x-3x, but we are still looking at an O(np^2) loop.

I see there are a few FIXME comments in there. Have you guys thought about 
better ways of doing this? 

I'll admit I'm not fully certain what the code is trying to do so I 
haven't come up with a better strategy but I think we could sort the ranks 
of the parent comm and the ranks of the subcomm and then walk through the 
arrays to do the rank/lrank/lpid mapping. Is that basically what this is 
trying to do?

That would at least get us to O(NplgNp). 

For subcomms off of comm_world I think we can get rid of part of this 
loop. I'm not sure if there is a way to get rid of it alltogether in the 
general subcomm case. I don't think there is. 

If you have some ideas we can try some patches for now. I'm not sure if 
changes would make it by MPICH2 1.1

Thanks.

Brian Smith
BlueGene MPI Development
IBM Rochester
Phone: 507 253 4717
smithbr at us.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090320/b30b10df/attachment.htm>