[mpich-discuss] Hydra issues

Dave Goodell goodell at mcs.anl.gov
Wed Aug 26 15:36:16 CDT 2009


Scott is using 1.1.1p1.  I'm trying to file a ticket to followup and  
make sure that we haven't regressed on this issue, but trac is giving  
us some fits right now.

-Dave

On Aug 26, 2009, at 3:29 PM, Rajeev Thakur wrote:

> Which version of MPICH2 are you using? There was a scalability bug in
> MPI_Init with Nemesis and MPD in 1.1 that has been fixed in 1.1.1.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Scott Atchley
>> Sent: Wednesday, August 26, 2009 3:24 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] Hydra issues
>>
>> Rusty,
>>
>> 1024 (128 nodes, 8 ppn).
>>
>> Scott
>>
>> On Aug 26, 2009, at 4:17 PM, Rusty Lusk wrote:
>>
>>> Ah.   I assume that we still use a lazy connect, so
>> MPI_Init should
>>> be fast, but then allgatherv would indeed cause a lot of
>>> connections, depending on choice of allgatherv implementation (how
>>> many?)
>>>
>>> On Wednesday,Aug 26, 2009, at 3:09 PM, Scott Atchley wrote:
>>>
>>>> No, the ring starts fast enough. It is connecting 1024 processes
>>>> that is slow (allgatherv).
>>>>
>>>> By contrast, Intel MPI launched in < 10 seconds.
>>>>
>>>> Scott
>>>>
>>>> On Aug 26, 2009, at 4:07 PM, Rusty Lusk wrote:
>>>>
>>>>> I assume that you mean launching the MPD ring is slow.  Once the
>>>>> MPD ring is up, launching should be quick.   The original
>> idea was
>>>>> that the MPD ring would be persistent across jobs, even from
>>>>> different people, as long as the jobs used the same nodes.
>>>>>
>>>>> Rusty
>>>>>
>>>>> On Wednesday,Aug 26, 2009, at 2:45 PM, Scott Atchley wrote:
>>>>>
>>>>>> On Aug 26, 2009, at 3:39 PM, Pavan Balaji wrote:
>>>>>>
>>>>>>>>> However you could use one of the various workarounds
>> for this,
>>>>>>>>> such as an LD_PRELOADed setvbuf call:
>> http://lists.gnu.org/archive/html/bug-coreutils/2008-11/msg00164.html
>>>>>>>> This does not change the behavior.
>>>>>>>> I am still stumped as to why there is no delay when using
>>>>>>>> persistent (launch-mode=2) versus a delay with no proxies
>>>>>>>> (launch-mode=1).
>>>>>>>
>>>>>>> This works for me. We need to figure out how to make this
>>>>>>> portable now.
>>>>>>>
>>>>>>> -- Pavan
>>>>>>
>>>>>> Thanks for your persistence (no pun intended).
>>>>>>
>>>>>> When running with 1,024 ranks, launching via MPD can
>> take several
>>>>>> minutes. I am assuming that hydra will launch in seconds.
>>>>>>
>>>>>> Scott
>>>>>
>>>>
>>>
>>
>>
>



More information about the mpich-discuss mailing list