[mpich-discuss] MPICH2-1.0.8 performance issueson Opteron Cluster

Thu Jan 8 04:14:01 CST 2009

Hi,

   No, I'm not using any dynamic processes.

Quoting "Rajeev Thakur" <thakur at mcs.anl.gov>:

> Are you using MPI-2 dynamic process functions (connect-accept or spawn)? It
> is possible that for dynamically connected processes on the same machine,
> Nemesis communication goes over TCP instead of shared memory (Darius can
> confirm), whereas with ssm it does not.
>
> Rajeev
>
>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of James S Perrin
>> Sent: Wednesday, January 07, 2009 11:00 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] MPICH2-1.0.8 performance
>> issueson Opteron Cluster
>>
>> Hi,
>> 	I've just tried out 1.1a2 and get similar results as
>> 1.0.8 for both
>> nemesis and ssm.
>>
>> Regards
>> James
>>
>> PS Zoom view in image is 0.21s of course!
>>
>> James S Perrin wrote:
>> > Darius,
>> >
>> >     I will try out the 1.1 version shortly. Attached are
>> two images from
>> > jumpshot of the same section of code using nemesis and ssm.
>> I've set the
>> > view to be the same length of time (2.1s) for comparison.
>> It seems to me
>> > that the Isends and Irecvs from the master to the slaves (and visa
>> > versa) are what are causing the slow down when using nemesis. These
>> > messages are quite small ~1k. The purple events are
>> Allreduce Allgathers
>> > between the slaves.
>> >
>> > Regards
>> > James
>> >
>> > Darius Buntinas wrote:
>> >> James, Dmitry,
>> >>
>> >> Would you be able to try the latest alpha version of 1.1?
>> >>
>> >>
>> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarb
>> alls/1.1a2/src/mpich2-1.1a2.tar.gz
>> >>
>> >>
>> >> Nemesis is the default channel in 1.1, so you don't have to specify
>> >> --with-device= when configuring.
>> >>
>> >> Note that if you have more than one process and/or thread per core,
>> >> nemesis won't perform well.  This is because nemesis does
>> active polling
>> >> (but we expect to have a non-polling option for the final
>> release).  Do
>> >> you know if this is the case with your apps?
>> >>
>> >> Thanks,
>> >> -d
>> >>
>> >> On 01/05/2009 09:15 AM, Dmitry V Golovashkin wrote:
>> >>> We have similar experiences with nemesis in a prior
>> mpich2 release.
>> >>> (scalapack-ish applications on multicore linux cluster).
>> >>> The resultant times were remarkably slower. The nemesis
>> channel was an
>> >>> experimental feature back then, I attributed slower
>> performance to a
>> >>> possible misconfiguration.
>> >>> Is it possible to submit a new ticket (for non-ANL folks)?
>> >>>
>> >>>
>> >>>
>> >>> On Mon, 2009-01-05 at 09:00 -0500, James S Perrin wrote:
>> >>>> Hi,
>> >>>>     I thought I'd just mention that I too have found that our
>> >>>> software performs poorly with nemesis compared to ssm on our
>> >>>> multi-core machines. I've tried it on both a 2xDual core
>> AMD x64 and
>> >>>> 2xQuad core Xeon x64 machines. It's roughly 30% slower. I've not
>> >>>> been able to do any analysis as yet as to where the
>> nemesis version
>> >>>> is loosing out?
>> >>>>
>> >>>>     The software performs mainly point-to-point
>> communication in a
>> >>>> master and slaves model. As the software is interactive
>> the slaves
>> >>>> call MPI_Iprobe while waiting for commands. Having
>> compiled against
>> >>>> the ssm version would have no effect, would it?
>> >>>>
>> >>>> Regards
>> >>>> James
>> >>>>
>> >>>> Sarat Sreepathi wrote:
>> >>>>> Hello,
>> >>>>>
>> >>>>> We got a new 10-node Opteron cluster in our research
>> group. Each
>> >>>>> node has two quad core Opterons. I installed MPICH2-1.0.8 with
>> >>>>> Pathscale(3.2) compilers and three device configurations
>> >>>>> (nemesis,ssm,sock). I built and tested using the Linpack(HPL)
>> >>>>> benchmark with ACML 4.2 BLAS library for the three
>> different device
>> >>>>> configurations.
>> >>>>>
>> >>>>> I observed some unexpected results as the 'nemesis'
>> configuration
>> >>>>> gave the worst performance. For the same problem
>> parameters, the
>> >>>>> 'sock' version was faster and the 'ssm' version hangs.
>> For further
>> >>>>> analysis, I obtained screenshots from the Ganglia
>> monitoring tool
>> >>>>> for the three different runs. As you can see from the attached
>> >>>>> screenshots, the 'nemesis' version is consuming more
>> 'system cpu'
>> >>>>> according to Ganglia. The 'ssm' version fares slightly
>> better but
>> >>>>> it hangs towards the end.
>> >>>>>
>> >>>>> I may be missing something trivial here but can anyone
>> account for
>> >>>>> this discrepancy? Isn't 'nemesis' device or 'ssm' device
>> >>>>> recommended for this cluster configuration? Your help
>> is greatly
>> >>>>> appreciated.
>> >
>> >
>> >
>> --------------------------------------------------------------
>> ----------
>> >
>> >
>> >
>> --------------------------------------------------------------
>> ----------
>> >
>>
>> --
>> --------------------------------------------------------------
>> ----------
>>    James S. Perrin
>>    Visualization
>>
>>    Research Computing Services
>>    Devonshire House, University Precinct
>>    The University of Manchester
>>    Oxford Road, Manchester, M13 9PL
>>
>>    t: +44 (0) 161 275 6945
>>    e: james.perrin at manchester.ac.uk
>>    w: www.manchester.ac.uk/researchcomputing
>> --------------------------------------------------------------
>> ----------
>>   "The test of intellect is the refusal to belabour the obvious"
>>   - Alfred Bester
>> --------------------------------------------------------------
>> ----------
>>
>
>