[mpich-discuss] MPICH2-1.0.8 performance issues on Opteron Cluster

Mon Jan 5 11:15:55 CST 2009

James, Dmitry,

Would you be able to try the latest alpha version of 1.1?

http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/1.1a2/src/mpich2-1.1a2.tar.gz

Nemesis is the default channel in 1.1, so you don't have to specify
--with-device= when configuring.

Note that if you have more than one process and/or thread per core,
nemesis won't perform well.  This is because nemesis does active polling
(but we expect to have a non-polling option for the final release).  Do
you know if this is the case with your apps?

Thanks,
-d

On 01/05/2009 09:15 AM, Dmitry V Golovashkin wrote:
> We have similar experiences with nemesis in a prior mpich2 release.
> (scalapack-ish applications on multicore linux cluster).
> The resultant times were remarkably slower. The nemesis channel was an
> experimental feature back then, I attributed slower performance to a
> possible misconfiguration.
> Is it possible to submit a new ticket (for non-ANL folks)?
> 
> 
> 
> On Mon, 2009-01-05 at 09:00 -0500, James S Perrin wrote:
>> Hi,
>> 	I thought I'd just mention that I too have found that our software 
>> performs poorly with nemesis compared to ssm on our multi-core machines. 
>> I've tried it on both a 2xDual core AMD x64 and 2xQuad core Xeon x64 
>> machines. It's roughly 30% slower. I've not been able to do any analysis 
>> as yet as to where the nemesis version is loosing out?
>>
>> 	The software performs mainly point-to-point communication in a master 
>> and slaves model. As the software is interactive the slaves call 
>> MPI_Iprobe while waiting for commands. Having compiled against the ssm 
>> version would have no effect, would it?
>>
>> Regards
>> James
>>
>> Sarat Sreepathi wrote:
>>> Hello,
>>>
>>> We got a new 10-node Opteron cluster in our research group. Each node 
>>> has two quad core Opterons. I installed MPICH2-1.0.8 with Pathscale(3.2) 
>>> compilers and three device configurations (nemesis,ssm,sock). I built 
>>> and tested using the Linpack(HPL) benchmark with ACML 4.2 BLAS library 
>>> for the three different device configurations.
>>>
>>> I observed some unexpected results as the 'nemesis' configuration gave 
>>> the worst performance. For the same problem parameters, the 'sock' 
>>> version was faster and the 'ssm' version hangs. For further analysis, I 
>>> obtained screenshots from the Ganglia monitoring tool for the three 
>>> different runs. As you can see from the attached screenshots, the 
>>> 'nemesis' version is consuming more 'system cpu' according to Ganglia. 
>>> The 'ssm' version fares slightly better but it hangs towards the end.
>>>
>>> I may be missing something trivial here but can anyone account for this 
>>> discrepancy? Isn't 'nemesis' device or 'ssm' device recommended for this 
>>> cluster configuration? Your help is greatly appreciated.
>