Original Link: https://www.anandtech.com/show/2666



Today, AMD releases their new 45nm Opterons, codenamed Shanghai. It's been a very quiet year for AMD on the server front, after a fairly rough launch of Barcelona, but AMD hopes to regain consumer confidence and earn back some market share from Intel.

A little history

Barcelona was AMD's first product based on their monolithic quad-core design, which was a very different path from what Intel decided to take. Intel decided effectively to join two dual-core CPUs at the hip to make their quad-core product. Barcelona was not a smooth launch. Hindered by CPU design issues and supply problems, it was one of the worst launches we've seen from AMD. However, today is a new day, a new part, and hopefully new customers.

Shanghai is an update to the Barcelona architecture, which means it is socket compatible and should be a drop-in replacement in servers that have a BIOS update that supports Shanghai. Something that some people might not realize is that AMD uses the same core product in their 2P, 4P, and 8P product lines. This may not matter to everyone, but it makes the life of OEMs a little easier. Besides the various tweaks to the Barcelona architecture, Shanghai is also a die shrink to 45nm. As AMD has already revealed in their 2009/2010 roadmap, Shanghai will be with us until the end of 2009 as a quad-core chip, followed by a six-core product when AMD releases Istanbul. Shanghai will ship in several different clock speeds, listed below.

AMD Shanghai Overview
Model CPU Clock MC Clock Part Number Price
Opteron 2384 2.7GHz 2.2GHz OS2384WAL4DGI $989
Opteron 2382 2.6GHz 2.2GHz OS2382WAL4DGI $873
Opteron 2380 2.5GHz 2.0GHz OS2380WAL4DGI $698
Opteron 2378 2.4GHz 2.0GHz OS2378WAL4DGI $523
Opteron 2376 2.3GHz 2.0GHz OS2376WAL4DGI $377
Opteron 8384 2.7GHz 2.2GHz OS8384WAL4DGI $2149
Opteron 8382 2.6GHz 2.2GHz OS8382WAL4DGI $1865
Opteron 8380 2.5GHz 2.0GHz OS8380WAL4DGI $1514
Opteron 8378 2.4GHz 2.0GHz OS8378WAL4DGI $1165

All of the Shanghai parts that release this year will be 75W versions, ranging from 2.3 to 2.7GHz. In Q1 of 2009, AMD expects to release HE (55W) models as well as SE (105W) models. Another change that's coming next year is the move to HyperTransport 3.0, which will increase bandwidth by up to 17.6GB/s. We found it a bit odd that there will be effectively an update to the architecture to support this move, and that this feature didn't make this initial product release.



What's new with Shanghai?

Besides the die shrink to 45nm using a new Immersion Lithography process, AMD has thrown some other enhancements at the Barcelona architecture to increase performance in a variety of workloads. Details on the various features are quite scarce, but here is what we know at this point.

L3 Cache Increase

One of the first enhancements, and perhaps the biggest change, is the increase of the L3 cache from 2MB to 6MB. On average, it's expected that this will increase performance by anywhere from 5-10%.

Memory Bandwidth

Memory bandwidth is up from 667MHz to 800MHz with the introduction of Shanghai.

Smart Fetch

This allows cores to enter a halt state during idle times to reduce CPU power consumption. According to AMD, this can reduce CPU power consumption by as much as 21% or 15W.

Virtualization

Enhanced Rapid Virtualization Indexing (RVI) / ~25% faster world switch and L3 Cache index disable (improved data integrity).

It's obvious that AMD is still focused on performance per Watt, which you'll see in the results.

Test Setup

AMD Shanghai System
2U Supermicro rack mount with H8DMU+ board
16GB DDR2 800MHz Memory
1 15.5K Cheetah SAS Drive for OS
Windows 2008 Enterprise
SQL 2008 Enterprise

Intel Harpertown System
2U Supermicro rack mount with X7DWN+ board
16GB DDR2 800MHz Memory
1 15.5K Cheetah SAS Drive for OS
Windows 2008 Enterprise
SQL 2008 Enterprise

Raid Setup
Promise J300s Enclosure
6 Seagate 1TB SAS Drives configured in a Raid 0 array for performance
LSI Logic 8480E MegaRaid Controller



Quest Software Benchmark Factory

We mentioned that the benchmarks used previously are no longer useful, as we did not have the I/O capacity required to support them. We went looking for alternative benchmarks and stumbled upon Benchmark Factory from Quest Software. Below is a description of the product and the benchmarks that we used in this article.

Benchmark Factory for Databases is a performance and code scalability testing tool that simulates users and transactions on a database and replays a production or synthetic workload in non-production environments. This enables organizations to validate database scalability as user loads increase, application changes are made, and platform changes are implemented. Benchmark Factory is available for Oracle, SQL Server, DB2, Sybase, MySQL and other databases via ODBC and Native connectivity.

Benchmark Factory provides many tests that you can run, and has a very nice and customizable metric reporting engine. We decided to run the AS3AP test, and the Scalable Hardware CPU, and Reads tests. Here is what Quest's help file says about these tests:

AS3AP

The AS3AP benchmark is an American National Standards Institute (ANSI) Structured Query Language (SQL) relational database benchmark. The AS3AP benchmark provides the following features:

  • Tests database processing power
  • Built-in scalability and portability that tests a broad range of database systems
  • Minimizes effort in implementing and running benchmark tests
  • Provides a uniform metric and straightforward interpretation of benchmark results

Systems tested with the AS3AP benchmark must support common data types and provide a complete relational interface with basic integrity, consistency, and recovery mechanisms. The AS3AP benchmark can test systems ranging from a single-user microcomputer Database Management System (DBMS) to a high-performance parallel or distributed database.

Scalable Hardware

The Scalable Hardware benchmark measures relational database systems. This benchmark is a subset of the AS3AP benchmark and tests the following:

  • CPU
  • Disk
  • Network

It can also test any combination of the above three entities.



Idle Power and AS3AP Performance


We tested the systems, at idle, with both four DIMM and eight DIMM configurations. The impact was not as significant on the AMD system as it was on the Intel system. The difference in power consumption on the AMD system is only 7W but on the Intel system it was 49W.

Barcelona uses 6% more power at idle than Shanghai. The eight DIMM Intel system uses 56% more power than Shanghai, and the four DIMM Intel system uses 30% more power than Shanghai. A large chunk of this difference on Intel clearly comes from the FB-DIMMs, given the power scaling from four to eight FB-DIMMs, but AMD may still have a slight lead overall even if we discount the memory.


For the first three load points it is close but Intel is able to lead by an average of 15%. Once we hit load point four all systems are almost identical and then with load point five Shanghai is able to lead by as much as 11%. In the first four load points, the performance of Shanghai @ 2.7 GHz vs. Barcelona @ 2.3 GHz is within 2% even though there is a 17% bump in clock. At load point five, Shanghai is able to outpace Barcelona by 11% but still less than the clock bump.


Shanghai and Barcelona exhibit similar CPU usage profiles for the first few load points and then you see that Barcelona @ 2.3 GHz maxes out.


The Opterons are the clear leaders from a power perspective. Shanghai uses approximately 12% less power than Barcelona. Intel uses 27% to 46% more power than Shanghai, depending on the DIMM configuration.


At all load points, Shanghai is the clear winner. For the first four load points Shanghai is again ~12% more efficient than Barcelona, and as much as 28% or 47% more efficient than the Intel systems depending upon the DIMM configuration.

Note: The line for Harpertown four DIMMs is dotted in the above graphs because we did not actually run this configuration but speculate this is the power consumption based on idle power consumption analysis.



Scalable CPU Performance


Unlike the AS3AP benchmark where AMD was able to match the performance of Harpertown at load point four and dominate at load point five, Intel is able to dominate this benchmark. Harpertown is able to lead by an average of 17% over Shanghai. Again, the performance of Shanghai @ 2.7 GHz is comparable to Barcelona @ 2.3 GHz at the lower load points but is able to lead Barcelona by as much as 25% at load point five.


We see similar CPU utilization profiles from the systems and the Opterons ramp quicker than Harpertown.


Opterons are again the clear winners on this one, but the Harpertown system is a little more competitive on this benchmark. Shanghai is still able to use around 17% and 27% less power compared to the Xeon configurations.


Even with Intel's better performance, Shanghai is still able to lead at three of the five load points in this test. The Xeon with eight DIMMs is not even close, but the Xeon four DIMM configuration is competitive for the first three load points.

Note: The line for Harpertown four DIMMs is dotted in the above graphs because we did not actually run this configuration but speculate this is the power consumption based on idle power consumption analysis.



Scalable Reads Performance


In this benchmark Harpertown is able to lead for three of the five loads points, and Shanghai leads for the other two. At the lower load points, Harpertown is able to lead by as much as 26%, but at the two top load points the Shanghai lead drops to just 2%. Again, the performance of Shanghai @ 2.7 GHz at the lower load points is not significantly better than Barcelona @ 2.3 GHz. At load points four and five the faster Shanghai core is able to easily outpace Barcelona.


The results are similar to previous CPU utilization graphs.


Again, Intel is getting closer to AMD efficiency but not quite. Shanghai uses around 20% less power than the lowest power consuming Harpertown configuration.


Shanghai is able to lead on the majority of the load points here, but the four DIMM Harpertown configuration is able to match it for the first two load points. Shanghai is able to lead Barcelona by as much as 37%.

Note: The line for Harpertown four DIMMs is dotted in the above graphs because we did not actually run this configuration but speculate this is the power consumption based on idle power consumption analysis.



Conclusion

As you've seen, AMD is still competitive with Intel's 3.0 GHz Harpertown in the database workloads that we've shown here. We were quite surprised that Shanghai was able to meet and, in some cases, pass Harpertown at various workload levels in some of the benchmarks. Obviously, when it comes to power, AMD is still leading this space by a significant margin. FB-DIMMs obliterate any power efficiency in Intel's processors, especially when you have eight (or more in some cases) of them present in a server.

What about how Shanghai fairs against its older brother Barcelona? Well, in some cases, the gain is clearly just the increased clock speed. However, in others Shanghai achieves an increase of anywhere from 10-15% over and above the clock speed difference. It's obvious that Shanghai is what Intel would call a "Tock" of the clock for the Barcelona architecture, and it is a nice little bump for turning a few knobs and a die shrink. 2009 will be a very interesting year for AMD and Intel. Whether or not Shanghai can hold down the fort until Istanbul comes out remains to be seen.

Log in

Don't have an account? Sign up now