One of the popular ways to describe computer performance these days is to measure how many million instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS) the computer can execute. Knowing the MIPS and MFLOPS numbers, as well as what type of data needs to be processed, one can choose a system that better suits for specific workload. For instance, the common approach to run scientific applications is to get a system with higher MFLOPS. But, even if you get a system with very high theoretical MFLOPS performance, is it really going to help with the performance of specific scientific application? To answer this question, AMD did a study using two Opteron 6200 series "Interlagos" processors, a production CPU and a custom made one.
Production microprocessor, used in the study, was Opteron 6276. This part has 2 dies in the package, 4 Bulldozer modules per die, and 2 CPU cores per module, which gives 16 cores per package. Each Bulldozer (BD) module has a dedicated 2 MB L2 cache, and there are also two 8 MB L3 caches, shared between all cores on each die. The 6276 processor runs at 2.3 GHz, or up to 3.2 GHz with Turbo Core feature active. Because each BD module can process 8 double-precision floating-point operations per cycle, the maximum theoretical performance of dual Opteron 6276 system is about 250 GFLOPS, or 250,000 MFLOPs. This takes into account that the processors have 85% efficiency when they are fully loaded.
Custom made processor was Opteron 6275, named "Fangio". This CPU has the same specifications as the 6276, except that its Floating Point unit is limited to 2 double-precision FLOPS per module. As a result of FPU capping, the maximum theoretical performance of this chip is 75 GFLOPS. Integer performance and memory throughput of the 6275 are identical to Opteron 6276.
Both processors were used to run a few HPC applications: High-Performance Linpack (HPL), STREAM Triad, and OpenFOAM. Out of these three applications only the HPL is FPU bound, therefore the resulting HPL performance for Opteron 6275 and 6276 SKUs was close to their theoretical maximum values, 75 GFLOPS and 250 GFLOPS. STREAM Triad and OpenFOAM apps are memory bound, thus they spend more time waiting for data, rather than doing actual calculations. The former program achieved only 16 GFLOPS performance for both CPUs. The latter application scored 22 GLOPS on Opteron 6275, and the 6276 was 14% faster. As can be seen from these results, the maximum theoretical FLOPS performance does not always translate to the maximum application performance. This applies to many Computational Fluid Dynamics applications, and to many other programs, that deal with large datasets. More details are available in the PDF presentation, linked below.
Source: HPC Advisory Council (PDF file)