AMD HD7000 Graphics - Compute Optimised
Traditionally, graphics cards have primarily been designed to take data from your CPU and convert it into images on your screen. For a few years, NVidia have extended the goalposts, and have had the GPU taking a lot of work off the CPU, and in many cases have processed that data faster than a CPU core. AMD have not kept up with NVidia in the compute capabilities of their GPUs. Until now, that is.
Since the Radeon 9700 series (R300) GPUs, AMD has used a VLIW architecture for the GPU cores. This has allowed for 4 or 5 independent instructions to execute in parallel on each Streaming Processor. For some applications this can be very efficient, but there is an inherent flaw in the design when one instruction needs to access data output by another instruction. AMD internal testing at the time of launch of Cayman GPUs showed that a core using VLIW5 architecture was, on average, using only 3.4 out of 5 core. Shrinking from VLIW5 to VLIW4 partially improved these figures, but it was not ideal. This is because all optimization for a program has to be done by the compiler.
AMD's new range of GPUs, the HD7000 series, introduce 3 new features to improve performance. The first of these is Graphics Core Next (GCN), which adds huge GPU compute capabilities to the GPU. Then they added Eyefinity 2.0 and App Acceleration.
GCN is built on TSMC's 28nm process, and incorporates PCIe 3.0. The biggest new feature is a move away from VLIW instructions. In place of the Streaming Processors in Cayman, we now have SIMD vector processors, which can process up to 16 data elements in a single clock cycle and have a 64KB register file.
Combining 4 SIMD with fetch, decode and branch logic, and then adding 16KB L1 data cache, another 16KB shared read-only L1 cache and 32KB read-only instruction cache, topped off with a dedicated scalar unit, we get a Compute Unit (CU). In contrast to VLIW implementation which needed the compiler to handle scheduling, with GCN the CU now handles local scheduling. This can speed up code execution, but at the expense of space on the die. However, the effect of this is negligible, thanks to the hardware controlled scheduling.
Additionally, code compilation is simplified now the compiler doesn't have to handle scheduling, and with most processors and other hardware using SIMD instructions, the basis for a compiler for GCN code is already there.
The scalar unit is comprised of a Scalar ALU and an 8 KB register file. It can execute one instruction per clock cycle, and is used for "one-off" instructions, leaving the SIMD units free to execute instructions that can operate in parallel on multiple data elements. Simple integer operations, conditional branches and jumps are examples of instructions that are not suitable for SIMD operation.
To produce a GPU, up to 32 CUs must be combined with several other elements. The CUs are grouped together in arrays of 4 CUs, with each array sharing the L1 read-only cache. There is up to 768KB shared level 2 cache that every CU has full access to via Pixel Pipes and ROPs. Asynchronous Compute Engines (ACEs) control compute operations, while Graphics Command Processor routes graphics tasks to various parts of the graphics subsystem, via several primitive pipes.
The ACEs control the flow of work, from accepting work to routing it to the CUs to be processed. The GPU can contain multiple ACEs, enabling the processing of multiple concurrent tasks. The ACEs control resource allocation, context switching and task priority. One result of this is a limited out-of-order execution capability, with idle tasks being re-prioritised to free up resources that are needed for other tasks.
The graphics command processors routes graphics tasks via several pipelines to other parts of the graphics system. The pipelines are responsible for a number of fixed function tasks, including tessellation, geometry and surface processing. Because GCN is fully scalable, it will be possible to handle to some very large amounts of geometry. Full details of the graphics functionality are not yet available (despite several existing reviews of the HD 7970).
There are several also a few features in GCN that will help developers. These include support for pointers, virtual functions, exception support, and even recursion. This means it will be easier for developers to create new applications (or update existing code) that are GPU accelerated.
A unified address space means that all instructions sent to a GCN GPU will use the x86-64 address space. The GPU will be responsible for converting those addresses to local memory addresses, via an I/O Memory Mapping Unit.
Moving back to standard GPU functionality, we have an update to Eyefinity. Eyefinity 2.0 supports a wide range of grid configurations (screen positions and layout), with flexible bezel compensation and a maximum resolution of 16K x 16K. Additionally, it is possible to have independent audio streams on each display device, a feature that will be useful when you are conferencing online.
All-in-all, if these new cards are as good as the specs say, I look forward to seeing one. For now I'll just read a few reviews to see what I'll be missing when the cards are released :)
Related News (newer articles):
Jun 22, 2012: AMD HD 7970 GHz Edition launched
Mar 20, 2012: Radeon HD 7990 Specs Detailed
Mar 05, 2012: AMD Releases Radeon 7800 Series GPUs
Feb 15, 2012: AMD launches HD 7750 and HD 7770 graphics cards
Feb 03, 2012: AMD 2012-2013 Graphics Roadmap
Jan 16, 2012: New AMD 79x0 SKU, and more Radeon HD7000(M) release dates
Jan 10, 2012: AMD Radeon HD 7970 / 7000M graphic cards are now available
Jan 07, 2012: Entry Level and Mainstream HD 7000 Graphics
Jan 06, 2012: AMD Tahiti (HD7900) GPUs Ready For Market