AMD K10 microarchitecture

  Instruction processing flow  

All instructions are loaded into L1 cache first. As soon as instructions are loaded into the cache they are pre-decoded (this is necessary to determine instruction length), and target addresses for all loaded branch instructions are predicted. Pre-decode and branch data are stored in L1 cache together with actual data. All instructions from L1 cache are loaded into Fetch-Decode Unit as 32 byte blocks. Depending on the type of loaded instructions, they are passed either to DirectPath or VectorPath decoders where instructions are decoded into macro-operations, or macro-ops. Simple and average complexity instructions are passed to DirectPath decoder unit which translates each instruction into either one or two macro-operations. VectorPath decoder can decode complex instructions into one or more macro-operations. Both decoders can decode instructions at a rate of three macro-ops per cycle. The decoders cannot work in parallel - only one of them can decode instructions at the same time. Decoded macro-ops are passed to Instruction Control Unit (ICU) and stored in 72 macro-ops reorder buffer. ICU then dispatches instructions to integer and floating-point schedulers. Both schedulers work independently from each other. They both break macro-ops into simpler micro-operations, or micro-ops, and pass them to Integer Execution Unit and Floating Point Execution Unit respectively. Integer Execution unit contains three Arithmetic-Logic Units (ALU) and three Address Generation Units (AGU). Each ALU/AGU unit can execute one micro-op per cycle, thus total throughput of Integer Unit is 6 micro-ops per cycle. The ALU can handle most of micro-operations with two exceptions:

  • All integer multiplication micro-ops are always scheduled for ALU 0.
  • New LZCNT and POPCNT) instructions are always scheduled for ALU 2.
Floating-point scheduler feeds Floating Point Execution Unit, which can execute up to three x87, MMX, 3DNow!, SSE, SSE2, SSE3 and SSE4a instructions per cycle. The Execution Unit contains three pipes - FADD, FMUL and FSTOR. Each pipe has 12-macro-op buffer. Each of these pipes can handle only certain types of instructions:
  • FADD can handle addition, subtraction, comparison, logical operations and some MMX data moving instructions.
  • FMUL can handle multiplication, division, logical operations and some MMX data moving mov instructions.
  • FSTOR can handle conversion, store and some load and data moving instructions.
The status of execution of integer and floating-point micro-ops is sent to the ICU. When all outstanding micro-ops for one macro-op are executed, the macro-op is retired from the re-order buffer.

  L1 cache  

Each core has two L1 caches: 64 KB instruction cache and 64 KB data cache. L1 cache is 2-way set associative, with 64-byte line size, and supports 2 128-bit loads or 2 64-bit stores per cycle. Load operations can be issued out of order if there are no data dependency restrictions. If L1 cache doesn't have requested data, the CPU requests data from L2 cache, then from L3 cache and ultimately from system memory. Data is fetched directly to L1 cache. If there is no available space in L1 cache then least recently used data is evicted to L2 cache. Latency of L1 cache is 3 cycles.

  L2 cache  

Each core has its own exclusive 512 KB level 2 cache. L2 cache only serves as a victim cache, that is only contains data evicted from L1 cache. If level 2 cache doesn't have available space for new evicted data, then least recently used data is moved either into L3 cache (if the data may be used by other core) or completely removed from the cache. When data is requested from L2 cache, it is moved to level 1 cache and removed from level 2 cache. Latency of L2 cache is 9 cycles. In the future the size of L2 cache can be increased to 1 MB, associativity of the cache can be also changed.

  L3 cache  

2 MB L3 cache is shared between all 4 cores. This cache holds data that may be used by more than one core. It's a victim cache, that is it contains only data evicted from L2 cache. Whenever data is requested from L3 cache it is either copied into L1 cache (in a case when the data may be used by more than one core), or moved into L1 cache. Latency of L3 cache is variable. In the future the size of L3 cache can be increased up to 8 MB. It's possible that some K10-based processors won't have L3 cache at all.

  Translation-Lookaside Buffer  

Cache4 KB pages2 MB pages1 GB pages
L1 instruction3216supported but not recommended
L1 data48supported
L2 instruction512 (4-way set associative) not supported
L2 data512 (4-way set associative)128 (2-way set associative)not supported

  Integrated Memory Controller  

The CPU incorporates integrated DDR2 memory controller that can be used as a single 128-bit or dual 64-bit controller.

  HyperTransport technology  

HyperTransport is a high-speed point-to-point link that is used to communicate with peripheral or I/O devices. In multi-processor systems the HyperTransport link can be used to interface with other processors. K10 micro-architecture utilizes HyperTransport 3.0 specification of HyperTransport. This specification increases maximum bandwidth to approximately 20.8 GB per second for 16-bit links. Two new features of HT 3.0 specification are:

  • Link splitting - one 16-bit link can be configured as two 8-bit links.
  • Retry - the hardware can detect corrupted packets and re-transmit them.
HT 3.0 specification includes optional dynamic link frequency and width feature. This feature can help reduce HyperTransport unit power consumption, but it's not clear if this feature is a part of K10 micro-architecture or not.

  Sideband Stack optimizer  

This unit tracks all instructions that reference stack-pointer, for example, PUSH, POP, LEAVE and other instructions. The unit can execute more than one of those instructions in parallel assuming that there is no dependency between instructions.

  Branch prediction  

K10 micro-architecture uses the following methods to predict program branches:

  • Branch Target address Buffer (BTB) table holds 2048 predicted branch addresses.
  • Global History Bimodal Counter (GHBC) table contains 16384 2-bit counters.
  • Indirect address prediction table is used to predict indirect branches. This table contains 512 addresses.
  • Return address stack table that holds 24 return addresses
(c) Copyright 2003 Gennadiy Shvets