Instruction processing flow All instructions are loaded into L1 cache first. As soon as instructions
are loaded into the cache they are pre-decoded (this is necessary to
determine instruction length), and target addresses for all loaded
branch instructions are predicted. Pre-decode and branch data are
stored in L1 cache together with actual data. All instructions from L1
cache are loaded into Fetch-Decode Unit as 32 byte blocks. Depending
on the type of loaded instructions, they are passed either to
DirectPath or VectorPath decoders where instructions are decoded into
macro-operations, or macro-ops. Simple and average complexity
instructions are passed to DirectPath decoder unit which
translates each instruction into either one or two macro-operations.
VectorPath decoder can decode complex instructions into one or more
macro-operations. Both decoders can decode instructions at a rate of
three macro-ops per cycle. The decoders cannot work in parallel - only
one of them can decode instructions at the same time. Decoded
macro-ops are passed to Instruction Control Unit (ICU) and stored in
72 macro-ops reorder buffer. ICU then dispatches instructions to
integer and floating-point schedulers. Both schedulers
work independently from each other. They both break macro-ops into
simpler micro-operations, or micro-ops, and pass them to Integer
Execution Unit and Floating Point Execution Unit respectively.
Integer Execution unit contains three Arithmetic-Logic Units (ALU) and
three Address Generation Units (AGU). Each ALU/AGU unit can execute
one micro-op per cycle, thus total throughput of Integer Unit is 6
micro-ops per cycle. The ALU can handle most of micro-operations with
two exceptions:
- All integer multiplication micro-ops are always scheduled for ALU 0.
- New LZCNT and POPCNT) instructions are always scheduled for ALU 2.
Floating-point scheduler feeds Floating Point Execution Unit, which
can execute up to three x87, MMX, 3DNow!, SSE, SSE2, SSE3 and SSE4a
instructions per cycle. The Execution Unit contains three pipes -
FADD, FMUL and FSTOR. Each pipe has 12-macro-op buffer. Each of these
pipes can handle only certain types of instructions:
- FADD can handle addition, subtraction, comparison, logical operations and some MMX data moving instructions.
- FMUL can handle multiplication, division, logical operations and some MMX data moving mov instructions.
- FSTOR can handle conversion, store and some load and data moving instructions.
The status of execution of integer and floating-point micro-ops is
sent to the ICU. When all outstanding micro-ops for one macro-op are
executed, the macro-op is retired from the re-order buffer. L1 cache Each core has two L1 caches: 64 KB instruction cache and 64 KB
data cache. L1 cache is 2-way set associative, with 64-byte line size,
and supports 2 128-bit loads or 2 64-bit stores per cycle. Load
operations can be issued out of order if there are no data dependency
restrictions. If L1 cache doesn't have requested data, the CPU
requests data from L2 cache, then from L3 cache and ultimately
from system memory. Data is fetched directly to L1 cache. If there is
no available space in L1 cache then least recently used data is
evicted to L2 cache. Latency of L1 cache is 3 cycles. L2 cache Each core has its own exclusive 512 KB level 2 cache. L2 cache only
serves as a victim cache, that is only contains data evicted from L1
cache. If level 2 cache doesn't have available space for new evicted
data, then least recently used data is moved either into L3 cache (if
the data may be used by other core) or completely removed from the
cache. When data is requested from L2 cache, it is moved to level 1
cache and removed from level 2 cache. Latency of L2 cache is 9 cycles.
In the future the size of L2 cache can be increased to 1 MB,
associativity of the cache can be also changed. L3 cache 2 MB L3 cache is shared between all 4 cores. This cache holds
data that may be used by more than one core. It's a victim cache,
that is it contains only data evicted from L2 cache. Whenever data is
requested from L3 cache it is either copied into L1 cache (in a case
when the data may be used by more than one core), or moved into L1
cache. Latency of L3 cache is variable. In the future the size of L3
cache can be increased up to 8 MB. It's possible that some K10-based
processors won't have L3 cache at all. Translation-Lookaside Buffer
| Cache | 4 KB pages | 2 MB pages | 1 GB pages |
| L1 instruction | 32 | 16 | supported but not recommended |
| L1 data | 48 | supported |
| L2 instruction | 512 (4-way set associative) | | not supported |
| L2 data | 512 (4-way set associative) | 128 (2-way set associative) | not supported |
Integrated Memory Controller The CPU incorporates integrated DDR2 memory controller that can be
used as a single 128-bit or dual 64-bit controller. HyperTransport technology HyperTransport is a high-speed point-to-point link that is used to
communicate with peripheral or I/O devices. In multi-processor systems
the HyperTransport link can be used to interface with other
processors. K10 micro-architecture utilizes HyperTransport 3.0
specification of HyperTransport. This specification increases maximum
bandwidth to approximately 20.8 GB per second for 16-bit links. Two
new features of HT 3.0 specification are:
- Link splitting - one 16-bit link can be configured as two 8-bit links.
- Retry - the hardware can detect corrupted packets and re-transmit them.
HT 3.0 specification includes optional dynamic link frequency and
width feature. This feature can help reduce HyperTransport unit power
consumption, but it's not clear if this feature is a part of K10
micro-architecture or not. Sideband Stack optimizer This unit tracks all instructions that reference stack-pointer, for
example, PUSH, POP, LEAVE and other instructions. The unit can execute
more than one of those instructions in parallel assuming that there is
no dependency between instructions. Branch prediction K10 micro-architecture uses the following methods to predict program branches:
- Branch Target address Buffer (BTB) table holds 2048 predicted branch addresses.
- Global History Bimodal Counter (GHBC) table contains 16384 2-bit counters.
- Indirect address prediction table is used to predict indirect branches. This table contains 512 addresses.
- Return address stack table that holds 24 return addresses
|