|News · Chips · Information · Forum · Links · About||Contact Us|
|Identification · Pinouts · S-Spec numbers · Wanted||Site Search|
AMD K10 microarchitecture
Instruction processing flow
All instructions are loaded into L1 cache first. As soon as instructions are loaded into the cache they are pre-decoded (this is necessary to determine instruction length), and target addresses for all loaded branch instructions are predicted. Pre-decode and branch data are stored in L1 cache together with actual data. All instructions from L1 cache are loaded into Fetch-Decode Unit as 32 byte blocks. Depending on the type of loaded instructions, they are passed either to DirectPath or VectorPath decoders where instructions are decoded into macro-operations, or macro-ops. Simple and average complexity instructions are passed to DirectPath decoder unit which translates each instruction into either one or two macro-operations. VectorPath decoder can decode complex instructions into one or more macro-operations. Both decoders can decode instructions at a rate of three macro-ops per cycle. The decoders cannot work in parallel - only one of them can decode instructions at the same time. Decoded macro-ops are passed to Instruction Control Unit (ICU) and stored in 72 macro-ops reorder buffer. ICU then dispatches instructions to integer and floating-point schedulers. Both schedulers work independently from each other. They both break macro-ops into simpler micro-operations, or micro-ops, and pass them to Integer Execution Unit and Floating Point Execution Unit respectively. Integer Execution unit contains three Arithmetic-Logic Units (ALU) and three Address Generation Units (AGU). Each ALU/AGU unit can execute one micro-op per cycle, thus total throughput of Integer Unit is 6 micro-ops per cycle. The ALU can handle most of micro-operations with two exceptions:
Each core has two L1 caches: 64 KB instruction cache and 64 KB data cache. L1 cache is 2-way set associative, with 64-byte line size, and supports 2 128-bit loads or 2 64-bit stores per cycle. Load operations can be issued out of order if there are no data dependency restrictions. If L1 cache doesn't have requested data, the CPU requests data from L2 cache, then from L3 cache and ultimately from system memory. Data is fetched directly to L1 cache. If there is no available space in L1 cache then least recently used data is evicted to L2 cache. Latency of L1 cache is 3 cycles.
Each core has its own exclusive 512 KB level 2 cache. L2 cache only serves as a victim cache, that is only contains data evicted from L1 cache. If level 2 cache doesn't have available space for new evicted data, then least recently used data is moved either into L3 cache (if the data may be used by other core) or completely removed from the cache. When data is requested from L2 cache, it is moved to level 1 cache and removed from level 2 cache. Latency of L2 cache is 9 cycles. In the future the size of L2 cache can be increased to 1 MB, associativity of the cache can be also changed.
2 MB L3 cache is shared between all 4 cores. This cache holds data that may be used by more than one core. It's a victim cache, that is it contains only data evicted from L2 cache. Whenever data is requested from L3 cache it is either copied into L1 cache (in a case when the data may be used by more than one core), or moved into L1 cache. Latency of L3 cache is variable. In the future the size of L3 cache can be increased up to 8 MB. It's possible that some K10-based processors won't have L3 cache at all.
Integrated Memory Controller
The CPU incorporates integrated DDR2 memory controller that can be used as a single 128-bit or dual 64-bit controller.
HyperTransport is a high-speed point-to-point link that is used to communicate with peripheral or I/O devices. In multi-processor systems the HyperTransport link can be used to interface with other processors. K10 micro-architecture utilizes HyperTransport 3.0 specification of HyperTransport. This specification increases maximum bandwidth to approximately 20.8 GB per second for 16-bit links. Two new features of HT 3.0 specification are:
Sideband Stack optimizer
This unit tracks all instructions that reference stack-pointer, for example, PUSH, POP, LEAVE and other instructions. The unit can execute more than one of those instructions in parallel assuming that there is no dependency between instructions.
K10 micro-architecture uses the following methods to predict program branches:
(c) Copyright 2003 Gennadiy Shvets