Today's superscalar processors rename registers, bypass registers, checkpoint state so that they can recover from speculative execution, check for dependencies, allocate execution units, and access multi-ported register files.
The circuits employed are complex and irregular, requiring much effort and ingenuity to implement well. Furthermore, the delays through many of the circuits grow quadratically with issue width (the maximum number of simultaneously fetched or issued instructions) and window size (the maximum number of instructions within the processor core), making future scaling of today's designs problematic. With billion transistor chips per wafer on the horizon, this scalability barrier appears to be one of the most serious obstacles for high-performance uniprocessors in the next decade.
Surprisingly, it is possible to extract the same instruction-level parallelism (ILP) with a regular circuit structure that has only logarithmic gate delay and linear wire delay (speed-of-light delay) or even sublinear wire delay, depending on how much memory bandwidth is required for the processor. This paper describes a new processor microarchitecture, called the Ultrascalar processor.
AMD is working on the Toledo to be Ultrascalar, and the Egypt core is confirmed to be. The Ultrascalar chips (know as HyperScalar by AMD) will be able to process many many many more times the information in a given binary pattern without extremely high clock rates. AMD will most likely use a 1.6-2.0GHz rating on the HyperScalar series of the Toledo, and the Egypt, upon further development, will implement a 1.8-2.4GHz Hyperscalar core. SMT will also be introduced on the Toledo (Intel calls SMT [simaeltaneous multithreading] Hyperthreading)
The Ultrascalar processor core performs the same functions
as a typical superscalar processor core. It renames registers, analyses register and memory data dependencies, executes instructions out of order, forwards results, efficiently reverts from mispredictions, and commits and retires instructions. The Ultrascalar processor core is much more regular and has lower asymptotic critical-path length than
todays superscalars, however. In fact, all the scaling circuits within the processor core are instances of a single algorithm, parallel prefix, implemented in VLSI. Because of the core's simplicity, it is easily apparent how the number of gates within a critical path grows with the issue width and window size. The core does not include the memory and branch prediction subsystems. Instead the core presents the same interface to the instruction fetch unit and the data cache as today's superscalar processor cores. Predicted instruction sequences enter the core, and data load and store requests are initiated by the core. The Ultrascalar core will benefit from the any advances in effective instruction fetch rate and in data memory bandwidth that can be applied to traditional superscalar processors. In particular, since the Ultrascalar processor core performs the same functions as the core of today's superscalars, it achieves the same CPI performance as existing superscalars when attached to a traditional 4-instruction-wide fetch unit using traditional branch prediction techniques and a traditional cache organization. As effective fetch rates and data and widths increase, the Ultrascalar core can scale gracefully, raising the CPI without exploding the cycle time.
English Translation- Gian and Matt are dumb and could not understand what AMD ZEN wrote there. Who is right now you bastard Giancarlo?