Misplaced Pages

Lion Cove: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 10:49, 2 December 2024 editCanonNi (talk | contribs)Extended confirmed users, Page movers, IP block exemptions, New page reviewers, Pending changes reviewers, Rollbackers17,451 edits Adding short description: "CPU architecture designed by Intel"Tag: Shortdesc helper← Previous edit Latest revision as of 03:30, 5 December 2024 edit undo4202C (talk | contribs)239 editsm Cache 
(4 intermediate revisions by 3 users not shown)
Line 7: Line 7:


=== Front end === === Front end ===
The front-end of the Lion Cove core for fetching, decoding and issuing instructions has been made wider and deeper.<ref>{{cite web |last1=Mujtaba |first1=Hassan |date=June 3, 2024 |title=Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling |url=https://wccftech.com/intel-lunar-lake-cpu-architecture-deep-dive-lion-cove-skymont-double-digit-ipc-new-thread-director/ |website=Wccftech |language=en-US |access-date=December 2, 2024}}</ref> There is 8-way decoding of instructions from the Instruction Queue, up from 6-way decode in Redwood Cove. Likewise, Lion Cove's the Out-of-Order Engine uses an 8-way allocation/rename queue, increased from Redwood Cove's 6-way queue.<ref name="Lam RWD">{{cite web |last=Lam |first=Chester |date=September 22, 2024 |title=Intel's Redwood Cove: Baby Steps are Still Steps |url=https://chipsandcheese.com/2024/09/22/intels-redwood-cove-baby-steps-are-still-steps/ |website=Chips and Cheese |language=en-US |access-date=December 2, 2024}}</ref> The Out-of-Order Engine has split the renamers and scheduling into dedicated integer and vector domains which allows Intel to modify each of these domains independently in future designs without requiring a complete redesign of the Out-of-Order Engine.<ref name="Killian">{{cite web |last1=Killian |first1=Zak |date=June 3, 2024 |title=Intel Lunar Lake CPU Deep Dive: Chipzilla’s Mobile Moonshot |url=https://hothardware.com/reviews/intel-lunar-lake-deep-dive?page=2 |website=HotHardware |language=en-US |access-date=December 2, 2024}}</ref> Both of these domains have their own individual access to the micro-op queue.<ref>{{cite web |title=Intel Core Ultra Arrow Lake Preview |url=https://www.techpowerup.com/review/intel-core-ultra-arrow-lake-preview/4.html |website=TechpowerUp |language=en-US |date=October 10, 2024 |access-date=December 2, 2024}}</ref> The larger Ops cache size and longer queue benefit efficiency as more micro-ops being stored in the larger cache does not require the decode logic to be powered up again.<ref name="Killian"/> The front-end of the Lion Cove core for fetching, decoding and issuing instructions has been made wider and deeper.<ref>{{cite web |last1=Mujtaba |first1=Hassan |date=June 3, 2024 |title=Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling |url=https://wccftech.com/intel-lunar-lake-cpu-architecture-deep-dive-lion-cove-skymont-double-digit-ipc-new-thread-director/ |website=Wccftech |language=en-US |access-date=December 2, 2024}}</ref> There is 8-way decoding of instructions from the Instruction Queue, up from 6-way decode in Redwood Cove. Likewise, Lion Cove's the Out-of-Order Engine uses an 8-way allocation/rename queue, increased from Redwood Cove's 6-way queue.<ref name="Lam RWD">{{cite web |last=Lam |first=Chester |date=September 22, 2024 |title=Intel's Redwood Cove: Baby Steps are Still Steps |url=https://chipsandcheese.com/2024/09/22/intels-redwood-cove-baby-steps-are-still-steps/ |website=Chips and Cheese |language=en-US |access-date=December 2, 2024}}</ref> The Out-of-Order Engine has split the renamers and scheduling into dedicated integer and vector domains which allows Intel to modify each of these domains independently in future designs without requiring a complete redesign of the Out-of-Order Engine.<ref name="Killian">{{cite web |last1=Killian |first1=Zak |date=June 3, 2024 |title=Intel Lunar Lake CPU Deep Dive: Chipzilla's Mobile Moonshot |url=https://hothardware.com/reviews/intel-lunar-lake-deep-dive?page=2 |website=HotHardware |language=en-US |access-date=December 2, 2024}}</ref> Both of these domains have their own individual access to the micro-op queue.<ref>{{cite web |title=Intel Core Ultra Arrow Lake Preview |url=https://www.techpowerup.com/review/intel-core-ultra-arrow-lake-preview/4.html |website=TechpowerUp |language=en-US |date=October 10, 2024 |access-date=December 2, 2024}}</ref> The larger Ops cache size and longer queue benefit efficiency as more micro-ops being stored in the larger cache does not require the decode logic to be powered up again.<ref name="Killian"/>


{| class="wikitable" style="text-align:center; white-space:nowrap; font-size:90%" {| class="wikitable" style="text-align:center; white-space:nowrap; font-size:90%"
Line 67: Line 67:
|- |-
|} |}



=== Execution Engine === === Execution Engine ===
Line 77: Line 76:


Lion Cove supports ] instructions but it is disabled in heterogenous processor generations like Arrow Lake and Lunar Lake. This is no different to Golden Cove, Raptor Cove or Redwood Cove that had their AVX-512 support disabled in all heterogenous non-server products. Lion Cove supports ] instructions but it is disabled in heterogenous processor generations like Arrow Lake and Lunar Lake. This is no different to Golden Cove, Raptor Cove or Redwood Cove that had their AVX-512 support disabled in all heterogenous non-server products.



=== Cache === === Cache ===
Line 103: Line 101:
|- |-
! style="text-align:left;" | Bandwidth ! style="text-align:left;" | Bandwidth
| <small>___B/clk</small> | <small>128B/clk</small>
| <small>128B/clk</small> | <small>128B/clk</small>
|- |-
Line 115: Line 113:
|- |-
! style="text-align:left;" | Associativity ! style="text-align:left;" | Associativity
| <small>_-way</small> | <small>6-way</small>
| <small>16-way</small> | <small>16-way</small>
|- |-
Line 198: Line 196:


==== L3 ==== ==== L3 ====
The read bandwidth when a single Lion Cove core accesses the L3 cache has regressed from 16 bytes per cycle with Redwood Cove to 10 bytes per cycle for Lion Cove. Despite this lower bandwidth in reading and writing data, the latency of accessing L3 data has been reduced from 75-cycles to 51-cycles.<ref name="Lam LNC"/> The read bandwidth when a single Lion Cove core accesses the L3 cache has regressed from 16 bytes per cycle with Redwood Cove to 10 bytes per cycle for Lion Cove. Despite this lower bandwidth in reading and writing data, the latency of Lion Cove accessing L3 data has been reduced from 75-cycles to 51-cycles in Lunar Lake.<ref name="Lam LNC"/> However, Lion Cove in Arrow Lake suffers from much higher latency at 84-cycles due to a longer ring bus design as its L3 cache is being shared by both its P-cores and E-cores.<ref>{{cite web |last1=Lam |first1=Chester |date=December 4, 2024 |title=Examining Intel's Arrow Lake, at the System Level |url=https://chipsandcheese.com/p/examining-intels-arrow-lake-at-the |website=Chips and Cheese |language=en-US |access-date=December 5, 2024}}</ref>
Lunar Lake's L3 cache is exclusive to its four Lion Cove P-cores while its four E-cores sit on a separate "island" without an L3 cache.<ref>{{cite web |title=Intel Lunar Lake Technical Deep Dive - So many Revolutions in One Chip |url=https://www.techpowerup.com/review/intel-lunar-lake-technical-deep-dive/3.html |website=TechPowerUp |language=en-US |date=June 4, 2024 |access-date=December 5, 2024}}</ref>
{{Clear}} {{Clear}}



Latest revision as of 03:30, 5 December 2024

CPU architecture designed by Intel

Lion Cove is a 64-bit, two-way, x86 CPU core architecture designed by Intel. The Lion Cove core is featured in Core Ultra Series 2 Arrow Lake and Lunar Lake processors.

Architecture

Lion Cove is a performance core architecture aimed at providing high compute performance with wider integer and vector execution units, wider fetch and increased core frequencies compared to the Intel's density-optimized E-core architectures. Intel claims a 14% IPC increase with the Lion Cove P-core over Redwood Cove. Intel approached the Lion Cove design process with the intention to "remove any transistor from the design that doesn't directly contribute to productivity", stripping down the core design in order to focus on single-threading and core area efficiency. Ori Lempel served as Senior Principal Engineer for the Lion Cove- P-core design.

Front end

The front-end of the Lion Cove core for fetching, decoding and issuing instructions has been made wider and deeper. There is 8-way decoding of instructions from the Instruction Queue, up from 6-way decode in Redwood Cove. Likewise, Lion Cove's the Out-of-Order Engine uses an 8-way allocation/rename queue, increased from Redwood Cove's 6-way queue. The Out-of-Order Engine has split the renamers and scheduling into dedicated integer and vector domains which allows Intel to modify each of these domains independently in future designs without requiring a complete redesign of the Out-of-Order Engine. Both of these domains have their own individual access to the micro-op queue. The larger Ops cache size and longer queue benefit efficiency as more micro-ops being stored in the larger cache does not require the decode logic to be powered up again.

Redwood Cove Lion Cove
Decode 6-way 8-way
Allocation/Rename 6-way 8-way
Retirement 8-wide 12-wide
Deep instruction window 512 576
Execution Ports 12 18
Op Cache 4096 entry
8-way
5250 entry
12-way
Op Queue 144 entry 192 entry

Branch Predictor

Branch prediction has been strengthened in Lion Cove with the core's prediction block being 8 times wider than Redwood Cove. The branch predictor in a core tries to predict the outcome when there are diverging code paths or branch. Lion Cove's L0 Branch Target Buffer (BTB) cache has been doubled to 256 entries to store a higher number of target addresses for a taken branch which can be used to help predict the next branch and reduce the number of misses.

Buffer caches entries
Redwood Cove Lion Cove
L0 BTB 128 256
L1 BTB 5K 6K
L2 BTB 12K 12K

Execution Engine

Integer Unit

Lion Cove increases the number of integer Arithmetic Logic Units (ALUs) to six. Redwood Cove contained five ALUs that used a 256-bit wide pipe. The number of integer multiply units has risen from 1 to 3 which means that the core can enact more than 1 integer multiply operations per cycle.

Vector Engine

Intel's vector engine design in Lion Cove now more closely resembles that used by AMD since Zen with four pipes for floating point and vector execution. Two of those pipes deal with floating point multiplies and multiply-adds, while the two other pipes handle floating point adds. The number of floating point dividers has increased from 1 to 2 with improved throughput. For handling sort-vector instructions, the vector engine contains 4 SIMD ALUs, up from 3 in Redwood Cove.

Lion Cove supports AVX-512 instructions but it is disabled in heterogenous processor generations like Arrow Lake and Lunar Lake. This is no different to Golden Cove, Raptor Cove or Redwood Cove that had their AVX-512 support disabled in all heterogenous non-server products.

Cache

Lion Cove introduces an expanded cache hierarchy with four caching tiers rather than three. With select Broadwell SKUs in 2015, Intel added a 128 MB eDRAM that acted like fourth level cache. However, this eDRAM was not a traditional cache as it was placed on a separate die as a form of slower shared memory between the CPU cores and graphics with its intended purpose being to reduce memory access requests. Broadwell's L3 cache had three times lower per-cycle latency and over triple the bandwidth compared to its eDRAM. In terms of adding a new level of traditional cache, the last time Intel did so was in 2003 with L3 cache on the Pentium 4 Extreme Edition.

Cache Redwood Cove Lion Cove
L0D Size 48 KB 48 KB
Associativity 12-way 12-way
Latency 5-cycles 4-cycles
Bandwidth 128B/clk 128B/clk
L0I Size 64 KB 64 KB
Associativity 6-way 16-way
Latency -cycles -cycles
Bandwidth 32B/clk 128B/clk
L1 Size 192 KB
Associativity 12-way
Latency 9-cycles
Bandwidth 2×64B/clk
L2 Size 2 MB 2.5–3 MB
Associativity 16-way 10-way
Latency 16-cycles 17-cycles
Bandwidth __B/clk 2×64B/clk
L3 Size 4 MB 4 MB
Associativity 12-way 12-way
Latency 75-cycles 51-cycles
Bandwidth 32B/clk Read
32B/clk Write
32B/clk Read
32B/clk Write

L0

Lion Cove's L0 caches are what were formerly known as L1 data and instruction caches in any other CPU core architecture. Even though Intel maintains the larger L0 cache sizes in recent core architectures, they have managed to reduce the load-to-use latency down to 4-cycles, not seen since Skylake, rather than 5-cycles in Redwood Cove.

L1

The new 192 KB L1 cache in the Lion Cove core acts as a mid-level buffer cache between the L0 data and instruction caches inside the core and the L2 cache outside the core. It is focussed on reducing latency in the event of L0 data cache misses rather than needing to access the L2 cache. Accessing data in the L1 cache comes with a 9-cycle latency which is nearly half the latency that comes with accessing the L2 cache.

L2

L2 cache is important for the Lion Cove core architecture as Intel's reliance on L2 cache is to insulate the cores from the L3 cache's slow performance. Lion Cove was designed to accommodate L2 caches configurable from 2.5 MB up to 3 MB depending on the product. Lunar Lake's Lion Cove implementation contains a 2.5 MB L2 cache while the Lion Cove variant in Arrow Lake contains contains a 3 MB L2 cache. Lion Cove's larger L2 cache continues the trend of Intel increasing the size of the L2 cache for the last few generations of their P-cores such as Golden Cove, Raptor Cove and Redwood Cove. The previous generation Redwood Cove P-core architecture featured 2 MB of L2 cache. However, increasing the cache size often brings higher latency. Lion Cove's L2 cache has a 17-cycle latency, up from Redwood Cove's 16-cycle latency. Theoretically, the L2 cache can deliver a bandwidth of 110 bytes per cycle but this was limited to 64 bytes per cycle in Lunar Lake for power savings.

L3

The read bandwidth when a single Lion Cove core accesses the L3 cache has regressed from 16 bytes per cycle with Redwood Cove to 10 bytes per cycle for Lion Cove. Despite this lower bandwidth in reading and writing data, the latency of Lion Cove accessing L3 data has been reduced from 75-cycles to 51-cycles in Lunar Lake. However, Lion Cove in Arrow Lake suffers from much higher latency at 84-cycles due to a longer ring bus design as its L3 cache is being shared by both its P-cores and E-cores. Lunar Lake's L3 cache is exclusive to its four Lion Cove P-cores while its four E-cores sit on a separate "island" without an L3 cache.

References

  1. Campbell, Mark (June 4, 2024). "Why are Intel ditching Hyperthreading with Lion Cove and Lunar Lake?". OC3D. Retrieved December 2, 2024.
  2. ^ "Next Gen P-core: The Lion Cove Architecture" (PDF). Intel. June 3, 2024. Retrieved December 2, 2024.
  3. Mujtaba, Hassan (June 3, 2024). "Intel Lunar Lake CPU Architecture Deep-Dive: Lion Cove +14% IPC, Skymont IPC More Than Raptor Cove, Next-Gen Power Managment & Scheduling". Wccftech. Retrieved December 2, 2024.
  4. ^ Lam, Chester (September 22, 2024). "Intel's Redwood Cove: Baby Steps are Still Steps". Chips and Cheese. Retrieved December 2, 2024.
  5. ^ Killian, Zak (June 3, 2024). "Intel Lunar Lake CPU Deep Dive: Chipzilla's Mobile Moonshot". HotHardware. Retrieved December 2, 2024.
  6. "Intel Core Ultra Arrow Lake Preview". TechpowerUp. October 10, 2024. Retrieved December 2, 2024.
  7. ^ Cozma, George (June 4, 2024). "Intel's Lion Cove Architecture Preview". Chips and Cheese. Retrieved December 2, 2024.
  8. ^ Lam, Chester (September 27, 2024). "Lion Cove: Intel's P-Core Roars". Chips and Cheese. Retrieved December 2, 2024.
  9. ^ Cutress, Ian (November 2, 2020). "A Broadwell Retrospective Review in 2020: Is eDRAM Still Worth It?". AnandTech. Retrieved December 2, 2024.
  10. Shimpi, Anand Lal (September 16, 2003). "Intel Developer Forum Fall 2003 - Day 1: Introducing Pentium 4 Extreme Edition". AnandTech. Retrieved December 2, 2024.
  11. ^ Bonshor, Gavin (June 3, 2024). "Intel Unveils Lunar Lake Architecture: New P and E cores, Xe2-LPG Graphics, New NPU 4 Brings More AI Performance". AnandTech. Retrieved December 2, 2024.
  12. Lam, Chester (January 11, 2024). "Previewing Meteor Lake at CES". Chips and Cheese. Retrieved December 2, 2024.
  13. Lam, Chester (December 4, 2024). "Examining Intel's Arrow Lake, at the System Level". Chips and Cheese. Retrieved December 5, 2024.
  14. "Intel Lunar Lake Technical Deep Dive - So many Revolutions in One Chip". TechPowerUp. June 4, 2024. Retrieved December 5, 2024.

See also

Intel processors
Lists
Microarchitectures
IA-32 (32-bit x86)
x86-64 (64-bit)
x86 ULV
Current products
x86-64 (64-bit)
Discontinued
BCD oriented (4-bit)
pre-x86 (8-bit)
Early x86 (16-bit)
x87 (external FPUs)
8/16-bit databus
8087 (1980)
16-bit databus
80C187
80287
80387SX
32-bit databus
80387DX
80487
IA-32 (32-bit x86)
x86-64 (64-bit)
Other
CISC
iAPX 432
EPIC
Itanium
RISC
i860
i960
StrongARM
XScale
Related
Intel CPU core roadmaps from P6 to Panther Lake
Atom (ULV) Node name Pentium/Core
Microarch. Step Microarch. Step
600 nm P6 Pentium Pro
(133 MHz)
500 nm Pentium Pro
(150 MHz)
350 nm Pentium Pro
(166–200 MHz)
Klamath
250 nm Deschutes
Katmai NetBurst
180 nm Coppermine Willamette
130 nm Tualatin Northwood
Pentium M Banias NetBurst(HT) NetBurst(×2)
90 nm Dothan Prescott Prescott‑2M Smithfield
Tejas Cedarmill (Tejas)
65 nm Yonah Nehalem (NetBurst) Cedar Mill Presler
Core Merom 4 cores on mainstream desktop, DDR3 introduced
Bonnell Bonnell 45 nm Penryn
Nehalem Nehalem HT reintroduced, integrated MC, PCH
L3-cache introduced, 256KB L2-cache/core
Saltwell 32 nm Westmere Introduced GPU on same package and AES-NI
Sandy Bridge Sandy Bridge On-die ring bus, no more non-UEFI motherboards
Silvermont Silvermont 22 nm Ivy Bridge
Haswell Haswell Fully integrated voltage regulator
Airmont 14 nm Broadwell
Skylake Skylake DDR4 introduced on mainstream desktop
Goldmont Goldmont Kaby Lake
Coffee Lake 6 cores on mainstream desktop
Amber Lake Mobile-only
Goldmont Plus Goldmont Plus Whiskey Lake Mobile-only
Coffee Lake Refresh 8 cores on mainstream desktop
Comet Lake 10 cores on mainstream desktop
Sunny Cove Cypress Cove (Rocket Lake) Backported Sunny Cove microarchitecture for 14nm
Tremont Tremont 10 nm Skylake Palm Cove (Cannon Lake) Mobile-only
Sunny Cove Sunny Cove (Ice Lake) 512 KB L2-cache/core
Willow Cove (Tiger Lake) X graphics engine
Gracemont Gracemont Intel 7
(10nm ESF)
Golden Cove Golden Cove (Alder Lake) Hybrid, DDR5, PCIe 5.0
Raptor Cove (Raptor Lake)
Crestmont Crestmont Intel 4 Redwood Cove Meteor Lake Mobile-only
NPU, chiplet architecture
Skymont Skymont N3B (TSMC) Lion Cove Lunar Lake Low power mobile only (9-30W)
Arrow Lake
Darkmont Darkmont 18A Cougar Cove Panther Lake
  • Strike-through indicates cancelled processors
  • Bold names are microarchitectures
  • Italic names are future processors
Categories: