Misplaced Pages

Simultaneous multithreading: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 16:48, 5 September 2008 editHcobb (talk | contribs)Extended confirmed users, Pending changes reviewers14,752 edits Modern commercial implementations: Cleared up SMT vs other performance features.← Previous edit Latest revision as of 10:13, 19 February 2024 edit undoWikiCleanerBot (talk | contribs)Bots926,203 editsm v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)Tag: WPCleaner 
(282 intermediate revisions by more than 100 users not shown)
Line 1: Line 1:
{{short description|Efficiency improving technique for superscalar CPUs}}
{{Cleanup-section|date=November 2007}}
'''Simultaneous multithreading''' ('''SMT''') is a technique for improving the overall efficiency of ] ] with ]. SMT permits multiple independent ] of execution to better use the resources provided by modern ].


== Details ==
{{Mergeto|Multithreading (computer hardware) |Talk:Multithreading (computer hardware)|date=August 2007}}
The term ''multithreading'' is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different ]s, different ]s, different ]s, different ], etc.). Although running on the same core, they are completely separated from each other.
Multithreading is similar in concept to ] but is implemented at the thread level of execution in modern superscalar processors.


Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being ] (also known as super-threading). In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads is decided by the chip designers. Two concurrent threads per CPU core are common, but some processors support many more.<ref>{{cite web |title=The First Direct Mesh-to-Mesh Photonic Fabric |url=https://hc2023.hotchips.org/assets/program/conference/day2/Interconnects/HC23.Intel.JasonHoward.v3.pdf#page=5 |access-date=2024-02-08 |archive-date=2024-02-08 |archive-url=https://web.archive.org/web/20240208201646/https://hc2023.hotchips.org/assets/program/conference/day2/Interconnects/HC23.Intel.JasonHoward.v3.pdf#page=5}}</ref>
'''Simultaneous multithreading''', often abbreviated as '''SMT''', is a technique for improving the overall efficiency of ] ] with ]. SMT permits multiple independent ]s of execution to better utilize the resources provided by modern ]s.


Because it inevitably increases conflict on shared resources, measuring or agreeing on its effectiveness can be difficult. However, measured ] of SMT with parallel native and managed workloads on historical 130&nbsp;nm to 32&nbsp;nm Intel SMT (]) implementations found that in 45&nbsp;nm and 32&nbsp;nm implementations, SMT is extremely energy efficient, even with in-order Atom processors.<ref name="asplos11">ASPLOS'11</ref> In modern systems, SMT effectively exploits concurrency with very little additional dynamic power. That is, even when performance gains are minimal the power consumption savings can be considerable.<ref name="asplos11"/>
==Details==
Some researchers{{who|date=June 2019}} have shown that the extra threads can be used proactively to seed a ] like a cache, to improve the performance of another single thread, and claim this shows that SMT does not only increase efficiency. Others{{who|date=June 2019}} use SMT to provide redundant computation, for some level of error detection and recovery.
Multithreading is similar in concept to ] but is implemented at the ] level of execution in modern ] processors.


However, in most current cases, SMT is about hiding ], increasing efficiency, and increasing throughput of computations per amount of hardware used.{{citation needed|date=December 2018}}
Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being ]. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity have limited the number to two for most SMT implementations.

Because the technique is really an efficiency solution and there is inevitable increased conflict on shared resources, measuring or agreeing on the effectiveness of the solution can be difficult. Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.

However, in most current cases, SMT is about hiding memory latency, efficiency and increased throughput of computations per amount of hardware used.


==Taxonomy== ==Taxonomy==
In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is ] technique which tries to increase Instruction Level Parallelism (ILP), the other is ] approach exploiting Thread Level Parallelism (TLP). In processor design, there are two ways to increase on-chip parallelism with fewer resource requirements: one is superscalar technique which tries to exploit ] (ILP); the other is multithreading approach exploiting ] (TLP).


Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely: Superscalar means executing multiple instructions at the same time while thread-level parallelism (TLP) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:
* Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as ]. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. '''Fine-grain''' multithreading issues instructions for different threads after every cycle, while '''coarse-grain''' multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's ] processor uses coarse-grain multithreading, while Sun's ] uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can only issue up to one instruction per cycle. * Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as ]. It can be further divided into fine-grained multithreading or coarse-grained multithreading depending on the frequency of interleaved issues. '''Fine-grained''' multithreading—such as in a ]—issues instructions for different threads after every cycle, while '''coarse-grained''' multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's ] processor uses coarse-grained multithreading, while Sun's ] uses fine-grained multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.
* Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so. * Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
* Chip-level multiprocessing (CMP or ]): integrates two or more superscalar processors into one chip, each executes threads independently * Chip-level multiprocessing (CMP or ]): integrates two or more processors into one chip, each executing threads independently.
* Any combination of multithreaded/SMT/CMP * Any combination of multithreaded/SMT/CMP.


The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time. The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.


== Historical implementations == == Historical implementations ==
While multithreading CPUs have been around since the 1950s, Simultaneous Multithreading was first researched by IBM in 1968. The first major commercial CPU developed with SMT was the ] (EV8). This chip was developed by ] in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Hank Levy of the University of Washington. The processor was never released, since the Alpha line of processors was discontinued shortly before ] acquired ] (formerly ]). Dean Tullsen's work was also used to create the Intel Pentium 4 Processor. While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of the ] project.<ref>{{cite web |url=http://people.cs.clemson.edu/~mark/acs_end.html |title=End of IBM ACS Project |first=Mark |last=Smotherman |date=25 May 2011 |access-date=January 19, 2013 |publisher=School of Computing, Clemson University}}</ref> The first major commercial microprocessor developed with SMT was the ] (EV8). This microprocessor was developed by ] in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before ] acquired ] which had in turn acquired ]. Dean Tullsen's work was also used to develop the ] versions of the Intel Pentium&nbsp;4 microprocessors, such as the "Northwood" and "Prescott".


== Modern commercial implementations == == Modern commercial implementations ==
The ] ] was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality ] (HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, and some programs actually slow down slightly when HTT is turned on due to increased contention for resources such as bandwidth, caches, ]s, ] entries, etc. This is generally the case for poorly written data access routines that cause high latency intercache transactions (cache thrashing) on multi-processor systems. Programs written before multiprocessor and multicore designs were prevelant commonly did not optimize cache access because on a single cpu system there is only a single cache which is always coherent with itself. On a multiprocessor system each cpu or core will typically have its own cache, which is interlinked with the cache of other cpu/cores in the system to maintain cache coherency. If thread A accesses a memory location and thread B then accesses memory location it can cause an intercache transaction particularly where the cache line fill exceeds 2 bytes, as is the case for all modern processors. The ] ] was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06&nbsp;GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality ], and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement<ref>{{cite journal|last1=Marr|first1=Deborah|title=Hyper-Threading Technology Architecture and Microarchitecture|journal=Intel Technology Journal|date=February 14, 2002|volume=6|issue=1|page=4|doi=10.1535/itj|url=http://www.diku.dk/OLD/undervisning/2004f/303/Hyper-Thread.pdf|access-date=25 September 2015|archive-date=24 October 2016|archive-url=https://web.archive.org/web/20161024004724/http://www.diku.dk/OLD/undervisning/2004f/303/Hyper-Thread.pdf|url-status=dead}}</ref> compared against an otherwise identical, non-SMT Pentium&nbsp;4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on.<ref>{{cite web |title=CPU performance evaluation Pentium&nbsp;4 2.8 and 3.0 |url=http://users.telenet.be/nicvroom/performanceP4.htm |access-date=2011-04-22 |archive-date=2021-02-24 |archive-url=https://web.archive.org/web/20210224131422/http://users.telenet.be/nicvroom/performanceP4.htm |url-status=dead }}</ref> This is due to the ] of the Pentium&nbsp;4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches, ], ] entries, and equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium&nbsp;4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This was enough to completely overcome that performance hit.<ref>{{cite web|title=Replay: Unknown Features of the NetBurst Core. Page 15|url=http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|website=Replay: Unknown Features of the NetBurst Core.|publisher=xbitlabs.com|access-date=24 April 2011|url-status=dead|archive-url=https://web.archive.org/web/20110514180659/http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|archive-date=14 May 2011}}</ref>


The latest ] designs include an SMT system known as ''"MIPS MT"''. MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. The latest ] ] designs include an SMT system known as "MIPS MT".<ref>{{cite web|title=MIPS MT ASE description|url=https://www.imgtec.com/mips/architectures/multi-threading/}}</ref> MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. ], a Cupertino-based startup, is the first MIPS vendor to provide a processor ] based on eight cores, each of which runs four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. ] MIPS CPUs have two SMT threads per core.


IBM's ]/Q has 4-way SMT.
The ] ], announced in May 2004, comes as either a dual core DCM, or quad-core or 8-core MCM, with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading.


The IBM ], announced in May 2004, comes as either a dual core dual-chip module (DCM), or quad-core or oct-core multi-chip module (MCM), with each core including a two-thread SMT engine. ]'s implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with eight cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBM ] has 8 intelligent simultaneous threads per core (SMT8).
Although many people reported that ]' ] (known as ''"Niagara"'' until its 14 November 2005 release) and the upcoming processor ]d ''"]"'' (to be launched ~2009 <ref>http://www.theregister.co.uk/2007/12/14/sun_rock_delays/</ref>) are implementations of ] focused almost entirely on exploiting SMT and ] techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara chip has 8 cores per chip, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a ]. ]' ] is different, it has more complex cores that have more than one pipelines.


] starting with the ] processor in 2013 has two threads per core (SMT-2).
The ] ], released in 2008, is the first Intel product to feature SMT (marketed as Hyper-threading) without supporting instruction reordering, speculative execution, or register renaming.


Although many people reported that ]' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor ]d "]" (originally announced in 2005, but after many delays cancelled in 2010) are implementations of ] focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has eight cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a ]. Sun Microsystems' Rock processor is different: it has more complex cores that have more than one pipeline.
==See also==
*], the fundamental software entity scheduled by the operating system kernel to execute on a CPU or processor (core)
*], where the system (or partition of a larger computer hardware platform) contains more than one CPU or processor (core) and where the operating system kernel is not limited to which of the available CPUs (cores) a given thread can be scheduled to execute on


The ] SPARC T3 has eight fine-grained threads per core; SPARC T4, SPARC T5, SPARC M5, M6 and M7 have eight fine-grained threads per core of which two can be executed simultaneously.
==References==
<references />
*LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through Pipelining", Computer Feb 1974
*D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," In 22nd Annual International Symposium on Computer Architecture, June, 1995
*D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," In 23rd Annual International Symposium on Computer Architecture, May, 1996


] SPARC64 VI has coarse-grained Vertical Multithreading (VMT) SPARC VII and newer have 2-way SMT.
==External links==
*
*
*


Intel ] Montecito uses coarse-grained multithreading and Tukwila and newer ones use 2-way SMT (with dual-domain multithreading).
{{CPU_technologies}}
{{Parallel_computing}}
]
]
]


] ] has 4-way SMT (with time-multiplexed multithreading) with hardware-based threads which cannot be disabled, unlike regular Hyper-Threading.<ref>{{cite web |first1=Michaela |last1=Barth |first2=Mikko |last2=Byckling |first3=Nevena |last3=Ilieva |first4=Sami |last4=Saarinen |first5=Michael |last5=Schliephake |editor-first=Volker |editor-last=Weinberg |title=Best Practice Guide Intel Xeon Phi v1.1 |date=18 February 2014 |publisher=Partnership for Advanced Computing in Europe |url=http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-html/ |access-date=22 November 2016 |archive-date=3 May 2017 |archive-url=https://web.archive.org/web/20170503073453/http://www.prace-ri.eu/best-practice-guide-intel-xeon-phi-html/ |url-status=dead }}</ref> The ], first released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the ], after its absence on the ].
]

]
AMD ] FlexFPU<!-- Don't put three different links next to each other, it's confusing for readers --> and Shared L2 cache are multithreaded but integer cores in module are single threaded, so it is only a partial SMT implementation.<ref>{{cite web |title=AMD Bulldozer Family Module Multithreading |date=July 2013 |publisher=wccftech |url=http://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg |access-date=2013-07-22 |archive-date=2013-10-17 |archive-url=https://web.archive.org/web/20131017014731/http://cdn3.wccftech.com/wp-content/uploads/2013/07/AMD-Steamroller-vs-Bulldozer.jpg |url-status=dead }}</ref><ref>{{cite web |first=Gareth |last=Halfacree |title=AMD unveils Flex FP |date=28 October 2010 |publisher=bit-tech |url=https://www.bit-tech.net/news/hardware/2010/10/28/amd-unveils-flex-fp/1}}</ref>
]

]
AMD ] has 2-way SMT.
]

]<ref name="urlSoft Machines unveils VISC virtual chip architecture | bit-tech.net">{{cite web |url=https://bit-tech.net/news/tech/cpus/soft-machines-visc/1/ |title=Soft Machines unveils VISC virtual chip architecture &#124; bit-tech.net |format= |accessdate=}}</ref><ref>{{cite web |first=Ian |last=Cutress |title=Examining Soft Machines' Architecture: An Element of VISC to Improving IPC |date=12 February 2016 |publisher=AnandTech |url=http://www.anandtech.com/show/10025/examining-soft-machines-architecture-visc-ipc}}</ref><ref>{{cite web|title=Next Gen Processor Performance Revealed|date=February 4, 2016|publisher=VR World|url=https://vrworld.com/2016/02/04/next-gen-processor-performance-revealed/|archive-url=https://web.archive.org/web/20170113044935/https://vrworld.com/2016/02/04/next-gen-processor-performance-revealed/|archive-date=2017-01-13}}</ref><ref>{{cite web|title=Architectural Waves|year=2017|publisher=Soft Machines|url=http://www.softmachines.com/technology/|url-status=dead|archive-url=https://web.archive.org/web/20170329105223/http://www.softmachines.com/technology/|archive-date=2017-03-29}}</ref> uses the ''Virtual Software Layer'' (translation layer) to dispatch a single thread of instructions to the ''Global Front End'' which splits instructions into ''virtual hardware threadlets'' which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where.

== Disadvantages ==
Depending on the design and architecture of the processor, simultaneous multithreading can decrease performance if any of the shared resources are bottlenecks for performance.<ref>{{cite web|title=Replay: Unknown Features of the NetBurst Core. Page 15|url=http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|website=Replay: Unknown Features of the NetBurst Core.|publisher=xbitlabs.com|access-date=24 April 2011|url-status=dead|archive-url=https://web.archive.org/web/20110514180659/http://www.xbitlabs.com/articles/cpu/display/replay_15.html#sect0|archive-date=14 May 2011}}</ref> Critics argue that it is a considerable burden to put on software developers that they have to test whether simultaneous multithreading is good or bad for their application in various situations and insert extra logic to turn it off if it decreases performance. Current operating systems lack convenient ] calls for this purpose and for preventing processes with different priority from taking resources from each other.<ref></ref>

There is also a security concern with certain simultaneous multithreading implementations. Intel's hyperthreading in ]-based processors has a vulnerability through which it is possible for one application to steal a ] from another application running in the same processor by monitoring its cache use.<ref></ref> There are also sophisticated machine learning exploits to HT implementation that were explained at ].<ref></ref>

== See also ==
* ]
* ]
* ]

== References ==
{{Reflist|2}}

;General
{{refbegin}}
*{{cite journal |first1=Leonard E. |last1=Shar |first2=Edward S. |last2=Davidson |title=A multiminiprocessor system implemented through pipelining |journal=Computer |volume=7 |issue=2 |pages= 42–51|date=February 1974 |doi=10.1109/MC.1974.6323457 |s2cid=27957358 }}
*{{cite book |first1=D.M. |last1=Tullsen |first2=S.J. |last2=Eggers |first3=H.M. |last3=Levy |chapter=Simultaneous multithreading: Maximizing on-chip parallelism |chapter-url=https://ieeexplore.ieee.org/document/524578 |title=22nd Annual International Symposium on Computer Architecture |publisher=IEEE |year=1995 |isbn=978-0-89791-698-1 |pages=392–403 }}
*{{cite book |first1=D.M. |last1=Tullsen |first2=S.J. |last2=Eggers |first4=H.M. |last4=Levy |first3=J.S. |last3=Emer |first5=J.L. |last5=Lo |first6=R.L. |last6=Stamm |chapter=Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor |chapter-url=https://ieeexplore.ieee.org/document/1563047 |title=23rd Annual International Symposium on Computer Architecture |publisher=IEEE |doi=10.1145/232973.232993 |year=1996 |isbn=978-0-89791-786-5 |pages=191 |s2cid=1402376 }}
*{{cite book |first1=H. |last1=Esmaeilzadeh |first2=T. |last2=Cao |first3=X. |last3=Yang |first4=S.M. |last4=Blackburn |first5=K.S. |last5=McKinley |chapter=Looking back on the language and hardware revolutions: measured power, performance, and scaling |chapter-url=https://www.academia.edu/download/45265334/Looking_back_on_the_language_and_hardwar20160501-22522-g9hkyo.pdf |title=ASPLOS XVI Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems |publisher=ACM |doi=10.1145/1950365.1950402 |year=2011 |isbn=978-1-4503-0266-1 |pages=319–332 |s2cid=6845129 }}
{{refend}}

== External links ==
*
*
* {{cite web |first=Mark |last=Smotherman |title=Timeline of multithreading technologies |date=November 2007 |publisher=School of Computing, Clemson University |url=http://www.cs.clemson.edu/~mark/multithreading.html}}

{{CPU technologies}}
{{Parallel computing}}

{{Authority control}}

{{DEFAULTSORT:Simultaneous Multithreading}}
]
]
]
]
]

Latest revision as of 10:13, 19 February 2024

Efficiency improving technique for superscalar CPUs

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

Details

The term multithreading is ambiguous, because not only can multiple threads be executed simultaneously on one CPU core, but also multiple tasks (with different page tables, different task state segments, different protection rings, different I/O permissions, etc.). Although running on the same core, they are completely separated from each other. Multithreading is similar in concept to preemptive multitasking but is implemented at the thread level of execution in modern superscalar processors.

Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading (also known as super-threading). In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executed in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads is decided by the chip designers. Two concurrent threads per CPU core are common, but some processors support many more.

Because it inevitably increases conflict on shared resources, measuring or agreeing on its effectiveness can be difficult. However, measured energy efficiency of SMT with parallel native and managed workloads on historical 130 nm to 32 nm Intel SMT (hyper-threading) implementations found that in 45 nm and 32 nm implementations, SMT is extremely energy efficient, even with in-order Atom processors. In modern systems, SMT effectively exploits concurrency with very little additional dynamic power. That is, even when performance gains are minimal the power consumption savings can be considerable. Some researchers have shown that the extra threads can be used proactively to seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT does not only increase efficiency. Others use SMT to provide redundant computation, for some level of error detection and recovery.

However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.

Taxonomy

In processor design, there are two ways to increase on-chip parallelism with fewer resource requirements: one is superscalar technique which tries to exploit instruction-level parallelism (ILP); the other is multithreading approach exploiting thread-level parallelism (TLP).

Superscalar means executing multiple instructions at the same time while thread-level parallelism (TLP) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

  • Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as temporal multithreading. It can be further divided into fine-grained multithreading or coarse-grained multithreading depending on the frequency of interleaved issues. Fine-grained multithreading—such as in a barrel processor—issues instructions for different threads after every cycle, while coarse-grained multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecito processor uses coarse-grained multithreading, while Sun's UltraSPARC T1 uses fine-grained multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle.
  • Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
  • Chip-level multiprocessing (CMP or multicore): integrates two or more processors into one chip, each executing threads independently.
  • Any combination of multithreaded/SMT/CMP.

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Historical implementations

While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968 as part of the ACS-360 project. The first major commercial microprocessor developed with SMT was the Alpha 21464 (EV8). This microprocessor was developed by DEC in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Henry Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP acquired Compaq which had in turn acquired DEC. Dean Tullsen's work was also used to develop the hyper-threaded versions of the Intel Pentium 4 microprocessors, such as the "Northwood" and "Prescott".

Modern commercial implementations

The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology, and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on. This is due to the replay system of the Pentium 4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches, TLBs, re-order buffer entries, and equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium 4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This was enough to completely overcome that performance hit.

The latest Imagination Technologies MIPS architecture designs include an SMT system known as "MIPS MT". MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on eight cores, each of which runs four threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. Imagination Technologies MIPS CPUs have two SMT threads per core.

IBM's Blue Gene/Q has 4-way SMT.

The IBM POWER5, announced in May 2004, comes as either a dual core dual-chip module (DCM), or quad-core or oct-core multi-chip module (MCM), with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with eight cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput. IBM POWER8 has 8 intelligent simultaneous threads per core (SMT8).

IBM Z starting with the z13 processor in 2013 has two threads per core (SMT-2).

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock" (originally announced in 2005, but after many delays cancelled in 2010) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has eight cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor. Sun Microsystems' Rock processor is different: it has more complex cores that have more than one pipeline.

The Oracle Corporation SPARC T3 has eight fine-grained threads per core; SPARC T4, SPARC T5, SPARC M5, M6 and M7 have eight fine-grained threads per core of which two can be executed simultaneously.

Fujitsu SPARC64 VI has coarse-grained Vertical Multithreading (VMT) SPARC VII and newer have 2-way SMT.

Intel Itanium Montecito uses coarse-grained multithreading and Tukwila and newer ones use 2-way SMT (with dual-domain multithreading).

Intel Xeon Phi has 4-way SMT (with time-multiplexed multithreading) with hardware-based threads which cannot be disabled, unlike regular Hyper-Threading. The Intel Atom, first released in 2008, is the first Intel product to feature 2-way SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the Nehalem microarchitecture, after its absence on the Core microarchitecture.

AMD Bulldozer microarchitecture FlexFPU and Shared L2 cache are multithreaded but integer cores in module are single threaded, so it is only a partial SMT implementation.

AMD Zen microarchitecture has 2-way SMT.

VISC architecture uses the Virtual Software Layer (translation layer) to dispatch a single thread of instructions to the Global Front End which splits instructions into virtual hardware threadlets which are then dispatched to separate virtual cores. These virtual cores can then send them to the available resources on any of the physical cores. Multiple virtual cores can push threadlets into the reorder buffer of a single physical core, which can split partial instructions and data from multiple threadlets through the execution ports at the same time. Each virtual core keeps track of the position of the relative output. This form of multithreading can increase single threaded performance by allowing a single thread to use all resources of the CPU. The allocation of resources is dynamic on a near-single cycle latency level (1–4 cycles depending on the change in allocation depending on individual application needs. Therefore, if two virtual cores are competing for resources, there are appropriate algorithms in place to determine what resources are to be allocated where.

Disadvantages

Depending on the design and architecture of the processor, simultaneous multithreading can decrease performance if any of the shared resources are bottlenecks for performance. Critics argue that it is a considerable burden to put on software developers that they have to test whether simultaneous multithreading is good or bad for their application in various situations and insert extra logic to turn it off if it decreases performance. Current operating systems lack convenient API calls for this purpose and for preventing processes with different priority from taking resources from each other.

There is also a security concern with certain simultaneous multithreading implementations. Intel's hyperthreading in NetBurst-based processors has a vulnerability through which it is possible for one application to steal a cryptographic key from another application running in the same processor by monitoring its cache use. There are also sophisticated machine learning exploits to HT implementation that were explained at Black Hat 2018.

See also

References

  1. "The First Direct Mesh-to-Mesh Photonic Fabric" (PDF). Archived from the original (PDF) on 2024-02-08. Retrieved 2024-02-08.
  2. ^ ASPLOS'11
  3. Smotherman, Mark (25 May 2011). "End of IBM ACS Project". School of Computing, Clemson University. Retrieved January 19, 2013.
  4. Marr, Deborah (February 14, 2002). "Hyper-Threading Technology Architecture and Microarchitecture" (PDF). Intel Technology Journal. 6 (1): 4. doi:10.1535/itj. Archived from the original (PDF) on 24 October 2016. Retrieved 25 September 2015.
  5. "CPU performance evaluation Pentium 4 2.8 and 3.0". Archived from the original on 2021-02-24. Retrieved 2011-04-22.
  6. "Replay: Unknown Features of the NetBurst Core. Page 15". Replay: Unknown Features of the NetBurst Core. xbitlabs.com. Archived from the original on 14 May 2011. Retrieved 24 April 2011.
  7. "MIPS MT ASE description".
  8. Barth, Michaela; Byckling, Mikko; Ilieva, Nevena; Saarinen, Sami; Schliephake, Michael (18 February 2014). Weinberg, Volker (ed.). "Best Practice Guide Intel Xeon Phi v1.1". Partnership for Advanced Computing in Europe. Archived from the original on 3 May 2017. Retrieved 22 November 2016.
  9. "AMD Bulldozer Family Module Multithreading". wccftech. July 2013. Archived from the original on 2013-10-17. Retrieved 2013-07-22.
  10. Halfacree, Gareth (28 October 2010). "AMD unveils Flex FP". bit-tech.
  11. "Soft Machines unveils VISC virtual chip architecture | bit-tech.net".
  12. Cutress, Ian (12 February 2016). "Examining Soft Machines' Architecture: An Element of VISC to Improving IPC". AnandTech.
  13. "Next Gen Processor Performance Revealed". VR World. February 4, 2016. Archived from the original on 2017-01-13.
  14. "Architectural Waves". Soft Machines. 2017. Archived from the original on 2017-03-29.
  15. "Replay: Unknown Features of the NetBurst Core. Page 15". Replay: Unknown Features of the NetBurst Core. xbitlabs.com. Archived from the original on 14 May 2011. Retrieved 24 April 2011.
  16. How good is hyperthreading?
  17. Hyper-Threading Considered Harmful
  18. TLBleed: When Protecting Your CPU Caches is Not Enough
General

External links

Processor technologies
Models
Architecture
Instruction set
architectures
Types
Instruction
sets
Execution
Instruction pipelining
Hazards
Out-of-order
Speculative
Parallelism
Level
Multithreading
Flynn's taxonomy
Processor
performance
Types
By application
Systems
on chip
Hardware
accelerators
Word size
Core count
Components
Functional
units
Logic
Registers
Control unit
Datapath
Circuitry
Power
management
Related
Parallel computing
General
Levels
Multithreading
Theory
Elements
Coordination
Programming
Hardware
APIs
Problems
Categories: