Simultaneous multithreading

This is an old revision of this page, as edited by Hcobb (talk | contribs) at 16:48, 5 September 2008 (→Modern commercial implementations: Cleared up SMT vs other performance features.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 16:48, 5 September 2008 by Hcobb (talk | contribs) (→Modern commercial implementations: Cleared up SMT vs other performance features.)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

This section may require cleanup to meet Misplaced Pages's quality standards. No cleanup reason has been specified. Please help improve this section if you can. (November 2007) (Learn how and when to remove this message)

It has been suggested that this article be merged into Multithreading (computer hardware) and Talk:Multithreading (computer hardware). (Discuss) Proposed since August 2007.

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with Hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

Details

Multithreading is similar in concept to preemptive multitasking but is implemented at the thread level of execution in modern superscalar processors.

Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity have limited the number to two for most SMT implementations.

Because the technique is really an efficiency solution and there is inevitable increased conflict on shared resources, measuring or agreeing on the effectiveness of the solution can be difficult. Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.

However, in most current cases, SMT is about hiding memory latency, efficiency and increased throughput of computations per amount of hardware used.

Taxonomy

In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalar technique which tries to increase Instruction Level Parallelism (ILP), the other is multithreading approach exploiting Thread Level Parallelism (TLP).

Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as Temporal multithreading. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grain multithreading issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's Montecito processor uses coarse-grain multithreading, while Sun's UltraSPARC T1 uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can only issue up to one instruction per cycle.
Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
Chip-level multiprocessing (CMP or multicore): integrates two or more superscalar processors into one chip, each executes threads independently
Any combination of multithreaded/SMT/CMP

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Historical implementations

While multithreading CPUs have been around since the 1950s, Simultaneous Multithreading was first researched by IBM in 1968. The first major commercial CPU developed with SMT was the DEC 21464 (EV8). This chip was developed by DEC in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Hank Levy of the University of Washington. The processor was never released, since the Alpha line of processors was discontinued shortly before HP acquired Compaq (formerly DEC). Dean Tullsen's work was also used to create the Intel Pentium 4 Processor.

Modern commercial implementations

The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology (HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, and some programs actually slow down slightly when HTT is turned on due to increased contention for resources such as bandwidth, caches, TLBs, re-order buffer entries, etc. This is generally the case for poorly written data access routines that cause high latency intercache transactions (cache thrashing) on multi-processor systems. Programs written before multiprocessor and multicore designs were prevelant commonly did not optimize cache access because on a single cpu system there is only a single cache which is always coherent with itself. On a multiprocessor system each cpu or core will typically have its own cache, which is interlinked with the cache of other cpu/cores in the system to maintain cache coherency. If thread A accesses a memory location and thread B then accesses memory location it can cause an intercache transaction particularly where the cache line fill exceeds 2 bytes, as is the case for all modern processors.

The latest MIPS architecture designs include an SMT system known as "MIPS MT". MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.

The IBM POWER5, announced in May 2004, comes as either a dual core DCM, or quad-core or 8-core MCM, with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading.

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the upcoming processor codenamed "Rock" (to be launched ~2009 ) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara chip has 8 cores per chip, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a Barrel processor. Sun Microsystems' Rock processor is different, it has more complex cores that have more than one pipelines.

The Intel Atom, released in 2008, is the first Intel product to feature SMT (marketed as Hyper-threading) without supporting instruction reordering, speculative execution, or register renaming.

References

http://www.theregister.co.uk/2007/12/14/sun_rock_delays/

LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through Pipelining", Computer Feb 1974
D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," In 22nd Annual International Symposium on Computer Architecture, June, 1995
D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," In 23rd Annual International Symposium on Computer Architecture, May, 1996

External links

Processor technologies

Models

Architecture

Instruction set
architectures

Types	Orthogonal instruction set CISC RISC Application-specific EDGE TRIPS VLIW EPIC MISC OISC NISC ZISC VISC architecture Quantum computing Comparison Addressing modes
Instruction sets	Motorola 68000 series VAX PDP-11 x86 ARM Stanford MIPS MIPS MIPS-X Power POWER PowerPC Power ISA Clipper architecture SPARC SuperH DEC Alpha ETRAX CRIS M32R Unicore Itanium OpenRISC RISC-V MicroBlaze LMC System/3x0 S/360 S/370 S/390 z/Architecture Tilera ISA VISC architecture Epiphany architecture Others

Execution

Instruction pipelining	Pipeline stall Operand forwarding Classic RISC pipeline
Hazards	Data dependency Structural Control False sharing
Out-of-order	Scoreboarding Tomasulo's algorithm Reservation station Re-order buffer Register renaming Wide-issue
Speculative	Branch prediction Memory dependence prediction

Parallelism

Level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory Distributed
Multithreading	Temporal Simultaneous Hyperthreading Simultaneous and heterogenous Speculative Preemptive Cooperative
Flynn's taxonomy	SISD SIMD Array processing (SIMT) Pipelined processing Associative processing SWAR MISD MIMD SPMD

Processor
performance

Transistor count
Instructions per cycle (IPC)
- Cycles per instruction (CPI)
Instructions per second (IPS)
Floating-point operations per second (FLOPS)
Transactions per second (TPS)
Synaptic updates per second (SUPS)
Performance per watt (PPW)
Cache performance metrics
Computer performance by orders of magnitude

Types

By application	Embedded system Microprocessor Microcontroller Mobile Ultra-low-voltage ASIP Soft microprocessor
Systems on chip	System on a chip (SoC) Multiprocessor (MPSoC) Cypress PSoC Network on a chip (NoC)
Hardware accelerators	Coprocessor AI accelerator Graphics processing unit (GPU) Image processor Vision processing unit (VPU) Physics processing unit (PPU) Digital signal processor (DSP) Tensor Processing Unit (TPU) Secure cryptoprocessor Network processor Baseband processor

Word size

Core count

Components

Functional units	Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Memory management unit (MMU) Load–store unit Translation lookaside buffer (TLB) Branch predictor Branch target predictor Integrated memory controller (IMC) Memory management unit Instruction decoder
Logic	Combinational Sequential Glue Logic gate Quantum Array
Registers	Processor register Status register Stack register Register file Memory buffer Memory address register Program counter
Control unit	Hardwired control unit Instruction unit Data buffer Write buffer Microcode ROM Counter
Datapath	Multiplexer Demultiplexer Adder Multiplier CPU Binary decoder Address decoder Sum-addressed decoder Barrel shifter
Circuitry	Integrated circuit 3D Mixed-signal Power management Boolean Digital Analog Quantum Switch

Power
management

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Categories: