Misplaced Pages

SSE2: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 16:35, 25 November 2010 editPuffin (talk | contribs)27,888 editsm Reverted edits by 190.202.129.33 (talk) to last revision by Neard (HG)← Previous edit Latest revision as of 08:21, 14 August 2024 edit undo217.100.173.130 (talk) See also: Fix link to SSE2 instruction list 
(123 intermediate revisions by 94 users not shown)
Line 1: Line 1:
{{Short description|Intel SIMD processor supplementary instruction sets introduced by Intel}}
{{Unreferenced|date=November 2008}}
{{Multiple issues|
'''SSE2''', '''Streaming SIMD Extensions 2''', is one of the Intel ] (Single Instruction, Multiple Data) ] sets first introduced by ] with the initial version of the ] in 2001. It extends the earlier ] instruction set, and is intended to fully supplant ]. Intel extended SSE2 to create ] in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Rival chip-maker ] added support for SSE2 with the introduction of their ] and ] ranges of ] 64-bit CPUs in 2003.
{{Refimprove|date=May 2013}}
{{Context|date=April 2023}}
}}
'''SSE2''' ('''Streaming SIMD Extensions 2''') is one of the Intel ] (Single Instruction, Multiple Data) ] sets introduced by ] with the initial version of the ] in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.


SSE2 introduced double-precision floating point instructions in addition to the single-precision floating point and integer instructions found in SSE. SSE2 extends earlier SSE instruction set by adding 144 new instructions to the previous 70 instructions. SSE2 intends to fully replace ], a SIMD instruction set found on IA-32 architecture processors. Competing chip-maker ] added support for SSE2 with the introduction of their ] and ] ranges of ] 64-bit CPUs in 2003.
==Changes==
SSE2 extends MMX instructions to operate on XMM registers, allowing the programmer to completely avoid the eight 64-bit MMX registers "aliased" on the original IA-32 floating point register stack. This permits mixing integer SIMD and scalar floating point operations without the mode switching required between MMX and ] floating point operations. However, this is over-shadowed by the value of being able to perform MMX operations on the wider SSE registers.


SSE2 was extended to create ] in 2004, and extended once again to create ] in 2006.
Other SSE2 extensions include a set of ]-control instructions intended primarily to minimize ] when processing indefinite streams of information, and a sophisticated complement of numeric format conversion instructions.

==Features==
Most of the SSE2 instructions implement the integer vector operations also found in MMX. Instead of the MMX registers they use the XMM registers, which are wider and allow for significant performance improvements in specialized applications. Another advantage of replacing MMX with SSE2 is avoiding the mode switching penalty for issuing ] instructions present in MMX because it is sharing register space with the x87 FPU. The SSE2 also complements the floating-point vector operations of the SSE instruction set by adding support for the double precision data type.

Other SSE2 extensions include a set of ]s intended primarily to minimize ] when processing infinite streams of information, and a sophisticated complement of numeric format conversion instructions.


AMD's implementation of SSE2 on the AMD64 (]) platform includes an additional eight registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for x86-64 architecture (or in Intel's parlance, "Intel 64") in 2004. AMD's implementation of SSE2 on the AMD64 (]) platform includes an additional eight registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for x86-64 architecture (or in Intel's parlance, "Intel 64") in 2004.


==Differences between x87 FPU and SSE2== ==Differences between x87 FPU and SSE2==
The FPU (x87) instructions usually store intermediate results with 80 bits of precision. When legacy FPU software algorithms are ported to SSE2, certain combinations of math operations or input datasets can result in measurable numerical deviation: This is of critical importance to scientific computations, if the calculation results must be compared against results generated from a different machine architecture. FPU (x87) instructions provide higher precision by calculating intermediate results with 80 bits of precision, by default, to minimise ] in numerically unstable algorithms (see ] and references therein). However, the x87 FPU is a scalar unit only whereas SSE2 can process a small vector of operands in parallel.


If code designed for x87 is ported to the lower precision double precision SSE2 floating point, certain combinations of math operations or input datasets can result in measurable numerical deviation, which can be an issue in reproducible scientific computations, e.g. if the calculation results must be compared against results generated from a different machine architecture. A related issue is that, historically, language standards and compilers had been inconsistent in their handling of the x87 80-bit registers implementing double extended precision variables, compared with the double and single precision formats implemented in SSE2: the rounding of extended precision intermediate values to double precision variables was not fully defined and was dependent on implementation details such as when registers were spilled to memory.
Depending on the compiler or interpreter (and optimizations) used, different intermediate results of a given mathematical expression or iterative algorithm may need to be temporarily saved, and later reloaded. SSE2 works with either 32 or 64 bits (4 or 8 bytes) of precision while x87 instructions normally produces 80-bit results in its 80-bit registers (10 bytes). All the 80 bits of an x87 result may be stored in memory, but is nevertheless often rounded to 64 or 32 bits for compatibility with the most common floating point data types. Depending on precision as well as ''when'' such roundings are performed, the numerical results may be different. Similar differences can be seen when comparing results from 32 or 64-bit precision SSE2 code with corresponding results of 32, 64, or 80-bit precision x87 code. The following Fortran code compiled with ] is offered as an example; the exact value of the third and final number printed is zero.

program hi
real a,b,c,d
real x,y,z
a=.013
b=.027
c=.0937
d=.79
y=-a/b + (a/b+c)*EXP(d)
print *,y
z=(-a)/b + (a/b+c)*EXP(d)
print *,z
x=y-z
print *,x
end

Compiling to 387 floating point instructions and running yields:
# g95 -o hi -mfpmath=387 -fzero -ftrace=full -fsloppy-char hi.for
# ./hi
0.78587145
0.7858714
5.9604645E-8

Compiling to SSE2 instructions and running yields:
# g95 -o hi -mfpmath=sse -msse2 -fzero -ftrace=full -fsloppy-char hi.for
# ./hi
0.78587145
0.78587145
0.


==Differences between MMX and SSE2== ==Differences between MMX and SSE2==
SSE2 extends MMX instructions to operate on XMM registers. Therefore, it is possible to convert all existing MMX code to SSE2 equivalent. Since an XMM register is twice as long as an MMX register, loop counters and memory access may need to be changed to accommodate this. SSE2 extends MMX instructions to operate on XMM registers. Therefore, it is possible to convert all existing MMX code to an SSE2 equivalent. Since an SSE2 register is twice as long as an MMX register, loop counters and memory access may need to be changed to accommodate this. However, 8 byte loads and stores to XMM are available, so this is not strictly required.


Although one SSE2 instruction can operate on twice as much data as an MMX instruction, performance might not increase significantly. Two major reasons are: accessing SSE2 data in memory not ] to a 16-byte boundary will incur significant penalty, and the ] of SSE2 instructions in most ] implementations is usually smaller than MMX instructions. ] has recently addressed the first problem by adding an instruction in ] to reduce the overhead of accessing unaligned data, and the last problem by widening the execution engine in their ]. Although one SSE2 instruction can operate on twice as much data as an MMX instruction, performance might not increase significantly. Two major reasons are: accessing SSE2 data in memory not ] to a 16-byte boundary can incur significant penalty, and the ] of SSE2 instructions in older ] implementations was half that for MMX instructions. ] addressed the first problem by adding an instruction in ] to reduce the overhead of accessing unaligned data and improving the overall performance of misaligned loads, and the last problem by widening the execution engine in their ] in Core 2 Duo and later products.

Since MMX and x87 register files alias one another, using MMX will prevent x87 instructions from working as desired. Once MMX has been used, the programmer must use the emms instruction (C: _mm_empty()) to restore operation to the x87 register file. On some operating systems, x87 is not used very much, but may still be used in some critical areas like pow() where the extra precision is needed. In such cases, the corrupt floating-point state caused by failure to emit emms may go undetected for millions of instructions before ultimately causing the floating-point routine to fail, returning NaN. Since the problem is not locally apparent in the MMX code, finding and correcting the bug can be very time consuming. As SSE2 does not have this problem and it usually provides much better throughput and provides more registers in 64-bit code, it should be preferred for nearly all vectorization work.


==Compiler usage== ==Compiler usage==
When first introduced in 2000, SSE2 was not supported by software development tools. For example, to use SSE2 in a ] project, the programmer had to either manually write inline-assembly or import object-code from an external source. Later the Visual C++ Processor Pack added SSE2 support to ] and ]. When introduced in 2000, SSE2 was not supported by software development tools. For example, to use SSE2 in a ] project, the programmer had to either manually write inline-assembly or import object-code from an external source. Later the Visual C++ Processor Pack added SSE2 support to ] and ].


The ] can automatically generate SSE4/SSSE3/SSE3/SSE2 and/or SSE-code without the use of hand-coded assembly, letting programmers focus on algorithmic development instead of assembly-level implementation. Since its introduction, the Intel C Compiler has greatly increased adoption of SSE2 in Windows application development. The ] can automatically generate ], ], ], SSE2, and SSE code without the use of hand-coded assembly.


Since GCC 3, ] can automatically generate SSE/SSE2 scalar code when the target supports those instructions. ] for SSE/SSE2 has been added since GCC 4. Since GCC 3, ] can automatically generate SSE/SSE2 scalar code when the target supports those instructions. ] for SSE/SSE2 has been added since GCC 4.
Line 57: Line 38:
The ] can also generate SSE2 instructions when the compiler flag -xvector=simd is used. The ] can also generate SSE2 instructions when the compiler flag -xvector=simd is used.


Since ] 2012, the compiler option to generate SSE2 instructions is turned on by default.
==CPUs supporting SSE2==
* ]-based CPUs (], ], ], etc)
* ] CPUs
* ] ]-based CPUs (], ], ], ], etc)
* ] ] and ]
* ]-based CPUs (Core Duo, Core Solo, etc)
* ]-based CPUs (Core 2 Duo, Core 2 Quad, etc)
* ]
* ]
* ] ]
* ] ]
* ] ]
* ] ]


==CPU support==
==Notable IA-32 CPUs not supporting SSE2==
SSE2 is an extension of the ] architecture, based on the ]. Therefore, only x86 processors can include SSE2. The ] architecture supports the ] as a compatibility mode and includes the SSE2 in its specification.<ref>{{cite web|last=Matz|first=Michael|title=System V Application Binary Interface - AMD64 Architecture Processor Supplement - Draft Version 0.99.4|url=https://www.cs.washington.edu/education/courses/351/12wi/supp-docs/abi.pdf|access-date=April 26, 2013|author2=Hubicka, Jan|author3=Jaeger, Andreas|author4=Mitchell, Mark|date=January 2010}}{{Dead link|date=December 2021 |bot=InternetArchiveBot |fix-attempted=yes }}</ref><ref>{{cite web|last=Fog|first=Agner|title=Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms|url=http://www.agner.org/optimize/optimizing_cpp.pdf|access-date=April 26, 2013|archive-date=April 8, 2013|archive-url=https://web.archive.org/web/20130408125402/http://www.agner.org/optimize/optimizing_cpp.pdf|url-status=live}}</ref> It also doubles the number of XMM registers, allowing for better performance. SSE2 is also a requirement for installing Windows 8<ref>{{cite web|url=https://docs.microsoft.com/en-us/windows/desktop/dxmath/pg-xnamath-internals|title=DirectXMath Programming Guide/Library Internals|access-date=July 2, 2019|archive-date=July 2, 2019|archive-url=https://web.archive.org/web/20190702024151/https://docs.microsoft.com/en-us/windows/desktop/dxmath/pg-xnamath-internals|url-status=live}}</ref> (and later) or Microsoft Office 2013 (and later) "to enhance the reliability of third-party apps and drivers running in Windows 8".<ref>{{cite web|url=http://windows.microsoft.com/en-GB/windows-8/what-is-pae-nx-sse2|title=What is PAE, NX, and SSE2 and why does my PC need to support them to run Windows 8 ?|last=Microsoft Corporation|archive-url=https://web.archive.org/web/20130411004411/http://windows.microsoft.com/en-GB/windows-8/what-is-pae-nx-sse2|archive-date=April 11, 2013|access-date=March 19, 2013}}</ref>
SSE2 is an extension of the ] architecture. Therefore any architecture that does not support IA-32 does not support SSE2. ] CPUs all implement ]. All known ] CPUs also implement SSE2. Since IA-32 predates SSE2, early IA-32 CPUs did not implement it. SSE2 and the other SIMD instruction sets were intended primarily to improve CPU support for realtime graphics, notably gaming. A CPU that is not marketed for this purpose or that has an alternative SIMD instruction set has no need for SSE2.


The following CPUs implemented IA-32 after SSE2 was developed, but did not implement SSE2: The following IA-32 CPUs support SSE2:


* ] CPUs prior to ], including all ]-based CPUs * ] ]-based CPUs (], ], ], ], ])
* ] CPUs prior to ] * Intel ] and ]
* ]
* ]
* ]
* ]

The following IA-32 CPUs were released after SSE2 was developed, but did not implement it:

* ] CPUs prior to ], such as ]
* ] * ]
* ] * ]
* ]


== See also == ==See also==
* ] * ]

==References==
{{Reflist}}


{{Multimedia extensions}} {{Multimedia extensions}}
Line 88: Line 69:
] ]
] ]

]
]
]
]
]
]
]
]
]
]
]
]
]
]
]

Latest revision as of 08:21, 14 August 2024

Intel SIMD processor supplementary instruction sets introduced by Intel
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "SSE2" – news · newspapers · books · scholar · JSTOR (May 2013) (Learn how and when to remove this message)
This article provides insufficient context for those unfamiliar with the subject. Please help improve the article by providing more context for the reader. (April 2023) (Learn how and when to remove this message)
(Learn how and when to remove this message)

SSE2 (Streaming SIMD Extensions 2) is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets introduced by Intel with the initial version of the Pentium 4 in 2000. SSE2 instructions allow the use of XMM (SIMD) registers on x86 instruction set architecture processors. These registers can load up to 128 bits of data and perform instructions, such as vector addition and multiplication, simultaneously.

SSE2 introduced double-precision floating point instructions in addition to the single-precision floating point and integer instructions found in SSE. SSE2 extends earlier SSE instruction set by adding 144 new instructions to the previous 70 instructions. SSE2 intends to fully replace MMX, a SIMD instruction set found on IA-32 architecture processors. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

SSE2 was extended to create SSE3 in 2004, and extended once again to create SSE4 in 2006.

Features

Most of the SSE2 instructions implement the integer vector operations also found in MMX. Instead of the MMX registers they use the XMM registers, which are wider and allow for significant performance improvements in specialized applications. Another advantage of replacing MMX with SSE2 is avoiding the mode switching penalty for issuing x87 instructions present in MMX because it is sharing register space with the x87 FPU. The SSE2 also complements the floating-point vector operations of the SSE instruction set by adding support for the double precision data type.

Other SSE2 extensions include a set of cache control instructions intended primarily to minimize cache pollution when processing infinite streams of information, and a sophisticated complement of numeric format conversion instructions.

AMD's implementation of SSE2 on the AMD64 (x86-64) platform includes an additional eight registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for x86-64 architecture (or in Intel's parlance, "Intel 64") in 2004.

Differences between x87 FPU and SSE2

FPU (x87) instructions provide higher precision by calculating intermediate results with 80 bits of precision, by default, to minimise roundoff error in numerically unstable algorithms (see IEEE 754 design rationale and references therein). However, the x87 FPU is a scalar unit only whereas SSE2 can process a small vector of operands in parallel.

If code designed for x87 is ported to the lower precision double precision SSE2 floating point, certain combinations of math operations or input datasets can result in measurable numerical deviation, which can be an issue in reproducible scientific computations, e.g. if the calculation results must be compared against results generated from a different machine architecture. A related issue is that, historically, language standards and compilers had been inconsistent in their handling of the x87 80-bit registers implementing double extended precision variables, compared with the double and single precision formats implemented in SSE2: the rounding of extended precision intermediate values to double precision variables was not fully defined and was dependent on implementation details such as when registers were spilled to memory.

Differences between MMX and SSE2

SSE2 extends MMX instructions to operate on XMM registers. Therefore, it is possible to convert all existing MMX code to an SSE2 equivalent. Since an SSE2 register is twice as long as an MMX register, loop counters and memory access may need to be changed to accommodate this. However, 8 byte loads and stores to XMM are available, so this is not strictly required.

Although one SSE2 instruction can operate on twice as much data as an MMX instruction, performance might not increase significantly. Two major reasons are: accessing SSE2 data in memory not aligned to a 16-byte boundary can incur significant penalty, and the throughput of SSE2 instructions in older x86 implementations was half that for MMX instructions. Intel addressed the first problem by adding an instruction in SSE3 to reduce the overhead of accessing unaligned data and improving the overall performance of misaligned loads, and the last problem by widening the execution engine in their Core microarchitecture in Core 2 Duo and later products.

Since MMX and x87 register files alias one another, using MMX will prevent x87 instructions from working as desired. Once MMX has been used, the programmer must use the emms instruction (C: _mm_empty()) to restore operation to the x87 register file. On some operating systems, x87 is not used very much, but may still be used in some critical areas like pow() where the extra precision is needed. In such cases, the corrupt floating-point state caused by failure to emit emms may go undetected for millions of instructions before ultimately causing the floating-point routine to fail, returning NaN. Since the problem is not locally apparent in the MMX code, finding and correcting the bug can be very time consuming. As SSE2 does not have this problem and it usually provides much better throughput and provides more registers in 64-bit code, it should be preferred for nearly all vectorization work.

Compiler usage

When introduced in 2000, SSE2 was not supported by software development tools. For example, to use SSE2 in a Microsoft Visual Studio project, the programmer had to either manually write inline-assembly or import object-code from an external source. Later the Visual C++ Processor Pack added SSE2 support to Visual C++ and MASM.

The Intel C++ Compiler can automatically generate SSE4, SSSE3, SSE3, SSE2, and SSE code without the use of hand-coded assembly.

Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4.

The Sun Studio Compiler Suite can also generate SSE2 instructions when the compiler flag -xvector=simd is used.

Since Microsoft Visual C++ 2012, the compiler option to generate SSE2 instructions is turned on by default.

CPU support

SSE2 is an extension of the IA-32 architecture, based on the x86 instruction set. Therefore, only x86 processors can include SSE2. The AMD64 architecture supports the IA-32 as a compatibility mode and includes the SSE2 in its specification. It also doubles the number of XMM registers, allowing for better performance. SSE2 is also a requirement for installing Windows 8 (and later) or Microsoft Office 2013 (and later) "to enhance the reliability of third-party apps and drivers running in Windows 8".

The following IA-32 CPUs support SSE2:

The following IA-32 CPUs were released after SSE2 was developed, but did not implement it:

See also

References

  1. Matz, Michael; Hubicka, Jan; Jaeger, Andreas; Mitchell, Mark (January 2010). "System V Application Binary Interface - AMD64 Architecture Processor Supplement - Draft Version 0.99.4" (PDF). Retrieved April 26, 2013.
  2. Fog, Agner. "Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms" (PDF). Archived (PDF) from the original on April 8, 2013. Retrieved April 26, 2013.
  3. "DirectXMath Programming Guide/Library Internals". Archived from the original on July 2, 2019. Retrieved July 2, 2019.
  4. Microsoft Corporation. "What is PAE, NX, and SSE2 and why does my PC need to support them to run Windows 8 ?". Archived from the original on April 11, 2013. Retrieved March 19, 2013.
Instruction set extensions
SIMD (RISC)
SIMD (x86)
Bit manipulation
  • BMI (ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012)
  • ADX (2014)
Compressed instructions
Security and cryptography
Transactional memory
Virtualization
Suspended extensions' dates are struck through.
Categories: