Revision as of 14:33, 13 October 2016 edit194.100.106.190 (talk) Use 'octet' instead of 'byte' to avoid confusion.← Previous edit |
Latest revision as of 15:59, 25 November 2018 edit undo174.254.130.36 (talk) Now redirects to the specific section for this RFC.Tag: Redirect target changed |
(15 intermediate revisions by 11 users not shown) |
Line 1: |
Line 1: |
|
|
#REDIRECT ] |
|
{{original research|date=February 2013}} |
|
|
{{Refimprove|date=July 2010}} |
|
|
'''UTF-9''' and '''UTF-18''' (9- and 18-] ], respectively) were two ] joke specifications for encoding Unicode on systems where the ] (nine bit group) is a better fit for the native word size than the ], such as the 36-bit ] and the ]. Both encodings were specified in RFC 4042, written by ] (inventor of ]) and released on April 1, 2005. The encodings suffer from a number of flaws and it is confirmed by their author that they were intended as a joke.<ref>{{cite web|url=http://panda.com/mrc/|title=Mark Crispin's Web Page|accessdate=2006-09-17}} Points out ] for two of his RFCs.</ref> |
|
|
|
|
|
|
|
{{Rcat shell| |
|
However, unlike some of the "specifications" given in other April 1 ], UTF-9 and UTF-18 are actually technically possible to implement, and have in fact been implemented in ] assembly language. They are however not endorsed by the ]. |
|
|
|
{{R to related topic}} |
|
|
|
|
|
{{Rwh}} |
|
== Technical details == |
|
|
⚫ |
}} |
|
Like the ] ''code unit'' ]{{snd}} ]{{snd}} commonly called ], UTF-9 uses a system of putting an octet in the low 8 ]s of each nonet and using the high bit to indicate continuation. This means that ] and ] characters take one nonet each, the rest of the ] characters take two nonets each and non-BMP code points take three. Code points that require multiple nonets are stored starting with the most significant non-zero octet. |
|
|
|
|
|
This table shows the UTF-9 encoding scheme (the x characters are replaced by the bits of the code point): |
|
|
{| class="wikitable" |
|
|
|- |
|
|
!Number<br>of nonets!!Bits for<br>code point!!First<br>code point!!Last<br>code point!!Nonet 1!!Nonet 2!!Nonet 3 |
|
|
|- |
|
|
|style="text-align: center;"|1 |
|
|
|style="text-align: center;"|8 |
|
|
|style="text-align: right;"|U+0000 |
|
|
|style="text-align: right;"|U+00FF |
|
|
|<code>0xxxxxxxx</code> |
|
|
|- |
|
|
|style="text-align: center;"|2 |
|
|
|style="text-align: center;"|16 |
|
|
|style="text-align: right;"|U+0100 |
|
|
|style="text-align: right;"|U+FFFF |
|
|
|<code>1xxxxxxxx</code>||<code>0xxxxxxxx</code> |
|
|
|- |
|
|
|style="text-align: center;"|3 |
|
|
|style="text-align: center;"|21 |
|
|
|style="text-align: right;"|U+10000 |
|
|
|style="text-align: right;"|U+10FFFF |
|
|
|<code>1000xxxxx</code>||<code>1xxxxxxxx</code>||<code>0xxxxxxxx</code> |
|
⚫ |
|} |
|
|
|
|
|
The details of the encoding scheme used in UTF-9 differ from ] in a non-ideal way, as UTF-9 is not ]: the end of a longer sequence can be confused with a shorter sequence. For instance, <code>U+0041</code> is represented in octal as <code>101</code> and <code>U+E0041</code> as <code>416 400 101</code>. This stems from the lack of distinction between the beginning of a sequence and the subsequent continuation nonets, as both simply have their most significant bit set, and the lack of distinction between a one-nonet sequence and the last nonet of a multi-nonet sequence. In contrast, in ], the three different kinds of octets are trivially distinguishable from each other, making the scheme self-synchronizing. Searching within a UTF-9 encoded string or splitting one requires special care, as it is always necessary to search backwards to find the beginning of the current sequence. |
|
|
|
|
|
UTF-18 is a fixed length encoding using an 18 bit integer per code point. This allows representation of four planes, which are mapped to the four planes currently used by ] (planes 0–2 and 14). This means that the two private use planes (15 and 16) and the currently unused planes (3–13) are not supported. The UTF-18 specification does not say why they did not allow surrogates to be used for these code points, though when talking about UTF-16 earlier in the RFC, it says "This transformation format requires complex surrogates to represent code points outside the BMP". After complaining about their complexity, it would have looked a bit hypocritical to use surrogates in their new standard. It is unlikely that planes 3–13 will be assigned by ] any time in the foreseeable future. Thus, UTF-18, like ] and ], guarantees a fixed width for all code points (although not for all glyphs). |
|
|
|
|
|
== See also == |
|
|
* ] |
|
|
* ] |
|
|
* ] |
|
|
|
|
|
== Notes == |
|
|
{{Reflist}} |
|
|
|
|
|
== External links == |
|
|
* RFC 4042: UTF-9 and UTF-18 Efficient Transformation Formats of Unicode |
|
|
|
|
|
{{character encoding}} |
|
|
{{IETF RFC 1st april}} |
|
|
|
|
|
|
{{DEFAULTSORT:Utf-09 And Utf-18}} |
|
{{DEFAULTSORT:Utf-09 And Utf-18}} |
|
] |
|
] |
|
] |
|
|
] |
|
|
] |
|
] |