This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (September 2024) (Learn how and when to remove this message) |
This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help improve this article by introducing more precise citations. (September 2024) (Learn how and when to remove this message) |
MIME / IANA | ISO-10646-UTF-1 |
---|---|
Language(s) | International |
Current status | Obscure, of mainly historical interest. |
Classification | Unicode Transformation Format, extended ASCII, variable-width encoding |
Extends | US-ASCII |
Transforms / Encodes | ISO/IEC 10646 (Unicode) |
Succeeded by | UTF-8 |
UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
Design
Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points U+0080 through U+009F are also single bytes).
UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO/IEC 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 2 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
code point | UTF-8 | UTF-1 |
---|---|---|
U+007F | 7F | 7F |
U+0080 | C2 80 | 80 |
U+009F | C2 9F | 9F |
U+00A0 | C2 A0 | A0 A0 |
U+00BF | C2 BF | A0 BF |
U+00C0 | C3 80 | A0 C0 |
U+00FF | C3 BF | A0 FF |
U+0100 | C4 80 | A1 21 |
U+015D | C5 9D | A1 7E |
U+015E | C5 9E | A1 A0 |
U+01BD | C6 BD | A1 FF |
U+01BE | C6 BE | A2 21 |
U+07FF | DF BF | AA 72 |
U+0800 | E0 A0 80 | AA 73 |
U+0FFF | E0 BF BF | B5 48 |
U+1000 | E1 80 80 | B5 49 |
U+4015 | E4 80 95 | F5 FF |
U+4016 | E4 80 96 | F6 21 21 |
U+D7FF | ED 9F BF | F7 2F C3 |
U+E000 | EE 80 80 | F7 3A 79 |
U+F8FF | EF A3 BF | F7 5C 3C |
U+FDD0 | EF B7 90 | F7 62 BA |
U+FDEF | EF B7 AF | F7 62 D9 |
U+FEFF | EF BB BF | F7 64 4C |
U+FFFD | EF BF BD | F7 65 AD |
U+FFFE | EF BF BE | F7 65 AE |
U+FFFF | EF BF BF | F7 65 AF |
U+10000 | F0 90 80 80 | F7 65 B0 |
U+38E2D | F0 B8 B8 AD | FB FF FF |
U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 |
U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A |
U+100000 | F4 80 80 80 | FC 21 37 B2 7B |
U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C |
U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 |
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.
See also
References
- "The Unicode Standard: Appendix F FSS-UTF" (PDF) (PDF, 768 KiB). Version 1.1. Unicode, Inc.
- ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KiB) (1 ed.). Registration number 178. Archived from the original (PDF) on 2015-03-18.
{{cite web}}
: CS1 maint: numeric names: authors list (link) - Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
- Yergeau, F. (November 2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. STD 63. RFC 3629.
Character encodings | |
---|---|
Early telecommunications | |
ISO/IEC 8859 |
|
Bibliographic use | |
National standards | |
ISO/IEC 2022 | |
Mac OS Code pages ("scripts") | |
DOS code pages | |
IBM AIX code pages | |
Windows code pages | |
EBCDIC code pages | |
DEC terminals (VTx) | |
Platform specific |
|
Unicode / ISO/IEC 10646 | |
TeX typesetting system | |
Miscellaneous code pages | |
Control character | |
Related topics | |
Character sets |