Misplaced Pages

Character encodings in HTML: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 19:31, 1 March 2010 editDlrohrer2003 (talk | contribs)Extended confirmed users5,671 edits moving the <references> tag to its own section← Previous edit Revision as of 20:19, 8 March 2010 edit undoMs2ger (talk | contribs)Extended confirmed users, Rollbackers16,465 editsm Rewrite; Add refsNext edit →
Line 1: Line 1:
{{Html series}} {{Html series}}
] (<u>H</u>yper<u>t</u>ext <u>M</u>arkup <u>L</u>anguage) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international ]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ] two goals are worth considering: the information's ], and universal ] display. The possibility to use non-default ''']s in ]''' was introduced in HTML4 (1997), despite the fact that HTML was first introduced in 1991. If an HTML document includes ] outside the range of ], the information's ] and universal ] display may be harmed if the document does not define the used character encoding.


==The document character encoding== ==Specifying the document's character encoding==
There are several ways to specify which character encoding is used in the document. First, the ] can include the character encoding or "<code>charset</code>" in the ] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |url=http://tools.ietf.org/html/rfc2616#section-14.17 |chapter=Content-Type |title=Hypertext Transfer Protocol – HTTP/1.1 |first1=R. |last1=Fielding |authorlink1=Roy Fielding |first2=J. |last2=Gettys |authorlink2=Jim Gettys |first3=J. |last3=Mogul |first4=H. |last4=Frystyk |authorlink4=Henrik Frystyk Nielsen |first5=L. |last5=Masinter |first6=P. |last6=Leach |first7=T. |last7=Berners-Lee |authorlink7=Tim Berners-Lee |publisher=] |date=June 1999 |accessdate=8 March 2010}}</ref>
When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, ] / HTTP headers can be sent by the ] along with each web page (HTML document). A typical HTTP header looks like this:


Content-Type: text/html; charset=ISO-8859-1 Content-Type: text/html; charset=ISO-8859-1


For ] (not usually ]), the other method is for the HTML document to include this information at its top, inside the <code>HEAD</code> element. In ] (but not in ]), it is also possible to include this information in the document itself. In this case, the following code could be added near the top of the document, inside the <code>head</code> element:<ref name=html5charset/>


<source lang=html4strict>
&lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8"&gt; <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</source>


] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |url=http://www.whatwg.org/html/#charset |chapter=Specifying the document's character encoding |title=HTML5 |first=I. |last=Hickson |authorlink=Ian Hickson |publisher=] |date=5 March 2010 |accessdate=8 March 2010}}</ref>
] documents have a third option: to express the character encoding in the ] preamble, for example


<source lang=html4strict>
&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
<meta charset="utf-8">
</source>


] documents, including ] documents, on the other hand, can use a processing instruction, as follows:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-pi |chapter=Processing Instructions |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=] |date=26 November 2008 |accessdate=8 March 2010}}</ref>
These methods each advise the receiver that the file being sent uses the specified character encoding. The character encoding is often referred to as the "character set" and it indeed does limit the characters in the raw source text. However, the HTML standard states{{Citation needed|date=February 2010}} that the "charset" is to be treated as an encoding of ] characters and provides a way to specify characters that the "charset" does not cover. The term ] is also used similarly.


<source lang=xml>
It is a bad idea to send incorrect information about the character encoding used by a document. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform to the server's specification{{Clarify|date=February 2010}} &mdash; some users may have machines with different character sets. For this reason, many servers{{Citation needed|date=February 2010}} simply do not send the information at all{{Clarify|date=February 2010}}, thus avoiding making false promises. However, this may result in the equally bad situation where the ] displays the document incorrectly because neither sending party has specified a character encoding.
<?xml version="1.0" encoding="ISO-8859-1"?>
</source>


As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declaration not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers{{Citation needed|date=March 2010}} simply do not include a reference to the "<code>charset</code>" in the <code>Content-Type</code> header, thus avoiding making false promises. However, if the document does not specify the encoding either, this may result in the equally bad situation where the ] displays ] because it cannot find out which character encoding was used.
A known misconception about <tt>&lt;meta http-equiv="Content-Type"></tt> is that ] is intended to be interpreted directly by a browser, like an ordinary HTML tag. According to WWW Consortium, it helps ]<ref></ref> to generate some ] when it serves the document. The ] header specification for a HTML document must label an appropriate encoding in the Content-Type header,<ref>RFC 2616 </ref> missing <tt>charset=</tt> parameter results in acceptance of ] (so HTTP/1.1 formally does not offer such option as an unspecified character encoding), and this specification supersedes all HTML (or XHTML) meta element ones. This can pose a problem if the server generates an incorrect header and one does not have the access or the knowledge to change them.


If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume ], which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 127) usually appear incorrectly. This presents few problems for ]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In ] environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit to override ''incorrect'' charset label manually as well.
Due to widespread and persistent ignorance of HTTP <tt>charset=</tt> over the Internet (at its server side), WWW Consortium disappointed in HTTP/1.1’s strict approach<ref></ref> and encourage browser developers to use some fixes in violation of RFC 2616.
When browser reads a document with no character encoding information, it can either make a blind assumption or rely on user’s setting, either browser-wide or specific for a given document. Browsers usually also permit to override ''incorrect'' charset label manually. For Western European languages, it is typical and fairly safe to assume ] (which is similar to ISO-8859-1 but has printable characters in place of some control codes that are forbidden in HTML anyway), but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 127) usually appear incorrectly. This presents few problems for ]-speaking users, but other languages regularly &mdash; in some cases, always &mdash; require characters outside that range. In ] environments where there are several different multi-byte encodings in use, auto-detection is often employed.


It is increasingly common for multilingual websites to use one of the ]/] ], as this allows use of the same encoding for all languages. Generally ] is used rather than ] or ] because it is easier to handle in programming languages that assume a ] ASCII superset encoding, and it is efficient for ASCII-heavy text (which HTML tends to be). It is increasingly common for multilingual websites and websites in non-Western languages to use ], which allows use of the same encoding for all languages. ] or ], which can be used for all languages as well, are less widely used because they van be harder to handle in programming languages that assume a ] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.


Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some machine-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not see the page as intended. Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.


==Character references== ==Character references==
{{Main|character entity reference|numeric character reference}} {{Main|character entity reference|numeric character reference}}


In addition to native character encodings, characters can also be encoded as '''character references''', which can be '''numeric character references''' (] or ]) or '''character entity references'''. Character entity references are also sometimes referred to as '''named entities''', or '''HTML entities''' for HTML. HTML's usage of character references derives from ]. In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' (] or ]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from ].


===HTML character references===
Character entity references have the format <code>&amp;''name'';</code> where "name" is a case-sensitive alphanumeric string. For example, the character 'λ' can be encoded as <code>&amp;lambda;</code> in an HTML 4 document. Characters &lt;, &gt;, " and & are used to delimit tags, attribute values, and character references. Character entity references <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;amp;</code>, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters.
Numeric character references can be in decimal format, <code>&amp;#''DD'';</code>, where <code>''DD''</code> is a variable number of decimal digits. Similarly there is a hexadecimal format, <code>&amp;#x''HHHH'';</code>, where <code>''HHHH''</code> is a variable number of hexadecimal digits. Hexadecimal character references are case-insensitive in HTML. For example, the character 'λ' can be represented as <code>&amp;#955;</code>, <code>&amp;#x03BB;</code> or <code>&amp;#X03bb;</code>.


Numeric character references can be in decimal format, <code>&amp;#''DD'';</code>, where <code>''DD''</code> is a variable-width string of decimal digits. Similarly there is a hexadecimal format, <code>&amp;#x''HHHH'';</code>, where <code>''HHHH''</code> is a variable-width string of hexadecimal digits, though many consider it good practice to never use fewer than four hex digits, and never use an odd number of hex digits (due to the correspondence of two hex digits to one byte). Unlike named entities, hexadecimal character references are case-insensitive in HTML. For example, λ can also be represented as <code>&amp;#955;</code>, <code>&amp;#x03BB;</code> or <code>&amp;#X03bb;</code>. Character entity references have the format <code>&amp;''name'';</code> where "name" is a case-sensitive alphanumeric string. For example, 'λ' can also be encoded as <code>&amp;lambda;</code> in an HTML document. (For a list of all named HTML character entity references, see ].) The character entity references <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;amp;</code> are predefined in HTML and SGML, because <code>&lt;</code>, <code>&gt;</code>, <code>"</code> and <code>&amp;</code> are already used to delimit markup. This notably does not include XML's <code>&amp;apos;</code> (') entity.


Numeric references ''always'' refer to ] code points, regardless of the page's encoding. Using numeric references that refer to UCS control code ranges is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, ''not even by reference'' —so "&amp;#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding. Numeric references ''always'' refer to ] code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, ''not even by reference'', so "&amp;#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.


Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters (or not at all if a native ] encoding like ] is used). Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, for example a ] encoding like ], then HTML character references are usually only required for a the markup delimiting characters mentioned above.


===XML character entity references=== ===XML character references===
Unlike traditional HTML with its large range of character entity references, in ] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts: Unlike traditional HTML with its large range of character entity references, in ] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=] |date=26 November 2008 |accessdate=8 March 2010}}</ref>


*<code>&amp;amp;</code> → & (], U+0026) *<code>&amp;amp;</code> → & (], U+0026)
Line 50: Line 56:
*<code>&amp;apos;</code> → ' (apostrophe, U+0027) *<code>&amp;apos;</code> → ' (apostrophe, U+0027)


All other character entity references have to be defined before they can be used. For example, use of <code>&amp;eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&amp;#xA1b</code> rather than <code>&amp;#XA1b</code>. ], which is an XML application, supports the HTML 4 entity set and XML's <code>&amp;apos;</code> entity, which does not appear in HTML 4. All other character entity references have to be defined before they can be used. For example, use of <code>&amp;eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&amp;#xA1b</code> rather than <code>&amp;#XA1b</code>. ], which is an XML application, supports the HTML entity set, along with XML's predefined entities.

However, use of <code>&amp;apos;</code> in XHTML should generally be avoided for compatibility reasons. <code>&amp;#39;</code> or <code>&amp;#x0027;</code> may be used instead.

<code>&amp;amp;</code> has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences <code>&amp;amp;amp;amp;amp; ...</code> in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.

===HTML character entity references===
For a list of all named HTML character entity references, see ''']''' (approximately 250 entries).


== References == == References ==
{{Reflist}}
<references/>


== External links == == External links ==
* *
* *
*


] ]

Revision as of 20:19, 8 March 2010

HTML
Comparisons

The possibility to use non-default character encodings in HTML was introduced in HTML4 (1997), despite the fact that HTML was first introduced in 1991. If an HTML document includes characters outside the range of ASCII, the information's integrity and universal browser display may be harmed if the document does not define the used character encoding.

Specifying the document's character encoding

There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:

Content-Type: text/html; charset=ISO-8859-1

In HTML (but not in XHTML), it is also possible to include this information in the document itself. In this case, the following code could be added near the top of the document, inside the head element:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:

<meta charset="utf-8">

XML documents, including XHTML documents, on the other hand, can use a processing instruction, as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>

As each of these methods explain to the receiver how the file being sent should be interpreted, it would be inappropriate for these declaration not to match the actual character encoding used. Because a server usually can't know how a document is encoded—especially if documents are created on different platforms or in different regions—many servers simply do not include a reference to the "charset" in the Content-Type header, thus avoiding making false promises. However, if the document does not specify the encoding either, this may result in the equally bad situation where the user agent displays mojibake because it cannot find out which character encoding was used.

If a user agent reads a document with no character encoding information, it can fall back to using some other information. For example, it can rely on the user's settings, either browser-wide or specific for a given document, or it can pick a default encoding based on the user's language. For Western European languages, it is typical and fairly safe to assume Windows-1252, which is similar to ISO-8859-1 but has printable characters in place of some control codes. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 127) usually appear incorrectly. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In CJK environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit to override incorrect charset label manually as well.

It is increasingly common for multilingual websites and websites in non-Western languages to use UTF-8, which allows use of the same encoding for all languages. UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they van be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Character references

Main articles: character entity reference and numeric character reference

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.

HTML character references

Numeric character references can be in decimal format, &#DD;, where DD is a variable number of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;, where HHHH is a variable number of hexadecimal digits. Hexadecimal character references are case-insensitive in HTML. For example, the character 'λ' can be represented as &#955;, &#x03BB; or &#X03bb;.

Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, 'λ' can also be encoded as &lambda; in an HTML document. (For a list of all named HTML character entity references, see List of XML and HTML character entity references.) The character entity references &lt;, &gt;, &quot; and &amp; are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup. This notably does not include XML's &apos; (') entity.

Numeric references always refer to Unicode code points, regardless of the page's encoding. Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so "&#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, for example a Unicode encoding like UTF-8, then HTML character references are usually only required for a the markup delimiting characters mentioned above.

XML character references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:

  • &amp; → & (ampersand, U+0026)
  • &lt; → < (less-than sign, U+003C)
  • &gt; → > (greater-than sign, U+003E)
  • &quot; → " (quotation mark, U+0022)
  • &apos; → ' (apostrophe, U+0027)

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.

References

  1. Fielding, R.; Gettys, J.; Mogul, J.; Frystyk, H.; Masinter, L.; Leach, P.; Berners-Lee, T. (June 1999), "Content-Type", Hypertext Transfer Protocol – HTTP/1.1, IETF, retrieved 8 March 2010
  2. ^ Hickson, I. (5 March 2010), "Specifying the document's character encoding", HTML5, WHATWG, retrieved 8 March 2010
  3. Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Processing Instructions", XML, W3C, retrieved 8 March 2010
  4. Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Character and Entity References", XML, W3C, retrieved 8 March 2010

External links

Categories: