Versión en español




Character encoding


If you could see a computers conversation it will probably look something like this: "010110111011101011010010110...". This is the way computers use to communicate each other. The reason of the existence of character encodings is that it would be very difficult for us to understand this "ones and zeros" (bits) language.

What a n-bit character encoding does is to separate these bit codes into groups of n bits each, and assign a symbol that correspond to that ordered sequence. In a simple way, a character encoding or character set can be considered as a translation table where each group of bits is related to a single character. For example, an 8-bit character encoding could represent the sequence "10010101" as the letter "a", the sequence "01101100" as the symbol "&", and so on.

Character encodings are widely used (even when we don't realize of it) among operative systems as well as in HTML documents. They were born with computers as there was the need of translate computer codes into a human readable text. In the beginning, character encodings were created or used by vendors and designed to fill the needs of each one of them as well as the language they where targeting their sales to. This created an expansion effect that gave as result a large number of character encodings, each of them giving support for characters of different languages or for different uses.

The Unicode industry standard is intended to replace and unify the existing character encodings to reduce their number and improve effectiveness (by solving other character encodings lacks). The Unicode standard encodes characters using different schemes named Unicode Transformation Formats (UTF).

The UTF-8 is an 8-bit variable length character encoding for Unicode that's becoming very popular for Internet use due to its capability to represent every universal character still being ASCII compatible. UTF-8 uses from 1 to 4 bytes to encode a single character (8 to 32 bits) depending on the Unicode symbol.

An HTML document using the UTF-8 character encoding should contain a declaration in it's header using the HTML meta tag (e.g., <meta http-equiv="content-type" content="text/html; charset=UTF-8">). Symbols can be inserted in these documents using of a simple code that's associated to a single character or symbol, like &#931; (decimal notation), or &#x3A3; (hexadecimal notation) what will render the Greek capital letter Sigma ("Σ"). Actually, the Unicode symbols render depends on the font style used by the browser, and most of them don't provide support for all the characters in the Unicode standard (not supported characters are rendered as squares or interrogation symbols). See the character entity reference for more information about the representation of special characters in HTML documents.



Bypass footer options  |   Send to a friend Send to a friend  |  Post to del.icio.us Post to del.icio.us

Digg this page Digg this!  |  File on Furl File on Furl  |  Add to Yahoo! MyWeb Add to Yahoo! MyWeb

Bypass W3C declarations | 

Valid XHTML 1.0 Strict  |  Valid CSS Why should you trust us? Click the images on the left to see how seriously we write our own pages, then make your choice.

Level Triple-A conformance icon, W3C-WAI Web Content Accessibility Guidelines 1.0. This website gives its best effort to achieve the Level Triple-A Conformance, W3C-WAI Web Content Accessibility Guidelines 1.0. If you find any detail or error that we didn't see, don't hesitate and let us know.

The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. -- Tim Berners-Lee, W3C Director and inventor of the World Wide Web.

 Link to us  |  Contact us  |  Beyond HTML  |  Tools and resources  |  Sitemap  |  Webmaster