Home > HTML Reference > Character encoding

Character encoding

If you could see a computers conversation it will probably look something like this: "010110111011101011010010110...". This is the way computers use to communicate each other. The reason of the existence of character encodings is that it would be very difficult for us to understand this "ones and zeros" (bits) language.

What a n-bit character encoding does is to separate these bit codes into groups of n bits each, and assign a symbol that correspond to that ordered sequence. In a simple way, a character encoding or character set can be considered as a translation table where each group of bits is related to a single character. For example, an 8-bit character encoding could represent the sequence "10010101" as the letter "a", the sequence "01101100" as the symbol "&", and so on.

Character encodings are widely used (even when we don't realize of it) among operative systems as well as in HTML documents. They were born with computers as there was the need of translate computer codes into a human readable text. In the beginning, character encodings were created or used by vendors and designed to fill the needs of each one of them as well as the language they where targeting their sales to. This created an expansion effect that gave as result a large number of character encodings, each of them giving support for characters of different languages or for different uses.

The Unicode industry standard is intended to replace and unify the existing character encodings to reduce their number and improve effectiveness (by solving other character encodings lacks). The Unicode standard encodes characters using different schemes named Unicode Transformation Formats (UTF).

The UTF-8 is an 8-bit variable length character encoding for Unicode that's becoming very popular for Internet use due to its capability to represent every universal character still being ASCII compatible. UTF-8 uses from 1 to 4 bytes to encode a single character (8 to 32 bits) depending on the Unicode symbol.

An HTML document using the UTF-8 character encoding should contain a declaration in it's header using the HTML meta tag (e.g., <meta http-equiv="content-type" content="text/html; charset=UTF-8">). Symbols can be inserted in these documents using of a simple code that's associated to a single character or symbol, like Σ (decimal notation), or Σ (hexadecimal notation) what will render the Greek capital letter Sigma ("Σ"). Actually, the Unicode symbols render depends on the font style used by the browser, and most of them don't provide support for all the characters in the Unicode standard (not supported characters are rendered as squares or interrogation symbols). See the character entity reference for more information about the representation of special characters in HTML documents.

Bypass W3C declarations |

| Why should you trust us? Click the images on the left to see how seriously we write our own pages, then make your choice.

This website gives its best effort to achieve the Level Triple-A Conformance, W3C-WAI Web Content Accessibility Guidelines 1.0. If you find any detail or error that we didn't see, don't hesitate and let us know.

The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect. -- Tim Berners-Lee, W3C Director and inventor of the World Wide Web.