If you could see a conversation between computers it will probably look something like this: "010110111011101011010010110...". This is the method computers use to communicate with each other (binary format). But why would someone design a computer with a binary system instead of using a normal alphabet? Well, the reason is that the binary system adapts perfectly to the physical structure of circuits, where you can take only two different signals: an electrical pulse (1) or no signal at all (0).
But then, this codes are, at least, hard to understand. The reason for the existence of character encodings is that we need to translate these ones and zeros (bits) into a more human readable language.
What an n-bit character encoding does is separate these bit codes into groups of n bits each, and assign a symbol to that piece of sequence. In a simple way, a character encoding (or character set) can be considered as a translation table where each group of bits is related to a single character. For example, an 8-bit character encoding could represent the sequence "10010101" as the letter "a", the sequence "01101100" as the symbol "&", and so on.
To throw some light on the previous paragraph, I will take one particular character encoding that happens to be one of the most famous for eastern computer systems: the ASCII encoding. This encoding is very primitive, and therefore, simple. The next table shows all the values for the upper case letters (from A to Z):
| Char | Binary code |
|---|---|
| A | 1000001 |
| B | 1000010 |
| C | 1000011 |
| D | 1000100 |
| E | 1000101 |
| F | 1000110 |
| G | 1000111 |
| H | 1001000 |
| I | 1001001 |
| J | 1001010 |
| K | 1001011 |
| L | 1001100 |
| M | 1001101 |
| Char | Binary code |
|---|---|
| N | 1001110 |
| O | 1001111 |
| P | 1010000 |
| Q | 1010001 |
| R | 1010010 |
| S | 1010011 |
| T | 1010100 |
| U | 1010101 |
| V | 1010110 |
| W | 1010111 |
| X | 1011000 |
| Y | 1011001 |
| Z | 1011010 |
If you take a close look, you'll see that no code is repeated. This is something set on purspose to avoid ambiguities and is common to all character encodings.
This way, if we intercept a line of computer code that goes "1001000101010010011011001100" and we know that the character encoding used to produce it is ASCII we could translate its meaning to the western alphabet. First we separate the line in groups of 7 numbers each (because ASCII is a 7-bit character encoding) resulting in "1001000 1010100 1001101 1001100". Now each one of these chunks is a character, that we can decode matching the sequences with the ASCII table. If you search the matches in the table above you'll see the word is HTML.
The same logic could be applied in case you need to transform a text into a bits sequence.
Character encodings are widely used (even when we don't realize it) in operative systems, text documents, HTML documents, email documents, etc. Every piece of text must have a character encoding because, in the end, all digital information is a sequence of bits.
This is why character encodigns matter to HTML. As an HTML document is a piece of text that must be read and interpreted (by browsers and people), it must use a character encoding. If not set, the character encoding is set by default and most times sent in the header information when a page is requested.
To set a character encoding in a page you should use the HTML meta tag in the head (HTML head element) of the document. In the following example, we make a declaration setting ASCII (with its preferred name US-ASCII) as the character encoding of the page.
Other very popular and useful character encodings are ISO-8859-1 (usually called "Latin 1") and UTF-8.
Character encodings where born with computers as a mechanism to translate the binary computer code into a human readable text. Soon, computer vendors begun making their own character sets according to the needs of their market, creating an expansion effect which resulted in a massive amount of different encoding tables.
This expansion effect had two primary supporters. First, the ease with which character encodings can be created (basically anyone can create his own character set by gruping ones and zeros and asigning a character to each group). Second, the different needs of each market (the language to which the computers were targeted to and their use).
The Unicode industry standard is intended to replace and unify the existing character encodings to reduce their number and improve effectiveness (by solving other character encodings lacks). The Unicode standard encodes characters using different schemes named Unicode Transformation Formats (UTF).
The UTF-8 is an 8-bit variable length character encoding for Unicode that's becoming very popular for Internet due to its capability to represent every universal character still being ASCII compatible. UTF-8 uses from 1 to 4 bytes to encode a single character (8 to 32 bits) depending on the Unicode symbol.
An HTML document using the UTF-8 character encoding should contain the following declaration in its header:
Serving documents with UTF-8 encoding allows authors to insert any character without using character references (except for markup symbols). But be aware that creation and edition of UTF-8 documents must be done with a text editor compatible with this encoding (and correctly configured), otherwise, it will result in misrepresentation of characters.
Diseño y desarrollo: Latitud29.com