Unicode UTF-8

UTF-8

The UTF-8 format is the most common encoding format for Unicode characters. A UTF-8 encoded character requires variable 1 to 4 bytes. UTF-8 is optimized for the storage of ASCII characters. In the ASCII range with values from 0 to 127 in UTF-8 encoding, only one byte per character is used, the value of this byte is the same value as in the ASCII encoding. Therefore, the UTF-8 encoding is especially suitable for texts that consist mainly of ASCII or ANSI characters and contain only a few other characters, as it is the case for example in English or texts of most European languages.

While the first 128 characters (ASCII) need a byte for encoding, the next 1920 characters need two bytes. These characters are Latin characters with diacritical marks such as the German umlauts (Ä, Ö..), or Greek and Cyrillic letters. Four bytes are used only for rarely used characters, such as unusual Chinese, Japanese and Korean characters.

Usage

Especially in the Internet and in the areas of data transmission, UTF-8 encoding has become increasingly important in recent years. This development was supported by the World Wide Web Consortium, which recommends UTF-8 as the default encoding for XML and HTML, or the Internet Engineering Task Force, which requires all new Internet protocols to support UTF-8. Currently, over 97% of all Internet pages are UTF-8 encoded, making UTF-8 the most widespread encoding on the web. Named characters (HTML entities), which used to be very widespread, are also disappearing more and more from the source texts of Internet pages and are being replaced by UTF-8 encoded characters today.

In contrast, UTF-8 is used less frequently within programs. For example, Windows uses UTF-16 internally.

Advantages and Disadvantages

One of the main advantages of UTF-8 is the low memory requirement of the preferred characters. When storing an English text that only consists of the characters A-Z, a-z, 0-9 and common punctuation marks, UTF-8 requires half as much storage space as UTF-16 and even only a quarter of the storage space of UTF-32. UTF-8 manages with only one byte for these characters, while UTF-32 always requires four bytes per character and UTF-16 at least two bytes. Nevertheless, all characters can be mapped with UTF-8, which is not possible with other encodings such as ASCII or ANSI, which also only require one byte for their characters.

A disadvantage of UTF-8 is the larger size when many of the non-preferred characters are used, that are characters that requires 3 or 4 bytes. In such cases, other encodings would be more space-efficient.

Another disadvantage is that due to the variable number of bytes per character, it is not easy to access a specific character in the text, nor is the length of a UTF-8 encoded text easy to determine. In both cases, all bytes must first be gone through and interpreted individually. In encodings such as UTF-32 this is much easier: Because each character has exactly 4 bytes, the length of a text can be determined directly based on the number of bytes (bytes / 4) and it is also possible to directly jump to the nth character in the text without knowing the text because the starting byte position is automatically obtainable from four times the character position.

Byte Order Mark

The Byte Order Mark (BOM) from UTF-8 is the sequence of bytes EF BB BF, which can appear as the characters ï»¿ if the appropriate program can not deal with UTF-8. The problem of the byte order do not arises in UTF-8 encoding, but it is better to use a BOM, to characterize which coding is used. However, a reliable differentiation is not unique to 100 percent, because of course in the ANSI format also strings like ï»¿ are allowed and could happen in theory at least at the beginning of a file.