Unicode

UTF-8

The UTF-8 format is the most common encoding for Unicode characters. A UTF-8 encoded character variable requires 1 to 4 bytes. UTF-8 is optimized for storage of ASCII characters. In the ASCII range with values from 0 to 127 in UTF-8 encoding, only one byte per character is used, the value of this byte is the same value as in the ASCII encoding. Therefore, the UTF-8 encoding is especially suitable for texts that consist mainly of ASCII or ANSI characters and contain only a few other characters, as it is the case for example in English or texts of European languages.

While the first 128 characters (ASCII) need a byte for encoding, the next 1920 characters need two bytes. These characters are Latin characters with diacritical marks such as the German umlaut (Ä, Ö..), or Greek and Cyrillic letters. Four bytes are used only for rarely used characters, such as unusual Chinese, Japanese and Korean characters.

The UTF-8 encoding is becoming more and more important on the Internet, because the Internet Engineering Task Force requires all of the new Internet protocols to support UTF-8 and UTF-8 is used increasingly to display special characters on web pages instead of named entities or other thinks.

A disadvantage of UTF-8 is the larger size when many of the non-preferred characters are used, that are characters that requires 3 or 4 bytes. In such cases, other encodings would be more space-efficient.

The Byte Order Mark (BOM) from UTF-8 is the sequence of bytes EF BB BF, which can appear as the characters  if the appropriate program can not deal with UTF-8. The problem of the byte order do not arises in UTF-8 encoding, but it is better to use a BOM, to characterize which coding is used. However, a reliable differentiation is not unique to 100 percent, because of course in the ANSI format also strings like  are allowed and could happen in theory at least at the beginning of a file.