InfoCenter

UTF-7

UTF-7 is an encoding that is used to encode Unicode characters by using only the range of ASCII characters. This encoding has the advantage that even in environments or operating systems that understand only 7-bit ASCII, Unicode characters can be represented and transferred.

For example, some Internet protocols such as SMTP for email, only allow the 128 ASCII characters and all other major bytes are not allowed. All of the other UTF encodings use at least 8 bits, so that they can not be used for such purposes.

The characters A to Z, a to z, 0 to 9 and the special characters ' ( ) , . / : - ? remain in the coding as they are. Thus, texts that are predominantly composed of ASCII characters remain largely readable. The ASCII characters ! " # $ % & * ; < = > @ [ ] ^ _ ` { | } can be remained as they are, but should be coded, since they may not be understood by all programs and protocols. All other characters are encoded and also converted to ASCII characters. The + marks the beginning of such an encoding, the - (or any other character which can not occur in the encoding) marks the end.

The German word for cheese "Käse", for instance, would be coded as K+AOQ-se. The ASCII characters K, s and e remain the same, while "ä" is converted to AOQ (other ASCII characters). The beginning and the end of this sequence are marked with - and +.

Usage

Although UTF-7 has a large coding efficiency, it could not prevail because the decoding and encoding is relatively difficult, encodings like UTF-8 can be understood by most software and almost always the 7-bit limitation does not matter much.

Byte Order Mark

The byte order mark (BOM) of UTF-7 encoded files consists of the byte sequence 2B 2F 76 followed by one of the bytes 38, 39, 3A or 3B. This specialty, which differs from all other encodings, results from the fact that the last 2 bits of the encoded UTF-7 representation of U+FEFF belong to the following byte. This gives us 4 different possible bytes in the fourth position, the fifth variant is used if there is no character following the byte order mark.

In a text editor that does not understand the UTF-7 encoding, the first 3 bytes of the signature are shown as "+/v". Depending on the variation, the fourth character can be 8, 9, : or ; be.