Unicode UTF-32

UTF-32

In the Unicode encoding UTF-32, each character is encoded with four bytes (32 bits). The result is a larger memory requirement compared to all other encodings, since all other encodings have a variable byte length for each character. However, this also results in the advantage that UTF-32 encoded files or streams are easier to handle and process, because each byte has exactly its place and there is no variable length.

Advantages and Disadvantages of UTF-32

One advantage of this encoding is, that a particular character can be accessed easily in memory and it is no problem to determine the length of a text, because you only have to divide the number of used bytes by four to get the number of characters.

A decisive disadvantage is the larger memory requirement. In comparison to texts consisting of Latin letters, which are stored in UTF-7, UTF-8 or ANSI because the memory requirement of UTF-32 encoding is four times larger. Even in case you are using other characters like Cyrillic or Greek letters, UTF-32 needs much more memory, because in all other encodings only less used and unusual characters are encoded with four bytes.

Byte Order Mark

UTF-32 can be stored both as Big Endian and Little Endian. This means that the byte order (= endianness) is either from right to left (big endian) or from left to right (little endian). The byte order mark (BOM) for a storage as Big Endian is 00 00 FE FF, as Little Endian FF FE 00 00, accordingly.

If you open a file in UTF-32 format in a text editor or another program that cannot interpret the UTF-32 format, Latin letters typically appear with a distance of 3 spaces or question marks between the individual letters. This is due to the fact that in the UTF-32 format each character is saved with 4 bytes and characters such as A-Z or a-z are encoded as 3 null bytes plus one byte that corresponds exactly to the respective byte for the character from the ASCII character set. If the program interprets the byte sequence of the file as ASCII or Latin-1, the ASCII or Latin-1 characters are retained and the zero bytes are displayed as spaces or question marks.

The byte order mark in this context is typically shown for UTF32-BE as "??þÿ" or as "ÿþ??" for files encoded in UTF32-LE format. The zero byte has no specific character equivalent, so the representation depends on the program used. In addition to spaces or question marks, the reading of the file can also be completely aborted at the position of the first zero byte.