Unicode Endianness (Byte Order)

Endianness (Byte Order)

With the Byte Order (Endianness), the memory organization is designated in computer technology. Every time if more bits than in the smallest addressable unit are required to store something, it is needed to declare in which order the data is stored.

In general, the smallest addressable unit is composed of eight bits, that is one byte. If you have to store more than one byte, there are two possibilities for storage: We either store the bytes from left to right (this is called big endian) or from right to left (little endian).

Big Endian

Big Endian means to start at the big end (hence the name). Similar to the representation of the time with the sequence "hour - minute - second", here, the byte with the most significant bit is stored first, that means, it is stored in the lowest memory address.

Little Endian

The other option is to start at the small end. Like writing the date in the order "day - month - year", here, the byte with the least significant bit is the first one. This byte is then written to the lowest memory address in this case.

Relevance for Unicode Formats

In Unicode text encodings such as UTF-16 or UTF-32, where a character is represented not just by one byte but by several bytes, the question of the byte order or endianness arises automatically.

In the UTF-32 coding, for example, a character always has exactly 4 bytes. These bytes of a character can be arranged either from right to left according to Big Endian or from left to right according to Little Endian. Accordingly, the encodings are called UTF-32 Big Endian (UTF-32 BE) or UTF-32 Little Endian (UTF-32 LE).

The so-called Byte Order Mark (BOM) is used to distinguish between the two types of encoding. The BOM corresponds to the code point U+FEFF and can optionally be inserted as the first character in a file. It should not be displayed as a visible character by text editors. For the UTF-32 coding, the byte order mark consists of the byte sequence 00 00 FE FF (Big Endian) or the byte sequence FF FE 00 00 (Little Endian). This means that programs that read appropriately coded files can decide immediately after the first bytes how the following bytes or characters are to be interpreted.

The byte sequences FE FF (UTF-16 LE) and FF FE (UTF-16 BE) are used for UTF-16 encodings. With other encodings, such as ASCII or ANSI, the question of the byte order does not arise, since each character always has one byte and therefore cannot be arranged in different ways.