Byte Order Mark (BOM)
The Unicode Byte Order Mark is a Unicode character, that displays the endianness of a Unicode file or stream. This character has the Unicode position U+FEFF and can also be used to determine the coding of a text file. The character always comes first in the file and is not interpreted as part of the text by the software that supports the corresponding format. An advantage of this technique is that no additional information must be supplied and the key for interpretation is located directly in the file.
Depending on the encoding, a different byte sequence results from the character U-FEFF. The byte sequences for the most popular encodings are summarized in this table:
|Encoding||Byte Order Mark|
|UTF-7||2B 2F 76 ( 38 | 39 | 2B | 2F )|
|UTF-8||EF BB BF|
|UTF-16 Big Endian||FE FF|
|UTF-16 Little Endian||FF FE|
|UTF-32 Big Endian||00 00 FE FF|
|UTF-32 Little Endian||FF FE 00 00|
It is imperative in order to show a file correctly to use the Byte Order Mark in UTF-16 and UTF-32 encodings, because one character in these encodings occupies several bytes and the byte order mark indicates the order in which the bytes have to be interpreted (see: Big Endian Byte Order and Little Endian). On the other hand, in UTF-8 and UTF-7, the BOM is not mandatory, but nonetheless leads to better results, because many programs otherwise would interpret such texts as ANSI.
You can easily see, that the BOM indicates the order of the bytes, when comparing the sequences of bytes between Big Endian (most significant bit in the beginning) and Little Endian (least significant bit in the beginning), because these two codes have an opposite byte order. In UTF-16 Little Endian, the byte sequence is FF FE and in UTF-16 Big Endian it is just the contrary (FE FF). As general in UTF-32, four bytes are used per character. That is also evident, if you look at the BOM: 00 00 FE FF for UTF-32 Big Endian and FF FE 00 00 for UTF-32 Little Endian.
Problems and false interpretations using the byte order mark occur if programs can not interpret the BOM, and show ANSI characters instead. For example, ï»¿ can be shown for the BOM from UTF-8 (EF BB BF). Here is a little problematic, because ANSI files also allows the byte sequence EF BB BF. So, if you store the string ï»¿ at the beginning of a file and you save this file as ANSI, most software will interpret the rest of the file as coded in UTF-8. With applications like the TextConverter, you are able to read and write files with or without Byte Order Mark and you can change the Unicode format of files or whether a Byte Order Mark is used in the files or not.
If the character U+FEFF appears at another position than at the beginning of a file, it is displayed as a sign with a width of 0 and no break. However, the deliberate use of this mark is obsolete for this purpose. U+FEFF should now be used as a byte order mark only and you should now use the code position U+2060 for a character with no width and no break.