Byte Order Mark (BOM)
The Unicode Byte Order Mark is a Unicode character, that displays the endianness of a Unicode file or stream. This character has the Unicode position U+FEFF and can also be used to determine the coding of a text file. The character always comes first in the file and is not interpreted as part of the text by the software that supports the corresponding format. An advantage of this technique is that no additional information must be supplied and the key for interpretation is located directly in the file.
- Byte Order Mark of different Encodings
- Interpretation of the Byte Order Mark
- Change, Remove or Add Byte Order Mark
Byte Order Mark of different Encodings
Depending on the encoding, a different byte sequence results from the character U-FEFF. The byte sequences for the most popular encodings are summarized in this table:
|Encoding||Byte Order Mark|
|UTF-7||2B 2F 76 ( 38 | 39 | 2B | 2F )|
|UTF-8||EF BB BF|
|UTF-16 Big Endian||FE FF|
|UTF-16 Little Endian||FF FE|
|UTF-32 Big Endian||00 00 FE FF|
|UTF-32 Little Endian||FF FE 00 00|
It is imperative in order to show a file correctly to use the Byte Order Mark in UTF-16 and UTF-32 encodings, because one character in these encodings occupies several bytes and the byte order mark indicates the order in which the bytes have to be interpreted (see Big Endian and Little Endian regarding the byte order). On the other hand, in UTF-8 and UTF-7, the BOM is not mandatory, but nonetheless leads to better results, because programs otherwise could interpret such texts as ANSI also.
You can easily see, that the BOM indicates the order of the bytes, when comparing the sequences of bytes between Big Endian (most significant bit in the beginning) and Little Endian (least significant bit in the beginning), because these two codes have an opposite byte order. In UTF-16 Little Endian, the byte sequence is FF FE and in UTF-16 Big Endian it is just the contrary (FE FF). As general in UTF-32, four bytes are used per character. That is also evident, if you look at the BOM: 00 00 FE FF for UTF-32 Big Endian and FF FE 00 00 for UTF-32 Little Endian.
Interpretation of the Byte Order Mark
Problems and false interpretations using the byte order mark occur if programs can not interpret the BOM, and show ANSI characters instead. For example, ï»¿ can be shown for the BOM from UTF-8 (EF BB BF). Here is a little problematic, because ANSI files also allows the byte sequence EF BB BF. So, if you store the string ï»¿ at the beginning of a file and you save this file as ANSI, most software will interpret the rest of the file as coded in UTF-8. With applications like the TextConverter or the TextEncoder, you are able to read and write files with or without Byte Order Mark and you can change the Unicode format of files or whether a Byte Order Mark is used in the files or not.
If the character U+FEFF appears at another position than at the beginning of a file, it is displayed as a sign with a width of 0 and no break. However, the deliberate use of this mark is obsolete for this purpose. U+FEFF should be used as a byte order mark only and you should now use the code position U+2060 for a character with no width and no break.
Change, Remove or Add Byte Order Mark
With the program TextEncoder you can change, remove or add the Byte Order Mark of files. After starting the TextEncoder, you can do the following:
- Drag the files you want to edit from any folder onto the TextEncoder.
- On the right side under "Changes" activate the option "Encoding".
- Under "Write Byte Order Mark (BOM) into Files", set whether the files should get a Byte Order Mark or not.
- In the storage options at the bottom right, set whether you want to overwrite the files or save them under a new name as new files.
- Click on the button "Convert".
The file list in the TextEncoder contains a column named "BOM". Here you can see if your added files currently have a Byte Order Mark or not.