InfoCenter

Byte Order Mark (BOM)

The Unicode Byte Order Mark is a Unicode character, that displays the endianness of a Unicode file or stream. This character has the Unicode position U+FEFF and can also be used to determine the coding of a text file. The character always comes first in the file and is not interpreted as part of the text by the software that supports the corresponding format. An advantage of this technique is that no additional information must be supplied and the key for interpretation is located directly in the file.

Byte Order Mark of different Encodings

Depending on the encoding, a different byte sequence results from the character U-FEFF. The byte sequences for the most popular encodings are summarized in this table:

EncodingByte Order MarkASCII
ANSINo BOM-
UTF-72B 2F 76 ( 38 | 39 | 2B | 2F )+/v 89+/
UTF-8EF BB BF
UTF-16 Big EndianFE FFþÿ
UTF-16 Little EndianFF FEÿþ
UTF-32 Big Endian00 00 FE FF??þÿ
UTF-32 Little EndianFF FE 00 00ÿþ??

The last column (ASCII) shows how the byte sequence of the byte order mark would look like if it were interpreted as ASCII characters in a text editor.

It is imperative in order to show a file correctly to use the Byte Order Mark in UTF-16 and UTF-32 encodings, because one character in these encodings occupies several bytes and the byte order mark indicates the order in which the bytes have to be interpreted (see Big Endian and Little Endian regarding the byte order). On the other hand, in UTF-8 and UTF-7, the BOM is not mandatory, but nonetheless leads to better results, because programs otherwise could interpret such texts as ANSI also.

You can easily see, that the BOM indicates the order of the bytes, when comparing the sequences of bytes between Big Endian (most significant bit in the beginning) and Little Endian (least significant bit in the beginning), because these two codes have an opposite byte order. In UTF-16 Little Endian, the byte sequence is FF FE and in UTF-16 Big Endian it is just the contrary (FE FF). As general in UTF-32, four bytes are used per character. That is also evident, if you look at the BOM: 00 00 FE FF for UTF-32 Big Endian and FF FE 00 00 for UTF-32 Little Endian.

Interpretation of the Byte Order Mark

Problems and false interpretations using the byte order mark occur if programs can not interpret the BOM, and show ANSI characters instead. For example,  can be shown for the BOM from UTF-8 (EF BB BF). Here is a little problematic, because ANSI files also allows the byte sequence EF BB BF. So, if you store the string  at the beginning of a file and you save this file as ANSI, most software will interpret the rest of the file as coded in UTF-8. With applications like the TextConverter or the TextEncoder, you are able to read and write files with or without Byte Order Mark and you can change the Unicode format of files or whether a Byte Order Mark is used in the files or not.

If the character U+FEFF appears at another position than at the beginning of a file, it is displayed as a sign with a width of 0 and no break. However, the deliberate use of this mark is obsolete for this purpose. U+FEFF should be used as a byte order mark only and you should now use the code position U+2060 for a character with no width and no break.

Change, Remove or Add Byte Order Mark

With the program TextEncoder you can change, remove or add the Byte Order Mark of files. After starting the TextEncoder, you can do the following:

The file list in the TextEncoder contains a column named "BOM". Here you can see if your added files currently have a Byte Order Mark or not.