Unicode UTF-16

UTF-16

The encoding format UTF-16 is the oldest one of all Unicode encoding formats and is optimized for the most commonly used characters of the Basic Multilingual Plane (BMP). Unicode characters whose code is in the range of U+0000 to U+FFFF, are in the Basic Multilingual Plane. These are Latin and other European writings and their symbols, African and Asian characters. The characters in this field are mapped directly to the two bytes (16 bits) of a UTF-16 code unit.

Thus the encoding UTF-16 is best suited for characters of this area, even if it requires twice memory in comparison with the encodings UTF-8 and ANSI for texts consisting of ASCII or ANSI characters, because for ASCII and ANSI characters only one byte (instead of two bytes) is used to store ASCII characters in UTF-8 and ANSI encodings.

Usage

UTF-16 Little Endian is used as internal representation of strings in Windows 2000 / XP / 2003 / Vista / 7 / 10 / 11 (and in the other Windows version in between) and is what is understood in the Windows Notepad under the encoding named "Unicode". Also other operating systems like macOS, or Symbian are using UTF-16 as their default encoding.

Byte Order Mark

Both Big Endian and Little Endian can be used to save text in UTF-16 format. The difference is whether the byte units should be written from left to right or from right to left. Correspondingly, the Byte Order Mark (BOM) for UTF-16 Big Endian is FE FF and for UTF-16 Little Endian FF FE.

If you look at a UTF-16 encoded file in an editor that cannot interpret the byte order mark and the format correctly and thus interprets the file as a Latin-1 encoded file, the characters "þÿ" are displayed for UTF16-BE and for UTF16- LE the characters "ÿþ" are displayed.