Unicode

ASCII and ANSI

ASCII is the acronym for American Standard Code for Information Interchange, and a 7-bit character encoding. It forms the basis of many other character encodings. In ASCII, 128 characters are defined, of which 95 are printable and 33 non-printable characters. The following table lists the characters and their decimal notation.

ASCII Table

As it can be seen, the ASCII character set includes the Latin alphabet, the ten Arabic numerals and some punctuation marks and control characters. Diacritical characters like Ä or À, as used in most languages which are based on the Latin alphabet, do not exist in the ASCII range. For the representation of these characters, a minimum of 8 bits is required to have sufficient space for encoding.

For this reason, it is almost always worked with the 8-bit ANSI instead of the 7-bit ASCII. With ANSI, 256 different characters can be encoded (8 bit corresponds to 2 ^ 8 = 256 possibilities) instead of only 128 characters (7-bit corresponds to 2 ^ 7 = 128 possibilities for encoding). ANSI actually stands for the American National Standards Institute, but in computer technology, it is used almost exclusively for the group of character sets explained in the following. ANSI is normally compatible with the 128 ASCII characters and also contains some additional language-specific characters such as Ä, À, ß, and so on. Depending on which extension is used, 128 different other characters are used as the extension. The most common ASCII extension is shown in the following table and is called Latin 1 (ISO 8859-1).

Latin-1 Table

There are also other enhancements, such as Latin 5 (ISO 8859-5, Turkish), Latin 7 (ISO 8859-7 Celtic) or Latin 10 (ISO 8859-10, Southeast European). Even with these extensions, the first 128 characters are the common ASCII characters, while the other 128 characters are characters, which are required for the appropriate language or corresponding character set.

Although ANSI encoding requires only one byte per character and thus is the most effective encoding, there are disadvantages, because this efficiency results from the impossibility to store different character systems or other special characters in one file. Of course, for normal English or German text, ANSI is obviously not insufficient. But as soon as other characters such as Cyrillic letters or special characters appear in the text, these characters can not be saved with ANSI. In addition, the problem of incompatibility occurs when exchanging files, since it may happen that the sender has saved a file as Latin-1 and the recipient works with Latin-10. In this case, characters outside the ASCII range of 128 characters can be displayed incorrectly, because the same code stands for other characters in Latin-1 and Latin-10. To avoid this danger, it is recommended to save files in a language independent code format such as Unicode UTF-7, UTF-8, UTF-16 or UTF-32. In these encodings, every character has only one code which do not differ, so that there is no possibility to display an incorrect character.

ANSI or ASCII has no byte order mark. Actually, this is not necessary, as it is written only one byte per character anyway, and therefore the endianness does not matter. However, some problems can result, if there is no label whether a file is ANSI or ASCII or not, because also with formats like UTF-8 it is not imperative to have a byte order mark. So a appropriate file might be ANSI, ASCII, UTF-8 or another encoding and it is hard to interpret these files, if you try to display them in a correct way.