Line Breaks
There are several different ways in which line breaks can be implemented within plain text files. In this article we would like to first look at these different kinds of line break types and then address the problems that can arise due to this variability as well as provide solutions and some application examples for these problems.
- Character-Based Line Break Types
- Line Breaks by Defining a Fixed Line Length
- Line Breaks in HTML Source Code and other Markup Languages
- Line Breaks in the Source Code of Programming Languages
- Detection of the Line Break Type of a File
- Problems with File Exchanges
- How to Change the Line Break Type of Files
- Files with mixed Line Breaks
Basically, there are three different categories into which we can classify the different types of line breaks: character-based line breaks, line breaks by defining a fixed line length, as well as line breaks implemented by a markup language. In the following sections, we would first like to compare these three categories and their most prominent representatives as an introduction to the topic.
Character-Based Line Break Types
Most plain text files use certain previously defined characters respectively bytes to mark their line breaks. If the program that is supposed to read, process or display such a text file knows these characters, it knows that these characters should not be displayed as letters but can be interpreted as (invisible) line breaks.
This approach would be easy to implement if a single specific character for a line break had been agreed upon over time. However, because the various systems have grown historically, this is still not the case today. So, depending on the operating system, other characters or bytes can be used for a line break.
Characters and Codepoints for Line Breaks and their Use
The following table provides an overview of the different characters and character combinations for line breaks and the most common systems that is using each type of line break:
Abbreviation | Code (Hex/Dec) | Character Set | System/Usage |
CR LF | 0D 0A / 13 10 | ASCII | Windows, MS-DOS, OS/2, Symbian OS, Palm OS, Atari TOS, CP/M, MP/M, RT-11, Amstrad CPC, DEC TOPS-10 as well as most other early non-Unix and non-IBM operating systems |
LF | 0A / 10 | ASCII | Line Feed - Unix and Unix-like systems (Linux, macOS, Mac OS X, Android, BSD, AIX, Xenix and so on), Amiga, AmigaOS, QNX, Multics, BeOS, RISC OS and other POSIX standard oriented systems |
CR | 0D / 13 | ASCII | Carriage Return - Mac OS (Classic) up to version 9, Apple II, Lisa OS, Commodore 64 (C64), Commodore 128 (C128), Acorn BBC, ZX Spectrum, TRS-80, Oberon, HP Series 80, MIT Lisp Machine, OS-9 |
RS | 1E / 30 | ASCII | Record Separator - QNX (before the POSIX implementation with version 4) |
EOL | 9B / 155 | ATASCII | End Of Line - Atari 8-Bit Computer |
NL | 15 / 21 | EBCDIC | New Line - IBM Mainframe Systems such as z/OS (OS/390) or IBM i (i5/OS, OS/400) |
LF | 25 / 37 | EBCDIC | Line Feed - EBCDIC character for ASCII's 0A |
RNL | 06 / 06 | EBCDIC | Require New Line (since 2007) |
76 / 118 | ZX80/ZX81 | Sinclair Research Home Computers Linebreak | |
VT | U+000B | Unicode | Vertical Tab |
FF | U+000C | Unicode | Form Feed |
NEL | U+0085 | Unicode | Next Line |
LS | U+2028 | Unicode | Line Separator |
PS | U+2029 | Unicode | Paragraph Separator |
The worldwide most widespread and frequently used character set is ASCII (American Standard Code for Information Interchange) respectively the Unicode standard that is based on ASCII. The two most common and widespread line break types also come from this character set: the Unix line break LF as well as the Windows line break CR LF.
Unix and the current macOS from Apple use the Unicode code point U+000D as a line break, while older Apple systems use U+000A. Windows and MS-DOS use both of these characters one after the other in the order 0D0A. In addition to these three characters and character sequences, the Unicode standard also requires the code points U+000B (Vertical Tab VT), U+000C (Form Feed FF = new page), U+0085 (Next Line NEL), U+2028 (Line Separator LS) as well as U+2029 (Paragraph Separator PS) to be interpreted as a line break. However, to date, only a few programs do this.
One of the best-known character sets outside of the ASCII world is the 8-bit character set EBCDIC (Extended Binary Coded Decimal Interchange Code) developed by IBM for its mainframe computers. This character set uses the hexadecimal character 15 (decimal 21) for a line break, which combines the functions of CR and LF. In addition, EBCDIC also contains the ASCII-typical characters CR and LF (albeit the latter under a different character code) and, from 2007, the additional character RNL (Required New Line), which can be used to encode a conditional automatic line break.
Less common or only of historical relevance are the line breaks EOL (End Of Line) used on Atari 8-bit computers (mainly used in the 1980s) from the 8-bit character set ATASCII (ATARI Standard Code for Information Interchange) used by Atari, the line breaks from the ZX80 and ZX81 character sets used by Sinclair Research Ltd for its computers also in the 1980s as well as the line break RS (Record Separator), which was used by the QNX operating system until the release of version 4.0 in 1990. Some historical operating systems even defined newlines at the bit level: for example, the CDC 6000 series operating systems from the 1960s, at a time when memory was expensive, defined their line breaks as two or more zero bits filled 6-bit characters at the end of a 60-bit word.
Why does the Windows Line Break consist of two Characters?
The fact that Windows, MS-DOS and most other early non-Unix and non-IBM operating systems, in contrast to the other operating systems mentioned, define their line breaks with two characters has historical reasons and can be traced back to the procedure of typewriters and old printing devices:
On a typewriter, the break of a line is namely also carried out by two actions that can be distinguished from each another: On the one hand, the writing position moves back to the beginning of the line (carriage return) and, on the other hand, the writing position moves down one line, for example, by pushing the paper to be printed further by turning the roller (line feed). According to this logic, a complete "line break" is made up of a combination of these two actions. When character set systems for computers were developed in the 1960s, in these character sets, separate control characters for the carriage return as well as for the line feed were defined in order to be able to map and implement the control of printers at that time in the same way. This history is still reflected in today's most recent Windows versions.
The carriage return was given the decimal code 13 (hexadecimal 0D) in the ASCII character set at that time and is abbreviated as "CR", the line feed was given the decimal code 10 (hexadecimal 0A) and is abbreviated as "LF". Both of these characters can still be found in the current Unicode standard under the same numerical code points today.
Some systems also used the distinction between CR and LF for various text effects. If only CR without LF was used for the printer control, a carriage return without a line feed could be achieved. In this way, the writing position could reach the beginning of an already printed line and thus overprint the existing text with other characters. For example, this way text could be underlined, crossed out or written in bold. Diacritical characters outside of the character set actually used were also made possible in this way by overprinting or combining different characters. Similarly, the control character RI (Reverse Line Feed) defined with the code point U+008D in the Unicode standard can be used.
Unicode, ASCII, EBCDIC, HTML Entities and Escape Sequences
As we saw in the last section, in addition to many similarities, there are also certain differences between the individual character sets. For this reason, we would like to compare the relevant characters again in the next table:
Character | Unicode Code Point | ASCII | EBCDIC | HTML Entity | Escape Sequence | |||
CR | U+000D | 0D | 13 | 0D | 13 | 
 | | \r |
LF | U+000A | 0A | 10 | 25 | 37 | 
 | | \n |
CR LF | - | 0D 0A | 13 10 | 0D 25 | 13 37 | - | - | \r\n |
NEL/NL | U+0085 | - | 15 | 21 | … | … | \u0085 | |
VT | U+000B | 0B | 11 | 0B | 11 |  |  | \v |
FF | U+000C | 0C | 12 | 0C | 12 |  |  | \f |
LS | U+2028 | - | - | 
 | 
 | \u2028 | ||
PS | U+2029 | - | - | 
 | 
 | \u2029 |
Since the Unicode standard has completely adopted all characters from the ASCII character set with identical code points as its "Basic Latin" block for compatibility reasons, all characters for line breaks from the ASCII character set such as the line feed LF, the carriage return CR, the vertical tab VT as well as the page feed FF are defined both in the ASCII character set and as Unicode codepoints with the same number.
In addition, the Unicode standard defines the code points U+0085, U+2028 and U+2029 as additional line breaks that are not part of the ASCII character set. To be distinguished from these real line breaks are the Unicode code points U+2424 (Symbol for Newline), U+23CE (Return Symbol), U+240D (Symbol for Carriage Return) as well as U+240A (Symbol for Line Feed). Although those characters do not generate a line break themselves, they can be used to create glyphs that are visible to the user in order to visualize the otherwise invisible line break characters.
The EBCDIC character set, which is mainly used on IBM mainframe systems, also has many parallels to ASCII. Although the standard EBCDIC line break is the character NEL (hexadecimal code 15 / decimal code 21), which itself has no ASCII equivalent, EBCDIC also defines codepoints for the characters CR, LF, VT and FF. Of these four characters, only LF is defined under a code point different from ASCII in EBCDIC (25/37 instead of 0A/10).
The Unicode character equivalent to EBCDIC-NL is NEL (Next Line) and has the Unicode codepoint U+0085. This character has been defined in the Unicode standard in addition to CR and LF to enable bidirectional conversion from and to all other encodings. If we only had the characters CR and LF available in the Unicode standard, this would not be possible: For example, if we wanted to convert an EBCDIC text to Unicode and back again, in this case we could first convert all NEL line breaks to either LF or CR LF. When converting back, however, we would be faced with ambiguity, since EBCDIC makes a distinction between CR, LF and NL and it would therefore no longer be clear whether our LF and CR characters were already LF and CR before (and should therefore be maintained) or they were originally an NL (which would have to be converted back). So, only because the three different characters CR, LF and NEL are also available to us in the Unicode standard, transformation is possible without loss of information.
Furthermore, the last two columns of the table show the HTML entities as well as the escape sequences of the individual characters. The HTML entities can be used to insert the respective characters into HTML source text. The table shows the HTML entities in both hexadecimal and decimal notation. These two variants lead to the same result and can therefore be used interchangeably. For the LF character, also the HTML entity 
 can be used. Similarly, also the escape sequences from the last column are placeholders for the characters mentioned. The escape sequences can be used, for example, in regular expressions or in some programming languages as an alias respectively to insert the corresponding line break characters. More about this in the section on line breaks in the program code.
ASCII-based 8 bit encodings such as the Windows code pages or the Latin character sets are not listed in the table, as these character sets have also adopted all ASCII characters and therefore correspond to the "ASCII" column of the table.
Just for the sake of completeness, it should also be mentioned that in addition to the common and most frequently used Unicode standard, which adopted its first code points from the ASCII character set, there is also an alternative Unicode standard called UTF-EBCDIC, which is instead widely based on the EBCDIC character set.
Byte Representations in different Encodings
Depending on the encoding used, the mentioned Unicode codepoints result in different bytes within a stored file. The following table provides an overview of the byte sequences of the various line break types in the encodings ASCII, UTF-7, UTF-8, UTF-16 Litte Endian and Big Endian as well as UTF-32 Litte Endian and Big Endian:
Character | Unicode Code Point | ASCII | UTF‑7 | UTF‑8 | UTF‑16 LE | UTF‑16 BE | UTF‑32 LE | UTF‑32 BE |
CR | U+000D | 0D | 0D | OD | 0D 00 | 00 0D | 0D 00 00 00 | 00 00 00 0D |
LF | U+000A | 0A | 0A | 0A | 0A 00 | 00 0A | 0A 00 00 00 | 00 00 00 0A |
CR LF | - | 0D 0A | 0D 0A | 0D 0A | 0D 00 0A 00 | 00 0D 00 0A | 0D 00 00 00 0A 00 00 00 | 00 00 00 0D 00 00 00 0A |
NEL/NL | U+0085 | - | 2B 41 49 55 | C2 85 | 85 00 | 00 85 | 85 00 00 00 | 00 00 00 85 |
VT | U+000B | 0B | 2B 41 41 73 | 0B | 0B 00 | 00 0B | 0B 00 00 00 | 00 00 00 0B |
FF | U+000C | 0C | 2B 41 41 77 | 0C | 0C 00 | 00 0C | 0C 00 00 00 | 00 00 00 0C |
LS | U+2028 | - | 2B 49 43 67 | E2 80 A8 | 28 20 | 20 28 | 28 20 00 00 | 00 00 20 28 |
PS | U+2029 | - | 2B 49 43 6B | E2 80 A9 | 29 20 | 20 29 | 29 20 00 00 | 00 00 20 29 |
Typical 8-bit encodings that are based on ASCII, such as the Windows code pages or the Latin character sets, are not listed separately in the table. These encodings use the same bytes as ASCII, which can be found in the ASCII column. Many other ANSI code pages and character sets also follow this convention.
The byte representations listed in this table are, among other things, important for the detection of the line break type of files, which we will address in the section on recognizing the line break type of a file.
Line Break Characters as Line Separators or Line Terminators
Characters for line breaks can be interpreted in two different ways, both of which have their proponents and applications: a line break character can be considered either as a separator between two lines or as a marker for the end of a line.
To demonstrate this difference, let's look at the following example, where "N" represents the line break character:
abcNdefN
The content of such a file could be interpreted in two different ways:
- If we interpret the line break character as a separator between two lines, our example would have three lines: the first line with the content "abc", the second line with the content "def", followed by a third line that is empty.
- However, if we interpret the line break character as a line terminator, we would only get two lines: The first line with the content "abc" and "N" as the marker for the end of the line as well as the second line with the content "def" and again "N" as a terminator.
There are programs that regard newline characters as separators and other programs that interpret newline characters as terminators. The problems that result from this are obvious: programs that consider the line break character as a separator may interpret one (empty) line too much; programs that consider the line break character as a line-end marker may have problems reading the last line of a file.
Input of Line Break Characters
The system line break is usually easiest to enter using the Enter key. An exception occurs if the input is made within an editor that understands other line break types and in which either a file with a non-system line break type is currently being worked on or the settings of this editor (or another program) is set to a corresponding other line break type.
Entering the other line break types is a little more difficult: some systems and text editors allow the keyboard shortcut CTRL + J to enter the LF character. Other common key combinations are CTRL + M for CR as well as CTRL + K for VT (this is also the reason why sometimes ^M is displayed for CR). If we interpret CR and LF as a carriage return and a line feed, we can do this using the Pos1 and the Arrow-Down keys.
Within HTML source code, line break characters can additionally be inserted via their HTML entities, which are listed in the table that can be found in the section "Unicode, ASCII, EBCDIC, HTML Entities and Escape Sequences". Furthermore, we can enter the characters using the keyboard shortcut ALT + Codepoint of the character using the Num keypad of the keyboard and in some contexts, such as in regular expressions or in many programming languages, we can also use the escape sequences of the characters, which are also listed in the table mentioned for each of the characters. More on the latter in the section on line breaks in the source code of programming languages.
Line Breaks by Defining a Fixed Line Length
In contrast to the line break types based on specific character definitions that were introduced in the last section, text files with a fixed line length do not require the definition of one or more characters for a line break. Instead, each line of such a file is based on a line length that can initially be freely selected, but is kept constant within the whole file. In the file itself, all lines are then simply written one after the other and, if necessary, brought to the required length using a suitable filler respectively padding character.
The content of such a file (here, for example, with a fixed line length of four characters) can then look like this:
ABCDABC ABCD
A program that knows the line length used for the file and can display it, can then interpret this content as follows:
ABCD
ABC
ABCD
Since the second line only contains three characters, we used a space as a filler character here. If we hadn't done that, the "A" from the third line would have moved to the end of the second line.
Distribution and Areas of Application
Files with a fixed line length are significantly less common than files that implement their line breaks with a defined break character. The main reason against using a fixed line length is the lack of flexibility. After all, very few texts have the same number of characters in each line.
Nonetheless, there are some useful applications for such files, for example in the case of CSV data or other data sets whose values in each line are all of the same length, so additional characters for line breaks would not add any further information to the interpretation of such files, so that these characters can be omitted accordingly (especially in applications or environments where memory needs to be saved).
Fixed Line Length as System Line Break
The fixed line length was only used as a system line break on some of the first mainframe computers. At that time, fixed line lengths of 72 or 80 characters were common on such systems. This number was modeled on the punch cards used previously, which also typically included 80 columns per card, of which the columns 73 to 80 were often used for sequence numbers. Some of these systems encoded lines longer than 80 characters by placing a carriage character such as # as the first character at the beginning of the next line to be linked.
Records-based file systems, such as those used by the operating systems OpenVMS, RSX-11 or various newer mainframe computers, also do not require a line break character. Such systems store text files as one record per line. Each of these records contains a length field at the beginning of the line in which the length of the respective line is stored individually. This means that no additional line delimiter in the form of a control character is necessary, since the reading program already knows from this information after how many characters the line ends respectively how many characters have to be read to read a line. Even if storage in this way does not require a line break character, the record management systems used are usually able to pass the requested lines to a requesting program with a line separator character if necessary.
Line Breaks in HTML Source Code and other Markup Languages
Next to the character-based line break types and the line breaks defined by a fixed line length, which we looked at in the last two sections, there is another way to implement line breaks using a markup language.
Line Breaks in HTML Source Code
One of the best-known representatives of markup languages is the XML-based source code of HTML, the basis of Internet pages as we know them today. The implementation of line breaks in HTML source code and in other similar markup languages is special because line breaks can occur on two different levels: The source text itself can contain any character-based line breaks such as CRLF or LF, but they remain hidden because the final display of line breaks on the website that is later visible in the browser is solely based on the text based HTML tags and other formatting such as CSS style sheets.
To illustrate this, we would like to look at two HTML sources as an example. On the one hand, there is the following HTML source code:
<h1>Headline</h1><p>First Paragraph</p><p>Second<br>Paragraph</p>
On the other hand, there is this source code:
<h1>Headline</h1>
<p>First Paragraph</p>
<p>Second<br>
Paragraph</p>
As we can see, the first example does not contain any "visible" line breaks, while in the second example there is a line break after each meaningful paragraph and within the second paragraph. Nevertheless, both source codes lead to exactly the same display in the browser. So-called whitespace such as additional spaces, tabs or even line breaks do not play any role in the interpretation of the source text.
The only thing that matters in this example source code is that we have put one text into an h1 tag (from "heading 1") and two other texts into a p-tag (from "paragraph"). By default (you can also override this behavior), these tags are both interpreted to insert a line break in the form of a paragraph after them. The same applies to tags like h2 (heading 2), h3 (heading 3), li (list elements) or the classic HTML line break br (simple break, which we used to wrap the second paragraph. Other tags such as formatting tags like b (bold) or i (italic) do not insert automatic breaks in the representation.
Classic line breaks in the source text that are not based on tags, on the other hand, can be used independently of the display in the browser, for example to structure the source text in order to make it more readable. These line breaks, which are later invisible in the browser, can be used in the form of character-based line breaks or, for example, they can be inserted via so-called HTML entities, which are listed in the section on HTML entities.
Override the Behavior with the pre Tag
This behavior of invisible character-based line breaks in the source text can be overridden using the HTML tag "pre" as well as the CSS style attribute "white-space:pre". The line breaks and other whitespace such as spaces in the source text that are located within a pre tag or within tags with the CSS property "white-space" with the value "pre" are output as such in the browser:
<pre>Line 1
Line 2</pre>
<span style="white-space: pre">Line 3
Line 4</span>
This source text creates four broken lines in the browser even though the line breaks were only written into the source text using otherwise invisible "whitespace". The line break between the first and the second line is created by the pre tag, the line break between the third and the fourth line is created by the CSS property of the enclosing span element.
Line Breaks in LaTeX, Markdown, RTF, Creole, PostScript, BBCode and AsciiDoc
Other common markup languages include LaTeX, Markdown, RTF, Creole and PostScript, each of which uses a different syntax to mark line breaks:
- TeX / LaTeX offers us three ways for marking a line break: with two slashes "\\", with "\newline" or with "\hfill \break".
- Markdown turns blank lines into paragraphs and two or more spaces at the end of a line into a line break.
- In the Rich Text Format (RTF), paragraphs can be inserted with "\par" (from "paragraph") and simple line breaks with "\line".
- Creole uses "\\linebreak" to mark line breaks.
- In the page description language PostScript, things are a little bit different: Here, to output text on a new line, we have to use the "moveto" command to move to the desired output position before specifying the text (in the case of a line break, that is to the location on our page at which the new line should start).
In markup languages such as BBCode or AsciiDoc, on the other hand, despite the possibility of other markups (such as "[b]word[/b]" or "*word*" for bold text), line breaks from the source text are also included in the result. So, in these markup languages, the character-based line break itself is used as markup (what is similar to Markdown).
Requirements for using Markup Languages
The prerequisite for using markup languages such as HTML, TeX / LateX, Markdown, RTF, Creole, PostScript, BBCode or AsciiDoc is of course that the used markups, commands and tags must be known. Without knowing how specific markup is meant, to be used or to be interpreted, a representation is not possible.
Line Breaks in the Source Code of Programming Languages
In programming source code, we are also faced with the problem that - similar to HTML source text - we have to make a distinction between the source code itself and what is later actually displayed by the executed and possibly compiled program. It is important to master the balancing act between code that is as human-readable as possible, but which may not have a negative impact on the program.
Like in HTML, this balancing act was solved again, by the fact that many programming languages make a strict distinction between the line breaks in the source code and the line breaks that the program later outputs: Depending on the operating system, the usual character-based line breaks can generally be used in the source code, while for those in the program a markup language exists to output line breaks, which can differ from programming language to programming language. Some examples of this are listed in the following table:
Language | Explicit Line Break | System Line Break |
C | char s[] = "-\r\x0A-"; | char s[] = "\n"; |
C++ | std::string s = "-\r\x0A-"; | std::string s = "\n"; |
C# | string s = "-\r\n-"; | string s = Environment.NewLine; |
Java | String s = "-\r\n-"; | String s = System.lineSeparator(); String s = "-%n-"; |
JavaScript / TypeScript | var s = "-\n-"; | |
Delphi | var s: string; s := '-' + #13#10 + '-'; | var s: string; s := sLineBreak; |
Lazarus / FreePascal | var s: string; s := '-' + #13#10 + '-'; | var s: string; s := LineEnding; |
PHP | $s = "-\r\n-"; | $s = PHP_EOL; |
Python | s = "-\r\n-" | s = os.linesep |
Perl | my $s = "-\r\x0A-"; | my $s = "\n"; |
Haskell | "-\CR\LF-" :: [Char] | "\n" :: [Char] |
Visual Basic | Dim s1 As String = "-" & vbCrLf & "-"; Dim s2 As String = "-" & vbCr & "-"; Dim s3 As String = "-" & vbLf & "-"; | Dim s1 As String = System.Environment.NewLine; Dim s2 As String = vbNewLine; (deprecated) |
SQL | UPDATE tab SET col = '-' + CHAR(13) + CHAR(10) + '-'; |
As the table shows, in most programming languages we can use two different approaches to insert a line break:
- Either we define our line break explicitly using its characters, which means we are firmly committed to a specific line break type (the examples each show two lines with the content "-" using the Windows line break \r\n respectively 0D 0A, 13 10 or \CR\LF - we can create a Unix line break in the same way by omitting \r, 0D, 13 or \CR and writing only \n, 0A, 10 or \LF),
- or we define our line break platform-independent using certain variables, constants or functions that the respective programming language makes available to us. The advantage of the latter is that we don't have to worry about the system on which our program is running, since this way we get the appropriate system line break type automatically (if we want that).
In the next two sections we would like to go into more detail about both variants and their pitfalls.
Explicit Line Break
In many programming languages such as C, C++, C#, Java, PHP, Python, Perl or Haskell, the escape sequences such as \r and \n introduced in the section "Character-Based Line Break Types" can be used to insert a line break into a string. Basically, \r stands for a carriage return (CR, U+000D) and \n for a line feed (LF, U+000A), which allows the line breaks to be generated for the different systems.
Depending on the programming language, the following aspects and particularities must be taken into account when using \r and \n:
- In some programming languages such as PHP and Perl, we can define strings using both single quotes ('text') as well as double quotes ("text"). However, escape sequences such as \r and \n are only automatically replaced if they appear within double quotes in these languages. So "-\r\n-" would create a line break between the two characters "-" and "-" while '-\r\n-' would keep the characters as such. In other programming languages such as JavaScript, TypeScript and Python, it doesn't matter whether we use single or double quotes. JavaScript and Python interpret both "\n" and '\n' as a newline. In C, C++, C#, Java and Haskell, however, this question does not even arise: In these programming languages, only double quotation marks are used for strings, while single quotation marks are reserved for chars.
- Although JavaScript and TypeScript understand both \r and \n, we should still be careful when using these two characters together: even on Windows computers, an alert("-\r\n-") does not produce just one but two newlines. The platform-independent JavaScript (as well as TypeScript) interprets both \r as well as \n as their own single line break. To create just one line break, we should use \n, for example like alert("-\n-"). Nevertheless, text in these languages can certainly contain the \r\n variant. This should be taken into account, for example, when processing user input that comes from a Windows computer. Also LS and PS breaks are accepted as line breaks in JavaScript input, but not NEL, which is interpreted as a space. Furthermore, we should note that if we want to generate HTML with our JavaScript or TypeScript code, what matters are not the character-based line breaks \r or \n, but rather the proper HTML tags, which we have described in the section about line breaks in HTML.
- Java distinguishes between \r, \n and %n. %n stands for the system line break regardless of the platform (that is \r\n if the program is running on Windows, \n if the program is running on Unix, Linux, macOS and so on), while \n explicitly stands for the character U+000A (that is exclusively for the Unix line break LF) and \r explicitly stands for the character U+000D (CR). Java's readLine() accepts both CR, LF and CRLF as line breaks. When reading EBCDIC text, however, the EBCDIC character NL is not mapped to NEL (U+0085) but to LF (U+000A).
- In C, C++, Perl and Haskell, \r and \n do not automatically always represent exactly the characters CR (U+000D) and LF (U+000A). It depends on the mode used: If a file is opened or written to, this can be done either in text mode or in binary mode. In binary mode, \r and \n behave as expected. In text mode, on the other hand (based on the C standard), the escape sequence \n alone represents a complete system line break. This means that simply using \n in text mode under Windows results in the output of a complete CRLF line break, so \n alone produces the output actually expected from \r\n. If we use \r\n instead, this would result in an output of CRCRLF on Windows, thus doubling CR (\r becomes the first CR, then \n becomes the second CR and LF). On Unix systems, however, \r\n would result in the output CRLF, since a full Unix newline consists only of the LF character. For this reason, the examples for C, C++, Perl and Haskell explicitly use \x0A or \CR instead of \n, although \n would work and print the character 0A (LF) in all cases on Unix and also in binary mode on Windows. The tricky thing is that this problem only becomes apparent when a program is run under Windows. If developed and tested on a Unix system, the problem does not exist and therefore cannot be noticed.
- In PHP, Python, C# and Java, however, it is guaranteed that \r always stands for U+000D and \n always stands for U+000A respectively that \r\n always corresponds to the Windows line break CRLF. There is no automatic conversion in these languages.
The situation is somewhat different in Delphi, Lazarus, Visual Basic and SQL. Instead of the escape sequences \r and \n, in Visual Basic we can use the constants vbCr and VbLf for the characters CR and LF. There is also the constant vbCrLf for the Windows line break respectively for both characters together. In Delphi, FreePascal and Lazarus we can insert the characters directly via their character codes #13 (CR) and #10 (LF). The situation is likewise in the database language SQL, where we can similarly use CHAR(13) and CHAR(10) to generate the corresponding characters.
Variables, Constants and Functions for the System Line Break
In addition to this explicit definition of line breaks, most programming languages also provide us with variables, constants or functions with which the respective system line break can be inserted regardless of the platform:
- An example for this is the constant "LineEnding" in Lazarus and Free Pascal ("sLineBreak" is the equivalent in Delphi). Depending on which system we compile our program on, this constant contains the appropriate line break. So, if we compile our program for Windows, "LineEnding" contains the characters CR and LF (that is the Windows line break). However, if we compile our program for macOS or Linux instead, "LineEnding" contains the Unix line break LF used under macOS and Linux. Lazarus gives us the choice of either writing #13#10 for a fixed line break or remaining variable with "LineEnding".
- Similar concepts also exist in other programming languages. For example, PHP provides us with the constant PHP_EOL for the same purpose, while in Visual Basic the constant is called vbNewLine.
- As explained in more detail in the last section, in the programming languages C, C++, Perl and Haskell we can use the escape sequence /n for the system line break, which in most other languages only stands for the character LF.
- Other languages provide us with functions to obtain the system line break: In Python, for example, this can be done with os.linesep, in C# we can use Environment.NewLine for the same purpose.
- Some languages even provide us with several options: For example, in Java we can insert the system line break either with the escape sequence %n (instead of \n) or we can use the function System.lineSeparator(). The situation is similar with Visual Basic, where we have both vbNewLine and System.Environment.NewLine available (however, vbNewLine has now been marked as deprecated and should no longer be used).
- JavaScript, TypeScript and SQL, on the other hand, do not have a native concept for determining the system line break due to their optimization for platform-independent usability.
In this way, we can decide for ourselves whether we want to explicitly use a specific line break type (for example because our goal is to save files with exactly this line break type) or whether our program should automatically use the appropriate system line break (for example because our program should produce an appropriate output on different systems).
Network Protocols
The use of the correct line break type also plays an important role in network protocols. Many of these network protocols, such as HTTP, SMTP, FTP and IRC, are text-based and use the CRLF line break type for their line-by-line transmitted requests.
Some programs adhere strictly to this standard and accordingly refuse to process requests that use a different line break type such as LF (such as qmail). Other programs are more tolerant in their processing or even incorrectly always use the system line break type for their requests, which can lead to problems in communication with systems that implement the standard more strictly. Some of these problems also arise from the use of the C-typical \n which, as we saw in the last section, can resolve as either the correct CRLF or the incorrect LF in programming languages such as C, C++, Perl and Haskell depending on the operating system and mode.
To avoid these problems, some protocols now recommend also recognizing line break types other than CRLF. However, with the continued use of CRLF we are still on the right side, as we do not know which possibly outdated program is being used on the other side.
Detection of the Line Break Type of a File
In contrast to the encoding of text files, whose "encoding ID" we can write for certain Unicode encodings as an identification mark in the form of a so-called Byte Order Mark (BOM) at the beginning of a text file, it is not that easy when trying to recognize the line break type of a file. For the line break type used in a file, there is nothing comparable that could signal us the used line break type in a similar way like the BOM. Therefore, when we have an unknown text file in front of us, all we have left are a few rules of thumb to determine the line break type of this file.
A first indication delivers the operating system on which the text file has been created: If the file originated on a Windows computer, the Windows line break CR LF is likely. However, if the file was created on a recent Mac or on Linux, the file probably uses the Unix type line break LF. However, this type of classification can at best be a rule of thumb, since there are, for example, enough text editors for Windows available that may have preset the Windows line break in their default settings, but with which it is also possible to create files using any other line break type. In addition, it could also be completely unclear which system a file actually comes from.
For this reason, when interpreting a text file of unknown origin, we should not rely on such guesswork but rather try to make a decision based on the bytes of the file that we know. For example, we can proceed as follows:
- Count line breaks: In a first step, we should go through the bytes respectively the code points of the characters of the file and then see what type of line break might fit the profile we determined using this approach. For example, if our file contains a lot of codepoints of type $0A (LF) but not a single codepoint of type $0D (CR), then it is most likely a text file with the Unix line break type LF. On the other hand, if $0D and $0A occur exactly the same number of times and also in a combination such that every $0D is followed by a $0A, then we most likely have a file with the Windows line break type CRLF. We can proceed in the same way for any of the other line break types in question (for a summary table of possible byte sequences, see the section on byte representations of line break characters in different encodings).
- No finding: If we didn't find a single character in our file, that could appear in a line break or that could indicate the use of a certain line break type, this could have two reasons. Either our file does not contain a line break at all but only consists of a single line or it is a file that uses the fixed line length as a line delimiter. If the second case applies, we can try to guess a possible fixed line length based on the file structure (for example through recurring patterns), but without information about which fixed line length was chosen for the file, it becomes very difficult. Also due to the low prevalence of text files with fixed line lengths, if in doubt in such a case we should tend to the preferred line break of the system on which we open the file (for further editing the file).
- Ambiguous counting: Another problem can arise if our count does not produce a clear result. For example, our file could contain both CR and LF characters, but not of the same number. This means that our file cannot be clearly classified as a Windows text file (for this there would have to be the same number of CR and LF characters in the file), nor can the file be clearly assigned to the CR or LF line break (for this the other type should not occur in the file at all). We cover this special case in the section "Files with Mixed Line Breaks" of this article.
Fortunately, we only have to do the work described here if we want to program an application ourselves that can handle all types of text files. A program that can already do this is the TextConverter: By default, the TextConverter works with the option "Line Break Type" > "Automatic Detection", which means that the TextConverter automatically carries out the analysis of your files as described here and you don't notice anything about it. However, with the TextConverter it is of course also possible to change this default setting and read or save files with any other type of line break. With the TextConverter, all line break types presented in this article can be used in the same way as a fixed line length or single respectively multiple user-defined characters or code points as a line break.
Problems with File Exchanges
The different encodings for line breaks can cause serious problems when exchanging files between different systems.
The problems can be of a wide variety of kinds:
- For example, a file created on Linux suddenly appears to have no line breaks any more on Windows, because Windows expects one more character for a complete line break than Linux wrote into the file: the entire file is thus displayed as a single long line. The situation is similar when we try to use a Windows text editor to open a file created on a Mac or another Unix system (that cannot handle this type of line break).
- The other way around, text files created on Windows can cause all line breaks to be duplicated on Unix systems, as some Unix editors interpret both CR as well as LF as a single line break of their own, even though these two characters are considered together as one line break on Windows. Some editors also display the additional line break as ^M or <cr> at the end of each line. The problem of double line breaks is also facilitated on macOS systems for historical reasons: Since the predecessor system Mac OS Classic up to version 9 used the character CR instead of today's LF as a line separator, also many modern macOS programs still interpret not only LF but also the character CR as a complete line break.
- If we use files with a fixed line length, on the one hand, we have to inform the recipient about the number of characters per line, and on the other hand, we have to ensure that our recipient can open the file at all, since very few programs support text files with a fixed line length.
- Things can also get complicated when it comes to files that use one of the less common line break types such as VT, FF, NEL, LS or PS. For example, the default Windows text editor "Notepad" did not recognize any of these line break types until Windows 11 and liked to interpret NEL as an ellipsis character (…) according to the Windows code page codepoint 85 of the same number. Only with Windows 11 (or also with the update 1803 for Windows 10) Notepad was revised and now recognizes all of these line break types except NEL. Before this change, the Windows Editor did not even recognize the widespread but non-system line break LF. The default Linux text editor of the GNOME desktop environment (Ubuntu, Fedora, Debian, Suse) "gedit" at least recognizes LS and PS, but not VT, FF and NEL. Only the default text editor "TextEdit" from macOS correctly recognizes, in addition to the frequently used line break types CR, LF and CRLF, also all of the additional line break characters VT, FF, NEL, LS and PS required by the Unicode standard.
- With historical line break types such as RS, EOL or Sinclair, we cannot even expect that a normal text editor will recognize the files on its own without additional settings.
- If a text file from another system is not simply intended to be displayed in a text editor but is used, for example, as a configuration file or a data record, it may happen that the program in question does not even recognize the file or interprets it incorrectly. In addition, such errors sometimes only become apparent late or the affected programs issue error messages that are difficult to interpret.
- To make matters worse, the way programs deal with foreign line break types can vary widely. Some programs such as browsers, which naturally have to process text files from any system like perhaps no other program class since for a website it is completely unclear on which system the underlying source code was created, usually accept each of the characters mentioned as a line break. Other programs can be very strict and only accept one specific character. All nuances are conceivable between these two extremes.
In order to nevertheless make such files readable on the system of your choice, there are two options: either we just use a program that also understands exotic types of line breaks, or we swap the character for the line break in these files before further viewing or editing. We'll look at how this works in the next section.
How to Change the Line Break Type of Files
If you would like to read text files from other operating systems or other sources that use a different line break type than your operating system natively on your system, you can rewrite the line breaks of the files in question respectively replace the previous line breaks with the line break type you prefer. Such a rewrite may also be necessary if you want to read your text files with a program that only understands a certain kind of line break and cannot carry out the necessary conversion itself.
Change Line Break Type with the TextEncoder
Regardless for what reason you would like to change the line break type of files, you can easily make this change, even with any number of files at the same time, using the software TextEncoder. To do this, simply follow the steps below:
- First, drag all the files whose line break type you want to change onto the Text Encoder. Alternatively, you can also open the files individually or search entire folders for files using arbitrary filters.
- On the right side of the main window under "Changes" > "Line Breaks" select your desired target line break type under "Save as", for example "CRLF - Windows" or "LF - Unix".
- Optional: Under "Read As" you can optionally specify what line break type should be used for reading the files. By default, the option "Auto Detect" is used, which should be sufficient for most cases. However, for more exotic line break types such as line breaks that are defined by a fixed line length or line breaks that are based on user-defined code points, you should make a corresponding selection here.
- At the bottom right of the main window you will find the Storage Options. Here you decide whether you want to overwrite the respective original file or whether you want to store the converted files as new files, for example in a new folder.
- Finally, you have to click on the button "Convert and Save" under the storage options. This will change the line break of all files currently in the program's file list according to your current settings.
The TextEncoder supports all character-based line break types presented in this tutorial as well as line breaks after a fixed number of characters for both reading and saving text files. Additionally, you can also define and use custom line breaks via single or multiple characters or codepoints.
If you want to automate the line break change of one or more files (for example all files from a specific folder) via a script, you can use the TextEncoder in its version TextEncoder Pro CL.
Change Line Break Type with the TextConverter
Also with the application Text Converter it is possible to change the used line breaks of text files. The procedure is the same as that just described for the TextEncoder. Also the selection of supported line break types is identical to the TextEncoder.
However, the line break options in the TextConverter are not located under "Changes > Line Breaks" but under "Actions > Files > Line Break Type". In addition, you can also use the TextConverter for numerous other manipulations of plain text, CSV as well as XML files, while the TextEncoder is only intended for changing the encoding and line break type of text files. The Text Converter is also available as a batch version, which can be controlled and automated via the command line or using a script.
Files with mixed Line Breaks
In the section "Detecting the Line Break Type of a File" we had already discussed the case that there may be text files that can contain several types of line breaks at the same time. None of the possible line break types can then clearly and uniquely be assigned to such a file.
Emergence of Files with mixed Line Breaks
Such files with mixed line breaks can emerge in various ways:
It is possible, for example, that a file may have been edited by different people on different systems. For example, if these people are using a text editor that can only understand and write their own system line break type, the following can quickly happen: Person A creates a text file on Linux. At this point, the resulting file only contains the Unix line break type LF. Person B then opens the file on Windows and begins adding a few paragraphs. These new paragraphs are written into the file using the Windows CR LF line break, but the old LF line breaks remain untouched. Such a file then unintentionally contains several types of line breaks.
The same can happen if multiple files from different systems are appended together without first harmonizing the line break type of the files.
Reparation of Files with mixed Line Breaks
But what can we do when it's already too late? How can we fix such a file with mixed line breaks? Fortunately, we don't have to do this manually since we can just use the TextEncoder again, which has already been introduced in the last section. How this works exactly is explained in the tutorial "How to Repair Text Files with mixed Line Breaks".
And that next time you don't end up with files with mixed line breaks again: the program TextConverter can join several text files together, taking into account their different line break types. And of course without you having to explicitly care about it.