TextConverter | Supported Formats

Supported Formats

With the TextConverter, arbitrary texts and text files can be edited regardless of their format. That can be, for example, plain text files that typically have the file extension TXT, CSV files that typically have one of the file extensions CSV or TSV, files in XML-based formats that, for example, can have the file endings XML, XHTML, HTML, HTM, RSS or SVG, source code files such as PHP, JS, BAT, CMD, SH, VBS, C, CPP, CS, PAS, PY or R as well as any other text formats such as JSON, SQL, DIF, CSS or INI - just to list some of them.

PDF documents or office documents such as Microsoft Word Documents (DOC, DOCX), Microsoft Excel Spreadsheets (XLS, XLSX) or other office files such as ODT, ODS, PPT or PPTX cannot be processed with the TextConverter, because internally those formats are no text files. However, it is possible to export text files and CSV files with the TextConverter into the formats DOCX, ODT, XLSX, ODS and as an image (JPG, PNG, BMP).

The TextConverter offers numerous actions to process texts and text files. With the actions for processing the entire text and with the actions for editing lines, all texts and text files of any format can be edited. In addition, the TextConverter provides some format-specific actions for the processing of CSV files and the processing of XML files.

Regardless of the format of a text file, a text file can be stored in different encodings and with using different types of line breaks. In the two subsequent tables you can see which encodings and line break types are supported by the TextConverter.

Encodings

In the following table you can see an overview of all encodings supported by the TextConverter. These encodings can be read, written and changed by the TextConverter.

If you use the TextConverter with its default settings - that means without changing any settings - the TextConverter will try to automatically determine the encoding of a file. The TextConverter will then also use this encoding for storing the corresponding file. So, if you only want to edit the content of a text file (for example with replacements of text), you do not need to worry about the encoding settings.

If you would like to change the encoding of files or if you want to read files using a specific encoding, you can use the settings under "Actions > Files > Encoding". In addition to the options for reading and writing, you will also find an option regarding the question of whether a byte order mark should be written into the files or not. In the column "BOM" in the table, you can see whether an encoding facilitates byte order marks or not.

Also in the case, you control the TextConverter via the command line or via a script, without specifying an explicit encoding for reading or saving the file, the encoding is automatically determined during reading and also used again for writing. If you want to deviate from this default behavior, you can use the values from the column "Parameter" from the table. An introduction and examples of the use of the parameters can be found in the article about the script control of the TextConverter in the section about parameters for encoding.

Encoding	Description	BOM	Parameter
ASCII	7-bit encoding with 128 characters (00 to 7F)	no	ascii
Latin-1	8-bit encoding according to ISO 8859-1	no	latin1
Latin-2	8-bit encoding according to ISO 8859-2	no	latin2
WIN-ANSI	Language-dependent ANSI code page of your Windows installation	no	win-ansi
WIN-1250	Windows Code Page 1250 (Central European)	no	win-1250
WIN-1251	Windows Code Page 1251 (Cyrillic)	no	win-1251
WIN-1252	Windows Code Page 1252 (Western European)	no	win-1252
WIN-1253	Windows Code Page 1253 (Greek)	no	win-1253
CP437	Code Page 437 (CP437, IBM437, OEM-US)	no	cp437
UTF-7	For using Unicode in non-8-bit environments	yes	utf7
UTF-8	Unicode encoding with variable 1 to 4 bytes per character	yes	utf8
UTF-16 LE	Unicode encoding with variable 2 or 4 bytes per character, Little Endian	yes	utf16le
UTF-16 BE	Unicode encoding with variable 2 or 4 bytes per character, Big Endian	yes	utf16be
UTF-32 LE	Unicode encoding with fixed 4 bytes per character, Little Endian	yes	utf32le
UTF-32 BE	Unicode encoding with fixed 4 bytes per character, Big Endian	yes	utf32be

You can find out more about the respective encodings and their differences in the introduction to the Unicode text file formats.

Line Break Types

In the following table you can see an overview of all types of line breaks provided by the TextConverter. Since the TextConverter also supports line breaks at custom characters or code points, you are not bound to this selection but you can also define and use your own line breaks at one or more characters or code points.

If the TextConverter is used without explicitly defining a type of line break for reading or writing, the TextConverter will try to automatically determine the type of line break used in a text or text file in its default settings. This type of line break is then also reused for the storage of the file. If you would like to change the line break type of a file or read files using a specific line break, you can use the settings under "Actions > Files > Line Break Type".

If you would like to change the line break type of files via a script or via the command line with the TextConverter or if you want to use a specific line break type for reading files, you can use the values from the column "Parameter". You can find out how you can control the TextConverter in batch mode with parameters for the line break type in the article about the script control of the TextConverter in the section parameters for the line break type.

Line Break	System / Designation	Code Point	Parameter
CRLF	Windows, DOS, OS/2, CP/M, Symbian, Palm, Atari	U+000D + U+000A	crlf
LF	Unix, Linux, macOS, Mac OS X, Android, AmigaOS, BSD	U+000A	lf
CR	Classic Mac OS, Apple II, Commodore C64, OS-9	U+000D	cr
NL	EBCDIC New Line - IBM Mainframe Systems	U+0015	nl
RNL	EBCDIC Require New Line	U+0006	rnl
LF	EBCDIC Line Feed	U+0025	lf_ebcdic
EOL	ATASCII End Of Line	U+009B	eol
GS	Group Separator	U+001D	gs
RS	Record Separator	U+001E	rs
US	Unit Separator	U+001F	us
FF	Unicode Form Feed	U+000C	ff
NEL	Unicode Next Line	U+0085	nel
LS	Unicode Line Separator	U+2028	ls
PS	Unicode Paragraph Separator	U+2029	ps
VT	Vertical Tab	U+000B	vt
TAB	Horizontal Tab	U+0009	tab
FIXED	Fixed Line Length with x Characters	-	fixedlength-x
NOCHAR	No Character	-	nochar
-	Linebreak at Character x	-	customstr-x
-	Linebreak at Codepoint x	-	customcp-x
-	Linebreak at one of the Characters x, y or z	-	customstrs-x,y,z
-	Linebreak at one of the Codepoints x, y or z	-	customcps-x,y,z

You can find out more about the different types of line breaks in the introduction to line breaks.

Custom Line Breaks

If you want to work with line actions or if you want to change the line break type of files or texts using the TextConverter, you are not limited to the types of line breaks shown in the table. This selection is only the list of predefined line break types, which you can select directly in the drop down list in the TextConverter.

In order to define user-defined line breaks at one or more arbitrary characters or codepoints, you can go to "Actions > Files > Line Break Type > Read as" or "Actions > Files > Line Break Type > Save as" and select either "Custom Character" or "Custom Codepoint" from the drop down list - depending on whether you want to specify the line break for reading and/or writing as a character or as a codepoint. After this selection, an input field appears in which you can write your desired line break.

If you select "Custom Character", you can directly enter the character or the characters in the input field that should be interpreted as a line break when reading or writing. So, for example "|" or "--".

If you select "Custom Codepoint", you have the option of entering your line break in the form of one or more codepoints. This has the advantage over the specification as a character that you can also easily specify invisible or non-displayable characters. Codepoints can be written either hexadecimal, decimal or in the form U+X. In order to define the Windows line break CR LF as a custom codepoint, you could, for example, use the formats "#0D#0A" (hexadecimal), "13 10" (decimal), "13 10" (dezimal), "U+0D U+0A" or "U+000D U+000A".

If you control the TextConverter via the command line or a script, the custom line breaks can be passed via the parameters customstr-x and customcp-x. With customstr-x you can pass characters and with customcp-x codepoints, with the x standing for the respective character(s) or code point(s). For example, customstr-ab (line break at the string "ab") or customcp-#0D#0A (line break at the Windows line break CR LF defined by the codepoints #0D#0A in hexadecimal notation). Further examples of the use of the parameters for custom line breaks can be found in the tutorial for the script control of the TextEncoder in the section "Custom Characters for Line Breaks". Even if this tutorial is about the TextEncoder, you can also use the examples shown there for the TextConverter.

Lines with a Fixed Line Length

In addition to the line breaks on one or several characters, the TextConverter also supports reading and saving texts and text files with a fixed line length. This means that the end of a line is not defined by a certain character or a certain codepoint, but by a defined number of characters. For example, by the definition that a line always consists of 10 characters.

In the TextConverter, under "Actions > Files > Line Break Type > Read as" you can select the option "Line Break after this Number of Characters (Fixed Line Length)" and enter your desired number of characters. Under "Save as" you can select "No Character" if you want to keep this type of line break. If not, simply select a different type of line break in order to change the line break type of your text.

A more detailed explanation about working with files with a fixed line length can be found in the tutorial about rewriting text files with a fixed line length. This tutorial is written for the TextEncoder, but you can also use everything for the TextConverter.

Line Breaks on multiple Characters

Typically, line breaks are defined by a single fixed character or by a single fixed string. For example, with the fixed character LF (Unix, Linux, macOS) or the fixed string CR LF (Windows). This line break remains constant over the entire file or the entire text and no other character is interpreted as a line break.

However, with the TextConverter you can deviate from this rigid rule and you are also able to define multiple characters or multiple strings that are interpreted independently of each other as a line break. For example, both CR LF and LF. This function can be useful, for example, if text files of different systems have been copied into one file and this file is now to be repaired. This means that the TextConverter could be used at this point to read the file taking into account both types of line breaks in order to then save the file with a fixed uniform type of line break.

If you want to use the TextConverter via the graphical user interface and define line breaks at several characters, you can go to "Actions > Files > Line Break Type > Read as" and either select "Line break at each of these characters (comma-separated)" or "Line break at each of these code points (comma-separated)". These two options offer the possibility of defining several characters as a line break either directly via typing the characters or in the form of codepoints. The individual characters or strings must be separated with a comma. For example, "a,bc" for a line break at both every "a" and on every "bc" in the text. If you want to use the comma as a line break itself, you can put it in quotation marks, for example "",",." for a line break at every comma and every point in the file. Codepoints can be specified in the formats hexadecimal ("#0D#0A"), decimal ("13 10") or in the form U+X ("U+0D U+0A" or "U+000D U+000A").

If you control the TextConverter via the command line or via a script, you can use the parameters customstrs-x and customcps-x for line breaks at multiple characters. The x is to be replaced by the desired line breaks, for example customstrs-a,bc and customcps-#0D#0A for the two examples mentioned above. In the tutorial about the script control of the TextEncoder in the section "Line break on multiple Characters" you will find further explanations and examples for the use of the parameters customstrs-x and customcps-x. Everything in this tutorial also applies to the TextConverter.

Further information and examples on the topic are also available in the AskingBox tutorial "Repair Text Files with mixed Line Breaks". The examples there relate to the TextEncoder again, but can also be used for the TextConverter.