InfoCenter

Regular Expressions

In software like the Text Converter, you can carry out special tasks with the help of regular expressions, which expands the possibilties of the usage of such programs in a very large way. To work with regular expressions, you have to know some basics, which are summarized on this page.

The summary is devided into the following sections to which you can jump directly:

What is a regular expression?

The usage of regular expressions corresponds to a search which is more general than a normal search. For example, if you are searching for "A" in a normal text, you can find all occurences of "A". But what can we do if we want to find all of the other uppercase characters? You can search for "A", "B", "C" and so on. Or you can use regular expressions. The short regular Expression [A-Z] is the same as searching for each of the single characters. You can do that as precice as you want. You can search for an arbitrary date, an arbitrary e-mail adress or whatever you want. How to do that, you can read in this summary. You can use regular expressions for example in text editors like the Text Converter, with which it is possible to delete or replace with regular expressions or to use the texts found by regular expressions to use it in another context or position, so that it is possible to change the format of a date or to rewrite each e-mail adress to a link. A normal search had to know all possible e-mail adresses for that, what is impossible.

Basics

In this table you can find the most important conventions and characters as well as their meanings. The 11 characters [](){}|?+-*^$\ and . are meta characters and have a special meaning within regular expressions, which will be explained in the following sections. If you want to use one of these characters as this character in a regular expression, you can use a \ in front of the character to escape the character from its meaning as meta character. Regularly, all of the other characters can be used in a regular expression as such.

Regular Expression Meaning and Example
a The regular expression "a" also matches "a". As long as the character is no meta character with another meaning, it can be used in the regular expression directly.
Example: abc defg abcdefgbafcgbde 0123456789
[abc] Square brackets can be used to define a character group. The example finds one of the characters a, b or c.
Example: abc defg abcdefgbafcgbde 0123456789
[a-f]The hyphen can be used to define a range of characters. The example finds one of the characters a, b, c, d, e or f. Also in this example, the regular expression only matches one character.
Example: abc defg abcdefgbafcgbde 0123456789
[0-9]The hyphen can also be used to define a range of numbers. The example finds the numbers 0 to 9.
Example: abc defg abcdefgbafcgbde 0123456789
[A-Z0-9abc] Within a character group, ranges and single characters can be used. In the example, the group consists of the upper case letters A to Z, the digits 0 to 9 as well as the lower case letters a, b and c.
Example: abc defg abcdefgbafcgbde 0123456789 -&
[^a]With the meta character ^ at the beginning of a character group, the character group is negated. That means, the example will match any character but not an a.
Example: abc defg abcdefgbafcgbde 0123456789
^aIf the meta character ^ is not included into a character group, it stands for the beginning of a string or a line. The example would match all lines or strings beginning with a.
Example: abc defg abcdefgbafcgbde 0123456789
a$Like the meta character ^ stands for the beginning of a string or a line, the character $ stands for its end. The example would match all strings or lines ending with an a.
Example: abc defg abcdefgbafcgbde 0123456789a
^abc$ Here the meta characters ^ and $ are used together. This example would match all strings or lines which are equal to "abc".
Example 1: abc
Example 2: abc abc
\b Stands for the position at the begin or end of a word.
\BStands for a position which is not at the begin or end of a word.
\babc\bFinds the single word "abc", but not the string "abc", when it is surrounded by other characters.
Example: abc abcde abc deabcde deabc abc
\babc Finds all words beginning with "abc".
Example: abc abcde abc deabcde deabc abc
abc\b Finds all words ending with "abc".
Example: abc abcde abc deabcde deabc abc
\Babc\BFinds all words containing "abc", but not beginning or ending with "abc".
Example: abc abcde abc deabcde deabc abc
abc\B Finds all words containing "abc", but not ending with "abc".
Example: abc abcde abc deabcde deabc abc
ABC|abc The character | stands for an alternative. This regex will find "ABC" and "abc".
Example: abcde ABCDE fgabcde FGABCDE
[a\-f]If there is a \ in front of a meta character, the meaning of this meta character is escaped. In this case, the meta character does not represent a range, but is used as own character. So, the example matches the characters a, f and -. In other words, with the character \ it is possible to add meta characters to character groups.
Example: abcdefg abcdefgh -
1\+1=2This regular expression matches the string 1+1=2. Again, the meta character + is escaped with \.
Example: 1+1=2 1\+1=2
[-af]Whenever a meta character is within a character group at a position with no meaning, it is used as normal character. The example matches the characters -, a and f.
Example: abcdefg abcdefgh -
[af-]The same applies to a - at the end of a group.
Example: abcdefg abcdefgh -
ab[cd] In this example, one character from a character group is combined with "ab". So, the example matches the strings "abc" and "abd".
Example: abcdef abdef cdabcd cdabdc
.The dot stands for an arbitray character. Every character will be found with this regular expression. It depends on the modifier, whether line breaks are included.
Example: abc defg abcdefgbafcgbde 0123456789 -&
.abMatches any string with three characters from which the last two characters are a and b, for example aab, eab, %ab, :ab and so on.
Example: abc dfgabc &ab fgxab
[^a]ab Matches the same strings as the regular expression ".ab" except the string "aab".
Example: aab cab dd dab fg &ab

If you want to test the regular expression introduced on this page, you can use the software Text Converter for this. Just open an arbitrary text file in this tool and click on Search and Replace in the mostright column. Here you can activate regular expressions in the search box. After that you can type in a regular expression to test how it works.

Repetitions

To describe situations in which there are repetitions of characters or whole character classes, you can use some of the following meta characters.

Regular ExpressionMeaning and Example
ab{2} The preceding element has to appear exactly for two times. This example would only match abb. The a is not added with a bracket to a class together with the b, because of that the repeating expression has only affect to the b and not to the a.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab{2,3} The preceding element has to appear at least two times and not more than three times. This regular expression would match abb, abbb, but not ab or abbbb.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab{2,}The preceding element has to appear at least two times. The example would match abb, abbb, abbbb and so on.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab{,3} The preceding element has to appear not more than three times. The example would match abbb and abbbb but not abbbbb.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab?The question mark indicates that the preceding element is optional. That means, the preceding element can appear, but it must not appear. The example would match a and ab. The question mark is the same as {0,1}.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab+The plus indicateds that the preceding element has to appear at least one times. The example would match ab, abb, abbb and so on but not a. The plus is the same as the experssion {1,}.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
ab*The character * indicates that the preceding element has to appear zero or more times. It is the same as the expression {0,}.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
[ab]+This example finds strings as a, b, ab, ba, abb, ababa and so on. It does not mean, that exactly the same preceding character has to be repeated. It means that characters from the group has to be repeated. If you would like to find repetitions of the same character, you have to use backreferences. The regular expression would be ([ab])\1+ and is explained above.
Example: a ab abb babb abbb abbbb babbbbbbbbbbd
a[bc]+dThis expression matches strings like abd, acd, abcd, acbd, abccbd, acbcbbcd and so on.
Example: abcd aad acbd fg abbccbbcd fg abd fg acd
[0-9]{2,3} Also in this example there is no need to have the same numbers repeatet. The expression matches all numbers with two or three digits, hence the numbers from 10 to 999. Strings like 1,2 will not be found by this expression.
Example: 1,4 10 89 ab3a ab42a 234

Character Classes

Behind creating own groups of characters with square brackets, there are also some pre defined character classes. With this classes, regular expressions become shorter and clearer.

Regular ExpressionMeaning and Example
\dThis expression (Digit) stands for a digit, it is the same as [0-9].
Example: abc defg abcdefgbafcgbde 0123456789 -&
\DThis expression stands for any character that is not a digit. It is the same as [^0-9] or [^\d].
Example: abc defg abcdefgbafcgbde 0123456789 -&
\wThis expression (Word) stands for a letter, a digit or a underscore. It is the same as [A-Za-z0-9_].
Example: abc defg abcdefgbafcgbde 0123456789 -&
\WThis expression stands for any character which is no digit, letter or underscore. It is the same as [^\w].
Example: abc defg abcdefgbafcgbde 0123456789 -&
\sThis expression stands for whitespace (Space), for example line breaks, tabs, spaces and so on.
\SThis expression stands for any character which is not a whitespace character, it is the same as [^s].

Grouping and Backreferences

With round brackets. you can group some characters, for example to apply an operator on the whole group. Furthermore, with round brackets you create backreferences. That means, the characters found in this brackets will be stored, so that you can re-use them in the same regex or even in the replace regex, when searching and replacing with regular expressions. The examples show some of the possibilities, which you should test in a program like the Text Converter to get a feeling for this expressions.

Regular ExpressionMeaning and Example
(ab)+ The whole group "ab" is repeated, one ore more repetitions of "ab" will be found.
Example: abcde ababcde ababababa
ab(cd|ef|gh)i Within the brackets, there are some alternatives. This example will match the strings "abcdi", "abefi" and "abghi" but no other strings.
Example: abcdifg abcdefghi abghi
([ab])\1+ Here the backreference \1 is used. Each bracket creates such a reference, the 1 corresponds to the first bracket in the expression. The expression means, that the letter found in the group [ab] has to be repeated one ore more times after the group. Hence, this expression matches "aa", "bb", "aaa", "bbb" and so on.
Example: aaaacd efbbbbbbbbghab
([ab])x\1x\1 The reference can also be used more than one time. This expression matches "axaxa" and "bxbxb".
Example: axaxa axax bxbxb axbxa
([ab])x(c)x\1x\2 In this expression, two references \1 and \2 are used, which correspond to the first and second bracket. The strings "axcxaxc" and "bxcxbxc" will match this expression.
Example: axcxaxc axax bxcxbxc axbxa
([ab])x(c)x\2x\2 It is not necessary to use each of the references resulting from brackets in the expression. Here only the second reference is used.
Example: axcxcxc axax bxcxcxc axbxa
(\d+\.)(\d+\.) References can not only be used within a single regular expression. In the Text Converter, you can search for a string with a regex and replace this string by using references like $1, $2 and so on. If you type the example in the search field and you replace this by $2$1, the found date will be turned around. Please note, that you have to activate regular expressions for the search and replace boxes under the boxes.
Example: "11.04." will replaced by "04.11."
\ba\b\s\b([aoeiu][a-z]+)\b With this regular expression you can find all single words "a" followed by another word beginning with a, o, e, i or u. In English, it is not allowed to write an "a" in front of a word beginning with a vowel. You can use the regular expression "an $1" in the Text Converter to correct this error.
Example: "a idea" will be replaced by "an idea"

Modifiers

The behaviour of regular expressions can be changed with so called modifiers. If you want to change this modifiers in the Text Converter in general, you can go to the menu "Settings > Settings regarding regular Expressions (RegEx)", where all of the modifiers can be changed. But it is also possible to change modifiers within regular expressions or to apply modifiers only on a part of the regex. How that works, you learn in the second part of this section. The following modifiers can be adjusted:

  • Modifier i (Case Insensitive): If this modifier is active, it will be searched independent from upper and lower case characters. That means, the regex [a-z] matches either only lower case letters (modifier i is not active) or both, lower and upper case letters (modifier i is active).
  • Modifier m (Multi Line): If this modifier is active, the whole file will be treated as multiple lines. That means ^ and $ match the beginning and the end of the whole file. If the modifier is not active, ^ and $ will match the beginning and the end of a line.
  • Modifier s (Single Line): Treat a string as a single line. If this modifier is active, the dot . matches all characters including spaces. If this modifier is not active, that will not be the case.
  • Modifier g (Greedy Mode): This modifier changes all of the following operators like + and *. If this modifier is active, all operators will behave normal. If this modifier is not active, the regular expressions will be applied non greedy. That means the + works as +?, the * works as *? and so on.
  • Modifier x (Extended Syntax): If this modifier is active, you can use whitespace (for example spaces) in your regular expressions and add comments (after a # in a line all other characters will not be used for the regular expression). With this, the regular expression will be more readable, but you have to escape all spaces with \ whenever it is not used in a character group.

If a modifier should only be used for one regular expression or even only for a part of a regex, you can use the following methods to change the modifiers. The modifiers mentioned above are named by their letters, that means the letters i, m, s, g or x have to be used.

Regular ExpressionMeaning and Example
(?i)[abc] In this example you can see how you can activate a modifier. In this example the modifier i for case insensitivity is activated.
Example: abcdef ABCDEF
(?i)[a](?-i)[cd] By using (?-i) a modifier is deactivated. In the example, first the modifier i is activated, the letters a and A will be found. After that the modifier i is deactivated, c and d have to be lower case to match this expression.
Example: ac Ac AC ad Ad AD
((?i)[a])[cd] With brackets you can reach the same results. Of course, i has to be deactivated generally in this example.
Example: ac Ac AC ad Ad AD
(?ig-msx)[abc] If you want to change more than one modifier at the same time, you can also do that within one expression. Other possibilities are (?ims) to activate some modifiers or (?-ims) to deactivate a number of modifiers.

Unicode

Often, there is the question whether and how you can use Unicode characters within regular expressions, thus for example Chinese characters or letters from the Cyrillic or Greek alphabet. Originally, regular expressions where only used for ANSI characters and lots of programs using regular expression still only support the range of ANSI characters. Different, it is in the Text Converter. In this software you can use arbitray Unicode characters in the same way, you are using ANSI characters. The following examples show how this works and how you can use Unicode characters.

Regular ExpressionMeaning and Example
[Д-И] In the Text Converter you can use Unicode characters in the same way you are using ASCII characters. The example uses the range Д to И from the Cyrillic alphabet in a character group.
Example: АБВГДЕЖЗИКЛ
Arbitrary special characters you can use like this example. Here is the character for infinity.
Example:
\x{221E} Alternatively, you can also use the Unicode HEX code for a character. This code is 221E for the infinity symbol and it is used like the regular expression in the example.
Example:
\x41 Also in the ASCII range, you can use the HEX code instead of the character. This makes sense especially when noting tabs or other characters that can not be written directly. In the example the hexadecimal code for A (code 41) is shown. A table with all of the HEX codes, you can get in this ASCII table.
Example: ABC ABCABC
[\x{0001}-\x{221E}] Characters defined with the HEX code in this way can be used as any other character in the syntax of the regular expressions. In the example a group of characters is defined, which range includes all characters up to the code of the infinity symbol. This includes Latin, Greek and Cyrillic but not Japanese characters.
Example: ABCGHJΔΨΩБВГДЕЖカモヤモ

Examples

With the knowledge written on this page, you can write arbitrary regular expressions on your own, by combining the rules in your own way. As an example, the following regular expression will be analyzed. This regular expression can be used to find an arbitrary E-Mail adress from any text:

\b[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b

As you can see, the regular expression is rounded by \b. That means, the E-Mail adress should be a single word and not rounded by other characters in the text. The structure of an E-Mail adress is name@domain.ending. As you can see in the regular expression, there is a character group before the @ which characters should be repeated at least one time (+). This is the name part of the E-Mail adress. After the @, there are two other character groups. One for the domain and one for the ending. These groups are devided by an dot. Because a dot is a special character within regular expressions, it has to be escaped by using \. The character group of the domain can consists of an arbitray number of characters, but at least one character (+) and the character group of the ending has to consist of at least 2 and at most 4 characters. This is indictaed by {2,4} behind the group.

With the regular expression in the example, you can find e-mails in texts. But how can you work with this regular expression? For example, in a program like the Text Converter, you can use the expression to search and replace texts. Simply go to the action "Replace Text" and activate regular expressions under the box for the search or replace term. If you use our example, you can enter a text which will replace the e-mail adress. For example, you can use the following regular expressions:

Search Term: (\b[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)
Replace with: <a href="mailto:$1">$1</a>

With this, the found e-mail adress will be rewritten as link. as you can see, the expression in the search box must be enclosed with brackets. Only if there are brackets around the expression, you can re-use the string in the replace box. The brackets can enclose arbitrary parts of the search term. This makes several other things possible. For example, we can try the following combination:

Search Term: .*(\b[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b).*
Replace with: $1

With this, you are searching for an e-mail adress with arbitrary characters (.*) around it. So, the whole text including the address will be found. But it will only replaced with the string found by the term in brackets, that is the e-mail. So, you can extract an e-mail address out of a text with this regular expression.

If you are using more than one bracket in your search term, you can use them with $1, $2, $3 and so on. Here is an example:

Search Term: (.*)search(.*)
Replace with: $1replace$2

In this example, we ware searching for the text in front and behind the string "search". All of this will be replaced with the text in front of and behind "search" and the word "replace" will be written between these texts. This regular expression does something, that can be carried out much more easier. The expression only replaces the word "search" with "replace". But if you modify the expression a little bit, you can search for different writings of a word or you can create much more complex search terms.

Software for Regular Expressions

If you would like to use the regular expression introduced on this page to work on text files, you can use the software Text Converter for this task. The Text Converter makes it possible to search in text files according to regular expressions, you can replace regular expressions with other texts or other expressions, you can delete parts of the text with the help of regular expressions or you can split files at the position of a regular expression to save the files as single files. Of course, with this program it is also possible to use matched parts in another context or order (backreferences). Another program is the Easy MP3 Player, with which it is possible to search your music collection with regular expressions, so that you can transform very specified searches.

  • Text Converter (Searching, replacing and deleting text with the help of regular expressions, backreferences and file splitting)
  • Easy MP3 Player (Search your music collection with regular expressions)
  • Clipboard Saver (Search and replace clipboard contents with regular expressions)

Important Note

This text about Regular Expression was written by Stefan Trost and it is not allowed to use this text (even in parts) in another context without a permission of Stefan Trost.