Regular expressions

Regular expressions or the shortcut RegExs are very powerful for searching and matching text patterns. Because the syntax and therefor the resulting expressions are not easy readable and therefor also not easy to understand, Regular Expressions are loved and hated. Knowledge of Regular Expressions and good tooling can make the results better and more understandable.

This page tries to accomplish that by giving examples, explanation, references, tooling and websites. Not all implementations of Regular Expressions have the same functionality. Always verify that the given examples are supported by your RegEx implementation.

But please keep in mind that regular expression are not the solution to everything. If you try to do too much with just one regular expression you might fall in the pitfall:

 Some people when confronted with a problem, think "I know, I'll use regular expressions."
 Now they have two problems.

Flavors

All modern regular expression flavors can trace their history back to the Perl programming language (Perl-style regular expressions).

AWK ^[1] is the first attempt embedding Regular Expressions and uses POSIX Extended Regular Expression (ERE).
Perl
Perl Compatible Regular Expressions (PCRE) is a C library developed by Philip Hazel ^[2].
.NET
Java. In Java 4 the first release of Regex.
JavaScript
Python
Ruby

General

To start with RegEx the user needs to know what RegEx are.
The following websites and references contain this information:

Regular-Expressions.info ^[3] gives a very good explanation on Regular Expressions.
The site has also a download of the very useful program RegexBuddy ^[4]. There is also a free tool written in JavaScript by Steven Levithan ^[5].
Another good starting point for Regular expressions is the description of the javascript implementation on W3schools.com ^[6].
Looking for a regular expressions but can not find one? The library on RegExLib.com ^[7] offers a wide range of examples.
Wanna know how to use Regex in a web environment. WebReference.com ^[8] offers an example of using Regex with Javascript.
The Regular Expressions Cookbook ^[9] is written by the same author as the RegexBuddy gives even more explanation on this subject.
Website regex101 is an online-test-tool for checking/testing your regular expression. Has all the regex-flavors (except Bash).

Elements

Regex are based on building elements. See the ones below.

Anchors

Syntax	Example	Description
^(caret)	^. matches a and d in abc\ndef.	Start of line. Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.
\A	\A. matches a in abc\ndef.	Same as the caret but never matches after line breaks.
$(dollar)	.$ matches c and f in abc\ndef.	Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.
\Z	.\Z matches f in abc\ndef	Same as the dollar only
\b	.\b matches c in abc.	Word boundary matches the position between a word character (anything matched by \w) and a non-word character.
\B	.\B matches b in abc.	Not word boundary matches the position between two word characters.
\m	\m. matches a in abctest .\m matches space in 'test for'	Start of word.
\M	\M. matches space and dot in 'test for.'	End of word.

Assertions

Syntax	Example	Description
(?=regex)	t(?=s) matches the 2^nd t in 'streets'. streets	Zero-width positive lookahead matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends. Looks for a character succeeded by the lookahead character.
(?!regex)	t(?!s) matches the 1^st t in 'streets'. streets	Zero-width negative lookahead is identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match. Looks for a character not succeeded by the lookahead character.
(?<=text)	(?<=s)t matches the 1^st t in 'streets'. streets	Zero-width positive look-behind matches at a position to the left of which text appears. Since regular expressions cannot be applied backwards, the test inside the look-behind can only be plain text. Some regex flavors allow alternation of plain text options in the look-behind.
(?<!text)	(?<!s)t matches the 2^nd t in 'streets'. streets	Zero-width negative look-behind matches at a position if the text does not appear to the left of that position.
(?>text)	(?>\d+) matches 5 and 00 in '$ 5.00'.	Once-only Subexpression, Also known as possessive quantifier.
?()		Condition [if then]
?()\|		Condition [if then else]

Characters

Syntax	Example	Description
\c	\ce matches te in testing	Matches all characters (XPATH)
\s	\sf matches "space"f in 'test for'	White space
\S	\St matches the 2^nd occurence of t in testing.	Non white space.
\d	\d matches all 9s in test99ing.	Digit
\D	\D matches test and ing in test99ing.	Not digit
\w	\w matches test and for in 'test for.'	Word
\W	\W matches "space" and "dot" in 'test for.'	Not word
\xhh	\x20 matches "space" in 'test for.'	Hexadecimal character hh
\xxxx	\0O40 matches "space" in 'test for.'	Octal character xxxx

Iterators

Iteration qualifiers are metacharacters that are not regular expressions by themselves.
Instead, they state how many iterations of the preceding expression there must be or can be, in order to match.
These metacharacters are: *, + and ?.

Syntax	Example	Description
*	*test(\d)ing** matches testing, test9ing, test99ing in 'testing, test9ing, test99ing'.	Any number of occurrences
+	*test(\d)ing** matches test9ing, test99ing in 'testing, test9ing, test99ing'.	One or more
?	*test(\d)ing** matches testing, test9ing in 'testing, test9ing, test99ing'.	Zero or one
{n}	test(\d){1}ing matches test9ing in 'testing, test9ing, test99ing'.	n times exact
{n,m}	test(\d){2,5}ing matches test99ing in 'testing, test9ing, test99ing'.	n, n+1, ..., m times.

Without these qualifiers, a regular expression will match exactly one occurrence in the text.

Groups

Every time you create a group by (), you can re-use the found information in the replacement. See the table below for examples.

Syntax	Example	Description
(regex)	(abc){3} matches abcabcabc. The 1^st group matches abc.	Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
(?:regex)	(?:abc){3} matches abc.	Non-capturing parentheses group the regex so you can apply regex operators. But do not capture anything and do not create backreferences.
\1 to \9	(abc\|def)=\1 matches abc=abc or def=def. But not abc=def.	Substituted with the text matched between the 1^st through 9,sup>th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences.

Modifiers

Syntax	Example	Description
(?i)	te(?i)st matches teST but not TEST.	Turn on case insensitivity for the remainder of the regular expression. (?-i) Turn off case insensitivity.
(?s)		Turn on "dot matches newline" for the remainder of the regular expression. The s stands for 'single line' mode.
(?m)	(?m)te st matches test	Caret and dollar match after and before newlines for the remainder of the regular expression.test is possible.
(?x)	(?x)te st matches test.	Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments. So also test(?m)# This will also match
(?i-sm:regex)	Combine options	Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off.

Quotation

Syntax	Example	Description
\	\- means literal the '-' character.	Nothing, but quotes the following character
\Q	\Q....\E means literal '....' characters. Same as \.\.\.\.	Nothing, but quotes all characters until \E
\E		Nothing, but ends quoting started by \Q

Ranges

Syntax	Example	Description
.	. matches abc in 'abc'	Any character (the used character is a dot) except new line (\n).
(a\|b)	(a\|b) matches ab in 'abc'.	a or b
(...)	(ab) matches ab in 'abc'.	Group. The character has to be in the same sequence.
(?:...)	(?:ab) matches ab in 'abc'.	Passive group does not create groups for back references.
[abc]	[abc] matches ac in 'Duplicate test'.	Range a, b or c.
[^tes]	[^ste] matches r in 'streets'	Not s, t or e.
[a-q]	[a-q] matches 'e' in 'streets'.	Letters between a and q.

Lookaround

Syntax	Example	Description
?=	bar(?=bar)	Positive lookahead finds the 1st bar ("bar" which has "bar" after it)
?!	bar(?!bar)	Negative lookahead finds the 2nd bar ("bar" which does not have "bar" after it)
?<	(?<=foo)bar	Positive lookbehind finds the 1st bar ("bar" which has "foo" before it)
?<!	(?<!foo)bar	negative lookbehind finds the 2nd bar ("bar" which does not have "foo" before it)

Note: In the examples the text to be examined is foobarbarfoo.

Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions. Negative lookahead is indispensable if you want to match something not followed by something else. Unfortunately not always supported in the regex implementations.

AWK does not support look-ahead or look-behind, since it uses POSIX Extended Regular Expression (ERE).
Bash script regex does not support "lookaround" at all.
Javascript support lookahead only
PCRE (and PCRE2) is not fully Perl-compatible when it comes to lookbehind.

So many differences but the standard is PCRE.

CodeWright

The implementation of regular expressions in applications and computer language is not equal and therefor can be very different to use. Below a few examples using CodeWright Search & Replace options.

Find

Regex	Meaning	Matches
([ ]+[0-9]+)	1 or more Spaces, 1 or more digits	[ 123456 ]

Replace

Find Regex	Meaning	Replace Regex	Meaning
( )([0-9][0-9][0-9])( )	Space 3 digits Space ..XXXX 123 YYYY...	\10\2\3	Replace the 1^st group, insert zero (0), 2^nd and 3^rd ...XXXX 0123 YYYY... Please remark in CW the better way ( )([\d]{3})( ) does not work, because CW does only have the iteration qualifiers *, + and ?.

Examples

XML

The following example is from the Regular Expression Cookbook ^[9]. Remove all XML Style Tags except <em> and <strong> from an XML or html page.

(?xm)                 # Permits comments and multiple lines
< /?                  # Permit closing tag
(?!                   # Negative lookahead
   (?: em | strong)   #   List of tags to avoid match
   \b                 #   Word boundary avoids partial word matches
)
[a-z]                 # Tag name initial character must be a-z
(?: [^>"']            #   Any char except >, " or '
  | "[^"]*"           #   Double quoted attribute values
  | '[^']*'           #   Single quoted attribute values
)*
>

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

PCRE

PCRE is short for Perl Compatible Regular Expressions.
It is the name of an open source library written in C by Philip Hazel.
The library is compatible with a great number of C compilers and operating systems.
Many people have derived libraries from PCRE to make it compatible with other programming languages.
The regex features included with PHP, Delphi, and R, and Xojo (REALbasic) are all based on PCRE.
The library is also included with many Linux distributions as a shared .so library and a .h header file.

Though PCRE claims to be Perl-compatible, there are more than enough differences between contemporary versions of Perl and PCRE to consider them distinct regex flavors.
Recent versions of Perl have even copied features from PCRE that PCRE had copied from other programming languages before Perl had them, in an attempt to make Perl more PCRE-compatible.
Today PCRE is used more widely than Perl because PCRE is part of so many libraries and applications.

Philip Hazel has recently released a new library called PCRE2.
The first PCRE2 release was given version number 10.00 to make a clear break with the previous PCRE 8.36.
Future PCRE releases will be limited to bug fixes. New features will go into PCRE2 only.
If you're taking on a new development project, you should consider using PCRE2 instead of PCRE.
But for existing projects that already use PCRE, it's probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code (but not to your regular expressions).
You can find more information about PCRE and PCRE2 at http://www.pcre.org.

Reference

top

↑ AWK created at Bell Labs in the 1970s. Its name is derived from the surnames of its authors — Alfred Aho, Peter Weinberger and Brian Kernighan.
↑ PCRE, Philip Hazel. The PCRE library is free, even for building commercial software.
↑ Regular-Expressions.info, the premier website about Regular Expressions, Tutorials, Language examples, books, and references made by Jan Goyvaerts
↑ ^4.0 ^4.1 RegexBuddy is really your perfect (software) companion for working with regular expressions.
See the live demos. Shows also the capability of the different Regex implementations (Java, Javascripts, Perl and more).
↑ ^5.0 ^5.1 RegexPal written by Steven Levithan in JavaScript. The only thing you need is a webbrowser.
↑ W3Schools, Full Web Building Tutorials, Free webtutorials
↑ RegExLib.com, Regular Expressions Library with description of the used expressions. Also an RegEx tester.
↑ Webreference.com, One of the oldest (created in 1995) and most respected Web development sites, WebReference.com is all about the Web and Webmastery. From browsing to authoring, HTML to advanced site design, we'll keep you informed.
↑ ^9.0 ^9.1 Regular Expressions Cookbook, Jan Goyvaerts and Steven Levithan, 510 pages, O'Reilly Media, ISBN-10 0596520689, Also available for the Kindle

[awk-1] AWK created at Bell Labs in the 1970s. Its name is derived from the surnames of its authors — Alfred Aho, Peter Weinberger and Brian Kernighan.

[pcre-2] PCRE, Philip Hazel. The PCRE library is free, even for building commercial software.

[3] Regular-Expressions.info, the premier website about Regular Expressions, Tutorials, Language examples, books, and references made by Jan Goyvaerts

[RegexBuddy-4] 4.0 ^4.1 RegexBuddy is really your perfect (software) companion for working with regular expressions.
See the live demos. Shows also the capability of the different Regex implementations (Java, Javascripts, Perl and more).

[RegexPal-5] 5.0 ^5.1 RegexPal written by Steven Levithan in JavaScript. The only thing you need is a webbrowser.

[6] W3Schools, Full Web Building Tutorials, Free webtutorials

[7] RegExLib.com, Regular Expressions Library with description of the used expressions. Also an RegEx tester.

[8] Webreference.com, One of the oldest (created in 1995) and most respected Web development sites, WebReference.com is all about the Web and Webmastery. From browsing to authoring, HTML to advanced site design, we'll keep you informed.

[RECookbook-9] 9.0 ^9.1 Regular Expressions Cookbook, Jan Goyvaerts and Steven Levithan, 510 pages, O'Reilly Media, ISBN-10 0596520689, Also available for the Kindle

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]