Regular expressions
Contents |
Regular expressions or the shortcut RegExs are very powerful for searching and matching text patterns. Because the syntax and therefor the resulting expressions are not easy readable and therefor also not easy to understand, Regular Expressions are loved and hated. Knowledge of Regular Expressions and good tooling can make the results better and more understandable.
This page tries to accomplish that by giving examples, explanation, references, tooling and websites. Not all implementations of Regular Expressions have the same functionality. Always verify that the given examples are supported by your RegEx implementation.
But please keep in mind that regular expression are not the solution to everything. If you try to do too much with just one regular expression you might fall in the pitfall:
Some people when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Flavors
All modern regular expression flavors can trace their history back to the Perl programming language (Perl-style regular expressions).
- Perl
- Perl Compatible Regular Expressions (PCRE ) is a C library developed by Philip Hazel [1].
- .NET
- Java. In Java 4 the first release of Regex.
- JavaScript
- Python
- Ruby
General
To start with RegEx the user needs to know what RegEx are. The following websites and references contain this information. Regular-Expressions.info [2] gives a very good explanation on Regular Expressions. The site has also a download of the very useful program RegexBuddy [3]. There is also a free tool written in JavaScript by Steven Levithan [4].
Another good starting point for Regular expressions is the description of the javascript implementation on W3schools.com [5].
Looking for a regular expressions but can not find one? The library on RegExLib.com [6] offers a wide range of examples.
Wanna know how to use Regex in a web environment. WebReference.com [7] offers an example of using Regex with Javascript.
The Regular Expressions Cookbook [8] is written by the same author as the RegexBuddy gives even more explanation on this subject.
Elements
Regex are based on building elements. See the ones below.
Anchors
Syntax | Example | Description |
---|---|---|
^(caret) | ^. matches a and d in abc\ndef. | Start of line. Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. |
\A | \A. matches a in abc\ndef. | Same as the caret but never matches after line breaks. |
$(dollar) | .$ matches c and f in abc\ndef. | Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. |
\Z | .\Z matches f in abc\ndef | Same as the dollar only |
\b | .\b matches c in abc. | Word boundary matches the position between a word character (anything matched by \w) and a non-word character. |
\B | .\B matches b in abc. | Not word boundary matches the position between two word characters. |
\m | \m. matches a in abctest .\m matches space in 'test for' |
Start of word. |
\M | \M. matches space and dot in 'test for.' | End of word. |
Assertions
Syntax | Example | Description |
---|---|---|
(?=regex) | t(?=s) matches the 2nd t in 'streets'. streets |
Zero-width positive lookahead matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends. Looks for a character succeeded by the lookahead character. |
(?!regex) | t(?!s) matches the 1st t in 'streets'. streets |
Zero-width negative lookahead is identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match. Looks for a character not succeeded by the lookahead character. |
(?<=text) | (?<=s)t matches the 1st t in 'streets'. streets |
Zero-width positive look-behind matches at a position to the left of which text appears. Since regular expressions cannot be applied backwards, the test inside the look-behind can only be plain text. Some regex flavors allow alternation of plain text options in the look-behind. |
(?<!text) | (?<!s)t matches the 2nd t in 'streets'. streets |
Zero-width negative look-behind matches at a position if the text does not appear to the left of that position. |
(?>text) | (?>\d+) matches 5 and 00 in '$ 5.00'. | Once-only Subexpression, Also known as possessive quantifier. |
?() | Condition [if then] | |
?()| | Condition [if then else] |
Characters
Syntax | Example | Description |
---|---|---|
\c | \ce matches te in testing | Matches all characters (XPATH) |
\s | \sf matches "space"f in 'test for' | White space |
\S | \St matches the 2nd occurence of t in testing. | Non white space. |
\d | \d matches all 9s in test99ing. | Digit |
\D | \D matches test and ing in test99ing. | Not digit |
\w | \w matches test and for in 'test for.' | Word |
\W | \W matches "space" and "dot" in 'test for.' | Not word |
\xhh | \x20 matches "space" in 'test for.' | Hexadecimal character hh |
\xxxx | \0O40 matches "space" in 'test for.' | Octal character xxxx |
Iterators
Iteration qualifiers are metacharacters that are not regular expressions by themselves. Instead, they state how many iterations of the preceding expression there must be or can be, in order to match. These metacharacters are: *, + and ?.
Syntax | Example | Description |
---|---|---|
* | test(\d)*ing matches testing, test9ing, test99ing in 'testing, test9ing, test99ing'. |
Any number of occurrences |
+ | test(\d)*ing matches test9ing, test99ing in 'testing, test9ing, test99ing'. |
One or more |
? | test(\d)*ing matches testing, test9ing in 'testing, test9ing, test99ing'. |
Zero or one |
{n} | test(\d){1}ing matches test9ing in 'testing, test9ing, test99ing'. |
n times exact |
{n,m} | test(\d){2,5}ing matches test99ing in 'testing, test9ing, test99ing'. |
n, n+1, ..., m times. |
Without these qualifiers, a regular expression will match exactly one occurrence in the text.
Groups
Every time you create a group by (), you can re-use the found information in the replacement. See the table below for examples.
Syntax | Example | Description |
---|---|---|
(regex) | (abc){3} matches abcabcabc. The 1st group matches abc. |
Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex. |
(?:regex) | (?:abc){3} matches abc. | Non-capturing parentheses group the regex so you can apply regex operators. But do not capture anything and do not create backreferences. |
\1 to \9 | (abc|def)=\1 matches abc=abc or def=def. But not abc=def. |
Substituted with the text matched between the 1st through 9,sup>th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences. |
Modifiers
Syntax | Example | Description |
---|---|---|
(?i) | te(?i)st matches teST but not TEST. | Turn on case insensitivity for the remainder of the regular expression. (?-i) Turn off case insensitivity. |
(?s) | Turn on "dot matches newline" for the remainder of the regular expression. The s stands for 'single line' mode. | |
(?m) | (?m)te st matches test |
Caret and dollar match after and before newlines for the remainder of the regular expression.test is possible. |
(?x) | (?x)te st matches test. | Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments. So also test(?m)# This will also match |
(?i-sm:regex) | Combine options | Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off. |
Quotation
Syntax | Example | Description |
---|---|---|
\ | \- means literal the '-' character. | Nothing, but quotes the following character |
\Q | \Q....\E means literal '....' characters. Same as \.\.\.\. |
Nothing, but quotes all characters until \E |
\E | Nothing, but ends quoting started by \Q |
Ranges
Syntax | Example | Description |
---|---|---|
. | . matches abc in 'abc' | Any character (the used character is a dot) except new line (\n). |
(a|b) | (a|b) matches ab in 'abc'. | a or b |
(...) | (ab) matches ab in 'abc'. | Group. The character has to be in the same sequence. |
(?:...) | (?:ab) matches ab in 'abc'. | Passive group does not create groups for back references. |
[abc] | [abc] matches abc in 'Duplicate test'. | Range a, b or c. |
[^tes] | [^ste] matches r in 'streets' | Not s, t or e. |
[a-q] | [a-q] matches 'e' in 'streets'. | Letters between a and q. |
CodeWright
The implementation of regular expressions in applications and computer language is not equal and therefor can be very different to use. Below a few examples using CodeWright Search & Replace options.
Find
Regex | Meaning | Matches |
---|---|---|
([ ]+[0-9]+) | 1 or more Spaces, 1 or more digits | [ 123456 ] |
Replace
Find Regex | Meaning | Replace Regex | Meaning |
---|---|---|---|
( )([0-9][0-9][0-9])( ) | Space 3 digits Space ..XXXX 123 YYYY... |
\10\2\3 | Replace the 1st group, insert zero (0), 2nd and 3rd ...XXXX 0123 YYYY... Please remark in CW the better way ( )([\d]{3})( ) does not work, because CW does only have the iteration qualifiers *, + and ?. |
Examples
XML
The following example is from the Regular Expression Cookbook [8]. Remove all XML Style Tags except <em> and <strong> from an XML or html page.
(?xm) # Permits comments and multiple lines < /? # Permit closing tag (?! # Negative lookahead (?: em | strong) # List of tags to avoid match \b # Word boundary avoids partial word matches ) [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any char except >, " or ' | "[^"]*" # Double quoted attribute values | '[^']*' # Single quoted attribute values )* >
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
See also
- Developer Mozilla, JavaScript RegEx Reference
Regex Desktop Testers
- RegexBuddy, One of the best software to create, maintain and test Regular Expression [3].
- Expresso, .NET application (not free).
- the Regulator, SourceForge project .NET Regex tester.
Regex Online Test
- PHP Live Regex, Free and online version, created by Philip Bjorge. Heavily inspired by Rubular (See below). Written in JQuery/Javascript.
- Rubular, Michael Lovitt minimalistic regex tester using Ruby 1.8.
- RegexPal, JavaScript version. Free and online [4]
- Lars Olav Torvik Regex tester for PHP CRE, PHP POSIX and JavaScript.
- .NET Regex build by David Seruyange in Microsoft .NET.
- Java Regex tester, Sergey Evdolimov Java Applet using Java 1.4. Also available as Eclipse Plugin, IDEA Plugin.
- ReAnimator, Oliver Steele has created a funny tool that shows Regex graphical.
- Extends Class, Online tool allows you to test regular expression in JavaScript.
Grep
- PowerGrep, g/re/p utility made by Jan Goyvaerts (not free).
- WinGrep, one of the oldest grep tool for Windows.
- Regex Renamer is more a search and replace tool
- Funduc SR, Search and Replace tool for windows.
Tools
- AWK, on Internal Wikipage UNIX Command Reference
- Awk Tutorial. AWK implements regex.
- Grep, on Internal Wikipage UNIX Command Reference
- UNIX_Command_Reference#sedsed, on Internal Wikipage UNIX Command Reference
Tutorial
- Regex Samples, Examples and answers
Reference
- ↑ PCRE, Philip Hazel. The PCRE library is free, even for building commercial software.
- ↑ Regular-Expressions.info, the premier website about Regular Expressions, Tutorials, Language examples, books, and references made by Jan Goyvaerts
- ↑ 3.0 3.1 RegexBuddy is really your perfect (software) companion for working with regular expressions. See the live demos. Shows also the capability of the different Regex implementations (Java, Javascripts, Perl and more).
- ↑ 4.0 4.1 RegexPal written by Steven Levithan in JavaScript. The only thing you need is a webbrowser.
- ↑ W3Schools, Full Web Building Tutorials, Free webtutorials
- ↑ RegExLib.com, Regular Expressions Library with description of the used expressions. Also an RegEx tester.
- ↑ Webreference.com, One of the oldest (created in 1995) and most respected Web development sites, WebReference.com is all about the Web and Webmastery. From browsing to authoring, HTML to advanced site design, we'll keep you informed.
- ↑ 8.0 8.1 Regular Expressions Cookbook, Jan Goyvaerts and Steven Levithan, 510 pages, O'Reilly Media, ISBN-10 0596520689, Also available for the Kindle