Regular expressions

From HaFrWiki
Jump to: navigation, search

Regular expressions or the shortcut RegExs are very powerful for searching and matching text patterns. Because the syntax and therefor the resulting expressions are not easy readable and therefor also not easy to understand, Regular Expressions are loved and hated. Knowledge of Regular Expressions and good tooling can make the results better and more understandable.

This page tries to accomplish that by giving examples, explanation, references, tooling and websites. Not all implementations of Regular Expressions have the same functionality. Always verify that the given examples are supported by your RegEx implementation.

But please keep in mind that regular expression are not the solution to everything. If you try to do too much with just one regular expression you might fall in the pitfall:

 Some people when confronted with a problem, think "I know, I'll use regular expressions."
 Now they have two problems.


Flavors[edit]

All modern regular expression flavors can trace their history back to the Perl programming language (Perl-style regular expressions).

  • Perl
  • Perl Compatible Regular Expressions (PCRE ) is a C library developed by Philip Hazel [1].
  • .NET
  • Java. In Java 4 the first release of Regex.
  • JavaScript
  • Python
  • Ruby

General[edit]

To start with RegEx the user needs to know what RegEx are. The following websites and references contain this information. Regular-Expressions.info [2] gives a very good explanation on Regular Expressions. The site has also a download of the very useful program RegexBuddy [3]. There is also a free tool written in JavaScript by Steven Levithan [4].


Another good starting point for Regular expressions is the description of the javascript implementation on W3schools.com [5].

Looking for a regular expressions but can not find one? The library on RegExLib.com [6] offers a wide range of examples.

Wanna know how to use Regex in a web environment. WebReference.com [7] offers an example of using Regex with Javascript.

The Regular Expressions Cookbook [8] is written by the same author as the RegexBuddy gives even more explanation on this subject.

Elements[edit]

Regex are based on building elements. See the ones below.

Anchors[edit]

Syntax Example Description
^(caret)  ^. matches a and d in abc\ndef. Start of line. Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.
\A  \A. matches a in abc\ndef. Same as the caret but never matches after line breaks.
$(dollar)  .$ matches c and f in abc\ndef. Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.
\Z  .\Z matches f in abc\ndef Same as the dollar only
\b  .\b matches c in abc. Word boundary matches the position between a word character (anything matched by \w) and a non-word character.
\B  .\B matches b in abc. Not word boundary matches the position between two word characters.
\m  \m. matches a in abctest
 .\m matches space in 'test for'
Start of word.
\M  \M. matches space and dot in 'test for.' End of word.

Assertions[edit]

Syntax Example Description
(?=regex)  t(?=s) matches the 2nd t in 'streets'.
 streets
Zero-width positive lookahead matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends.
Looks for a character succeeded by the lookahead character.
(?!regex)  t(?!s) matches the 1st t in 'streets'.
 streets
Zero-width negative lookahead is identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.
Looks for a character not succeeded by the lookahead character.
(?<=text)  (?<=s)t matches the 1st t in 'streets'.
 streets
Zero-width positive look-behind matches at a position to the left of which text appears. Since regular expressions cannot be applied backwards, the test inside the look-behind can only be plain text. Some regex flavors allow alternation of plain text options in the look-behind.
(?<!text)  (?<!s)t matches the 2nd t in 'streets'.
 streets
Zero-width negative look-behind matches at a position if the text does not appear to the left of that position.
(?>text)  (?>\d+) matches 5 and 00 in '$ 5.00'. Once-only Subexpression, Also known as possessive quantifier.
?() Condition [if then]
?()| Condition [if then else]

Characters[edit]

Syntax Example Description
\c  \ce matches te in testing Matches all characters (XPATH)
\s  \sf matches "space"f in 'test for' White space
\S  \St matches the 2nd occurence of t in testing. Non white space.
\d  \d matches all 9s in test99ing. Digit
\D  \D matches test and ing in test99ing. Not digit
\w  \w matches test and for in 'test for.' Word
\W  \W matches "space" and "dot" in 'test for.' Not word
\xhh  \x20 matches "space" in 'test for.' Hexadecimal character hh
\xxxx  \0O40 matches "space" in 'test for.' Octal character xxxx

Iterators[edit]

Iteration qualifiers are metacharacters that are not regular expressions by themselves. Instead, they state how many iterations of the preceding expression there must be or can be, in order to match. These metacharacters are: *, + and ?.

Syntax Example Description
*  test(\d)*ing matches testing, test9ing, test99ing
 in 'testing, test9ing, test99ing'.
Any number of occurrences
+  test(\d)*ing matches test9ing, test99ing
 in 'testing, test9ing, test99ing'.
One or more
?  test(\d)*ing matches testing, test9ing
 in 'testing, test9ing, test99ing'.
Zero or one
{n}  test(\d){1}ing matches test9ing
 in 'testing, test9ing, test99ing'.
n times exact
{n,m}  test(\d){2,5}ing matches test99ing
 in 'testing, test9ing, test99ing'.
n, n+1, ..., m times.

Without these qualifiers, a regular expression will match exactly one occurrence in the text.

Groups[edit]

Every time you create a group by (), you can re-use the found information in the replacement. See the table below for examples.

Syntax Example Description
(regex)  (abc){3} matches abcabcabc.
 The 1st group matches abc.
Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
(?:regex)  (?:abc){3} matches abc. Non-capturing parentheses group the regex so you can apply regex operators.
But do not capture anything and do not create backreferences.
\1 to \9  (abc|def)=\1 matches abc=abc or def=def.
 But not abc=def.
Substituted with the text matched between the 1st through 9,sup>th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences.

Modifiers[edit]

Syntax Example Description
(?i)  te(?i)st matches teST but not TEST. Turn on case insensitivity for the remainder of the regular expression.
(?-i) Turn off case insensitivity.
(?s) Turn on "dot matches newline" for the remainder of the regular expression. The s stands for 'single line' mode.
(?m)  (?m)te
 st
matches test
Caret and dollar match after and before newlines for the remainder of the regular expression.test is possible.
(?x)  (?x)te st matches test. Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments.
So also test(?m)# This will also match
(?i-sm:regex) Combine options Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off.

Quotation[edit]

Syntax Example Description
\  \- means literal the '-' character. Nothing, but quotes the following character
\Q  \Q....\E means literal '....' characters.
Same as \.\.\.\.
Nothing, but quotes all characters until \E
\E Nothing, but ends quoting started by \Q

Ranges[edit]

Syntax Example Description
.  . matches abc in 'abc' Any character (the used character is a dot) except new line (\n).
(a|b)  (a|b) matches ab in 'abc'. a or b
(...)  (ab) matches ab in 'abc'. Group. The character has to be in the same sequence.
(?:...)  (?:ab) matches ab in 'abc'. Passive group does not create groups for back references.
[abc]  [abc] matches abc in 'Duplicate test'. Range a, b or c.
[^tes]  [^ste] matches r in 'streets' Not s, t or e.
[a-q]  [a-q] matches 'e' in 'streets'. Letters between a and q.

CodeWright[edit]

The implementation of regular expressions in applications and computer language is not equal and therefor can be very different to use. Below a few examples using CodeWright Search & Replace options.

Find[edit]

Regex Meaning Matches
([ ]+[0-9]+) 1 or more Spaces, 1 or more digits [  123456 ]

Replace[edit]

Find Regex Meaning Replace Regex Meaning
( )([0-9][0-9][0-9])( ) Space 3 digits Space
..XXXX 123 YYYY...
\10\2\3 Replace the 1st group, insert zero (0), 2nd and 3rd
...XXXX 0123 YYYY...
Please remark in CW the better way ( )([\d]{3})( ) does not work, because CW does only have the iteration qualifiers *, + and ?.

Examples[edit]

XML[edit]

The following example is from the Regular Expression Cookbook [8]. Remove all XML Style Tags except <em> and <strong> from an XML or html page.

(?xm)                 # Permits comments and multiple lines
< /?                  # Permit closing tag
(?!                   # Negative lookahead
   (?: em | strong)   #   List of tags to avoid match
   \b                 #   Word boundary avoids partial word matches
)
[a-z]                 # Tag name initial character must be a-z
(?: [^>"']            #   Any char except >, " or '
  | "[^"]*"           #   Double quoted attribute values
  | '[^']*'           #   Single quoted attribute values
)*
>

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

See also[edit]

top

Regex Desktop Testers[edit]

  1. RegexBuddy, One of the best software to create, maintain and test Regular Expression [3].
  2. Expresso, .NET application (not free).
  3. the Regulator, SourceForge project .NET Regex tester.

Regex Online Test[edit]

  1. PHP Live Regex, Free and online version, created by Philip Bjorge. Heavily inspired by Rubular (See below). Written in JQuery/Javascript.
  2. Rubular, Michael Lovitt minimalistic regex tester using Ruby 1.8.
  3. RegexPal, JavaScript version. Free and online [4]
  4. Lars Olav Torvik Regex tester for PHP CRE, PHP POSIX and JavaScript.
  5. .NET Regex build by David Seruyange in Microsoft .NET.
  6. Java Regex tester, Sergey Evdolimov Java Applet using Java 1.4. Also available as Eclipse Plugin, IDEA Plugin.
  7. ReAnimator, Oliver Steele has created a funny tool that shows Regex graphical.

Grep[edit]

  1. PowerGrep, g/re/p utility made by Jan Goyvaerts (not free).
  2. WinGrep, one of the oldest grep tool for Windows.
  3. Regex Renamer is more a search and replace tool
  4. Funduc SR, Search and Replace tool for windows.

Tools[edit]

Tutorial[edit]

Reference[edit]

top

  1. PCRE, Philip Hazel. The PCRE library is free, even for building commercial software.
  2. Regular-Expressions.info, the premier website about Regular Expressions, Tutorials, Language examples, books, and references made by Jan Goyvaerts
  3. 3.0 3.1 RegexBuddy is really your perfect (software) companion for working with regular expressions. See the live demos. Shows also the capability of the different Regex implementations (Java, Javascripts, Perl and more).
  4. 4.0 4.1 RegexPal written by Steven Levithan in JavaScript. The only thing you need is a webbrowser.
  5. W3Schools, Full Web Building Tutorials, Free webtutorials
  6. RegExLib.com, Regular Expressions Library with description of the used expressions. Also an RegEx tester.
  7. Webreference.com, One of the oldest (created in 1995) and most respected Web development sites, WebReference.com is all about the Web and Webmastery. From browsing to authoring, HTML to advanced site design, we'll keep you informed.
  8. 8.0 8.1 Regular Expressions Cookbook, Jan Goyvaerts and Steven Levithan, 510 pages, O'Reilly Media, ISBN-10 0596520689, Also available for the Kindle