Regular expressions

From HaFrWiki42
Jump to navigation Jump to search

Regular expressions or the shortcut RegExs are very powerful for searching and matching text patterns. Because the syntax and therefor the resulting expressions are not easy readable and therefor also not easy to understand, Regular Expressions are loved and hated. Knowledge of Regular Expressions and good tooling can make the results better and more understandable.

This page tries to accomplish that by giving examples, explanation, references, tooling and websites. Not all implementations of Regular Expressions have the same functionality. Always verify that the given examples are supported by your RegEx implementation.

But please keep in mind that regular expression are not the solution to everything. If you try to do too much with just one regular expression you might fall in the pitfall:

 Some people when confronted with a problem, think "I know, I'll use regular expressions."
 Now they have two problems.


Flavors

All modern regular expression flavors can trace their history back to the Perl programming language (Perl-style regular expressions).

  • AWK [1] is the first attempt embedding Regular Expressions and uses POSIX Extended Regular Expression (ERE).
  • Perl
  • Perl Compatible Regular Expressions (PCRE) is a C library developed by Philip Hazel [2].
  • .NET
  • Java. In Java 4 the first release of Regex.
  • JavaScript
  • Python
  • Ruby

General

To start with RegEx the user needs to know what RegEx are.
The following websites and references contain this information:

  • Regular-Expressions.info [3] gives a very good explanation on Regular Expressions.
    The site has also a download of the very useful program RegexBuddy [4]. There is also a free tool written in JavaScript by Steven Levithan [5].
  • Another good starting point for Regular expressions is the description of the javascript implementation on W3schools.com [6].
  • Looking for a regular expressions but can not find one? The library on RegExLib.com [7] offers a wide range of examples.
  • Wanna know how to use Regex in a web environment. WebReference.com [8] offers an example of using Regex with Javascript.
  • The Regular Expressions Cookbook [9] is written by the same author as the RegexBuddy gives even more explanation on this subject.
  • Website regex101 is an online-test-tool for checking/testing your regular expression. Has all the regex-flavors (except Bash).

Elements

Regex are based on building elements. See the ones below.

Anchors

Syntax Example Description
^(caret)  ^. matches a and d in abc\ndef. Start of line. Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character.
\A  \A. matches a in abc\ndef. Same as the caret but never matches after line breaks.
$(dollar)  .$ matches c and f in abc\ndef. Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character.
\Z  .\Z matches f in abc\ndef Same as the dollar only
\b  .\b matches c in abc. Word boundary matches the position between a word character (anything matched by \w) and a non-word character.
\B  .\B matches b in abc. Not word boundary matches the position between two word characters.
\m  \m. matches a in abctest
 .\m matches space in 'test for'
Start of word.
\M  \M. matches space and dot in 'test for.' End of word.

Assertions

Syntax Example Description
(?=regex)  t(?=s) matches the 2nd t in 'streets'.
 streets
Zero-width positive lookahead matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends.
Looks for a character succeeded by the lookahead character.
(?!regex)  t(?!s) matches the 1st t in 'streets'.
 streets
Zero-width negative lookahead is identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.
Looks for a character not succeeded by the lookahead character.
(?<=text)  (?<=s)t matches the 1st t in 'streets'.
 streets
Zero-width positive look-behind matches at a position to the left of which text appears. Since regular expressions cannot be applied backwards, the test inside the look-behind can only be plain text. Some regex flavors allow alternation of plain text options in the look-behind.
(?<!text)  (?<!s)t matches the 2nd t in 'streets'.
 streets
Zero-width negative look-behind matches at a position if the text does not appear to the left of that position.
(?>text)  (?>\d+) matches 5 and 00 in '$ 5.00'. Once-only Subexpression, Also known as possessive quantifier.
?() Condition [if then]
?()| Condition [if then else]

Characters

Syntax Example Description
\c  \ce matches te in testing Matches all characters (XPATH)
\s  \sf matches "space"f in 'test for' White space
\S  \St matches the 2nd occurence of t in testing. Non white space.
\d  \d matches all 9s in test99ing. Digit
\D  \D matches test and ing in test99ing. Not digit
\w  \w matches test and for in 'test for.' Word
\W  \W matches "space" and "dot" in 'test for.' Not word
\xhh  \x20 matches "space" in 'test for.' Hexadecimal character hh
\xxxx  \0O40 matches "space" in 'test for.' Octal character xxxx

Iterators

Iteration qualifiers are metacharacters that are not regular expressions by themselves.
Instead, they state how many iterations of the preceding expression there must be or can be, in order to match.
These metacharacters are: *, + and ?.

Syntax Example Description
*  test(\d)*ing matches testing, test9ing, test99ing
 in 'testing, test9ing, test99ing'.
Any number of occurrences
+  test(\d)*ing matches test9ing, test99ing
 in 'testing, test9ing, test99ing'.
One or more
?  test(\d)*ing matches testing, test9ing
 in 'testing, test9ing, test99ing'.
Zero or one
{n}  test(\d){1}ing matches test9ing
 in 'testing, test9ing, test99ing'.
n times exact
{n,m}  test(\d){2,5}ing matches test99ing
 in 'testing, test9ing, test99ing'.
n, n+1, ..., m times.

Without these qualifiers, a regular expression will match exactly one occurrence in the text.

Groups

Every time you create a group by (), you can re-use the found information in the replacement. See the table below for examples.

Syntax Example Description
(regex)  (abc){3} matches abcabcabc.
 The 1st group matches abc.
Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
(?:regex)  (?:abc){3} matches abc. Non-capturing parentheses group the regex so you can apply regex operators.
But do not capture anything and do not create backreferences.
\1 to \9  (abc|def)=\1 matches abc=abc or def=def.
 But not abc=def.
Substituted with the text matched between the 1st through 9,sup>th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences.

Modifiers

Syntax Example Description
(?i)  te(?i)st matches teST but not TEST. Turn on case insensitivity for the remainder of the regular expression.
(?-i) Turn off case insensitivity.
(?s) Turn on "dot matches newline" for the remainder of the regular expression. The s stands for 'single line' mode.
(?m)  (?m)te
 st
matches test
Caret and dollar match after and before newlines for the remainder of the regular expression.test is possible.
(?x)  (?x)te st matches test. Turn on free-spacing mode to ignore whitespace between regex tokens, and allow # comments.
So also test(?m)# This will also match
(?i-sm:regex) Combine options Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off.

Quotation

Syntax Example Description
\  \- means literal the '-' character. Nothing, but quotes the following character
\Q  \Q....\E means literal '....' characters.
Same as \.\.\.\.
Nothing, but quotes all characters until \E
\E Nothing, but ends quoting started by \Q

Ranges

Syntax Example Description
.  . matches abc in 'abc' Any character (the used character is a dot) except new line (\n).
(a|b)  (a|b) matches ab in 'abc'. a or b
(...)  (ab) matches ab in 'abc'. Group. The character has to be in the same sequence.
(?:...)  (?:ab) matches ab in 'abc'. Passive group does not create groups for back references.
[abc]  [abc] matches ac in 'Duplicate test'. Range a, b or c.
[^tes]  [^ste] matches r in 'streets' Not s, t or e.
[a-q]  [a-q] matches 'e' in 'streets'. Letters between a and q.

Lookaround

Syntax Example Description
?= bar(?=bar) Positive lookahead finds the 1st bar ("bar" which has "bar" after it)
?! bar(?!bar) Negative lookahead finds the 2nd bar ("bar" which does not have "bar" after it)
?< (?<=foo)bar Positive lookbehind finds the 1st bar ("bar" which has "foo" before it)
?<! (?<!foo)bar negative lookbehind finds the 2nd bar ("bar" which does not have "foo" before it)

Note: In the examples the text to be examined is foobarbarfoo.

Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions. Negative lookahead is indispensable if you want to match something not followed by something else. Unfortunately not always supported in the regex implementations.

  • AWK does not support look-ahead or look-behind, since it uses POSIX Extended Regular Expression (ERE).
  • Bash script regex does not support "lookaround" at all.
  • Javascript support lookahead only
  • PCRE (and PCRE2) is not fully Perl-compatible when it comes to lookbehind.

So many differences but the standard is PCRE.

CodeWright

The implementation of regular expressions in applications and computer language is not equal and therefor can be very different to use. Below a few examples using CodeWright Search & Replace options.

Find

Regex Meaning Matches
([ ]+[0-9]+) 1 or more Spaces, 1 or more digits [  123456 ]

Replace

Find Regex Meaning Replace Regex Meaning
( )([0-9][0-9][0-9])( ) Space 3 digits Space
..XXXX 123 YYYY...
\10\2\3 Replace the 1st group, insert zero (0), 2nd and 3rd
...XXXX 0123 YYYY...
Please remark in CW the better way ( )([\d]{3})( ) does not work, because CW does only have the iteration qualifiers *, + and ?.

Examples

XML

The following example is from the Regular Expression Cookbook [9]. Remove all XML Style Tags except <em> and <strong> from an XML or html page.

(?xm)                 # Permits comments and multiple lines
< /?                  # Permit closing tag
(?!                   # Negative lookahead
   (?: em | strong)   #   List of tags to avoid match
   \b                 #   Word boundary avoids partial word matches
)
[a-z]                 # Tag name initial character must be a-z
(?: [^>"']            #   Any char except >, " or '
  | "[^"]*"           #   Double quoted attribute values
  | '[^']*'           #   Single quoted attribute values
)*
>

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

PCRE

PCRE is short for Perl Compatible Regular Expressions.
It is the name of an open source library written in C by Philip Hazel.
The library is compatible with a great number of C compilers and operating systems.
Many people have derived libraries from PCRE to make it compatible with other programming languages.
The regex features included with PHP, Delphi, and R, and Xojo (REALbasic) are all based on PCRE.
The library is also included with many Linux distributions as a shared .so library and a .h header file.

Though PCRE claims to be Perl-compatible, there are more than enough differences between contemporary versions of Perl and PCRE to consider them distinct regex flavors.
Recent versions of Perl have even copied features from PCRE that PCRE had copied from other programming languages before Perl had them, in an attempt to make Perl more PCRE-compatible.
Today PCRE is used more widely than Perl because PCRE is part of so many libraries and applications.

Philip Hazel has recently released a new library called PCRE2.
The first PCRE2 release was given version number 10.00 to make a clear break with the previous PCRE 8.36.
Future PCRE releases will be limited to bug fixes. New features will go into PCRE2 only.
If you're taking on a new development project, you should consider using PCRE2 instead of PCRE.
But for existing projects that already use PCRE, it's probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code (but not to your regular expressions).
You can find more information about PCRE and PCRE2 at http://www.pcre.org.


See also

top

Regex Desktop Testers

  1. RegexBuddy, One of the best software to create, maintain and test Regular Expression [4].
  2. Expresso, .NET application (not free).
  3. the Regulator, SourceForge project .NET Regex tester.

Regex Online Test

  1. Regex101, My favorite Regex tester.
  2. PHP Live Regex, Free and online version, created by Philip Bjorge. Heavily inspired by Rubular (See below). Written in JQuery/Javascript.
  3. Rubular, Michael Lovitt minimalistic regex tester using Ruby 1.8.
  4. RegexPal, JavaScript version. Free and online [5]
  5. Lars Olav Torvik Regex tester for PHP CRE, PHP POSIX and JavaScript.
  6. .NET Regex build by David Seruyange in Microsoft .NET.
  7. Java Regex tester, Sergey Evdolimov Java Applet using Java 1.4. Also available as Eclipse Plugin, IDEA Plugin.
  8. ReAnimator, Oliver Steele has created a funny tool that shows Regex graphical.
  9. Extends Class, Online tool allows you to test regular expression in JavaScript.

Grep

  1. PowerGrep, g/re/p utility made by Jan Goyvaerts (not free).
  2. WinGrep, one of the oldest grep tool for Windows.
  3. Regex Renamer is more a search and replace tool
  4. Funduc SR, Search and Replace tool for windows.

Tools

Tutorial

Reference

top

  1. AWK created at Bell Labs in the 1970s. Its name is derived from the surnames of its authors — Alfred Aho, Peter Weinberger and Brian Kernighan.
  2. PCRE, Philip Hazel. The PCRE library is free, even for building commercial software.
  3. Regular-Expressions.info, the premier website about Regular Expressions, Tutorials, Language examples, books, and references made by Jan Goyvaerts
  4. 4.0 4.1 RegexBuddy is really your perfect (software) companion for working with regular expressions.
    See the live demos. Shows also the capability of the different Regex implementations (Java, Javascripts, Perl and more).
  5. 5.0 5.1 RegexPal written by Steven Levithan in JavaScript. The only thing you need is a webbrowser.
  6. W3Schools, Full Web Building Tutorials, Free webtutorials
  7. RegExLib.com, Regular Expressions Library with description of the used expressions. Also an RegEx tester.
  8. Webreference.com, One of the oldest (created in 1995) and most respected Web development sites, WebReference.com is all about the Web and Webmastery. From browsing to authoring, HTML to advanced site design, we'll keep you informed.
  9. 9.0 9.1 Regular Expressions Cookbook, Jan Goyvaerts and Steven Levithan, 510 pages, O'Reilly Media, ISBN-10 0596520689, Also available for the Kindle