Regular expressions

A regular expression (often abbreviated "regex" or "regexp") is a description of a pattern of text. Commands and programs can use these patterns to search within text for matches, or strings that follow the given pattern.

The simplest patterns that regular expressions describe are literal characters or words - essentially substrings of the text. This is a bit like using Ctrl+F in your browser to search the text of a web site for a particular word. However, these patterns are also narrow, matching only a few specific strings.

The real power of regular expressions is their ability to match all the possible variations of a pattern without having to list them individually. Consider searching a large text file for information like:

Any ten-digit North American telephone number (with arbitrary digits)
A valid vehicle VIN (the unique, 17-digit identifier found on motor vehicles)
A syntactically complete C++ statement

An exhaustive list of possibilities for such patterns is impossible or impractical, but a regular expression can be used to describe the possibilities in a manageably concise format.

Image source: xkcd: Perl Problems by Randall Munroe (image link)

Regular expressions are a sort of language unto themselves and you can compose difficult, convoluted regexes that match with oddly-shaped data sets. These situations are the exception, not the rule. You can solve many everyday problems by writing simple regexes that can be used with many Linux command-line tools, including the vim text editor's search function; the grep search and sed stream editing commands; and the awk text processing language.

Regular expressions were introduced in the 1950's as a mathematical tool in the theory of computation, which studies questions like, "what questions can a computer answer?"

The set of strings a regular expression matches is called a regular language. The regular languages are exactly those recognized by a discrete finite automaton a (simplified) model of a computer.

The regular languages are not the most general! Non-regular languages include, e.g., strings of the form "0...1..." where the number of zeros and ones are equal - no regular expression captures all such strings.

Matching a single character

The building blocks of regular expressions are patterns that match a single character. There are options to match a specific character, a character from a set, or to match any single character.

Matching a single character literal

The pattern matching most single characters is just the character itself. For example, the pattern "a" matches the lower-case 'a'. This makes patterns matching "real words" easy to read and understand.

There are some characters that have a special meaning (other than themselves) when they appear in a regular expression. If you want to match the literal character instead of the special meaning, you escape the character using a backslash. One example is the backslash itself; you use the pattern "\\" to match a literal backslash in the text.

Image source: xkcd: backslashes by Randall Munroe (image link)

The period ('.') has the special meaning of matching any single character. For example, "R.y" matches both "Ray" and "Roy" (and "Rqy" and "R!y" and ...). You use an escaped period ("\.") to match a literal period.

Bracket expressions

A bracket expression matches a single character from a set of options listed inside. For example, "[ABCDF]" would match a single A, B, C, D, or F, and "[13579]" would match a single, odd digit. You can abbreviate a set of characters using a range, written with a dash between the first and last character. For example, "[A-DF]" matches a single character "between A and D" or F. If the first character in a bracket expression is a carat ('^'), it "flips" the expression and matches anything not in the list. For example, "[^2468]" matches a single, odd digit.

Because of their special meanings in bracket expressions, ], -, and ^ come with extra rules. You must list ] first, - last, and ^ anywhere button first if they are part of a bracket expression.

There are some classes of characters that have "nicknames" you can use inside bracket expressions, such as [:alpha:] for alphabetic characters. The brackets are part of the name, so you use "[[:alpha:]]" as a shorthand for "[A-Za-z]". A complete list of named ranges appears, e.g., in the grep man pages.

Lastly, the escaped character \w is a synonym for the useful set "[_[:alnum:]]". These are exactly the characters allowed in a C language identifier. It's opposite is \W, a synonym for "[^_[:alnum:]]".

Almost every modern programming language and environment provides a regular expression engine which interprets and uses regexes to search. They all provide roughly the same functionality, but (annoyingly) can differ slightly in syntax.

Linux command-line programs typically understand both POSIX regex syntaxes: basic (BRE) and enhanced (ERE) (which are used on this page. A tricky difference between the two: in BREs, metacharacters like ? lose their special meaning, and the backslashed version \? takes it instead - exactly the opposite of EREs.

Combining and repeating patterns

You can combine single-character regular expressions (or any other regexes) using a few simple rules. Often, there will be more than one way to write a pattern matching the text you care about.

Matching "first A, then B"

You match two regular expressions in a row by typing them one after the other. The simplest example of this is matching literal words. For example, the pattern "Ray" matches "an 'R' followed by an 'a' followed by a 'y'." You can join bracket expressions as well; "R[ao]y" matches both "Ray" and "Roy" (but not "Rqy").

The anchor symbols match not with specific characters, but specific locations in text.

'^' and '$' match the beginning and end of a line
\< and \> match the beginning and end of a word

For example, "\<rain" matches both "rain" and "raining" but not "train".

Matching "A or B"

You use the pipe character '|' between two regular expressions mean "or". Text matching either pattern matches this combined expression. For example, "Ray|Roy" matches both "Ray" and "Roy".

Matching "A more than once"

You repeat a regular expression by writing the pattern followed by a description of how many times it can repeat. Allowing a pattern to "repeat zero times" makes the pattern optional.

'?' means "repeat zero or one time." For example, "Ra?y" matches "Ry" and "Ray" (but not "Raay")
'*' means "repeat zero or more times." For example, "Ra*y" matches "Ry", "Ray", and "Raay" (and "Raaay", and ...)
'+' means "repeate one or more times." For example, "Ra+y" matches everything "Ra*y" matches except "Ry"
Curly braces let you specify a number of repetitions or a range of repetitions.
- {n} means "repeat exactly n times"
- {n,m} means "repeat at least n but no more than m times"
- {n,} means "repeat at least n times (no upper limit)
- {,m} means "repeat at most m (and possibly zero) times.

Example: Maryland license plates

The serial numbers on Maryland license plates have different formats corresponding to the date they were issued:

Serial numbers followed the pattern "[A-Z]{3}[0-9]{3}" (e.g. JLY488) from 1986 until 2004 (when all such patterns had been used)
Serial numbers switched to the pattern "[0-9][A-Z]{3}[0-9]{2}" (e.g., 5CLZ83) from 2004 until 2010
Serial numbers switched to the pattern "[0-9][A-Z]{2}[0-9]{4}" in 2010, which is still used today.

Since 2004, Maryland has also issued "Chesapeake Bay Preservation" license plates (with a different appearance) that use the pattern "[0-9]{5}[A-Z]{2}" instead.

You can use these simple rules to create and combine regular expressions that match increasingly complicated patterns in sophisticated data sets. If you construct a particularly useful regex, write it down somewhere (some people post their regex for different data sets online). Just be sure to label what the regex does - it can be difficult to reverse-engineer the pattern a complicated regex matches.

Try it yourself: NANP phone numbers

The North American Numbering Plan (NANP) governs telephone numbers in the United States, Canada, and the Caribbean. Examples include (800) 588-2300 and (773) 202-5864. Can you write a regular expression that matches such numbers?

It's easiest to write regular expressions for each "part" of the phone number and combine them.

The area code is a three-digit number written inside parenthesis: $[0-9]{3}$ (you must escape the parentheses).
A literal space
The exchange code is a three digit number: [0-9]{3}
A literal hyphen
The line number, a four-digit number: [0-9]{4}

Putting all three pieces together gives "$[0-9]$ [0-9]{3}-[0-9]{4}" as the complete expression.

You can further refine this pattern based on additional NANP rules:

Area codes never begin with zero or one; currently, no area code uses nine as the second digit
Exchange codes never begin with zero or one; the second and third digits are never both one

These refinements would lead to the regular expression "$[2-9][0-8][0-9]$ [2-9]([02-9][0-9]|[0-9][02-9])-[0-9]{4}"

Image source: xkcd: Regular Expressions by Randall Munroe (image link)

Linux Command Line