Regex provides powerful text matching functionality and is used across a wide variety of applications. In this article, we'll introduce the basic concepts, then delve into the different components which comprise a regular expression, using examples at each point, then showcase a list of commonly used regular expressions. If you'd like to skip to that list, feel free to scroll to the bottom. If you have any questions about regex after reading this article, or would like to troubleshoot a regular expression, feel free to leave a comment at the bottom!
Regex
Regex, short for regular expression, is an expression which describes a pattern of text which we want to find or identify within some larger text. Regular expressions are expressive enough to be able to find many different types of patterns and data within large texts using relatively small expressions as queries. A few common use cases for regular expressions include:
- Search and replace: word editors and IDEs, such as Visual Studio Code, allow for search and replace via regular expressions.
- Input validation: regex can be used to ensure input, such as email addresses and phone numbers, are valid input with the proper formatting and input characters.
- Extracting data: regular expressions can be leveraged to extract certain types of data, such as dates and ip addresses, from arbitrary data sources.
- Text tokenization and analysis: text parsers and search engines may use regular expressions to tokenize input queries and documents for search indexing.
In the sections below, we'll introduce the different components of which make up regular expressions, then bring it all together at the end with some examples of commonly used regular expressions.
Characters and Character Classes
A character class, also known as a character set, is a set of characters which we can declare by enclosing the characters together in braces, such as the set of vowels: [aeiou]
. That expression will match a single character of the specificed set; which in this example, is a vowel. If we want to include an entire range of letters or digits, we can use a dash as a shorthand to specify the range instead of typing everything out manually, such as: [a-z]
. If we want to exclude certain characters, we can do so by creating a character class with a ^ to indicate negation, such as: [^0-9]
, which will match anything that is not a digit. A few commonly used character sets are:
Character Class | Description |
---|---|
[a-z] | Match any character which is a lowercase letter. |
[A-Z] | Match any character which is a lowercase letter. |
[a-zA-Z] | Match any character which is an uppercase letter. |
[0-9] | Match any character which is a digit. |
[^0-9] | Match any character which is not a digit. |
[a-zA-Z0-9] | Match any character which is alphanumeric. |
[^a-zA-Z0-9] | Match any character which is not alphanumeric. |
gr[ae]y | Match the word gray or the word grey. This is a specific example of a general approach towards searching for words which have more than one spelling. |
Grouping and Alternatives
A group is a set of characters encased in parenthesis which indicates that all of the characters in that group should be treated as a single unit. For example: (abcd)
is a group containing "abcd". If we consider a character class to indicate an or relationship between the characters in the set, we can consider a group to be an and relationship. We can combine character classes and groups to take the qualities of both. For example: [(abc)(def)]
will match "abc" or "def". We could also write this as: ((abc)|(def))
, using the vertical slash "|" as an or declaration within the outer grouping itself.
Expression | Description |
---|---|
[0-9A-F] | Match a hexadecimal digit. |
([0-9]|[A-F]) | Match a hexadecimal digit; alternative way of writing the example above using groups. |
[(abc)(123)] | Match "abc" or "123". |
Wildcard and Multi Selectors
The wildcard is a special token which can match any character and is represented as a "." dot. This can be useful in certain situations where we are looking for certain characters in a certain order, but do not care about characters inbetween. When combined with a Kleen Star, this provides a powerful matching functionality. A Kleen Star, denoted as an asterisk "*", is a quantifier which indicates that the previous token can occur 0 or more times. The "+" quantifier states that the previous token can occur 1 or more times. Unlike the kleen star, this imposes a restriction that the previous token must appear at least once. If we want to match something exactly n times, we can use a similar expression: c{3}
, which reads "match the character c 3 times". We can also match something at least n times similarly: c{n,}
, where the comma indicates that more than n matches is also acceptable. To provide a range of occurences, such as 2-4, we can use: c{2,4}
The table below shows a few examples of these selectors in action:
Expression | Description |
---|---|
.* | Match any character 0 or more times. |
a*bc | Match the character "a" 0 or more times, followed by "bc". |
a.*bc | Match "a", followed by 0 or more of any other character, followed by "bc". |
[aeiou]* | Match 0 or more vowels. |
[aeiou]{2} | Match exactly two vowels (in a row). |
7{3,} | Match three or more occurences of "7". |
7{2,5} | Match 2-5 occurences of "7". |
(abcd)+ | Match "abcd" at least once. |
[0-9A-F]+ | Match at least one hexidecimal digit. |
((abc){2,}|(123){3}) | Match two or more occurences of "abc", or exactly three occurences of "123" |
Special Characters and Escaping
Regex includes special characters which are used to specify certain types of behaviors or characters such as newlines. We've discussed a few such characters already, such as [, (, ), ]
, which mark opening and closing groups and character sets. If we need to include one of those characters as an actual character, we will need to escape it with a "\" backslash. By escaping it, we indicate that the regular expression should match the character itself instead of using it for its normal purpose. For example: \(ab\)
matches "(ab)", and (\[abc]\)*
matches "[abc]" 0 or more times.
In the sections above, we've only pointed out parts of regular expressions. An entire regular expression is enclosed in an opening and closing forward slash "/". In the table below, we'll look at some common special characters and their meanings:
Special Character | Description | Example Expression |
---|---|---|
^ | If used at the beginning of the expression, it marks a new line. Otherwise, it marks not the following. | /^start/ : matches "start", doesn't match "... start". /[^abc]/ : matches anything which does not have a, b, or c. |
? | Indicates that the previous character is optional, or when used with a Kleen Star, indicates that the matching should be lazy. | /https?/ : matches "http" and "https". |
$ | Matches the end of a line. | /end$/ matches "... end", but not "... end ...". |
\d | Matches a digit. | /ab\dc/ matches "ab7c". |
\D | Matches any non-digit character. | /\d\d\D/ matches "12K". |
\w | Matches any alphanumeric character. | /\w\w/ matches "1z". |
\W | Matches any non alphanumeric character. | /\W/ matches "%". |
\D | Matches any non-digit character. | /\d\d\D/ matches "12K". |
\n | Matches a newline. | /abc\ndef/ matches "abc\ndef". |
\s | Matches any whitespace character. | /a\sb\sc\sd/ matches "a b\nc\td". |
\S | Matches any non whitespace character. | /a\Sb\Sc\Sd/ matches "a7bOcAd". |
\ | Escapes the following character. It can also preceed a backslash to escape a backslash. | \\n matches "\n" |
/ | Marks the beginning and ending of a regular expression declaration (in most programming languages). |
Flag Modifiers
A flag is a symbol which is appended after the close of a regular expression declaration which changes the matching behavior of the entire expression. Different programming languages may offer different flags and flags may have slightly different behaviors. The most common flags are indicated below:
Flag | Symbol | Description | Example Expression |
---|---|---|---|
Case Insensitive | i | Match letter characters regardless of whether they are lowercase or uppercase. | /abc/i : matches "abc", "aBc", "ABC", etc. |
Multiline Search | m | Treat the string as having multiple lines. The symbols "^" and "$" will refer to individual lines instead of the start and end of the whole string. | /^abc/m : matches "first line\nabc", which would not match without the multiline flag. |
Global Search | g | The behavior may change depending upon the programming language / regex engine, but this generally indicates that all matches should be returned, not just the first one. | /abc/g : matches all instances of "abc" in the string "abc - abc - abc". |
Dot All | s | This allows the "." dot symbol to match newline characters. | /abc.*xyz/g : matches "abc\nxyz". |
Creating Full Regular Expressions
In the sections above, we've described the individual pieces. To design our own regular expression, we will need to put the pieces together in a way which matches the type of pattern we are looking for. Generally, we can consider the pattern as a template, a sequence of characters arranged in some particular fashion with certain key characters in certain places. For example, an email must always have the @ symbol, so we can design a regular expression to find emails with "something", then a "@" character, then "something else". With that, we have the beginnings of a custom built regular expression. Generally speaking, similar to what we've just done, the following approach can be taken to create custom regular expressions:
- Identify the pattern which must be tested.
- Check if there are any variations of the pattern which must be accounted for.
- Determine if there are any fixed points or break points in the pattern, such as the @ in an email.
- Determine which parts of the pattern can accept any arbitrary number of one or more types of characters.
- Create a regular expression which conforms to the observations in the previous steps using the components described in the previous sections.
Note that there are generally multiple regular expressions which will match the same patterns.
Common Regular Expressions
The table below contains some of the simpler of the most commonly used regular expressions:
Pattern | Regular Expression | Notes |
---|---|---|
Decimal Number | /\d*(\.\d+)?/ | This matches any integer or decimal with at least one number after the decimal dot. |
Non Alphanumeric Character | /\W/ | This matches any string which has a non alphanumeric character. |
IPv4 Address | /(\d{1,4}\.){3}\d/ | This matches any valid IPv4 address, where each number can have 1-4 digits. |
IPv6 Address | /([\da-f]{1,4}:){7}[\da-f]{1,4}/i | This matches any valid IPv6 address, where each number is a hexadecimal with 1-4 digits. |
Date: month/date/year | \d{1,2}\/\d{1,2}\/(\d{4}|\d{2}) | Match a date such as 4/4/21 or 01/12/1997. |
Phone Number: (000)-(0000)-(0000) | (\(?\d{3}\)?[-\s])?\(?\d{3}\)?[-\s]\(?\d{4}\)? | Match a phone number such as (123)-(133)-(1233) or 123 987 1234. |
Readability and Code Quality
One of the downsides of using regular expressions is that the expressions themselves can get verbose relatively quickly. This can be mitigated to a degree by using special characters judiciously, but as the complexity of the expression grows, its readability will drop. Once a regular expression becomes too verbose, it may be a good idea revisit the expression to see if it can be stated more elegantly, or even do away with the expression altogether and consider implementing the matching at the code level.
Testing and Validating Regular Expressions
It is recommended that you validate any regular expression you create before pushing it to your codebase. Unit testing is an excellent way of doing this, but before that step, there are plenty of tools available which enable you to input a regular expression, specify some test cases or a piece of text in which to test, and check that the regular expression is behaving as expected. Regex101 is a great tool which can be used for this purpose.