In arithmetic use can use the operations + and x to build up expressions such as:
(5 + 3) x 4
Similarly, we can use regular operations to build up expressions describing languages, which are called regular expressions.
For example:
(ab|a)*
The value of the arithmetic is 32. The vale of a regular expression is a language. In this case, a language consisting of either pairs of ab or just a repeated.
The following breaks down the symbols and their meaning.
Regular Expression practice sheet here.
The simplest kind of matching is just entering some text to match. Open the interactive below and type "cat" into the box:
If you've only typed the 3 characters "cat", then it should find 6 matches.
Now try typing a dot (full stop or period) as the fourth character: "cat.". In a regular expression, "." can match any single character. Try adding more dots before and after "cat". How about "cat.s" or "cat..n"?
What do you get if you search for " ... " (three dots with a space before and after)?
Now try searching for "ic.". The "." matches any letter, so if you really wanted a full stop, you need to write it like this "ic\." – use this search to find "ic" at the end of a sentence.
Another special symbol is "\d", which matches any digit. Try matching 2, 3 or 4 digits in a row (for example, two digits in a row is "\d\d").
To choose from a small set of characters, try "[up]ff". Either of the characters in the square brackets will match. Try writing a regular expression that will match "fat", "sat" and "mat", but not "cat".
A shortcut for "[mnopqrs]" is "[m-s]"; try "[f-s]at" and "[4-6]".
Another useful shortcut is being able to match repeated letters. There are four common rules:
a* matches 0 or more repetitions of a
a+ matches 1 or more repetitions of a
a? matches 0 or 1 occurrences of a (that is, a is optional)
a{5} matches "aaaaa" (that is, a repeated 5 times)
Try experimenting with these. Here are some examples to try:
f+
pf*t
af*
f*t
f{5}
.{5}n
If you want to choose between options, the vertical bar is useful. Try the following, and work out what they match. You can type extra text into the test string area if you want to experiment:
was|that|hat
was|t?hat
th(at|e) cat
[Tt]h(at|e) [fc]at
(ff)+
f(ff)+
Notice the use of brackets to group parts of the regular expression. It's useful if you want the "+" or "*" to apply to more than one character.
Regular expression
The term regular expression is sometimes abbreviated to "regex", "regexp", or "RE". It's "regular" because it can be used to define sets of strings from a very simple class of languages called "regular languages", and it's an "expression" because it is a combination of symbols that follow some rules.
Click here for another challenge: you should try to write a short regular expression to match the first two words, but not the last three:
Of course, regular expressions are mainly used for more serious purposes. Click on the following interactive to get some new text to search:
The following regular expression will find common New Zealand number plates in the sample text, but can you find a shorter version using the {n} notation?
[A-Z][A-Z][A-Z]\d\d\d
How about an expression to find the dates in the text? Here's one option, but it's not perfect:
\d [A-Z][a-z][a-z] \d\d\d\d
Can you improve on it?
What about phone numbers? You'll need to think about what variations of phone numbers are common! How about finding email addresses?
Regular expressions are useful!
The particular form of regular expression that we've been using is for the Ruby programming language (a popular language for web site development), although it's very similar to regular expressions used in other languages including Java, JavaScript, PHP, Python, and Microsoft's .NET Framework. Even some spreadsheets have regular expression matching facilities.
Warning: Limitations!
But regular expressions have their limits – for example, you won't be able to create one that can match palindromes (words and phrases that are the same backwards as forwards, such as "kayak", "rotator" and "hannah"),
and you can't use one to detect strings that consist of n repeats of the letter "a" followed by n repeats of the letter "b". For those sort of patterns you need a more powerful system called a grammar (see the section on Parsing). But nevertheless, regular expressions are very useful for a lot of common pattern matching requirements.
Regular expressions and finite automata are equivalent in their descriptive power. This fact is rather remarkable, because finite automata and regular expressions superficially appear to be rather different. However, any regular expression can be converted into aa finite automaton that recognizes the language it describes and vice versa.
Let's run through a few examples of regular expressions and turn them into Finite Automatons.
On the left are regular expressions, on the right are their equivalent Finite Automatons.
Note that empty string, denoted by ε, can be interpreted throughout a string, as between characters, technically, there is an empty string.
This can be really useful when coming up with multiple accept states for given input.
In the next section we will look at a more powerful system called grammar in which we parse input.