Regular expressions (or RegEx) are like special codes used to search for patterns in text. Imagine you have a huge book, and you want to find all instances of a specific word within that book. Instead of reading every single page manually, you can use regular expressions to tell a computer what pattern to look for.
Regex in bioinformatics
Let's say you're looking for the sequence "ATG" which marks the start of a gene. You could use a regular expression like "ATG" to find all occurrences of that sequence in the text.
AATCTAGCATTTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTATAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACGTAGTAGCTAAAGCTAAACCTCAGGGGCTACTTTATAGCATCAAAAAAGCTAAACCTCAGGGGCTACTTTATAGCATCAAATGTAGCATTTACGTAGTAGCTAAAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACGTAGGGCAAAAGCTAAACCTCAGGGGCTACTTTATAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACGTAGGGCTATGTTTATAGCATCAAATCTAGCATTTACGTAGCATCAAATCTAGCATTTACGTAGTAGCTAAAGCTATTACGTTAGGGCAAAAGCTAAACCTCAGGGGCT
In bioinformatics, regular expressions are incredibly useful for searching through genetic sequences to find specific patterns related to genes, proteins, or other biological features. They help us efficiently analyze large amounts of genetic data without having to manually inspect large data files.
Special characters in RegEx
We can match any character using regular expressions, except those that have a special meaning in RegEx.
The below listed characters are special characters in RegEx:
. + * ? {} ^
$ () [] | \
If our pattern contains one of them, we must "escape" the character so that it is read as a string:
\.
The backslash (\) is used to escape characters, so that the dot above would actually be a character dot that we are looking for in a file.
Regular expressions can include special characters that represent different types of characters or patterns.
The dot (.)
For instance, the dot (.) represents a single character, any single character. It can be a digit, a letter, a symbol, and even a space.
If we want to match our sequence "ATG", but this time we want to include the next nucleotide, we can do that with "ATG.", in this case, the character after 'ATG' was a 'T', so we find 'ATGT':
ATAGCATCAAATGTAGCATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCATCTACGTAGCATCAAATCTAGCACGTACGTAGTAGCTCATGCTATTACGTAGCGCAACAGCTCAACCTCAGGCTACTTTATAGCATCAAATCTAGCATTAACGTAGTA
The star (*)
The asterisk (*) means "zero or more occurrences of the previous character." We can combine it with the dot in the following way ".*". So we match zero or more occurrences of any character.
So, if you wanted to find all sequences that start with "ATG" and end with "TAA", you could use a regular expression like "ATG.*TAA", which means "find 'ATG', followed by zero or more of any character, followed by 'TAA'":
ATAGCATCAAATGTAGCATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCATCTACGTAGCATCAAATCTAGCACGTACGTAGTAGCTCATGCTATTACGTAGCGCAACAGCTCAACCTCAGGCTACTTTATAGCATCAAATCTAGCATTAACGTAGTA
The plus (+)
Similar to the star, but it means "one or more occurrences of the previous character". Again, we can combine it with the dot in the following way ".+". So we match one or more occurrences of any character.
As in the previous example, we would get the same result if we wanted to find all sequences that start with "ATG" and end with "TAA", using a regular expression like "ATG.+TAA", which means "find 'ATG', followed by one or more of any character, followed by 'TAA'":
ATAGCATCAAATGTAGCATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCATCTACGTAGCATCAAATCTAGCACGTACGTAGTAGCTCATGCTATTACGTAGCGCAACAGCTCAACCTCAGGCTACTTTATAGCATCAAATCTAGCATTAACGTAGTA
But, what would happen if we have this sequence?
ATAGCATCAAATGTAACATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCATCTACGTAGCATCAAATCTAGCACGTACGTAGTAGCTCATGCTATTACGTAGCGCAACAGCTCAACCTCAGGCTACTTTATAGCATCAAATCTAGCATTCACGTAGTA
"ATG.*TAA" would be able to match it, but not "ATG.+TAA", as it requires that there is at least one character in between 'ATG' and 'TAA'
The question mark (?)
The question mark matches zero or one time the previous character
If we want to match our sequence starting with "ATG" and ending with TAA, and we know sometimes there is a T after ATG, but sometimes not, we can do that with "ATGT?TAA", in this case, the character after 'ATG' can be a T, or can be nothing, and both the following sequences would be matched:
ATAGCATCAAATGTAACATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCA
ATAGCATCAAATGTTAACATTTACGTAGTAGCTATAGCTATTACGTAGGGCTACTTTATAGCATCAAATCTAGCA
The curly brackets ({})
The curly brakets can reference the amount of times we expect the previous character to occur. It has three main configurations:
{m} - previous character exactly m number of times
{m,n} - previous character m to n number of times
{m,} - previous character m or more number of times
For example, in the following sequence we want to find three 'A' in a row. We could find them using "A{3}":
ATAGCATCATAATGTAGCATTTACGTAGTAGCTATAGCTATTACGTAGGGCTAAAAATAGCATCATATCTAG
We can also specify that we want to find 'A' a minimum of 3 times and a maximum of 5 with "A{3,5}", then we would match:
ATAGCATCATAATGTAGCATTTACGTAGTAGCTATAGCTATTACGTAGGGCTAAAAATAGCATCATATCTAG
We could get the same result in this specific sequence, by using "A{3,}", which searches for three 'A' in a row or more.
The caret (^)
The caret symbol is used to match the beginning of the line. So that if we want to match sequences that start with "ATG", we can use ^ATG.
For example, in the following sequence:
ATGATAGCTTAACATTTACGTAGTAGCTATAGCTATT
GTCATGAGCTATTAGCATCACATCTAGCACGTTCATG
ATGCTATGAAGTCTACTTTATAGCATCAAATCTAGTA
The regular expression ^ATG matches ATG only in the first and third lines because they begin with "ATG"
The dollar sign ($)
The dollar sign ($) is used to match the end of a line or string. If we want to match sequences that end with "TAA", we can use "TAA$".
For example, in these sequences:
TATAGCTAAAGTCTACTTTATAATCAATGATAGCTTAA
ATGAGCTATTAGCATCACATCTAGCAGTCATGAGCTAT
GTAGCATTTACGTAGTAGCTATAGCTATGCTATGAAGT
The regular expression TAA$ matches TAA only in the first line because it ends with "TAA"
Square brackets ([])
Square brackets [] are used to define a set of characters to match. For example, [ACGT] matches any single character that is A, C, G, or T.
If we want to find sequences where "ATG" is followed by a C or G, we can use ATG[CG].
In the sequence:
ATAGCATCAAATGCTAACATTTACGTAGTAGCTATAGCTATTACGTATGGCTACTTTATAGCATCAAATCT
The regular expression ATG[CG] matches ATGC and ATGG.
The pipe (|)
The pipe "|" means "or" and allows you to specify alternative patterns. For instance, ATG|TAA matches either "ATG" or "TAA"
In the sequence:
ATGACGACGTAGCGCAACAGCTCAACCTCAGGCTACTTTATAGCATCAAATCTAGCATTTAAATAG
TCTAGCATGACGACGTAGCGCAACAGCTCAACCTCAATAGCTATTACGTAGTGCAATGTACTATTA
ACCTCAGGCTACTTTATATAGCTATTACGTAGAGCATCAAATCTAGCATTTAAATAGCCCGTATCC
The regular expression ATG|TAA matches ATG or TAA when found.
Parentheses (())
Parentheses () are used for grouping and capturing. If we want to capture sequences that follow the pattern "ATG" followed by any two characters and "TAA," we can use "(ATG..TAA)".
ATGCTTAAATGCCCAGTAA
The regular expression (ATG..TAA) matches (and captures) "ATGCTTAA"
Referencing captures:
Captured groups can be referenced later in the same regular expression or used in programming languages.
To find repeated sequences like "ATGTACTAA", you can use:
(ATGTACTAA).*\1
It would match in the following sequence:
GTAAATGTACTAACAGTAACGTAGCGATGTACTAAACCTCAATAG
The backslash ( \)
The backslash "\" is used as an escape character to treat special characters literally. For example, if you want to match a literal dot, use "\." instead of "."
If we want to find "A.T" as it appears (with the dot), we can use "A\.T"
In the sequence:
A.TGACTTAAG.A.T
The regular expression "A\.T" matches A.T twice in the sequence.
Special characters in RegEx
. single character
* the preceding character matches 0 or more times
+ the preceding character matches 1 or more times
? the preceding character matches 0 or 1 times only
{n} the preceding character matches n number of times
{n, m} the preceding character matches at least n times, and up to m number of times
^ matches the beginning of the line
$ matches the end of the line
() grouping and capturing
| OR operator
[abc] the character is one of the characters included in the square brakets, thus matching a, b, or c
[a-d] the character is within the range of a to d, thus matching a, b, c, or d.
[^abc] the character is not one of the characters included in the square brakets, thus matching any character except a, b, or c
[a-zA-Z] it matches any letter equivalent: \w
[0-9] it matches any digit equivalent: \d
' ' space equivalent: \s
' ' tab equivalent: \t