Regular Expressions

Shell Metacharacters

Before we study Regular Expressions we should get familiar with the Shell meta characters which have similar syntax to regular expressions. Regular expressions are used by tools such as "grep" , "sed", "awk" while shell meta characters are used when we use tools such as "ls" and "echo" . Even though shell meta characters have similar syntax the meaning is different from regular expressions and we have to make sure that we do not confuse between the two. Regular Expressions are lot more sophisticated and more powerful than what the shell metacharacters have to offer.

Star(*)

The star matches anything. It could be a single or multiple characters or no characters at all . Say we have the following files in a folder:

a ab b c fb file1 file2 file3 file4

and we want the "ls" command to list the files1 to 4 . Using the star command we can write:

ls file*

[amittal@hills w1]$ ls file*

file1 file2 file3 file4

The above states that list all the files that begin with the word "file" and anything after that.

ls * gives us all the files

Example

$ ls

1.txt 2.txt a1 a2 b

[amittal@hills w1]$ ls *

a ab b c fb file1 file2 file3 file4

$ ls *.txt

1.txt 2.txt

$ ls *.txt

1.txt 2.txt

$ ls a*

a1 a2

$ ls a*1

The "*" is not just limited to listing the files. It will work with the "echo" command also.

[amittal@hills w1]$ echo *

a ab b c fb file1 file2 file3 file4

The shell will substitute "*" with the list of file names and then execute the command. Again the command can be anything.

Ex:

"file1.txt"

Contents of file 1 .

"file2.txt"

Contents of file 2 .

$ cat *

Contents of file 1 .

Contents of file 2 .

In the above the star is substituted and the shell creates the command:

cat file1 file2

The above command prints the contents of the 2 files.

Question Mark

The question mark represents a single character.

[amittal@hills w1]$ ls

a ab b c fb file1 file2 file3 file4

[amittal@hills w1]$ ls ?

a b c

[amittal@hills w1]$ ls ??

ab fb

The above command will list all the files with 2 characters.

$ ls

1.txt 2.txt a1 a2 b file1.txt file2.txt

$ ls ?*?

1.txt 2.txt a1 a2 file1.txt file2.txt

In the above we do not get "b" file printed out as the 2 question marks mean that the file name must be at least 2 characters long.

Square Brackets

Specifies a range. Ex:

[amittal@hills w1]$ ls

a ab b c fb file1 file2 file3 file4 notes

[amittal@hills w1]$ ls file[1-4]

file1 file2 file3 file4

This states that list all the files starting with the word "file1" and ending in any number from 1 to 4 .

Instead of range we can also specify the characters in the set. Ex:

[amittal@hills w1]$ ls file[1,3]

file1 file3

The "[1,3]" states that we can use any character "1" or "3" .

The square brackets with an "!" represents a Not condition. Let's say we wanted to list files beginning with the word "file" but do not want the "file3" listed.

[amittal@hills w1]$ ls

a ab b c fb file1 file2 file3 file4 notes temp1

[amittal@hills w1]$ ls file[!3]

file1 file2 file4

Example:

$ ls

1.txt 2.txt a1 a2 b file1.txt file2.txt

$ ls [1-2].txt

1.txt 2.txt

$ ls [1-5].txt

1.txt 2.txt

Since there are no files such as "3.txt", "4.txt" and "5.txt" we still get "1.txt" and "2.txt" printed out.

$ ls [a-z].txt

ls: cannot access '[a-z].txt': No such file or directory

$ touch a.txt

$ ls [a-z].txt

a.txt

$ ls [a-z]*.txt

a.txt file1.txt file2.txt

$ ls [a-z1-9]*.txt

1.txt 2.txt a.txt F3.txt file1.txt file2.txt

$ ls [a-zA-Z]*.txt

a.txt F3.txt file1.txt file2.txt

We can have multiple ranges.

$ ls [1,a]*.txt

1.txt a.txt

The comma means or.

$ ls a[1,2,3]

a1 a2

$ touch a

$ ls

1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt

$ ls a[1,2,3]*

a1 a2

Curly Brackets

This represents an or condition. Ex:

[amittal@hills w1]$ ls

a ab b c fb file1 file2 file3 file4 notes

[amittal@hills w1]$ ls {?,??}

The above states that list any files are either 1 or 2 characters long.

$ ls

1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt

$ ls {?,?*}

1.txt 2.txt a a a.txt a1 a2 b b F3.txt file1.txt file2.txt

Backslash

Is used to quote special characters .

Let's say we wanted to create a file named "*" .

[amittal@hills temp1]$ touch \*

[amittal@hills temp1]$ ls

[amittal@hills temp1]$ rm \*

[amittal@hills temp1]$ ls

Exercises

Ex1

We have the following files in a folder:

1.txt 2.txt a a.txt a1 a2 b F3.txt file1.txt file2.txt

List files "file1.txt" and "file2.txt" without using the word "txt" in your search command. Do not use the file names explicitly.

Solutions

Soln1

$ ls file[1-2]*

file1.txt file2.txt

The command ls *[1-2]???? will print all the files with the digits 1 or 2 and 4 characters after it.

$ ls *[1-2]????

1.txt 2.txt file1.txt file2.txt

Regular Expressions

Regular expressions offer powerful pattern matching capabilities . They are used by utilities such as ed, grep, sed and awk.

Recall that grep searches for a word or a pattern and prints the line if it finds it.

Star(*)

The star is used after a character or a string of characters and means 0 or more occurrences.

[amittal@hills w1]$ echo "The fox ate an orange" | grep fo*x

The fox ate an orange

[amittal@hills w1]$

[amittal@hills w1]$ echo "The foox ate an orange" | grep fo*x

The foox ate an orange

The pattern we are searching for is "fo*x" and that means a word that starts with "f" and ends with "x" and contains any number of "o" s including zero occurrence of "o" .

[amittal@hills w1]$ echo "The fx ate an orange" | grep fo*x

The fx ate an orange

The following do not produce a match because the pattern must have "f" at the begining and "x" at the end and any number of "o's" in the middle.

$ echo "The fix ate an orange" | grep fo*x

$ echo "The fonx ate an orange" | grep fo*x

Dot(.)

The "." can be used to match any single character.

[amittal@hills w1]$ echo "The fx ate an orange" | grep f.x

[amittal@hills w1]$

[amittal@hills w1]$ echo "The fox ate an orange" | grep f.x

The fox ate an orange

[amittal@hills w1]$ echo "The foox ate an orange" | grep f.x

[amittal@hills w1]$

In the last example the "foox" is not matched because the dot means a single character only. We can combine the star and dot to create ".*" which means any character repeated any number of times; essentially any string.

[amittal@hills w1]$ echo "The fioix ate an orange." | grep f.*x

The fioix ate an orange.

Caret

The caret is used to mean that the pattern must occur at the beginning of the line.

[amittal@hills w1]$ echo "The fox ate an orange." | grep ^fox

[amittal@hills w1]$

[amittal@hills w1]$ echo "The fox ate an orange." | grep ^The

The fox ate an orange.

The first example shows that "grep" does not find the word "fox" because it is not at the beginning of the line.

Dollar Sign

Similar to the caret the dollar sign is used to specify that the pattern must occur at the end of the line.

[amittal@hills w1]$ echo "The fox ate an orange." | grep ".$"

The fox ate an orange.

[amittal@hills w1]$ echo "The fox ate an orange" | grep ".$"

The fox ate an orange

[amittal@hills w1]$

[amittal@hills w1]$ echo "The fox ate an orange." | grep "\.$"

The fox ate an orange.

[amittal@hills w1]$

[amittal@hills w1]$ echo "The fox ate an orange" | grep "\.$"

[amittal@hills w1]$

We want to check if the line ends with a "." . However the pattern matches even when the line does not end with a "." . Since the dot represents any character the pattern ".$" will match any line that has some characters in it. To tell "grep" that we mean the character dot we need to backslash it.

Square Brackets

The square brackets can be used to specify a range or a group of characters.

[amittal@hills w1]$ echo "aabc" | egrep "^[a,c]+$"

[amittal@hills w1]$

[amittal@hills w1]$ echo "aac" | egrep "^[a,c]+$"

aac

[^] means a pattern that does not match the characters inside the square bracket .

amittal@hills w1]$ echo "that" | grep "[^aeiou]+"

[amittal@hills w1]$

[amittal@hills w1]$ echo "tht" | grep "[^aeiou]"

tht

Plus Sign

The plus sign means one or more occurrences of the character or string of characters before it.

[amittal@hills w1]$ echo "The fox ate an orange" | grep -E h+

The fox ate an orange

Notice we had to use the "-E" option to signify extended.

[amittal@hills w1]$ ls

a ab b c fb file1 file2 file3 file4 notes temp1

[amittal@hills w1]$ ls | grep -E file[1-4]+

file1

file2

file3

file4

Question Mark

The question mark applies to a character before it and can mean zero or 1 occurrence of the character.

[amittal@hills w1]$ echo "aab" | egrep aab?

aab

[amittal@hills w1]$ echo "aabb" | egrep aab?

aabb

[amittal@hills w1]$ echo "aa" | egrep aab?

[amittal@hills w1]$ echo "ac" | egrep aab?

Repeat a patter certain number of times

Check if a lower case character is repeated 2 times.

[amittal@hills w1]$ echo " aaa" | grep -E " [a-z]{2} "

[amittal@hills w1]$ aaa

Repeat a lower case character 4 times .

[amittal@hills w1]$ echo " aaa" | grep -E " [a-z]{4} "

[amittal@hills w1]$

Repeat a lower case character 1 or 2 times .

[amittal@hills w1]$ echo " aaa " | grep -E "[a-z]{1,2} "

[amittal@hills w1]$ aaa

[amittal@hills w1]$ echo " aa " | grep -E " [a-z]{2} "

[amittal@hills w1]$ aa

Character Classes

In addition to the square brackets we can use character classes represent certain ranges.

[[:alnum:]] Any of `[:digit:]' or `[:alpha:]'[[:alpha:]]Any letter:a b c d e f g h i j k l m n o p q r s t u v w x y z,A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.[[:blank:]]Space or tab.[[:digit:]]Any one of 0 1 2 3 4 5 6 7 8 9.[[:lower:]]Any one of a b c d e f g h i j k l m n o p q r s t u v w x y z.[[:punct:]]Any one of ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.[[:space:]]Any one of CR FF HT NL VT SPACE.[[:upper:]]Any one of A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.

$ echo "a134" | grep [[:alpha:]]

a134

$ echo "134" | grep [[:alpha:]]

$ echo "134" | grep [[:digit:]]

134

Basic and Extended Expressions

In basic regular expressions the meta-characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ lose their special meaning.

If we want the characters to take on special meanings we need to escape them ‘\?’, ‘\+’, ‘\{’, ‘\|’, ‘$’, and ‘$’.

In extended regular expressions it's the opposite. We do not have to escape the meta-characters and in order to

take off the special meaning we need to escape them.

For the "grep" utility we can use "egrep" or "grep -E" command for the extended usage.

For "sed" we can "-r" or "-E' if we are on the Mac.

[amittal@hills sed]$ echo "letter" | grep "et+e"

[amittal@hills sed]$

[amittal@hills sed]$ echo "let+er" | grep "et+e"

let+er

In the first example even though "t" is repeated 2 times "grep" only looks for the literal "+" . This is illustrated in the second

search. This time grep finds the word because it is looking for the "+" character .

[amittal@hills sed]$ echo "letter" | grep "et\+e"

letter

To tell grep to consider the "+" character as special we escape it and this time it finds the match.

Now let's repeat the scenario with the extended grep.

[amittal@hills sed]$ echo "letter" | egrep "et+e"

letter

[amittal@hills sed]$ echo "let+er" | egrep "et+e"

[amittal@hills sed]$ echo "let+er" | egrep "et\+e"

let+er

In the first line :

echo "letter" | egrep "et+e"

the "+" is recognized as a special character and it finds the word.

echo "let+er" | egrep "et+e"

In the second line we are looking for the literal "+"" but since the "+" is a meta character we are not going to find it. In order to take off the special meaning we need to escape the "+" and this is what happens on the third line.

echo "let+er" | egrep "et\+e"

The characters ‘?’, ‘+’, ‘{’, ‘|’, ‘(’, and ‘)’ were introduced after the regular expressions had already been defined. Now there was a problem because if say a utility such as "grep" takes these characters into consideration then that breaks the old scripts. The utility instead introduced "egrep" or "-E' option to deal with these new characters.

Exercises

Ex1:

Create some files in a folder . You can use the "touch" command if you like.

file1.txt file2.txt file3 file4

List files "file1.txt" and "file2.txt" without using the word "txt" in your search command. Do not use the file names explicitly.

Ex2:

Create some files in a folder . You can use the "touch" command if you like.

a ab b c fb file1 file2 file3 file4

list files "fb" and "file1" only

Do not use the file names explicitly.

Ex3:

Write a grep that will match a line if it contains a ca license plate with the following format :

Digit UpperCase UpperCase UpperCase Digit Digit Digit

The line should only contain the license no and nothing else.

Ex4:

Write a grep that matches the following name. It has 3 words with the last word being "Blvd" Ex:

Amador Valley Blvd

All the words start with an uppercase and there are exactly 2 spaces in the phrase.

Ex5:

Suppose a folder has 10 files named "file1", "file2" ... "file10" .

Complete the following command:

ls "regular expression"

to list the files 5 through 10 but you cannot use the numbers 5-10 .

Ex6:

Create 6 files with the following names:

file1, file2, file3 ... file6

Use the range square brackets [1-2] , [4-6] together with the curly brackets to output the files:

1 to 2 and 4 to 6 .

Ex7:

What pattern is satisfied by the following phrase:

egrep "^T{2}.*t$"

Use "echo" to test your answer out.

Ex8:

What pattern is satisfied by the following phrase:

egrep "[a,c]{1,3}"

If a string satisfies the above expression will it also satisfy the below expression ?

egrep "[a,c]{1,2}"

Ex9:

Sort the attached file "num.txt" so that the single digits are sorted first and then the double digits. The sorted file should look like "data.txt" .

Solutions

Solution 1

ls *.*

Solution 2

ls f*[b,1]

Solution 3

[amittal@hills ~]$ echo "3BPZ780" | egrep [0-9][A-Z]{3}[0-9]{3}

3BPZ780

This will find the license no . Now we need to make there are no other words on that line.

echo "3BPZ780" | egrep ^[0-9][A-Z]{3}[0-9]{3}$

Using the caret and the dollar sign we specify that the pattern must match at the beginning and at the end also.

Solution 4

echo "Amador Valley Blvd" | egrep "^[A-Z][a-zA-Z]* [A-Z][a-zA-Z]* Blvd$"

Solution 5

$ ls file[!1-2].txt

Solution 6

$ ls file{[1-2],[4-6]}

file1 file2 file4 file5 file6

Solution 7

There should be 2 occurences of capitol "T" at the begining and a small t at the end with any number of characters in the middle.

$ echo "TThis is a test" | egrep "^T{2}.*t$"

TThis is a test

Solution 9

File: "mysort.sh"

cat num.txt | egrep "^[0-9]$" | sort > data1.txt

cat num.txt | egrep -v "^[0-9]$" | sort > data2.txt

cat data1.txt data2.txt > data.txt

Page updated

Google Sites

Report abuse