grep

Prepared by : Mahesh Yadav Gaddam                                                                         Editor: Pavan Devarakonda

The grep command

This is an article on grep command  might helpful for system administrators (Middleware Admins, DBA, WLA, WAS Admin), developers, and others who want to learn more about most handy UNIX command. The UNIX most powerful command grep comes from the ed (UNIX text editor) search command, it stands for  “global regular expression print” or g/re/p. 

 

Sometimes, while working with the powerful grep command felt that it is like 'Trishul' in the hands of System Administrator. Lets have more understanding on grep command in this article.

This was such a useful command that it was written as a standalone utility

There are two other variants, egrep and fgrep that comprise the grep family

grep is the answer to the moments where you know you want the file that contains a specific phrase but you can’t remember its name

The grep command Family Differences

grep - uses regular expressions for pattern matching

fgrep - file grep, fgrep will search one or more files for a given lines that match the specified text string. You can handle in scripting with the 'Exit status' is 0 if any lines match, 1 if not, and 2 for errors. fgrep is faster than normal grep searches, but less flexible: it can only find fixed text, no support for regular expressions.

egrep - extended grep, uses a more powerful set of regular expressions but does not support backreferencing, generally the fastest member of the grep family

agrep – approximate grep; not standard

 

Syntax differences

Regular expression concepts we have seen so far are common to grep and egrep.

grep and egrep have slightly different syntax

grep: BREs

egrep: EREs (enhanced features we will discuss)

Major syntax differences:

grep: \( and \),  \{ and \}

egrep: ( and ), { and }

 

Protecting Regex Meta characters

Since many of the special characters used in regexs also have special meaning to the shell, it’s a good idea to get in the habit of single quoting your regexs

This will protect any special characters from being operated on by the shell

If you habitually do it, you won’t have to worry about when it is necessary

Escaping Special Characters

Even though we are single quoting our regexs so the shell won’t interpret the special characters, some characters are special to grep (eg * and .)

To get literal characters, we escape the character with a \ (backslash)

Suppose we want to search for the character sequence a*b*

Unless we do something special, this will match zero or more ‘a’s followed by zero or more ‘b’s, not what we want

a\*b\* will fix this - now the asterisks are treated as regular characters

egrep: Alternation

Regex also provides an alternation character | for matching one or another sub-expression

(T|Fl)an will match ‘Tan’ or ‘Flan’

^(From|Subject): will match the From and Subject lines of a typical email message

It matches a beginning of line followed by either the characters ‘From’ or ‘Subject’ followed by a ‘:’

Subexpressions are used to limit the scope of the alternation

At(ten|nine)tion then matches “Attention” or “Atninetion”, not “Atten” or “ninetion” as would happen without the parenthesis  - Atten|ninetion

egrep: Repetition Shorthands

The * (star) has already been seen to specify zero or more occurrences of the immediately preceding character

+ (plus) means “one or more”

abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but will not match ‘abd’

Equivalent to {1,}

The ‘?’ (question mark) specifies an optional character, the single character that immediately precedes it

July? will match ‘Jul’ or ‘July’

Equivalent to {0,1}

Also equivalent to (Jul|July)

The *, ?, and + are known as quantifiers because they specify the quantity of a match

Quantifiers can also be used with subexpressions

(a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match ‘a’ or a blank line

 

grep: Back-references

Sometimes it is handy to be able to refer to a match that was made earlier in a regex

This is done using backreferences

\n is the backreference specifier, where n is a number

Looks for nth subexpression

For example, to find if the first word of a line is the same as the last:

^\([[:alpha:]]\{1,\}\) .* \1$

The \([[:alpha:]]\{1,\}\) matches 1 or more letters

 

Practical Regex Examples

Variable names in C

[a-zA-Z_][a-zA-Z_0-9]*

You may need to search for Dollar amount with optional cents. The expression could be as follows

\$[0-9]+(\.[0-9][0-9])?

 

Similarly there could be need of search for Time of day

(1[012]|[1-9]):[0-5][0-9] (am|pm)

 

Some times HTML tags such as  headers <h1> <H1> <h2> …

<[hH][1-4]>

 

The grep Family

Syntax

grep [-hilnv] [-e expression] [filename]

egrep [-hilnv] [-e expression] [-f filename] [expression] [filename]

fgrep [-hilnxv] [-e string] [-f filename] [string] [filename]

-h     Do not display filenames

-i     Ignore case

-l     List only filenames containing matching lines

-n     Precede each matching line with its line number

-v     Negate matches

-x     exact Match whole line only (fgrep only)

-e expression        Specify expression as option

-f filename          Take the regular expression (egrep) or                 a list of strings (fgrep) from filename

 

The grep Examples: Fun with the Dictionary

/usr/dict/words contains about 25,000 words

egrep hh /usr/dict/words

beachhead

highhanded

withheld

withhold

egrep as a simple spelling checker: Specify plausible alternatives you know

egrep "n(ie|ei)ther" /usr/dict/words

neither

How many words have 3 a’s one letter apart?

egrep a.a.a /usr/dict/words | wc –l

54

egrep u.u.u /usr/dict/words

cumulus

Other Notes

Use /dev/null as an extra file name

Will print the name of the file that matched

grep test bigfile

This is a test.

grep test /dev/null bigfile

bigfile:This is a test.

Return code of grep is useful

 grep fred filename > /dev/null && rm filename

 

Good references:

1.       Wiki grep

2.       Examples of grep