Regex

Author: Mahesh Yadav Gaddam

Editor: Pavan Devarakonda

Regular Expressions

The regular expressions  for any Middleware Administrative is an essential skill. They are very simple and its crafting makes you an effective admin. Most of SOA based environments using multimega-byte files for processing between different services. As WebService Admin you might be working with File Adapters and huge size files, there could be need of validation of data feild of such files. This article will give you an overiew on regex  syntax and examples with samples. In UNIX shell programming  the regular expressions  are used to search for text in files. Remember that in UNIX everything can be treated as a file. The regular expressions can be used with the combination of the grep command, Stream manipulation: sed and xargs

xargs

 

UNIX limits the size of arguments and environment that can be passed down to child process

What happens when we have a list of 10,000 files to send to a command?

o   xargs handles this problem

o   Reads arguments as standard input

o   Sends them to commands that take file lists

o   May invoke program several times depending on size of arguments

 

 

The find utility command and xargs combination  

 find . -type f -print | xargs wc -l

-type f for files

-print to print them out

xargs invokes wc 1 or more times

wc -l a b c d e f g

wc -l h i j k l m n o

 

 

Compare to: find . -type f –exec wc -l {} \;

The -n option can be used to limit number of args

 

 

What is a Regular Expression?

A regular expression (regex) describes a set of possible input strings.

Regular expressions descend from a fundamental concept in Computer Science called finite automata theory

 

Regular expressions are endemic to UNIX

vi, ed, sed, and emacs

awk, tcl, perl and Python

grep, egrep, fgrep

compilers

 

The simplest regular expressions are a string of literal characters to match.

The string matches the regular expression if it contains the substring.

A regular expression can match a string in more than one place.

The . regular expression can be used to match any character.

Character Classes

Character classes [] can be used to match any specific set of characters.

Negated Character Classes

Character classes can be negated with the [^] syntax.

[aeiou] will match any of the characters a, e, i, o, or u

[kK]orn will match korn or Korn

Ranges can also be specified in character classes

[1-9] is the same as [123456789]

[abcde] is equivalent to [a-e]

You can also combine multiple ranges

[abcde123456789] is equivalent to [a-e1-9]

Note that the - character has a special meaning in a character class but only if it is used within a range,

[-123] would match the characters -, 1, 2, or 3

Named Character Classes

Commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, ctrl)

Syntax [:name:]

[a-zA-Z]                [[:alpha:]]

[a-zA-Z0-9]         [[:alnum:]]

[45a-z]                 [[45[:lower:]]

Important for portability across languages

Anchors

Anchors are used to match at the beginning or end of a line (or both).

^ means beginning of the line

$ means end of the line

Repetition

The * is used to define zero or more occurrences of the single regular expression preceding it.

 

Match length

A match will be the longest string that satisfies the regular expression.

 

Repetition Ranges

You can have ranges in the regurlar expressions  it can be specified as follows

{ } notation can specify a range of repetitions for the immediately preceding regex

{n} means exactly n occurrences

{n,} means at least n occurrences

{n,m} means at least n occurrences but no more than m occurrences

 

Example:

.{0,} same as .*

a{2,} same as aaa*

Sub-expressions

If you want to group part of an expression so that * or { } applies to more than just the previous character, use ( ) notation

Subexpresssions are treated like a single character

a* matches 0 or more occurrences of a

abc* matches ab, abc, abcc, abccc, …

(abc)* matches abc, abcabc, abcabcabc, …

(abc){2,3} matches abcabc or abcabcabc

 References:

1. The find command examples 

2. The sed command references