All communication requires a language of some description. People communicate with each other in a natural language such as English; they communicate with computers in a formal language such as Python or Regular Expressions. When people communicate, the intended meaning is not always clear. Many have heard the reply ‘what you say is not logical’. Formal languages are designed to be unambiguous; every valid expression in the language has one and only one possible interpretation.
In formal language theory, an alphabet Σ is a set of possible symbols that can occur in a language. The alphabet of binary is Σ = {0,1} for example. The set Σ* is then the infinite set of all possible combinations of the elements in Σ: Σ* = {0, {0}, {1}, {0, 0}, {0, 1}, {1,0}, {1,1}, {0,0,0},...}. A language L in formal language theory is a subset of Σ*. The language dictates which combinations of the tokens of the alphabet are valid. These combinations of tokens from Σ are often referred to as the strings or words of the language L.
Example 1 The language of zero or more ‘a’ tokens, which includes the strings “” (the empty string), “a”, and “aaaaa” is constructed using an alphabet containing only one token ‘a’ (Σ = {a}). “abe” is not in the language since it contains tokens which are not ‘a’,
Example 2 The language of zero or more ‘a’ tokens followed by the same number of ‘b’ tokens, which includes the strings “” (the empty string), “ab”, and “aaaabbbb” but not “aaaccc” or “aaabb” is constructed using an alphabet containing two tokens ‘a’ and ‘b’ (Σ= {a,b}).
The tokens of a language (and thus the alphabet) do not need to be what we would traditionally call a character (e.g. a letter of the English alphabet). Our language could just as well be made up with an alphabet containing full word tokens.
A regular expression describes a particular regular language. Not all languages are regular languages, for instance, although we can describe the first example above with the regular expression /a*/+, there is no regular expression that can describe the language in the second example above. This is not to say that there are not other expressions that describe the second language — but that regular expressions are not general enough to describe them — this fsa is often stated as have limited expressive power.
Programs can match regular expressions against strings. Not only are regular expressions excellent for describing simple string expressions but they can be converted into a 'machine' that can match a string very quickly. These machines created from RE are called FSA or Finite state machine.
A FSA consists of a set of states and transitions between these states. A transition is labelled with the character of the input string that causes the transition between states to occur An FSA has a start state and may have one or more accept states. A FSA recognises a string by:
starting at the beginning of the string and in the start state
if the current character in the input string labels one or more of the transitions from the current state, then follow the transition to the next state
if the current character does not label any transition, reject the string
if we aren't at the end of the string, goto step 2
if the current state is an accept state, accept the string
EXAMPLE
Creating and evaluating a boolean query is much more complicated than a keyword or linear operator queries. The reason for this is that they represent a much more powerfull class of languages. In fact, it is the same class of languages Context-Free Grammars (CFGs) that you need to describe the language of equal numbers of a’s and b’s.
The power of CFGs is that they can describe nested or recursive structures.
A grammar consists of a number of production rules. A production rule, for example S ->a, states that to make an S you need to make an a. Alternatively, you can think of it as to recognise an a in the input string. Each production is made up of an uppercase symbol (called a non-terminal symbol) on the left and a list of one or more uppercase and lowercase (non-terminal and terminal) symbols on the right. Terminal symbols are from the alphabet of the languages and non-terminal symbols are used like variables in a programming language. When there are multiple productions with the same non-terminal on the left then this means there are multiple ways of producing the non-terminal. One non-terminal is designated the start symbol which means that all strings in the langues must be derived starting from the non-terminal symbol.
Examples
Regular expressions with interactives
More about RE (Python focus)