In this section we will introduce context-free grammars, a more powerful method of describing languages.
Context-free grammars (CFGs) are used to describe context-free languages. A context-free grammar is a set of recursive rules used to generate patterns of strings. A context-free grammar can describe all regular languages and more, but they cannot describe all possible languages.
CFG’s are used to describe programming languages and parser programs in compilers can be generated automatically from context-free grammars.
Formal Definition:
A context-free grammar can be described by a four-element tuple (V, \Sigma, R, S),(V,Σ,R,S), where:
Set of variables: V is a finite set of variables (which are non-terminal);
Set of terminal symbols: Σ is a finite set (disjoint from V) of terminal symbols;
Set of rules: R is a set of production rules where each production rule maps a variable to a string s ∈ (V ∪ Σ) ∗ ;
Start symbol: S (which is in V ) which is a start symbol.
Example:
A -> 0A1
A -> B
B -> #
In the example above, the Context free languages are defined by having:
A set of variables (e.g. A, B)
A set of terminal characters (e.g. 0, 1, #)
A set of rules (see example above)
A -> 0A1
A -> B
B -> #
A start symbol, e.g. A
Regular languages are a subset of context free languages.
Context free languages were first created to describe human language
E.g. IK + STEM + NIET
Ik werk niet
JIJ/JE + STEM + T + NIET
Je werkt niet
Used in compilation of programming languages, e.g.
E.g.
S -> E ";"
| "{' SS "}" S "else" S
| "if" "(" E ")" S
SS -> ε
| SS S
E -> IDENT
Where ε stands for empty string and IDENT stands for identifier
These grammars can describe features that have a recursive structure which makes them useful in a variety of applications.
An important application of context-free grammars occurs in the specification and compilation of programming languages,
Often developers will learn the grammar of a language in order to learn that programming language referred to as the syntax.
Most compliers contain a component called a parser that extracts the meaning of a program prior to generating the compiled code or performing the interpreted execution.
A Grammar, lets call it G1, has the following alphabet {a, +, x, (, ) }
Our Grammar rules are as follows. Note, the pipe character | means or. So for example, and expression can be an "expression + term" or just a "term".
<EXPR> --> <EXPR> + <TERM> | <TERM>
<TERM> --> <TERM> x <FACTOR> | <FACTOR>
<FACTOR> --> (<EXPR>) | a
The two strings a + a x a and (a + a ) x a can be generated with the grammar G1.
The following parse trees are shown in the following figure:
Note, this looks very complicated, and you do not need to know the details of this. But what I do want you to take away from this is that parsers basically just pull apart text, into smaller pieces according to a set of rules or a "language".
This is a simplified example that does not include the end HTML tags, however this is what a parse tree might look like from an HTML file:
In the next section we will look at a summary of implementation for these various representations of formal language.