Regular expressions and regular grammars
There is a close connection between a regexp and a RG: the language of a
regular grammar must be the set corresponding to some regexp, but it does
not tell us how to find such a regexp.
BNF to Syntax Diagram
If the regexp is given in BNF, then we can find the regexp in a fairly
straightforward way by drawing the syntax diagram for the grammar.
Any adjacent symbols are a concatenation, and non-unique nonterminal is
a disjunction, and any recursive rule is a closure. Any terminal is a
state, and any nonterminal is a branch (in a normalised grammar). The
number of branches corresponds to the number of unique prefixes in the
righthand side of the rule for the nonterminal in question.
Start with the "distinguished" (i.e. start) state.
When the paths are completed, merge common suffixes iteratively (starting
with the longest suffix) until there are no common suffixes. Merge the
ends of the paths into a single exit path.
To define the corresponding regexp, write terms for each disjunction,
join concatenated symbols, mark all closures.
Syntax Diagram to REGEXP
To go from a syntax diagram to a RG, label each state, then write a rule for
each pair of states, working from final state to start state. Eliminate
single option rules.
Compilers and parsing
A regexp defines a FSM. That is, a regexp is a kind of high-level
program that must be translated into a low-level program that can be
executed by a computer. This translation is called "compiling".
Two things are necessary to compile a regexp into an executable program:
1. the regexp must be wellformed (legal)
2. a systematic method for translation must be defined
Determining whether an expression is wellformed is called "parsing". Recall
that a grammar defines a set of productions for constructing expressions.
Precisely how an expression is derived from a set of productions is its
parse (or its derivation). There may be more than one derivation for an
expression, but as long as there's at least one derivation then the
expression is wellformed.
Parsing, and compiler writing, are well-developed fields in computer
science, and many advanced treatments have been developed. We will
touch upon the basic principles necessary for compiling our regexp
search pattern.
Parsing
Determining whether a regexp is wellformed is the same as determining
whether it is part of the language of regexps. A language is defined
by a grammar, thus we first need a grammar for regexps. What kind of
grammar? Can we define the language of RGs by another RG? No. As it
happens we need a stronger formalism ... specifically we need a CFG.
We can only tell if a regexp is wellformed if it can be derived by the
CFG that defines the language of regexps.
Example CFG for regexps
E -> T
E -> T E
T -> F
T -> F*
T -> F + T
F -> v
F -> (E)
E=expression T=term F=factor
How is a regexp derived from this grammar? We need to parse the regexp
using the grammar for all regexps.
Top-down parse
Start from the "distinguished" nonterminal (the start symbol) and rewrite
nonterminals until all terminals of the regexp are accounted for with
nothing leftover and nothing missing.
Recursive descent parsers are depth first, top-down parsers. Note they
will not work for left-recursive rules.
Breadth first top down parsers keep a list of all possible rewrites. Gets
cumbersome.
Bottom-up
Bottom-up parsers try to fit the data to rules by matching terminals
against possible righthand sides of rules. Everytime a righthand side
is fully accounted for, it is replaced by its corresponding lefthand symbol.
If the input can be reduced to the start symbol, the expression is wellformed.
Choosing your weapon
Often, bottom-up parsing can be done simply by looking for illegal symbol
sequences, such as ++ or ** or unbalanced parentheses. If no bad sequences
are detected, the expression must be wellformed. Unfortunately, for pattern
searching we want more than simple expression validation ... ultimately
we need to build a FSM. That is, we need to build a compiler ... where
parsing is part of the process.
Recursive descent parsers are generally the easiest to implement. For the
most part, a grammar can be almost directly translated into a C program.
This has the side benefit of producing a parser at the same time. And
the parser can be transformed into a compiler.
E.g. Given global variables p (a string containg the regexp) and j (the
index of the symbol in p that is being parsed).
Grammar
-------
E -> T
E -> T E
T -> F
T -> F*
T -> F+T
F -> v
F -> (E)
Parser
------
expression()
{
term();
if(isvocab(p[j])||p[j]=='(')
expression();
}
term()
{
factor();
if(p[j]=='*') j++;
if(p[j]=='+'){ j++; term(); }
}
factor()
{
if(isvocab(p[j])) j++;
else
if(p[j]=='('){
j++; expression();
if(p[j]==')') j++;
else error();
}
else error();
}
parse()
{
expression();
if(p[j])error(0,p);
}
Note that this is not exactly a direct translation from the grammar. For
example, expression() examines the next character in order to decide what to
do. This is called "lookahead", and is necessary to prevent the parser from
recursing when it has reached the end of a bracketed expression.
Not all grammars require this, but some require looking even further ahead.
Note that a trace of the calling structure would display a parse tree.
E.g [a*b+ac]d
expression([a*b+ac]d)
term([a*b+ac]d)
factor([a*b+ac]d)
expression(a*b+ac]d)
term(a*b+ac]d)
factor(a*b+ac]d)
expression(b+ac]d)
term(b+ac]d)
factor(b+ac]d)
term(ac]d)
factor(ac]d)
expression(c]d)
term(c]d)
factor(c]d)
expression(d)
term(d)
factor(d)
Left recursion
Recursive descent parsers require that the grammar be written in a particular
way ... it cannot be left recursive. That is, the recursive symbol must
be rightmost, otherwise we may get an infinite loop. This is not always
easy to detect as the recursion may be deferred through another rule.
Example,
E -> E T
Compilers
A program that translates from one language to another is called a
compiler. E.g. translating a C program into assembly language.
Our goal is to compile a regexp into a FSM comprised of the three arrays
ch next 1 next2
To do this, we must create a state for each symbol in the regexp
(except for the parentheses). The difficulty is keeping track of the
information required to build each state. For example, to build the
branching state of a disjunction requires that we build each machine of
the disjunction first.
Fortunately, we can turn our parser directly into a compiler by having
procedures that consume input symbols build states.
Imagine we have a procedure called setstate(state,ch,next1,next2) that
sets the array values appropriately. Imagine also that we have a global
variable "state" that keeps track of the state currently being built.
Whenever a new state is created, state is incremented.
Each function of our grammar builds a machine for a portion of the regexp,
then returns the index of the initial state for that machine.
E.g. note: square backets for parentheses
E -> T
E -> T E
T -> F
T -> F*
T -> F+E
F -> v
F -> [E]
int set_state(int s, char c, int n1, int n2)
{ ch[s]=c;next1[s]=n1;next2[s]=n2; }
int expression()
{
int r;
r=term(n+1,p);
if(isvocab(p[j])||p[j]=='[') expression();
return(r);
}
int term()
{
int r, t1,t2,f;
f=state-1; r=t1=factor(n+1,p);
if(p[j]=='*'){
set_state(state,' ',state+1,t1);
j++; r=state; state++;
}
if(p[j]=='+'){
if(next1[f]==next2[f])
next2[f]=state;
next1[f]=state;
f=state-1;
j++;r=state;state++;
t2=term();
set_state(r,' ',t1,t2);
if(next1[f]==next2[f])
next2[f]=state;
next1[f]=state;
}
return(r);
}
int factor()
{
int r;
if(isvocab(p[j])){
set_state(state,p[j],state+1,state+1);
j++;r=state; state++;
}
else
if(p[j]=='['){
j++; r=expression();
if(p[j]==']')
j++;
else
error();
}
else
error();
return(r);
}
/*****************************************************************/
parse()
{
int initial;
initial=expression(0,p);
if(p[j])error(0,p);
set_state(state,' ',0,0);
}
E.g. [a*b+ac]d
s ch 1 2
--+--+-+-+
0 | a 1 1
1 | 3 0
2 | b 6 6
3 | 2 4
4 | a 5 5
5 | c 6 6
6 | d 7 7
7 | 0 0
start state: 1