Regular expressions and regular grammars There is a close connection between a regexp and a RG: the language of a regular grammar must be the set corresponding to some regexp, but it does not tell us how to find such a regexp. BNF to Syntax Diagram If the regexp is given in BNF, then we can find the regexp in a fairly straightforward way by drawing the syntax diagram for the grammar. Any adjacent symbols are a concatenation, and non-unique nonterminal is a disjunction, and any recursive rule is a closure. Any terminal is a state, and any nonterminal is a branch (in a normalised grammar). The number of branches corresponds to the number of unique prefixes in the righthand side of the rule for the nonterminal in question. Start with the "distinguished" (i.e. start) state. When the paths are completed, merge common suffixes iteratively (starting with the longest suffix) until there are no common suffixes. Merge the ends of the paths into a single exit path. To define the corresponding regexp, write terms for each disjunction, join concatenated symbols, mark all closures. Syntax Diagram to REGEXP To go from a syntax diagram to a RG, label each state, then write a rule for each pair of states, working from final state to start state. Eliminate single option rules. Compilers and parsing A regexp defines a FSM. That is, a regexp is a kind of high-level program that must be translated into a low-level program that can be executed by a computer. This translation is called "compiling". Two things are necessary to compile a regexp into an executable program: 1. the regexp must be wellformed (legal) 2. a systematic method for translation must be defined Determining whether an expression is wellformed is called "parsing". Recall that a grammar defines a set of productions for constructing expressions. Precisely how an expression is derived from a set of productions is its parse (or its derivation). There may be more than one derivation for an expression, but as long as there's at least one derivation then the expression is wellformed. Parsing, and compiler writing, are well-developed fields in computer science, and many advanced treatments have been developed. We will touch upon the basic principles necessary for compiling our regexp search pattern. Parsing Determining whether a regexp is wellformed is the same as determining whether it is part of the language of regexps. A language is defined by a grammar, thus we first need a grammar for regexps. What kind of grammar? Can we define the language of RGs by another RG? No. As it happens we need a stronger formalism ... specifically we need a CFG. We can only tell if a regexp is wellformed if it can be derived by the CFG that defines the language of regexps. Example CFG for regexps E -> T E -> T E T -> F T -> F* T -> F + E F -> v F -> [E] E=expression T=term F=factor How is a regexp derived from this grammar? We need to parse the regexp using the grammar for all regexps. Top-down parse Start from the "distinguished" nonterminal (the start symbol) and rewrite nonterminals until all terminals of the regexp are accounted for with nothing leftover and nothing missing. Recursive descent parsers are depth first, top-down parsers. Note they will not work for left-recursive rules. Breadth first top down parsers keep a list of all possible rewrites. Gets cumbersome. Bottom-up Bottom-up parsers try to fit the data to rules by matching terminals against possible righthand sides of rules. Everytime a righthand side is fully accounted for, it is replaced by its corresponding lefthand symbol. If the input can be reduced to the start symbol, the expression is wellformed. Choosing your weapon Often, bottom-up parsing can be done simply by looking for illegal symbol sequences, such as ++ or ** or unbalanced parentheses. If no bad sequences are detected, the expression must be wellformed. Unfortunately, for pattern searching we want more than simple expression validation ... ultimately we need to build a FSM. That is, we need to build a compiler ... where parsing is part of the process. Recursive descent parsers are generally the easiest to implement. For the most part, a grammar can be almost directly translated into a C program. This has the side benefit of producing a parser at the same time. And the parser can be transformed into a compiler. E.g. Given global variables p (a string containg the regexp) and j (the index of the symbol in p that is being parsed). Grammar ------- E -> T E -> T E T -> F T -> F* T -> F+E F -> v F -> [E] Parser ------ expression() { term(); if(isvocab(p[j])||p[j]=='[') expression(); } term() { factor(); if(p[j]=='*') j++; if(p[j]=='+'){ j++; expression(); } } factor() { if(isvocab(p[j])) j++; else if(p[j]=='['){ j++; expression(); if(p[j]==']') j++; else error(); } else error(); } parse() { expression(); if(p[j])error(0,p); } Note that this is not exactly a direct translation from the grammar. For example, expression() examines the next character in order to decide what to do. This is called "lookahead", and is necessary to prevent the parser from recursing when it has reached the end of a bracketed expression. Not all grammars require this, but some require looking even further ahead. Note that a trace of the calling structure would display a parse tree. E.g [a*b+ac]d expression([a*b+ac]d) term([a*b+ac]d) factor([a*b+ac]d) expression(a*b+ac]d) term(a*b+ac]d) factor(a*b+ac]d) expression(b+ac]d) term(b+ac]d) factor(b+ac]d) expression(ac]d) term(ac]d) factor(ac]d) expression(c]d) term(c]d) factor(c]d) expression(d) term(d) factor(d) Left recursion Recursive descent parsers require that the grammar be written in a particular way ... it cannot be left recursive. That is, the recursive symbol must be rightmost, otherwise we may get an infinite loop. This is not always easy to detect as the recursion may be deferred through another rule. Example, E -> E T Compilers A program that translates from one language to another is called a compiler. E.g. translating a C program into assembly language. Our goal is to compile a regexp into a FSM comprised of the three arrays ch next 1 next2 To do this, we must create a state for each symbol in the regexp (except for the parentheses). The difficulty is keeping track of the information required to build each state. For example, to build the branching state of a disjunction requires that we build each machine of the disjunction first. Fortunately, we can turn our parser directly into a compiler by having procedures that consume input symbols build states. Imagine we have a procedure called setstate(state,ch,next1,next2) that sets the array values appropriately. Imagine also that we have a global variable "state" that keeps track of the state currently being built. Whenever a new state is created, state is incremented. Each function of our grammar builds a machine for a portion of the regexp, then returns the index of the initial state for that machine. E.g. E -> T E -> T E T -> F T -> F* T -> F+E F -> v F -> [E] int set_state(int s, char c, int n1, int n2) { ch[s]=c;next1[s]=n1;next2[s]=n2; } int expression() { int r; r=term(n+1,p); if(isvocab(p[j])||p[j]=='[') expression(); return(r); } int term() { int r, t1,t2,f; f=state-1; r=t1=factor(n+1,p); if(p[j]=='*'){ set_state(state,' ',state+1,t1); j++; r=state; state++; } if(p[j]=='+'){ if(next1[f]==next2[f]) next2[f]=state; next1[f]=state; f=state-1; j++;r=state;state++; t2=expression(); set_state(r,' ',t1,t2); if(next1[f]==next2[f]) next2[f]=state; next1[f]=state; } return(r); } int factor() { int r; if(isvocab(p[j])){ set_state(state,p[j],state+1,state+1); j++;r=state; state++; } else if(p[j]=='['){ j++; r=expression(); if(p[j]==']') j++; else error(); } else error(); return(r); } /*****************************************************************/ parse() { int initial; initial=expression(0,p); if(p[j])error(0,p); set_state(state,' ',0,0); } E.g. [a*b+ac]d s ch 1 2 --+--+-+-+ 0 | a 1 1 1 | 3 0 2 | b 6 6 3 | 2 4 4 | a 5 5 5 | c 6 6 6 | d 7 7 7 | 0 0 start state: 1