Points from last day
Naive takes M*N comparisons in worst case.
KMP takes M+N comparisons in worst case, as we never compare a symbol
from the pattern again once it has succesfully matched.
BM takes M+N worst case, but averages closer to N/M if the alphabet is not
too small, because of the frequent mismatches in the last position.
Suffix tree string searching can be very fast. Tree constructed in
linear time, multiple searches are very fast (log_k M for an
alphabet of size k and patterns of length M)
Regular expressions and pattern searching
Regular expression searches are string searches for under-specified
patterns. It is often desirable to find patterns which share certain
characteristics, rather than explicit matches. E.g. my name with either
upper or lower case initial character. Before we see how to perform
regular expression searches, it is useful that we have a good grasp
of what a regular expression is.
Phrase structure grammars
PSGs describes valid expressions for a language using production rules,
or rewrite rules. For example, the rule
A -> a
Says that every occurrence of symbol A in an expression can be re-written
with the symbol a. The lefthand symbol is called a non-terminal and the
righthand symbol is called a terminal. Nonterminals always rewrite to
something else, and nonterminal never do.
A set of production rules is called a grammar, and it defines a set of
expressions. For example
A -> aA
A -> bB
B -> b
Defines the set of expressions that start with one or more a's and end
with two b's. For a grammar G, the set of expressions it defines is
called the language of G, written L(G).
There are four types of phrase structure grammars, but we are interested
in only two of them: type 2 and type 3 grammars, commonly called context-
free grammars and regular grammars respectively.
CFG are those grammars where the lefthand side of a rule consists of a
single nonterminal, and the righthand consists of one or more nonterminal
or terminal symbols.
E.g.
S -> aSb
describes the set of expressions consisting of one or more a's followed
by the same number of b's.
RG are those PSGs with a single nonterminal on the lefthand side and at
most one nonterminal on the righthand side which must be rightmost.
E.g.
A -> aA
A -> aB
B -> b
B -> bB
describes all strings consisting of one or more a's followed by one or
more b's. Note that the language of the CFG cannot be described by a
regular grammar. RGs can be modeled with FSMs, whereas CFGs require
a memory. Note also that all RGs are also CFGs.
Regular expressions
A regular expression is a pattern of symbols constructed from three
fundamental operations
1. concatenation: if two symbol are adjacent in the pattern then
there is a match iff the same symbols appear adjacent in
the text
e.g. ab means a followed by b
2. disjunction: if there are two alternatives in the pattern, there
is a match iff either of the alternatives occurs in the text:
the disjunction is denoted in the regular expression by the
symbol +
e.g. a+b means a or b
3. closure: if a portion of the pattern can be repeated arbitrarily
many times, there is a match iff the pattern occurs zero or
more times in the text: closure is denoted in a regular
expression by the symbol *
e.g. a* means zero or more adjacent a's
A string of symbols constructed from a finite alphabet (the terminals)
using these operations is called a regular expression. Any regular
expression describes the language of a regular grammar.
Definitions for regular expressions:
1. a terminal symbol is a regular expression
2. two concatenated regular expressions are a regular expression
3. two disjunct regular expressions are a regular excpression
4. the closure of a regular expression is a regular expression
5. if e is a regexp then (e) is equivalent to e
N.B. this differs from Sedgewick's syntax.
clarification examples
ab* means a followed by zero or more b's a(b)*
(ab)* means zero or more ab's
ab+c means ab or ac (ab)+(ac)
(a+b)* means any mixture of a's and b's (a*b*)*
RE's can be handled by FSMs. To construct the FSM, we first define a
correspondence between regular expression constructs and segments of a
FSM.
1. a terminal symbol corresponds to a two-state machine: the initial
state is labeled with the symbol and the second state is the final
state
2. concatenation for two finite state machines is achieved by merging
the final state of the first with the initial state of the second
3. disjunction is handled by creating a "branching machine" that
has one initial state leading to two final states; each final state
is merged with one of the initial states of the disjunct machines;
then the final states of the two disjunct machines are merged together
4. closure is handled by merging the final state with the initial state
of a branching machine, making this the new initial state of the
closure machine, then merging one of the new final states with the
original initial state
5. states are numbered consecutively as they are created
Example from the book (although a different machine results) ...
(A*B+AC)D
Abstract method:
recursively handle most deeply nested to least deeply nested
machines
1. construct machines for each symbol
2. handle closures
3. handle alternations (i.e. disjunctive selection)
4. handle concatenations
Representing the machine
Each state has a number, has a character it expects to read, has
one or two next states.
Use three arrays, one for the expected character (NULL if it is a
branching state only) and one each for
the possible next state numbers. Index into the arrays according to
the next state number.
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ch
next1
next2
Start at state 0, and go through the states as dictated by the
symbol on the input. Try it for AABCD.
state zero says we expect no input and we go to state 7.
state 7 says expect no input and go to state 1 or state 6.
How do we know which state to go to?
How do we use this finite state automata to search for a pattern?
Since there are two possible next states
Deterministic vs nondeterministic searching
Note that the KMP and BM skiparrays are finite state automata. The
symbol being processed determines exactly what the next state of the
string searcher is going to be. More precisely, they are deterministic
finite state automata.
If the pattern we're looking for can be underspecified, then there is
some ambiguity in the notion of a correct match ... more than one
character sequence may be acceptable. We cannot necessarily decide just
by examining one character where we are in the search. For example,
imagine I wanted to look for a match on beach or beech ... when we find
the first e, we don't know if we should next look for an a or another e.
We need a non-deterministic automata for this kind of search.
Lists
Lists have a head and a tail. For sorted lists, when we want to add
something, we find where it belongs and reset a few pointers. Similarly
to delete we find the item we want to remove, reset a few pointers and
then release the node.
Stacks
For some applications, we would like to place some restrictions on
how list items are accessed. The most basic restricted access list
is a pushdown stack.
A stack is a list which allows insertions and deletions only at one end.
We refer to that end as the top. It is like an IN box on your desk. Things
are put onto to the top of the pile, and the first thing taken out of the
pile is the last thing that was put on top. LIFO.
The operation for putting something on top of the stack is called a "push".
The operation for removing something is called "pop". One application for
a stack is the processing of arithmetic expressions given in reverse Polish
notation, or postfix.
The folowing arithemtic expression, written in infix,
5 * (((9 + 8) * (4 * 6)) + 7)
can be rewritten in reverse postfix notation as
5 9 8 + 4 6 * * 7 + *
To process this, we push all numbers on to the stack, and when we encounter
an operator we pop the top two numbers on the stack and apply the operator,
then push the result back onto the stack. When we reach the end of the
input, we pop the result.
Stack, of course, can be implemented using linked lists.
Queues
Another kind of restricted access list is one where new items are always
put on the end of the list, and items to be processed are always gotten
from the front. This structure is called a queue, and supports FIFO
processing, like a line of customers at a bank or supermarket.
Where a stack only has a top, a queue has a head and a tail. Adding a
new item to the tail is called a put operation, and taking things off
the head is called a get operation (sometimes these are called insert
and remove).
Deques
For our non-deterministic FSM we need a data structure which is similar
to both a stack and a queue. We need to keep a list that includes all
the states we might be in. The head of the list holds one of the possible
states we might be in. To process it, we pop it off. Then we add the one
or two possible next states to the tail of the list. Once we have popped
off all possible current states, the head of the list will contain the
list of possible next states, which we process in the same way.
One catch, if the next state is a branch then we take the two possible
next states and pop them onto the front of the list. That is, we only
want all possible next states to process a symbol on the input.
This double-ended queue is called a deque. More accurately, it is a
restricted output deque in that while we can add to either end, we can
only remove from the head. The operations are put, get and push ... or
push pop and put, depending upon how you see it.
Now lets see how we can use this deque to process our regular expression.
Processing the input AABCD
The deque starts with a single item, called the SCAN, which separates
possible current states in front from possible next states in back. To
begin the search, we put the initial state as the possible current state.
The process proceeds by popping the current state. If a character is to
be matched, the input is checked. If the input corresponds then we put
the possible next states on the end of the deque. If the next state is
NULL, then the two possible next states are pushed onto the deque.
The main loop stops under one of three conditions
1. end of input reached (no match found)
2. only the SCAN is left in the deque (no match found)
3. the final state is reached (match)