Points from last day Naive takes M*N comparisons in worst case. KMP takes M+N comparisons in worst case, as we never compare a symbol from the pattern again once it has succesfully matched. BM takes M+N worst case, but averages closer to N/M if the alphabet is not too small, because of the frequent mismatches in the last position. Suffix tree string searching can be very fast. Tree constructed in linear time, multiple searches are very fast (log_k M for an alphabet of size k and patterns of length M) Regular expressions and pattern searching Regular expression searches are string searches for under-specified patterns. It is often desirable to find patterns which share certain characteristics, rather than explicit matches. E.g. my name with either upper or lower case initial character. Before we see how to perform regular expression searches, it is useful that we have a good grasp of what a regular expression is. Phrase structure grammars PSGs describes valid expressions for a language using production rules, or rewrite rules. For example, the rule A -> a Says that every occurrence of symbol A in an expression can be re-written with the symbol a. The lefthand symbol is called a non-terminal and the righthand symbol is called a terminal. Nonterminals always rewrite to something else, and nonterminal never do. A set of production rules is called a grammar, and it defines a set of expressions. For example A -> aA A -> bB B -> b Defines the set of expressions that start with one or more a's and end with two b's. For a grammar G, the set of expressions it defines is called the language of G, written L(G). There are four types of phrase structure grammars, but we are interested in only two of them: type 2 and type 3 grammars, commonly called context- free grammars and regular grammars respectively. CFG are those grammars where the lefthand side of a rule consists of a single nonterminal, and the righthand consists of one or more nonterminal or terminal symbols. E.g. S -> aSb describes the set of expressions consisting of one or more a's followed by the same number of b's. RG are those PSGs with a single nonterminal on the lefthand side and at most one nonterminal on the righthand side which must be rightmost. E.g. A -> aA A -> aB B -> b B -> bB describes all strings consisting of one or more a's followed by one or more b's. Note that the language of the CFG cannot be described by a regular grammar. RGs can be modeled with FSMs, whereas CFGs require a memory. Note also that all RGs are also CFGs. Regular expressions A regular expression is a pattern of symbols constructed from three fundamental operations 1. concatenation: if two symbol are adjacent in the pattern then there is a match iff the same symbols appear adjacent in the text e.g. ab means a followed by b 2. disjunction: if there are two alternatives in the pattern, there is a match iff either of the alternatives occurs in the text: the disjunction is denoted in the regular expression by the symbol + e.g. a+b means a or b 3. closure: if a portion of the pattern can be repeated arbitrarily many times, there is a match iff the pattern occurs zero or more times in the text: closure is denoted in a regular expression by the symbol * e.g. a* means zero or more adjacent a's A string of symbols constructed from a finite alphabet (the terminals) using these operations is called a regular expression. Any regular expression describes the language of a regular grammar. Definitions for regular expressions: 1. a terminal symbol is a regular expression 2. two concatenated regular expressions are a regular expression 3. two disjunct regular expressions are a regular excpression 4. the closure of a regular expression is a regular expression 5. if e is a regexp then (e) is equivalent to e N.B. this differs from Sedgewick's syntax. clarification examples ab* means a followed by zero or more b's a(b)* (ab)* means zero or more ab's ab+c means ab or ac (ab)+(ac) (a+b)* means any mixture of a's and b's (a*b*)* RE's can be handled by FSMs. To construct the FSM, we first define a correspondence between regular expression constructs and segments of a FSM. 1. a terminal symbol corresponds to a two-state machine: the initial state is labeled with the symbol and the second state is the final state 2. concatenation for two finite state machines is achieved by merging the final state of the first with the initial state of the second 3. disjunction is handled by creating a "branching machine" that has one initial state leading to two final states; each final state is merged with one of the initial states of the disjunct machines; then the final states of the two disjunct machines are merged together 4. closure is handled by merging the final state with the initial state of a branching machine, making this the new initial state of the closure machine, then merging one of the new final states with the original initial state 5. states are numbered consecutively as they are created Example from the book (although a different machine results) ... (A*B+AC)D Abstract method: recursively handle most deeply nested to least deeply nested machines 1. construct machines for each symbol 3. handle closures 4. handle alternations (i.e. disjunctive selection) 5. handle concatenations Representing the machine Each state has a number, has a character it expects to read, has one or two next states. Use three arrays, one for the expected character (NULL if it is a branching state only) and one each for the possible next state numbers. Index into the arrays according to the next state number. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ch next1 next2 Start at state 0, and go through the states as dictated by the symbol on the input. Try it for AABCD. state zero says we expect no input and we go to state 7. state 7 says expect no input and go to state 1 or state 6. How do we know which state to go to? How do we use this finite state automata to search for a pattern? Since there are two possible next states Deterministic vs nondeterministic searching Note that the KMP and BM skiparrays are finite state automata. The symbol being processed determines exactly what the next state of the string searcher is going to be. More precisely, they are deterministic finite state automata. If the pattern we're looking for can be underspecified, then there is some ambiguity in the notion of a correct match ... more than one character sequence may be acceptable. We cannot necessarily decide just by examining one character where we are in the search. For example, imagine I wanted to look for a match on beach or beech ... when we find the first e, we don't know if we should next look for an a or another e. We need a non-deterministic automata for this kind of search. Lists Lists have a head and a tail. For sorted lists, when we want to add something, we find where it belongs and reset a few pointers. Similarly to delete we find the item we want to remove, reset a few pointers and then release the node. Stacks For some applications, we would like to place some restrictions on how list items are accessed. The most basic restricted access list is a pushdown stack. A stack is a list which allows insertions and deletions only at one end. We refer to that end as the top. It is like an IN box on your desk. Things are put onto to the top of the pile, and the first thing taken out of the pile is the last thing that was put on top. LIFO. The operation for putting something on top of the stack is called a "push". The operation for removing something is called "pop". One application for a stack is the processing of arithmetic expressions given in reverse Polish notation, or postfix. The folowing arithemtic expression, written in infix, 5 * (((9 + 8) * (4 * 6)) + 7) can be rewritten in reverse postfix notation as 5 9 8 + 4 6 * * 7 + * To process this, we push all numbers on to the stack, and when we encounter an operator we pop the top two numbers on the stack and apply the operator, then push the result back onto the stack. When we reach the end of the input, we pop the result. Stack, of course, can be implemented using linked lists. Queues Another kind of restricted access list is one where new items are always put on the end of the list, and items to be processed are always gotten from the front. This structure is called a queue, and supports FIFO processing, like a line of customers at a bank or supermarket. Where a stack only has a top, a queue has a head and a tail. Adding a new item to the tail is called a put operation, and taking things off the head is called a get operation (sometimes these are called insert and remove). Deques For our non-deterministic FSM we need a data structure which is similar to both a stack and a queue. We need to keep a list that includes all the states we might be in. The head of the list holds one of the possible states we might be in. To process it, we pop it off. Then we add the one or two possible next states to the tail of the list. Once we have popped off all possible current states, the head of the list will contain the list of possible next states, which we process in the same way. One catch, if the next state is a branch then we take the two possible next states and pop them onto the front of the list. That is, we only want all possible next states to process a symbol on the input. This double-ended queue is called a deque. More accurately, it is a restricted output deque in that while we can add to either end, we can only remove from the head. The operations are put, get and push ... or push pop and put, depending upon how you see it. Now lets see how we can use this deque to process our regular expression. Processing the input AABCD The deque starts with a single item, called the SCAN, which separates possible current states in front from possible next states in back. To begin the search, we put the initial state as the possible current state. The process proceeds by popping the current state. If a character is to be matched, the input is checked. If the input corresponds then we put the possible next states on the end of the deque. If the next state is NULL, then the two possible next states are pushed onto the deque. The main loop stops under one of three conditions 1. end of input reached (no match found) 2. only the SCAN is left in the deque (no match found) 3. the final state is reached (match)