| Exercise 2 |
Bison takes a context-free grammar specification and outputs C code that implements a bottom-up parser based on the grammar. It is possible to use Bison without understanding the LALR(1) parsing algorithm it uses - but when debugging Bison programs then the theory becomes helpful. However for simple grammars we can use Bison as a tool without paying (too much) attention to the underlying complications.
Bison functions in a similar manner to Flex, taking in a specification and outputting C. Bison files conventionally end in .y (this is the Yacc suffix). Unlike Flex the output is a C file with the same basic filename. So a typical Bison command sequence could be:
>bison grammar.y >gcc grammar.c
However, as we are going to define the lexical analyzer in a separate Flex file then things get a little more complicated.
We need to ensure that the tokens used in the parser produced by
Bison are the same as those coming out of the lexical analyzer. As the
parser is the dominant component of the front-end of the compiler we define the
alphabet of tokens in the Bison input file. We then generate a file
(filename.tab.h) to include when using Flex. This means
they are both using the same representation of the tokens.
So the whole sequence is modified to look something like this:
>bison -d gram1.y >flex -ogram1.yy.c gram1.l >gcc gram1.tab.c -lfl
Note that the Flex file 'includes' the token file and the Bison file 'includes' the lexical analyzer. The -d to Bison generates the token file so it can be included by Flex.
The inter-relationships between these files are typically managed using a
make file
Note that we do not need to use a separate Bison library (as in Flex or examples using Yacc) - unless we include some C that uses extra library functions.
Note: currently gcc is generating 3 warning
messages when compiling Bison-generated C files. They
don't appear to be significant. Also some machines in GB.04 generate
'clock skew' errors with make - again, no impact on
the actual output.
The Bison file format is very similar to the Flex format. Here is a small Bison file, gram1.y
%{
/*C declarations */
%}
%token WORD ADJECTIVE COLOUR NOUN FULLSTOP
%% /* the grammar section */
sentence: WORD ADJECTIVE COLOUR NOUN FULLSTOP { printf("valid sentence!"); }
;
%%
/* as with Flex, user-supplied C code */
/* include the Flex-generated lexical analyzer */
#include "gram1.yy.c"
main()
{
yyparse();
}
int yyerror(char *s) {
fprintf(stderr, "%s\n", s);
return 0;
}
Note:
%token
gram1.yy.c
yyerror
yyparse The grammar is designed to work with the following simple lexical analyzer, gram1.l
%{
#include "gram1.tab.h"
%}
LETTER [a-zA-Z]
WORD {LETTER}+
FULLSTOP "."
COLOUR "red"|"blue"|"green"
ADJECTIVE "bright"|"shiny"|"dull"
NOUN "car"|"truck"|"van"|"ute"
%%
{ADJECTIVE} { return ADJECTIVE; }
{COLOUR} { return COLOUR; }
{NOUN} { return NOUN; }
{FULLSTOP} { return FULLSTOP; }
{WORD} { return WORD; }
.|\n
%%
Note:
printfs but returns as
yylex is being called by the parser
gram1.tab.h make files are one way of organising these files, here is an
example Makefile
to coordinate all the different files. (Or you can write your own if you don't
like it.)
Join it all together to generate an executable that encompasses a lexical analyzer and a parser. The resulting program should print the success message for sentences like:
mjeff 50>gram1.out the shiny red car. valid sentence!
A recognizer is a limited version of a parser that just reports whether the input is valid according to the grammar. We have already got a recognizer for a very small subset of english.
Rewrite the grammar to allow any combination of adjectives so that it accepts sentences like this:
mjeff 81>gram2.out the red shiny bright blue ute. valid sentence!
They don't have to make sense :-)
Rewrite the grammar to express a constraint that colours must precede the other adjectives. So that these sentences are valid:
the red red shiny bright truck. the blue car. a red ute. the shiny truck.
but these sentences aren't:
the red shiny green van. a bright blue car.
Now try expressing the constraint that there must be at least one colour and one adjective in each sentence. So that these are now invalid sentences:
the shiny truck. a blue ute.
As well as tokens we can use characters in the grammar specification. The
FULLSTOP token just represents the single character
.
We can reduce our alphabet of explicit tokens by just passing the character
through to Bison. Just surround the character with '' in
the grammar. Replace the FULLSTOP token with character passing so
that the functionality isn't affected.
Allow the colours to be put in brakcets, (), e.g.
the (red) shiny van.
In Part 1 we created a recognizer - a program that checks the syntactic
structure of the token sequence against a grammar. So if we have NUMBER
PLUS NUMBER then we could accept it as a legal expresssion. But we also
need access to the actual values represented by the token NUMBER -
the values that are the token lexemes, or the value of yytext, in
the Flex input file. To do this we use a special variable called
yylval.
gram6.l is a simple lexical analyzer that returns a token alphabet consisting of:
INTEGER EOL '+' '-'
%{
#include "gram6.tab.h"
%}
DIGIT [0-9]
INTEGER {DIGIT}+
%%
{INTEGER} {
yylval = atoi(yytext);
return INTEGER;
}
"+" { return '+'; }
"-" { return '-'; }
\n { return EOL; }
.
%%
Note:
yylval is set before the token is returned.
yylval is where Bison expects to find the lexeme
information gram6.y
is a Bison grammar that accesses the lexemes returned from
yylex() and implements a small calculator.
%{
/*C declarations */
%}
%token INTEGER EOL
%% /* the grammar section */
line: exp EOL { printf("%d\n", $1); }
;
exp: INTEGER '+' INTEGER { $$ = $1 + $3; }
| INTEGER '-' INTEGER { $$ = $1 - $3; }
;
%%
#include "gram6.yy.c"
main()
{
yyparse();
}
int yyerror(char *s) {
fprintf(stderr, "%s\n", s);
return 0;
}
Note:
$$ variable which represents the left hand side value for
a non-terminal
$1, $2, $3, ... variables which
hold the values of the first, second, third items matched on the right hand
side of the grammar rule
'+' counts as one matched item so the
second INTEGER token is $3 rather than
$2
C action code in {} comes
before the final semi-colon. As we have to allow different
actions to occur when different alternatives are matched. We can have one
{} for each alternative production. $$ is not automatically set to something generally meaningful
($$ = $1 is the default), you have to construct $$
from $1, $2, ... etc. Later on we will be constructing
tree representations - where the subtree has children constructed from $1,
$2, ... etc.
One way to look at this is: whatever Flex puts into
yylval Bison gets out as $1, $2,
..
Of course we would like to able to parse sequences of these operators of
arbitrary length, such as 3+4-1 (and sequences of just one number
such as 7). To do so we will need a recursive grammar rule, you
have seen several examples so far on the course.
Write grammar rules to allow more than 2 operands.
Don't worry about shift/reduce conflicts if Bison reports any - Bison will apply its defaults and still generate a parser.
Because we can input a sentence like 10-3-2 Bison does
not know whether to interpret it as (10-3)-2 (=5) or 10-3(-2)
(=9). We want to able to specify the associativity of operators and there
are 2 ways to achieve this:
%token, we
can use:
%left '+'
to express the left associativity of +
%right and %nonassoc are also available
Use associativity to make the calculator correctly deal with: 10-3-2 (=5)
As with associativity there are two ways to deal with precedence:
In Bison we can specify the precedence of operators by the order in which we declare their associativity, the first definition has the lowest precedence and the last the highest. (To give operators the same precedence you can list more than one after a %<associativity> instruction)
With all this information you should be able to construct a calculator that
works for: 4*3-2+1 (and gets the result as 11)
An alternative at this point would be to look at 6,7 & 8 and then come back to this. These points contain information about debugging and verbose options which could simplify the development of this grammar.
Bison has a verbose option (-v) that generates a
filename.output file which is useful in checking the behaviour of
your grammar specification.
The format of this file is:
Conflict summary Grammar Terminals (and the rules in which they occur) Non-Terminals (and the rules in which they occur) The states and actions of the underlying DFA - with any conflicts highlighted
The output of the grammar and the terminals is useful to check that what you think you have given Bison as input is the same as what Bison thinks it has received.
Here is a typical Bison output file, gram7.output
It would be sensible to add the -v flag into your
Makefile
Generate one of these files for your grammar.
Bison can also produce a trace of its execution of the underlying DFA - this can be very useful in both understanding the behaviour of your program and the behaviour of bottom-up parsing in general.
To enable the trace function we need to do 2 things:
#define YYDEBUG 1 in the C declarations
section at the start of the Bison input file
main()
procedure before yyparse()extern int yydebug; yydebug = 1;
File gram7.y
shows this.
File trace7
shows the resulting execution when parsing the string 12-7+4
Try this on your grammar.
In case you thought we'd forgotten about Flex, we can access trace information for Flex as well.
To do this we need to do 2 things:
-d flag - add one to your
Makefile
main()
before yyparse()extern int yy_flex_debug; yy_flex_debug= 1;
So you can have Flex and Bison trace information running at the same time. Try it.
It doesn't look particularly pretty but once you have the code in you can
just set the variables to 0 to suppress either trace.
Debugging Information Summary: we have 3 (pre-built) sources of debugging information:
filename.output files from Bison
Of course, you always add more in the action parts of Flex and Bison files if you need.
The local Bison manual page is very detailed with lots of examples and explanation