Part 1 Bison

Bison takes a context-free grammar specification and outputs C code that implements a bottom-up parser based on the grammar. It is possible to use Bison without understanding the LALR(1) parsing algorithm it uses - but when debugging Bison programs then the theory becomes helpful. However for simple grammars we can use Bison as a tool without paying (too much) attention to the underlying complications.

Bison functions in a similar manner to Flex, taking in a specification and outputting C. Bison files conventionally end in .y (this is the Yacc suffix). Unlike Flex the output is a C file with the same basic filename. So a typical Bison command sequence could be:

>bison grammar.y
>gcc grammar.c

However, as we are going to define the lexical analyzer in a separate Flex file then things get a little more complicated.

We need to ensure that the tokens used in the parser produced by Bison are the same as those coming out of the lexical analyzer. As the parser is the dominant component of the front-end of the compiler we define the alphabet of tokens in the Bison input file. We then generate a file (filename.tab.h) to include when using Flex. This means they are both using the same representation of the tokens.

So the whole sequence is modified to look something like this:

>bison -d gram1.y
>flex -ogram1.yy.c gram1.l
>gcc gram1.tab.c -lfl

Note that the Flex file 'includes' the token file and the Bison file 'includes' the lexical analyzer. The -d to Bison generates the token file so it can be included by Flex.

The inter-relationships between these files are typically managed using a make file

Note that we do not need to use a separate Bison library (as in Flex or examples using Yacc) - unless we include some C that uses extra library functions.

Note: currently gcc is generating 3 warning messages when compiling Bison-generated C files. They don't appear to be significant. Also some machines in GB.04 generate 'clock skew' errors with make - again, no impact on the actual output.

1. Bison Files

The Bison file format is very similar to the Flex format. Here is a small Bison file, gram1.y


%{
  /*C declarations */
%}


%token WORD ADJECTIVE COLOUR NOUN FULLSTOP

%% /* the grammar section */


sentence: WORD ADJECTIVE COLOUR NOUN FULLSTOP { printf("valid sentence!"); }
        ;


%%
/* as with Flex, user-supplied C code */
/* include the Flex-generated lexical analyzer */
#include "gram1.yy.c"

main()
{
  yyparse();
}

int yyerror(char *s) {
  fprintf(stderr, "%s\n", s);
  return 0;
}

Note:

2. Linking the Lexical Analyzer

The grammar is designed to work with the following simple lexical analyzer, gram1.l


%{
#include "gram1.tab.h"
%}

LETTER [a-zA-Z]
WORD   {LETTER}+
FULLSTOP "."
COLOUR    "red"|"blue"|"green"
ADJECTIVE "bright"|"shiny"|"dull"
NOUN   "car"|"truck"|"van"|"ute"

%%

{ADJECTIVE}   { return ADJECTIVE; }
{COLOUR}      { return COLOUR; }
{NOUN}        { return NOUN; }
{FULLSTOP}    { return FULLSTOP; }
{WORD}        { return WORD; }
.|\n

%%

Note:

3. Getting it all together

make files are one way of organising these files, here is an example Makefile to coordinate all the different files. (Or you can write your own if you don't like it.)

Join it all together to generate an executable that encompasses a lexical analyzer and a parser. The resulting program should print the success message for sentences like:

mjeff 50>gram1.out
the shiny red car.
valid sentence!

4. Extending the recognizer

A recognizer is a limited version of a parser that just reports whether the input is valid according to the grammar. We have already got a recognizer for a very small subset of english.

Rewrite the grammar to allow any combination of adjectives so that it accepts sentences like this:

mjeff 81>gram2.out
the red shiny bright blue ute.
valid sentence!

They don't have to make sense :-)

5. Constraints

Rewrite the grammar to express a constraint that colours must precede the other adjectives. So that these sentences are valid:

the red red shiny bright truck.
the blue car.
a red ute.
the shiny truck.

but these sentences aren't:

the red shiny green van.
a bright blue car.

Now try expressing the constraint that there must be at least one colour and one adjective in each sentence. So that these are now invalid sentences:

the shiny truck.
a blue ute.

6. Characters

As well as tokens we can use characters in the grammar specification. The FULLSTOP token just represents the single character .

We can reduce our alphabet of explicit tokens by just passing the character through to Bison. Just surround the character with '' in the grammar. Replace the FULLSTOP token with character passing so that the functionality isn't affected.

Allow the colours to be put in brakcets, (), e.g.

 the (red) shiny van.
 

Part 2 Including Actions

In Part 1 we created a recognizer - a program that checks the syntactic structure of the token sequence against a grammar. So if we have NUMBER PLUS NUMBER then we could accept it as a legal expresssion. But we also need access to the actual values represented by the token NUMBER - the values that are the token lexemes, or the value of yytext, in the Flex input file. To do this we use a special variable called yylval.

1. Accessing the lexemes from Bison

gram6.l is a simple lexical analyzer that returns a token alphabet consisting of:

INTEGER EOL '+' '-'

%{
#include "gram6.tab.h"
%}

DIGIT [0-9]
INTEGER {DIGIT}+

%%


{INTEGER}  { 
            yylval = atoi(yytext); 
            return INTEGER;
           }

"+"        { return '+'; }
"-"        { return '-'; }
\n         { return EOL; }
.

%%

Note:

gram6.y is a Bison grammar that accesses the lexemes returned from yylex() and implements a small calculator.


%{
  /*C declarations */
%}

%token INTEGER EOL 

%% /* the grammar section */

line: exp EOL { printf("%d\n", $1); }
    ;

exp: INTEGER '+' INTEGER  { $$ = $1 + $3; } 
   | INTEGER '-' INTEGER  { $$ = $1 - $3; } 
   ;     

%%
#include "gram6.yy.c"

main()
{
        yyparse();
}

int yyerror(char *s) {
  fprintf(stderr, "%s\n", s);
  return 0;
}

Note:

$$ is not automatically set to something generally meaningful ($$ = $1 is the default), you have to construct $$ from $1, $2, ... etc. Later on we will be constructing tree representations - where the subtree has children constructed from $1, $2, ... etc.

One way to look at this is: whatever Flex puts into yylval Bison gets out as $1, $2, ..

2. Expressions

Of course we would like to able to parse sequences of these operators of arbitrary length, such as 3+4-1 (and sequences of just one number such as 7). To do so we will need a recursive grammar rule, you have seen several examples so far on the course.

Write grammar rules to allow more than 2 operands.

Don't worry about shift/reduce conflicts if Bison reports any - Bison will apply its defaults and still generate a parser.

3. Resolving Ambiguous Grammars

Because we can input a sentence like 10-3-2 Bison does not know whether to interpret it as (10-3)-2 (=5) or 10-3(-2) (=9). We want to able to specify the associativity of operators and there are 2 ways to achieve this:

Bison provides a simple way to express two common disambiguating rules: associativity and precedence.

4. Associativity

In the first section, next to %token, we can use:

%left '+'

to express the left associativity of +

%right and %nonassoc are also available

Use associativity to make the calculator correctly deal with: 10-3-2 (=5)

5. Precedence

As with associativity there are two ways to deal with precedence:

In Bison we can specify the precedence of operators by the order in which we declare their associativity, the first definition has the lowest precedence and the last the highest. (To give operators the same precedence you can list more than one after a %<associativity> instruction)

With all this information you should be able to construct a calculator that works for: 4*3-2+1 (and gets the result as 11)

An alternative at this point would be to look at 6,7 & 8 and then come back to this. These points contain information about debugging and verbose options which could simplify the development of this grammar.

6. Understanding what Bison is doing

Bison has a verbose option (-v) that generates a filename.output file which is useful in checking the behaviour of your grammar specification.

The format of this file is:

Conflict summary
Grammar
Terminals (and the rules in which they occur)
Non-Terminals (and the rules in which they occur)
The states and actions of the underlying DFA
 - with any conflicts highlighted

The output of the grammar and the terminals is useful to check that what you think you have given Bison as input is the same as what Bison thinks it has received.

Here is a typical Bison output file, gram7.output

It would be sensible to add the -v flag into your Makefile

Generate one of these files for your grammar.

7. Understanding what Bison is doing (2)

Bison can also produce a trace of its execution of the underlying DFA - this can be very useful in both understanding the behaviour of your program and the behaviour of bottom-up parsing in general.

To enable the trace function we need to do 2 things:

File gram7.y shows this.

File trace7 shows the resulting execution when parsing the string 12-7+4

Try this on your grammar.

8. Debugging Information Overload

In case you thought we'd forgotten about Flex, we can access trace information for Flex as well.

To do this we need to do 2 things:

So you can have Flex and Bison trace information running at the same time. Try it.

It doesn't look particularly pretty but once you have the code in you can just set the variables to 0 to suppress either trace.

Debugging Information Summary: we have 3 (pre-built) sources of debugging information:

Of course, you always add more in the action parts of Flex and Bison files if you need.


The local Bison manual page is very detailed with lots of examples and explanation