Week 5.2
- Follow Sets
- LL(1) Grammars
- Java-CUP Parser Generator
- JFlex Lexical Analyser Generator
- Using JFlex and Java-Cup Together
- Extending the Calculator Example
1.0 - Follow Sets
- The Follow set for a non-terminal, N, is the set of terminal symbols that may follow N in any context within the grammar.
- To calculate the follow sets, we need to look at all the contexts in which N is used.
- The start symbol
, of the grammar is slightly special, in that it can be followed by end-of-file - End-of-file is represented by the special terminal symbol
“$”
- With this convention, we avoid using end-of-file within any of our example grammars, reserving that symbol for end-of-file.
1.1 - Formal Definition
-
A non-terminal
is followed by a terminal symbol “a”, if there is a derivation from (i.e. the start symbol followed by the end-of-file terminal symbol), in which “a” follows N, - The follow set of
(denoted ) contains all terminal symbols such that our non-terminal symbol can follow in some derivation where we start from the start symbol of our grammar ( ) and end with the end-of-file symbol ($) - The sequence that we derive starts with a string of terminal and non-terminal symbols
, followed by and , then followed by which is a possibly empty sequence of terminal and non-terminal symbols.
- The follow set of
-
Note that Follow sets only include terminal symbols, and mat include the special terminal symbol for end-of-file $, which only appears at the end of an input string. Note that:
- First sets never include the end-off-file symbol $
- Follow sets never include
(Unlike First sets, in which they’re used to denote that the first sets can be nullable.)
1.2 - Rules for Calculating Follow Sets
-
Because the start symbol for the grammar should match the whole input string up to (but not including the end-of-file), the Follow set for the start symbol must include the end-of-file, which is represented here by the special terminal symbol $
$\{\$\}\subseteq\text{Follow(S)} $ -
We can compute the Follow set for a non-terminal,
using two facts: -
If there is a production of the form
then any symbols that can start
can follow , and hence must include all of the terminal symbols in with the exception of -
If there is a production of the form:
and
is nullable, then any token that can follow can also follow , and hence
The case where
is nullable includes the case when is empty, and the production is of the form: - i.e. we have no
- i.e. we have no
1.3 - Calculating Follow Sets Algorithmically
- The process used to calculate follow sets algorithmically is to:
-
Initialise Follow Sets Start with empty follow sets for all non-terminals, except for the start symbol
(in which }$) -
Make a Pass We make a pass through the grammar examining the right side of every production. For each occurrence of a non-terminal with in the right side of some production, we augment (& extend) the Follow set for that non-terminal according to the following process:
Assuming we are processing an occurrence of
and the production is of the form - we add (augment)
to the Follow set computed for so far, and if is nullable, we also add the current follow set for to the current Follow set for N - Add
to - If
is nullable, add to
- The process terminates when a pass doesn’t augment / modify any follow sets
After making a complete pass in which we process every occurrence of a non-terminal on the right side of every production, we repeat the process of making a pass but start with the Follow sets computed thus far, rather than the initial (mostly empty) Follow sets.
This pass may or may not extend some Follow sets. If no follow sets are modified in the pass, we are finished, otherwise repeat the process. Because all follow sets are finite, and each time we decide to repeat the process at least one Follow set must have been extended by at least one symbol, the whole process must terminate.
- we add (augment)
-
-
1.3.1 - Example of Calculating Follow Sets Algorithmically
-
Consider the following grammar definition, in which we have three non-terminal symbols:
-
To calculate the follow sets of non-terminal symbols, we need to know their first sets.
-
Initialise the follow sets.
S {$} A B -
In the first pass, we first look at the first production:
Here, we see that the non-terminal symbol
can be followed by anything that can start with. Therefore, we add to . Since
is nullable, we also need to include , so we add $ to
The non-terminal
can be followed by an empty sequence of symbols, which means that it can be followed by anything that can follow . At this point in time, we know that $ can follow
S {$} {$} A }$ B {$}
We then look at the second production:
From this production, we see that we have two options for the production:
- In the left alternative, there’s no non-terminal symbols (so we don’t need to consider it)
- In the right alternative, we have the non-terminal symbol
- Since
is followed by the empty sequence of symbols,
- Since
S {$} {$} A }$ B , y, z}$
We then look at the third production:
From this production, we see that
can follow , so we add to the follow set of . Since is a terminal symbol, and terminal symbols are never nullable, there are no further steps S {$} {$} A ,x}$ B , y, z}$ -
We pass through the grammar again to see if there are any changes that are to be made to the follow sets.
We look at the first production again
- The follow set of
must include the follow set of ✅ It already includes ,y,z}$ - B is nullable, so it must include the follow set of
✅ It already includes {$}
We then revisit the second production
- Since
is followed by the empty sequence, it must include - it doesn’t yet include so we add that to the follow set.
S {$} {$} A ,x}$ B , y, z, x}$ - The follow set of
-
{we pass through the grammar again, and notice that none of the follow sets change. Therefore, we’re done}
1.3.2 - Algorithm for Calculating Follow Sets
-
For calculating Follow sets, the set symbol contains the special end-of-file terminal symbol $
Map⟨Symbol, Set⟨Symbol⟩⟩ Follow calculateFollow() for Symbol N : g.nonterminals // foreach symbol N in the grammar (g) non-terminals Follow(N) := {}; // Initialise the follow sets Follow(g.start) := {$}; // Add EOF symbol to start symbol's follow sets do // Keep doing passes until the follow set doesn't change. saveFollow := Follow // Full copy for p : g.production // Foreach production updateFollow(p) while Follow != saveFollow // Value equality, essentially "while not changed"
-
The
updateFollow
procedure:- Takes a production p as a parameter.
- Updates the Follow sets for the non-terminals based on p, and the current follow sets
- Processes the right side of the production in reverse order (saves time?)
followCurrent
is maintained as the set of terminals that can follow the current symbol of the right hand side- It is updated at the end of each iteration (ready for the next iteration)
- If the current symbol
is nullable, followCurrent
is updated with - If
is not nullable, (including if it is a terminal symbol), followCurrent
is set to
updateFollow(Production p)
// Start from the last symbol on the RHS of the production
i := length(p.rhs) - 1;
followCurrent := Follow(p.lhs); // Can follow the whole RHS
while 0 <= i // Process the RHS in reverse order
if p.rhs(i) ∈ g.nonterminals
Follow(p.rhs(i)) := Follow(p.rhs(i)) ∪ followCurrent;
// Update followCurrent ready for the next iteration
if ϵ ∈ first(p.rhs(i))
// non-terminal symbol p.rhs(i) is nullable
// Augment followCurrent with all of the terminal symbols the
// current symobol (p.rhs(i)) can start with.
followCurrent := followCurrent ∪ (first(p.rhs(i)) - {ϵ})
else
// non-terminal symbol p.rhs(i) is not nullable
// Set followCurrent to the set of symbols that the
// current symbol (p.rhs(i)) can start with.
followCurrent := first(p.rhs(i))
else: // p.rhs(i) ∈ g.terminals
// Don't need to augment the follow set, but need to update followCurrent.
// Set followCurrent to the current symbol that we're looking at
// (this MUST be in the follow set of the symbol that we look at next, as we're
// moving from right to left).
followCurrent := {p.rhs(i)}; // Update it for the next iteration
i := i - 1
2.0 - LL(1) Grammars
- LL(1) grammars are a class of grammars that are suitable for recursive descent predictive parsing with a single-symbol lookahead.
- the first “L” in LL(1) grammars refers to the fact that the parsing of the input is from Left to right.
- The second “L” in LL(1) grammars refers to the fact that the parsing produces a Leftmost derivation sequence
- The “1” indicates that the parser uses one-symbol lookahead (i.e. the current token is the single-token look-ahead)
2.1 - Formal Definition of LL(1) Grammars
-
A BNF grammar is LL(1) if for each non-terminal
, where , -
The First sets of each pair of alternatives for
are disjoint: -
If
is nullable, and are disjoint
-
-
Because the First set for an alternative includes
if the alternative is nullable, the constraint that the First sets of all the alternatives are pairwise disjoint implies that at most one alternative is nullable - This is part of the reason for the convention of including
in the first sets.
- This is part of the reason for the convention of including
2.1.1 - LL(1) Grammars and Recursive Descent Parsing
- Given an LL(1) grammar, during recursive descent parsing, the current token (i.e. the look-ahead symbol) is either:
- In the first set of just one alternative, and that alternative is chosen (if the current token can start an alternative, we choose that alternative)
- It is in the follow set of N, and the (unique) nullable alternative for N is chosen.
- If neither of the above hold, there is a syntax error
2.1.2 - EBNF Grammars and LL(1) Grammars
- An EBNF grammar can be considered LL(1) if, when converted to a BNF grammar using the rules described in the Recursive Descent parsing notes, the resulting BNF grammar is LL(1).
3.0 - Java-CUP Parser Generator
-
Takes BNF grammar as input and generates a parser.
-
Java-CUP generates a Shift-Reduce parser (not a Recursive Descent Parser)
-
The input to the Java-Cup Parser Generator is the Calc.cup file.
- At a high level the file contains a description of the grammar for our expressions in BNF form
- These BNF form grammars get translated to Java code by Java-CUP
- Note that as in previous lecture examples, we’ve specified our grammar in this way so that the PLUS and MINUS operators have the same precedence, but TIMES has higher precedence.
- Additionally, the parentheses operator has the highest precedence.
-
We both want to parse the code and execute it
- That’s why we have the additional lines that specify the actions to be taken to execute the code
- All expressions, terms and factors have values - we set this in parsing the code.
-
Suppose we take the case of parsing a number, using the following Java-CUP definition
F ::= NUMBER:n {: RESULT = n; :} ;
-
The grammar definition above essentially says that if we have a number, we can refer to its value using the identifier
. -
If a number has a value
when we parse it, we set the value of the number to . -
Suppose we take the case of parsing an expression, using the following Java-CUP definition:
F ::= LPAREN E:e RPAREN {: RESULT = e; :} ;
-
The value of an expression enclosed within parentheses is just the value of the expression itself.
-
// Expression definitions
E ::= E:e PLUS T:t
{: RESULT = e + t; :}
;
E ::= E:e MINUS T:t
{: RESULT = e - t; :}
;
E ::= T:t
{: RESULT = t; :}
;
// Term definitions
T ::= T:t TIMES F:f
{: RESULT = t * f; :}
;
T ::= F:f
{: RESULT = f; :}
;
// Factor definitions
F ::= LPAREN E:e RPAREN
{: RESULT = e; :}
;
F ::= NUMBER:n
{: RESULT = n; :}
;
-
The grammar has two more productions - the start symbol of the grammar is COMMANDS which has two alternatives - (1) empty sequence of symbols or newline-separated commands
/* Top level allows multiple lines, each with a single expression */ Commands ::= /* empty */ | Command NEWLINE Commands ;
-
When we parse an expression, we either print out the expression, or print out “?” indicating that we’re encountered an error - “error” in the Java-CUP language indicates a syntax error.
/* Match an evaluate an expression, or complain on a syntax error */ Command ::= E:e {: System.out.println( e ); :} | error {: System.out.println( "?" ); :} ;
-
We declare / define our terminal and non-terminal symbols
- If a terminal or non-terminal symbol has an attribute, we have to define the type of the attribute (see terminal - number and non-terminal E, T, F)
- Note here that the primitive’s attribute type must be a java class, not a Java primitive type - e.g. “int” wouldn’t work.
/* Declaration of all the terminal symbols, including the type of their attributes, if applicable. */ terminal Integer NUMBER; terminal PLUS, MINUS, TIMES, LPAREN, RPAREN, NEWLINE, ILLEGAL; /* Declaration of all the nonterminal symbols, including the type of their attributes, if applicable. */ nonterminal Commands, Command; nonterminal Integer E, T, F;
4.0 - JFlex Lexical Analyser Generator
-
Takes a description of toke lexical tokens in the grammar (described using REGEX) and generates a scanner.
-
The input for the JFlex Lexical Analyser is in Calc.flex
-
The Calc.flex file contains a description of all of the lexical tokens in the grammar, defined using Regular Expressions (REGEX)
-
We first define our simple lexical tokens:
"+" { return makeToken( sym.PLUS ); } "-" { return makeToken( sym.MINUS ); } "*" { return makeToken( sym.TIMES ); } "(" { return makeToken( sym.LPAREN ); } ")" { return makeToken( sym.RPAREN ); }
-
We also define other lexical tokens:
{LineTerminator} { return makeToken( sym.NEWLINE ); } // Read in the whitespace and don't return a token; Just ignore it. {WhiteSpace} { /* ignore white space */ } // A number is one or more occurrences of a digit. {Digit}+ { int value; try { // yytext() is the text that we've passed in, // that's been associated with a digit. // we parse it as an integer and save its value // (note that yytext() is Calc.flex syntax that // refers to the string that we've matched.) value = Integer.parseInt( yytext() ); } catch( NumberFormatException e ) { /* Can only happen if the number is too big */ System.out.println( "integer too large" ); return makeToken( sym.ILLEGAL ); } // If successfully read in a number, make token. // Number tokens have a value associated, "value". return makeToken( sym.NUMBER, value); } { return makeToken( sym.ILLEGAL ); }
-
We know that these number types have values, as that’s how we defined them in Calc.cup.
/* Declaration of all the terminal symbols, including the type of their attributes, if applicable. */ terminal Integer NUMBER;
-
The curly-brace syntax defines macros, which are defined a few lines above:
/* Macro Declarations * These declarations are regular expressions that will be used latter * in the Lexical Rules Section. */ /* A line terminator is a \r (carriage return), \n (line feed), or \r\n. */ LineTerminator = \r|\n|\r\n /* White space is a space, tab, or form feed. */ WhiteSpace = [ \t\f] Digit = [0-9]
-
-
Note that this code not only defines WHAT these tokens are, but also some code to execute when a symbol is recognised.
4.1 - What Does the Lexical Analyser Do?
-
The Lexical Analyser wants to try to match the longest sequence of characters possible (that match the alternatives defined in Calc.flex.
-
If there is any ambiguity over which alternative to match, the Lexical Analyser will match the first one
- Therefore, the alternatives defined in Calc.flex are listed in order of their precedence.
"+" { return makeToken( sym.PLUS ); } "-" { return makeToken( sym.MINUS ); } "*" { return makeToken( sym.TIMES ); } "(" { return makeToken( sym.LPAREN ); } ")" { return makeToken( sym.RPAREN ); } {LineTerminator} { return makeToken( sym.NEWLINE ); } {WhiteSpace} { /* ignore white space */ } {Digit}+ { int value; try { value = Integer.parseInt( yytext() ); } catch( NumberFormatException e ) { /* Can only happen if the number is too big */ System.out.println( "integer too large" ); return makeToken( sym.ILLEGAL ); } return makeToken( sym.NUMBER, value); } . { return makeToken( sym.ILLEGAL ); }
-
If we don’t match any of the symbols, we have the last lexical token to match (the last line)
- This essentially creates a
sym.ILLEGAL
token if we haven’t matched anything that we recognise.
- This essentially creates a
4.2 - Other Definitions
%%
/* -----------------Options and Declarations Section----------------- */
/* The name of the class JFlex will create will be Lexer.
* Will write the code to the file Lexer.java.
*/
%class lexer
%unicode
/* Make the resulting class public */
%public
/* Will switch to a CUP compatibility mode to interface with a CUP
* generated parser.
* The terminal symbols defined by CUP are placed in the class sym.
*/
%cup
- We tell JFlex to write the Lexical Analyser code to a file called lexer.java with unicode encoding.
- We’ve also told JFlex to make the lexer class public, use Java-CUP compatibility mode (most importantly, getting the terminal symbols from sym.java)
/* The value returned at end of file.
*/
%eofval{
return makeToken( sym.EOF );
%eofval}
- We also define the desired behaviour for parsing an End-of-File symbol.
/* The current line number can be accessed with the variable yyline
* and the current column number with the variable yycolumn.
*/
%line
%column
- When parsing, we’d like to know the current line number and column number - we’re telling JFlex what we’d like to call these variables.
/* Declarations
* Code between %{ and %}, both of which must be at the beginning of a
* line, will be copied verbatim into the lexer class source.
* Here one declares member variables and functions that are used inside
* scanner actions.
*/
%{
ComplexSymbolFactory sf;
public lexer(java.io.Reader in, ComplexSymbolFactory sf){
this(in);
this.sf = sf;
}
/** To create a new java_cup.runtime.Symbol.
* @param kind is an integer code representing the token.
* Note that CUP and JFlex use integers to represent token kinds.
*/
private Symbol makeToken( int kind ) {
/* Symbol takes the token kind, and the locations of the
* leftmost and rightmost characters of the substring of the
* input file that matched the token.
*/
// System.err.println( "Token " + yytext() + " " + kind );
return sf.newSymbol( sym.terminalNames[kind], kind,
new ComplexSymbolFactory.Location(yyline, yycolumn),
new ComplexSymbolFactory.Location(yyline, yycolumn + yylength()) );
}
/** Also creates a new java_cup.runtime.Symbol with information
* about the current token, but this object has a value.
* @param kind is an integer code representing the token.
* @param value is an arbitrary Java Object.
* Below when tokens such as a NUMBER or IDENTIFIER are
* recognised they pass values which are respectively
* of type Integer and String. The types of these values *must*
* match their type as declared in the Terminals sections
* of the CUP specification.
*/
private Symbol makeToken(int kind, Object value) {
// System.err.println( "Token " + yytext() + " " +kind );
return sf.newSymbol( sym.terminalNames[kind], kind,
new ComplexSymbolFactory.Location(yyline, yycolumn),
new ComplexSymbolFactory.Location(yyline, yycolumn + yylength()),
value );
}
%}
- We then include some code that is to be included in the lexer class, verbatim.
- We define the following:
- A constructor for the class lexer which has two inputs - java.io.Reader to read input from, and ComplexSymbolFactory to create tokens.
- The makeToken function that we were using in the action code when we wanted to make tokens after detecting a particular lexical token.
- Note that there’s two definitions of the makeToken function
- makeToken(int kind) - used when the token doesn’t have a value
- makeToken(int kind, Object value) - used when the token has a value.
- Looking at the first makeToken definition.
- It takes in (int kind) - which refers to the integer to token mapping defined in sym.java
- We use the symbol factory that was passed into the lexer class constructor to create the symbol.
- We use the name defined in sym - sym.terminalNames[kind] gets the name of the symbol.
- We also pass in two locations - where the symbol starts, and where the symbol ends
- This is used to provide useful feedback to the user if an error occurs during the lexical analysis phase of compilation.
- Looking at the second makeToken definition
- The only difference is that it takes in an Object value type.
- It then passes this to the symbol factory’s method to create a new symbol along with the other parameters described in the simpler definition above.
4.3 - Running Java-CUP on Calc.CUP
-
There’s a bunch of stuff that we don’t need to know yet - we haven’t gone through the theory of a shift-reduce parser yet.
-
However, we can see that some of the “action code” that we wrote in Calc.CUP appearing in the parser generated by Java-CUP, parser.java.
case 5: // E ::= E PLUS T { Integer RESULT =null; Location exleft = ((java_cup.runtime.ComplexSymbolFactory.ComplexSymbol)CUP$parser$stack.elementAt(CUP$parser$top-2)).xleft; Location exright = ((java_cup.runtime.ComplexSymbolFactory.ComplexSymbol)CUP$parser$stack.elementAt(CUP$parser$top-2)).xright; Integer e = (Integer)((java_cup.runtime.Symbol) CUP$parser$stack.elementAt(CUP$parser$top-2)).value; Location txleft = ((java_cup.runtime.ComplexSymbolFactory.ComplexSymbol)CUP$parser$stack.peek()).xleft; Location txright = ((java_cup.runtime.ComplexSymbolFactory.ComplexSymbol)CUP$parser$stack.peek()).xright; Integer t = (Integer)((java_cup.runtime.Symbol) CUP$parser$stack.peek()).value; RESULT = e + t; CUP$parser$result = parser.getSymbolFactory().newSymbol("E",2, ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-2)), ((java_cup.runtime.Symbol)CUP$parser$stack.peek()), RESULT); } return CUP$parser$result;
E ::= E:e PLUS T:t {: RESULT = e + t; :} ;
4.4 - Run Configurations
4.4.1 - Java-CUP Run Configuration [Calc_CUP]
-
We use Java-CUP to generate the Parser. The run configuration for Java-CUP essentially runs (the main function of java_cup) on Calc.cup
4.4.2 - JFlex Lexical Analyser Run Configuration [Calc_JFlex]
-
We use the JFlex Lexical Analyser to create a lexical analyser (as a class called lexer)
-
To run the lexical analyser, we first need the symbols which are created by Java-CUP, therefore, we specify that the Calc_CUP configuration is run beforehand!
4.4.3 - Calc Configuration
-
The Calc configuration actually allows us to use the calculator.
-
However, to use the calculator, we have to ensure that all of our generators have run!
- Configuration Calc runs Calc_JFlex
- Configuration Calc_JFlex runs Calc_CUP.
4.5 - CalcCUP Java Class
import java.io.IOException;
import java.io.InputStreamReader;
import java_cup.runtime.ComplexSymbolFactory;
public class CalcCUP {
public static void main(String[] args) throws java.lang.Exception {
ComplexSymbolFactory csf = new ComplexSymbolFactory();
lexer calcLexer = new lexer( new InputStreamReader( System.in ), csf );
parser calcParser = new parser( calcLexer, csf );
try {
//calcParser.debug_parse();
calcParser.parse();
} catch( IOException e ) {
System.out.println( "Got IOException: " + e + "... Aborting" );
System.exit(1);
}
}
}
- In the main method of
CalcCUP
we:- Create a complex symbol factory,
csf
- Create a
lexer
calcLexer
(which uses thecsf
defined before) and parsercalcParser
(which uses thelexer
andcsf
defined before) - Then, the
calcParser.parse();
method runs the calculator.
- Create a complex symbol factory,
5.0 - Using JFlex and Java-Cup Together
-
JFlex can be set up to work with Java-CUP
- They need to be able to communicate what the terminal symbols in the grammar are.
- When Calc.cup is run, it not only generates parser.java but also sym.java which contains a description of all terminal and non-terminal symbols.
- Sym.java associates each of the terminal symbols with each terminal symbol, and a name (contained in the terminalNames array)
//---------------------------------------------------- // The following code was generated by CUP v0.11b 20160615 (GIT 4ac7450) //---------------------------------------------------- /** CUP generated interface containing symbol constants. */ public interface sym { /* terminals */ public static final int MINUS = 4; public static final int NUMBER = 2; public static final int EOF = 0; public static final int PLUS = 3; public static final int error = 1; public static final int RPAREN = 7; public static final int TIMES = 5; public static final int ILLEGAL = 9; public static final int NEWLINE = 8; public static final int LPAREN = 6; public static final String[] terminalNames = new String[] { "EOF", "error", "NUMBER", "PLUS", "MINUS", "TIMES", "LPAREN", "RPAREN", "NEWLINE", "ILLEGAL" }; }
-
In the Calc.flex “action code”, we define symbols to create using the Java-CUP’s sym.java symbol definitions.
"+" { return makeToken( sym.PLUS ); } "-" { return makeToken( sym.MINUS ); } "*" { return makeToken( sym.TIMES ); } "(" { return makeToken( sym.LPAREN ); } ")" { return makeToken( sym.RPAREN ); }
6.0 - Extending the Calculator Example.
-
We want to be able to use LET expressions, which allow us to specify identifiers (and their associated constant values).
1 + let x=2 in 2*x end
-
We can also define nested LET expressions
2 + let x=1 in let x=2 in x end end
6.1 - Implementing the LET Expression
6.1.1 - Modifying the Lexical Tokens
-
To implement this, we first need to modify the lexical tokens in the ECalc.flex file to include the new terminal symbols:
"+" { return makeToken( sym.PLUS ); } "-" { return makeToken( sym.MINUS ); } "*" { return makeToken( sym.TIMES ); } "(" { return makeToken( sym.LPAREN ); } ")" { return makeToken( sym.RPAREN ); } "=" { return makeToken( sym.EQUAL ); } "let" { return makeToken( sym.KW_LET ); } "in" { return makeToken( sym.KW_IN ); } "end" { return makeToken( sym.KW_END ); }
-
Since we have configured Java-CUP and JFlex to work together, we need to define the new terminal symbols in ECalc.cup too.
terminal Integer NUMBER; terminal String IDENTIFIER; terminal PLUS, MINUS, TIMES, LPAREN, RPAREN, NEWLINE, KW_LET, KW_IN, KW_END, EQUAL, ILLEGAL;
-
Following these changes, we notice that KW_LET, KW_IN and KW_END appear in sym.java.
6.1.2 - Lexical Analyser - Adding Identifiers
- We also need to add identifiers to the lexical analyser.
{Letter}({Letter}|{Digit})*
{ return makeToken( sym.IDENTIFIER, yytext() ); }
-
An identifier must start with a letter, and may be followed by 0 or more occurrences of letters and digits.
-
Note that letter and digit are defined as macros earlier in the ECalc.flex file.
Digit = [0-9] Letter = [a-zA-Z]
-
-
When we match the identifier, we need to create a token that’s associated with that identifier.
-
Note that the token has an associated token type (sym.IDENTIFIER) and a string - this is the plaintext name of the identifier (yytext is the string that was matched by the REGEX pattern
{Letter}({Letter}|{Digit})*
shown above). -
Note that we have given an IDENTIFIER an associated String type in ECalc.cup:
terminal String IDENTIFIER;
-
-
We also modify the grammar in ECalc.cup for the production of a factor.
- We add the alternative for a factor to be an identifier and a LET expression
F ::= LPAREN E:e RPAREN {: RESULT = e; :} | NUMBER:n {: RESULT = n; :} | IDENTIFIER:id {: int val = symtab.lookup( id ); if( debug ) { System.out.println( " Lookup " + id + " = " + val ); } RESULT = val; :} | KW_LET IDENTIFIER:id EQUAL E:def {: if( debug ) { System.out.println( " Adding " + id + " = " + def ); } symtab.add( id, def ); :} KW_IN E:e KW_END {: RESULT = e; String removed = symtab.removeDef(); if( debug ) { System.out.println( " Removed " + removed ); } :} ;
-
Action code for identifiers:
- When parsing identifiers, we need to know what the value associated with the identifier is.
- The identifier must be defined in a LET expression, and when we parse that LET expression, we store the value of the identifiers defined in a symbol table
symtab
- We then give the attribute the value that we look up in the symbol table.
-
Parsing a LET expression
- We first parse the first component of the LET expression,
KW_LET IDENTIFIER:id EQUAL E:def
- After we parse that component of the LET expression, we add the identifier (and its corresponding definition) to the symbol table.
- After adding the identifier to the symbol table, we parse the expression contained in the second half of the LET expression
- Since the symbol is already added to the symbol table, we can use it.
- We then evaluate the expression in the line
RESULT = e;
- Since we’re done with that identifier definition, we need to remove it from the symbol table - this is accomplished by the
symtab.removeDef();
method which pops the last definition from the top of the symbol table - the symbol table in this specific context acts like a stack.
- We first parse the first component of the LET expression,
-
Defining the Symbol Table
- At the top of the ECalc.cup file, we’ve defined some additional action code that gets inserted into the parser
- This action code defines a stack-based symbol table class.
- This stack-based implementation means that we can have nested LET expressions which can use the identifiers from the outer LET expressions.
action code {: /* Turn on debugging output */ boolean debug = false; /* Simple symbol table to remember values associated with identifiers * via LET clauses */ private class SymTable { /** Head of a list of entries */ private SymList head; /** Constructor for an empty symbol table */ public SymTable() { // Initially, no identifiers in the symbol table. head = null; } /** Entries in the list of symbols contain an identifier and * its corresponding value */ private class SymList { // An identier entry has the string name of the identifier, // its value and a link to the rest of the symbol table. String ident; int value; SymList rest; public SymList( String ident, int value, SymList rest ) { this.ident = ident; this.value = value; this.rest = rest; } } /** Look up the most recent definition of an identifier. * If the identifier is not found an error message is reported * and a totally silly value returned. */ public int lookup( String id ) { SymList p = head; // While the symbol table isn't empty and the current SymList // entry doesn't refer to the identifier that we want to identify, // keep going through the list. while( p != null && ! p.ident.equals( id ) ) { p = p.rest; } if( p != null ) { return p.value; } else { parser.report_error( id + " not found", id ); return 0x80808080; } } /** Add a new identifier and its corresponding value to the * symbol table. This overrides any previous entry for the * same identifier until it is removed. */ public void add( String id, int val ) { head = new SymList( id, val, head ); } /** Removed the topmost definition in the symbol table. * This will reactivate any previous definition of the * identifier being removed. */ public String removeDef() { String removed = head.ident; head = head.rest; return removed; } } /* Symbol table used in the calculator - initially empty. */ private SymTable symtab = new SymTable(); :}
-
Consider the following nested LET expression
2 + let x=1 in let x=2 in x end end
- When we define x=1, that will get pushed onto the stack.
- When we define x=2 in the second let expression, that will also get pushed onto the stack (overriding the previous definition of x=1)
- Therefore, when we evaluate x, we get the definition that x=2 and therefore 2 + x = 4