Lecture 11
1.0 - Tutorial 6 Compiler
- The Tutorial 6 Compiler is different to the Tutorial 4 Compiler
- It uses a Shift-Reduce Parser instead of a Recursive Descent Parser
- The Shift-Reduce Parser is generated by Java CUP
- The Lexical Analyser / Scanner is generated by JFlex
- We don’t interpret the code directly, we generate code (for a stack machine) which is then run using an emulator.
- The Abstract Syntax Tree, Symbol Table and Types remain the same.
- Since the AST classes remain the same, the static semantic checking remains the same as they’re defined with respect to the abstract syntax of the PL0 language.
- All of the source code associated with parsing in the T6 compiler is contained in the parse package.
- The input to the Java-CUP program used to generate the parser for the T6 compiler is the PL0.cup file.
2.0 - PL0.flex - Lexical Token Parser
-
Specify the syntax for each of the lexical tokens in the grammar, using Regular Expressions (REGEX)
- Definition of our terminal symbols have precedence
-
We define our symbols first, as they should have the highest precedence.
-
The lexical analyser tries to match the longest string possible, and then by the order that they’re listed in
-
For example, KW_BEGIN also matches the series of letters and digits required to form an identifier.
- Therefore, we put KW_BEGIN above the definition for the identifiers.
"(" { return makeToken(CUPToken.LPAREN); } ")" { return makeToken(CUPToken.RPAREN); } ";" { return makeToken(CUPToken.SEMICOLON); } ":=" { return makeToken(CUPToken.ASSIGN); } ":" { return makeToken(CUPToken.COLON); } .. "begin" { return makeToken(CUPToken.KW_BEGIN); } "call" { return makeToken(CUPToken.KW_CALL); } "const" { return makeToken(CUPToken.KW_CONST); } "do" { return makeToken(CUPToken.KW_DO); } "else" { return makeToken(CUPToken.KW_ELSE); } "end" { return makeToken(CUPToken.KW_END); } "if" { return makeToken(CUPToken.KW_IF); } "procedure" { return makeToken(CUPToken.KW_PROCEDURE); } "read" { return makeToken(CUPToken.KW_READ); } "then" { return makeToken(CUPToken.KW_THEN); } "type" { return makeToken(CUPToken.KW_TYPE); } "var" { return makeToken(CUPToken.KW_VAR); } "while" { return makeToken(CUPToken.KW_WHILE); } "write" { return makeToken(CUPToken.KW_WRITE); }
-
As mentioned before, we then define the other constructs within our grammar, with lower precedence.
- An identifier is any combination of letters and digit (which starts with a letter).
-
An identifier has a String attribute, and is assigned the value of
yytext()
which is just the string of characters that match the REGEX string (in this case, the name of the identifier)/* The rule for identifier must come after keywords to give the keywords * priority. Note that yytext returns the character string that matches * the pattern -- in this case the name of the identifier. */ {Letter}({Letter}|{Digit})* { return makeToken(CUPToken.IDENTIFIER, yytext()); }
-
A number also has an attribute (but is of type Integer) - the value is the integer parsed as a string (i.e. the integer representation of the number)
{Digit}+ { int value = 0x80808080; // Nonsense value try { value = Integer.parseInt(yytext()); } catch(NumberFormatException e) { /* Can only happen if the number is too big */ ErrorHandler.getErrorHandler().error( "integer too large", new ComplexSymbolFactory.Location(yyline, yycolumn)); } return makeToken(CUPToken.NUMBER, value); }
-
A comment is a string of text that starts with “//” and is followed by one or more comment characters.
-
Comment characters are defined as
CommentCharacter = [^\r\n\f]
i.e. any non-newline character. -
Since we essentially want to ignore comments and whitespace in our parsing, we don’t return any tokens.
/* This consumes the comment but not any following line terminator * (exploiting the longest match property). * The line terminator is consumed by the WhiteSpace definition. * This is to ensure a comment terminated by EOF, that isn't immediately * preceded by a line terminator, is recognised. */ "//"{CommentCharacter}* { /* ignore comment - an empty action causes the lexical analyser * to skip the matched characters in the input and then start * scanning for a token from the next character. */ } {WhiteSpace} { /* ignore white space */ }
-
We then define a lexical rule that generates an error token if we encounter illegal characters that don’t meet the above pattern.
/* Match any other character. * Let the parser deal with any illegal characters */ . { return makeToken(CUPToken.ILLEGAL); }
2.1 - CUPToken
-
CUPToken is an interface generated by Java-CUP which defines all of the symbol constants.
-
There are two main sections of the interface - firstly, each symbol is associated with an integer number.
public static final int DIVIDE = 8; public static final int LBRACKET = 19; public static final int KW_PROCEDURE = 28; public static final int EQUALS = 9;
-
Then, an array with all of the terminal symbol names are defined:
public static final String[] terminalNames = new String[] { "EOF", "error", "SEMICOLON", "COLON", "ASSIGN", }
-
-
We require that the lexical analyser code generated by JFlex and the Parser generated by Java-CUP use the same set of tokens - so this is why the CUPToken class is created.
- Note that in the PL0.flex file (which is used by JFlex to generate the Lexical Analyser / Scanner) we return tokens of type CUPToken.$token
3.0 - PL0.Cup - Parser and Parser Generator
- Essentially just a description of the grammar of our language, written in BNF Form coupled with action code that describes what to do to recognise a non-terminal symbol in the grammar.
3.1 - Defining the Terminal Symbols
-
Firstly, we define all of the terminal symbols in our language, both symbolic operators like assignment (:=), semicolon (;) as well as keyword definition such as KW_BEGIN (”begin”).
terminal SEMICOLON, /* ; */ COLON, /* : */ ASSIGN, /* := */ PLUS, /* + */ MINUS, /* - */ // And so on
-
Both the symbolic operators and keyword terminal symbols don’t have any attributes associated with them
-
However, the terminal symbols IDENTIFIER have associated types:
terminal String IDENTIFIER; /* identifier */ terminal Integer NUMBER; /* number */
3.2 - Defining the Non-Terminal Symbols - Expressions
-
In the parser, we generate an abstract syntax tree, and so for some of these non-terminal symbols we’ll need attributes associated with them.
-
When we parse an expression, we return an ExpNode (which is defined in the Tree package)
non terminal ExpNode Condition, RelCondition, Exp, Term, Factor; non terminal ExpNode LValue;
-
When parsing our operators, we return a Operator
non terminal Operator Relation, AddOp, MulOp, UnaryOperator;
-
When we parse a statement, we return a StatementNode
non terminal StatementNode Statement, CompoundStatement;
3.2.1 - Parsing a Left Value
- When parsing an identifier, we can either return an identifier or error
- Note that this syntax is specific to JavaCUP.
- Essentially, if we want to match an LValue but can’t match an Identifier, then it will match error instead (and return an error node) so that we can recover as well as possible and keep going (to provide the programmer with as much useful information as possible).
- If we’re able to parse an identifier node, the string name will be stored in the variable
id
.- We then create an IdentifierNode using that string name, and a location (used for error reporting)
- The position of that location is the leftmost character of the identifier - we can use JavaCUP syntax to get this location:
- “id” - the attribute, “xleft” - the location of the leftmost character of the attribute
- Similarly, if we’re not able to parse the Identifier, we create an error node, passing it the location of the error - “e” - the attribute, “xleft” - the location of the leftmost character
LValue ::= IDENTIFIER:id
{:
/* At this stage the identifier could be either a constant identifier or
* a variable identifier but which cannot be determined until static
* checking when the IdentifierNode will be transformed into either
* a ConstNode or a VariableNode or detected as invalid.
*/
RESULT = new ExpNode.IdentifierNode(idxleft, id);
:}
| error:e
{:
RESULT = new ExpNode.ErrorNode(exleft);
:}
;
3.2.2 - Parsing a Condition
-
At this point in time, a Condition is just a RelCondition, but this will be extended in Tutorial 6.
/* To allow for adding logical expressions. */ Condition ::= RelCondition:e {: RESULT = e; :} ;
3.2.3 - Parsing a RelCondition
- A RelCondition is either an Expression by itself, or an Expression followed by a Relation and another Expression
- If we match the first option, where the RelCondition is just an expression, we set the attribute to the Expression node.
- Otherwise, we need to create a new BinaryNode to contain both expression nodes, and the relation node
- In this case, we set the location to the leftmost character of the binary operator - this is done as we think this is the most meaningful position (mainly done so that we can differentiate between an issue with the binary operator itself vs an issue with the left or right expression).
/* Relational operators are lower precedence than arithmetic operators. */
RelCondition ::= Exp:e
{:
RESULT = e;
:}
| Exp:e1 Relation:op Exp:e2
{:
RESULT = new ExpNode.BinaryNode(opxleft, op, e1, e2);
:}
;
3.2.4 - Parsing a Relation
- A Relation is the non-terminal symbol representing all of the binary operators.
-
The attribute for each of these binary operators is just the operator itself (as we’ve defined it at the top of the file).
non terminal Operator Relation
-
Relation ::= EQUALS
{:
RESULT = Operator.EQUALS_OP;
:}
| NEQUALS
{:
RESULT = Operator.NEQUALS_OP;
:}
| LEQUALS
{:
RESULT = Operator.LEQUALS_OP;
:}
| LESS
{:
RESULT = Operator.LESS_OP;
:}
| GREATER
{:
RESULT = Operator.GREATER_OP;
:}
| GEQUALS
{:
RESULT = Operator.GEQUALS_OP;
:}
;
3.2.5 - Parsing an Expression
- Expressions are either just Terms or an AddOp between two expressions.
-
Note that this production here is left-recursive:
-
However, this is fine as we’re not generating a recursive descent parser - we’re generating a bottom-up (LALR) parser.
-
Exp ::= Term:t
{:
RESULT = t;
:}
| Exp:e1 AddOp:op Term:e2
{:
RESULT = new ExpNode.BinaryNode(opxleft, op, e1,e2);
:}
;
3.2.6 - Parsing an AddOp
-
An AddOp is either a PLUS or MINUS operator.
AddOp ::= PLUS {: RESULT = Operator.ADD_OP; :} | MINUS {: RESULT = Operator.SUB_OP; :} ;
3.2.7 - Parsing a Term
- A term is either a Factor or a MulOp between two Terms.
- Like in all of the previous examples, we return an attribute that’s the expression node associated with which alternative.
Term ::= Factor:f
{:
RESULT = f;
:}
| Term:e1 MulOp:op Factor:e2
{:
RESULT = new ExpNode.BinaryNode(opxleft, op, e1,e2);
:}
;
3.2.8 - Parsing a Factor
- If there’s a PLUS before a factor, we essentially just ignore it, and just return the value of the expression.
- However, if there’s a MINUS before a factor, that modifies the value.
- We handle this with the
case.
- We handle this with the
- The factor could be a condition, wrapped in parentheses - in this case, the attributed value is just that of the condition.
- The factor could be a Number - in this case, we construct a new Constant node
- This constant node is of type Integer, and has the value obtained by the parser.
Factor ::= PLUS Factor:e
{:
RESULT = e;
:}
| UnaryOperator:op Factor:e
{:
RESULT = new ExpNode.UnaryNode(opxleft, op, e);
:}
| LPAREN Condition:c RPAREN
{:
RESULT = c;
:}
| NUMBER:n
{:
RESULT = new ExpNode.ConstNode(nxleft,
Predefined.INTEGER_TYPE, n);
:}
| LValue:lval
{:
RESULT = lval;
:}
;
3.2.9 - Parsing a UnaryOperator
- At the moment, our Unary Operator only has one operator - a minus.
UnaryOperator ::= MINUS
{:
RESULT = Operator.NEG_OP;
:}
;
The following section looks at parsing statement nodes
3.3 - Defining Non-Terminal Symbols
3.3.1 - Parsing a Compound Statement
- As before, after matching the required format
KW_BEGIN StatementList KW_END
, we return a StatementNode which represents the structure of the code we’ve just parsed. - StatementNode.ListNode takes two arguments - a location
slxleft
and the statement list itselfsl
CompoundStatement ::= KW_BEGIN StatementList:sl KW_END
{:
RESULT = new StatementNode.ListNode(slxleft,sl);
:}
;
-
Note that
sl
is of type as defined in our non-terminal definition above:non terminal List<StatementNode> StatementList;
3.3.2 - Parsing a StatementList
- As defined in our grammar, there are two alternatives for a StatementList:
- A single statement
s
- A StatementList
sl
followed by a semicolon;
and another statements
- A single statement
- In the case where we have a single statement by itself, we construct a new ArrayList and add the statement into it.
- In the case where we are matching an added statement, we simply append the matched statement onto the existing StatementList.
- This example is important - we don’t simply have to set our attributes to the base AST nodes themselves, but we can use collections or other things to augment their behaviour.
StatementList ::= Statement:s
{:
RESULT = new ArrayList<StatementNode>();
RESULT.add(s);
:}
| StatementList:sl SEMICOLON Statement:s
{:
sl.add(s);
RESULT = sl;
:}
;
3.3.3 - Parsing a Statement
-
The production for a PL0 statement has a lot of alternatives.
Statement ::= KW_WHILE Condition:c KW_DO Statement:s {: RESULT = new StatementNode.WhileNode(cxleft, c, s); :} | KW_IF Condition:c KW_THEN Statement:s1 KW_ELSE Statement:s2 {: RESULT = new StatementNode.IfNode(cxleft, c, s1, s2); :} | CompoundStatement:s {: RESULT = s; :} | KW_READ LValue:lval {: RESULT = new StatementNode.ReadNode(lvalxleft, lval); :} | KW_WRITE Exp:e {: RESULT = new StatementNode.WriteNode(exleft, e); :} | LValue:lval ASSIGN Condition:rval {: RESULT = new StatementNode.AssignmentNode(lvalxleft, lval, rval); :} | KW_CALL IDENTIFIER:id LPAREN ActualParamList:pl RPAREN {: RESULT = new StatementNode.CallNode(idxleft, id); :} | error:loc {: RESULT = new StatementNode.ErrorNode(locxleft); :} ;
-
We have an error alternative, to insert an error node if we are to parse a statement, but it doesn’t match any of the existing alternatives.
| error:loc {: RESULT = new StatementNode.ErrorNode(locxleft); :}
-
Note that for the time being the ActualParameterList component of a procedure call doesn’t have any definition
| KW_CALL IDENTIFIER:id LPAREN ActualParamList:pl RPAREN {: RESULT = new StatementNode.CallNode(idxleft, id); :}
ActualParamList ::= /* empty */ ;
3.4 - Additional Notes
-
Any imports that are present at the top of PL0.cup are copied over to the CUPParser generated by Java-CUP.
-
We also include additional code that gets inserted so that we can handle errors in our own way, so we override some of Java-CUP’s default implementation of error handling.
parser code {: /* This section provides some methods used by Java CUP during parsing. * They override its default methods for reporting syntax errors. */ /* Retrieve the error handler to handle error messages. */ private Errors errors = ErrorHandler.getErrorHandler(); /* Override the default CUP syntax_error method with one * that integrates better with the compiler's error reporting. */ @Override public void syntax_error(Symbol cur_token) { errors.error("PL0 syntax error", ((ComplexSymbol) cur_token).xleft); } /* Override the default CUP unrecovered_syntax_error method with one * that integrates better with the compiler's error reporting. */ @Override public void unrecovered_syntax_error(Symbol cur_token) { errors.error("PL0 unrecovered syntax error", ((ComplexSymbol) cur_token).xleft); } :}
-
Additionally, in this phase we also want to construct a symbol table.
- We insert more code that defines instance variables so that we can use it later.
- Note that currentScope is effectively our symbol table.
action code {: /* This section provides global variables and methods used in the * semantics actions associated with parsing rules. * These are the only global variables you should need. */ /* Error handler for reporting error messages. */ private Errors errors = ErrorHandler.getErrorHandler(); /* The current symbol table scope is available globally in the parser. * Its current scope corresponds to the procedure (or main program) * being processed. */ private Scope currentScope; :}
4.0 - Stack Machine
- All of the code related to the PL0 Compiler stack machine is included in the machine package.
- The CodeGenerator class found in the tree package generates code for this stack machine.
4.1 - Code Generation
- Code Generation is implemented using the visitor pattern, for both statements and expressions to transform them into code.
4.2 - Memory Allocation
- The memory allocated to the stack machine is divided into three areas:
The Stack
The stack machine uses the stack for two purposes:- Evaluating expressions, and
- Storing activation records for every active procedure call / procedure invocation.
- Storing the activation records are important, as they store the values of local variables that are defined in that procedure.
The Heap
Used for dynamic allocations of objectsThe Code
Space Used for storing machine instructions
- The stack and heap share one contiguous area of memory.
- The stack grows from the bottom (from Address 0)
- The heap grows down from the top of the stack
- We have four special registers to properly manage the memory
Stack Pointer (sp)
The address of the top of the stack (+1), that is, the address into which the next PUSH operation will take place.Stack Limit (limit)
The address of the upper limit to which the stack can grow (which is also the bottom of the heap).Frame Pointer (fp)
The address of the stack frame (activation record) for the current procedure (or the main program)Program Counter (pc)
The address of the next machine instruction to be executed.