Week 2.2
1.0 - Backus-Naur Form (BNF) and EBNF
1.1 - Backus-Naur Form (BNF)
-
The Backus-Naur Form (BNF) allows a context-free grammar to be described by a finite set of productions of the form
- Where
is a single non-terminal symbol is a (possibly empty) sequence of terminal and non-terminal symbols.
- Where
1.1.1 - BNF Syntax for Alternatives
-
The set of productions
-
Can be written as
1.2 - Extended Backus-Naur Form (EBNF)
Whilst EBNF doesn’t extend the expressivity of BNF, it adds three extra constructs to our BNF form to make it easier to specify our grammars. Since these constructs don’t add any expressivity, we can convert any grammar defined in EBNF into BNF.
-
An
optional
syntactic construct, written in square brackets -
A
repetition
construct, written in curly brackets- The items in the brackets can appear 0 or more times.
-
A
grouping
of constructs, written in parentheses
1.2.1 - Converting EBNF Constructs to BNF
-
We can replace the
optional construct
by a new non-terminal - Essentially, anywhere where we have the construct
we replace with the new non-terminal symbol
-
For example, we can re-write the following production:
-
as the following productions
- Essentially, anywhere where we have the construct
-
We can replace the
repetition construction
by a new non-terminal -
For example, we can re-write the following production
-
as the following production
-
Note here that we still have the EBNF grouping construct
- We could remove the outer grouping construct without changing the definition of the production, however the same can’t be said for the inner grouping construct.
-
-
We can replace the grouping construct
by a new non-terminal symbol -
For example, we can re-write
-
as
-
2.0 - Recursive-Descent Parsing
- How can we use the grammar of a language to write a parser for our programming language?
- We want to implement the
Syntax Analysis Phase
of our compiler
- We want to implement the
- A recursive-descent parser is a recursive program to recognise sentences in a language:
- Input is a stream of lexical tokens represented by tokens of type
TokenStream
(a Java class) - Each non-terminal symbol
has a method parseN
that:- Recognises the longest string of lexical tokens in the input stream, starting from the current token, derivable from
; - As it parses the input it moves the current location forward;
- When it has finished parsing
, the current token is the token immediately following the last token matched as part of
- Recognises the longest string of lexical tokens in the input stream, starting from the current token, derivable from
- Input is a stream of lexical tokens represented by tokens of type
2.1 - Pattern-Matching a Terminal Symbol
- To recognise T, the parser calls the method
tokens.match
on the input token stream, with the terminal symbolToken.T
as a parameter:
ENBF Construct S | Recogniser code recog(S) | Notes |
---|---|---|
T | tokens.match(Token.T); | Make sure the token is what we want it to be, consume it and move on |
IDENTIFIER | tokens.match(Token.IDENTIFIER) |
2.2 - Pattern-Matching a Non-Terminal Symbol
- To recognise a non-terminal symbol N, simply call the corresponding method
EBNF Construct S | Recogniser code recog(S) | Notes |
---|---|---|
N | parseN(); | Call the corresponding parse method for the non-terminal symbol. |
RelCondition | parseRelCondition(); | When parsing a RelCondition, we’d call the parseRelCondition(); method to parse it. |
2.3 - Matching a Sequence of Terms
So we have these tokens that have been recognised. How do we parse it?
- A sequence
of EBNF factors is recognised by code that recognises each in a sequence
EBNF Construct S | Recogniser code recog(S) |
---|---|
recog( recog( recog( |
|
LPAREN RelCondition RPAREN | tokens.match(Token.LPAREN); parseRelCondition(); tokens.match(Token.RPAREN); |
2.4 - Matching a list of Alternatives
To implement this, we need to look at the current token + the following tokens - it’s not necessarily always possible to write a recursive descent parser that just looks at the current token.
- When matching a list of alternatives,
, we need to know which alternative to choose - Let First(
) be the set of terminal symbols that can start the sequence - We can predict this based on the value of the current token if: (i.e. without looking ahead)
- Assuming that our grammar satisfies these restrictions,
EBNF Construct S | Recogniser code recog(S) |
---|---|
if (tokens.isIn(First( recog( } else if (tokens.isIn(First( recog( } else ...{ } else if (tokens.isIn(First( recog( } else { /* Omitted if Nullable*/ errors.error(”Syntax Error”);} |
|
2.4.1 - Matching a List of Alternatives - Example
void parseFactor(){
if (tokens.isMatch(Token.LPAREN)) {
tokens.match(Token.LPAREN);
parseRelCondition();
tokens.match(Token.RPAREN);
} else if (tokens.isMatch(Token.NUMBER)) {
tokens.match(Token.NUMBER);
} else if (tokens.isMatch(Token.IDENTIFIER)) {
parseLValue();
} else {
errors.error("Syntax Error");
/* Or if we have a symbol that's Nullable, assume that we've matched the
* Nullable symbol.
*/
}
}
2.5 - Matching an Optional Construct
- When matching an optional construct,
, we need to choose between recognising and matching the empty sequence. - We can predict this based on the value of the current token if
- First(S) is disjoint from the set of tokens that immediately follows
- Assuming that our grammar satisfies this restriction,
EBNF Construct S | Recogniser code recog(S) |
---|---|
if (tokens.isIn(First( recog( } /* Else don’t do anything. */ |
2.5.1 - Matching an Optional Construct - Example
void parseRelCondition() {
parseExp(); // Recognise the expression
if (tokens.isIn(REL_OPS_SET)) { // Can the current token start the RelOp
parseRelOp();
parseExp();
}
}
Where REL_OPS_SET
contains the set of terminal symbols representing relational operators.
void parseIfStatement() {
tokens.match(Token.KW_IF);
parseCondition();
tokens.match(Token.KW_THEN);
parseStatement();
if (tokens.isMatch(Token.KW_ELSE)) {
tokens.match(Token.KW_ELSE);
parseStatement();
}
}
Whilst the grammar doesn’t satisfy the rules required, we can use our parser to resolve that ambiguity (matching else to the closest IF)
2.6 - Matching a Repetition Construct
- When matching a repetition construct,
, we need to choose when to stop recognising occurrences of , and match the empty sequence instead. - We can predict this based on the value of the current token if:
- First(
) is disjoint from the set of tokens that immediately follow
- First(
- Assuming that our grammar satisfies this restriction:
EBNF Construct S | Recogniser code recog(S) |
---|---|
while (tokens.isIn(First(S))) { recog(S) } |
2.6.1 - Matching a Repetition Construct - Example
void parseTerm(){
parseFactor();
while(tokens.isIn(TERMS_OPS_SET)){
// Step 1: Match either a Tokens.TIMES or Tokens.DIVIDE or throw an error
if(tokens.isMatch(Tokens.TIMES)){
tokens.match(Tokens.TIMES);
} else if (tokens.isMatch(Token.DIVIDE)) {
tokens.match(Tokens.DIVIDE);
} else {
fatal("Internal Error); // Theoretically unreachable
}
// Step 2: Match the factor
parseFactor();
}
}
2.7 - Summary
- We have outlined a systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
- However, we are recognising the grammatical structure with a purpose, e.g. to
- Construct an Abstract Syntax Tree or
- Calculate the value of an expression
3.0 - Parsing and Computing Expressions
Let’s now write a recursive descent parser that not only recognises an expression, but calculates it assuming that there are no syntacticial errors.
-
The following is a simple grammar for calculator expressions:
-
Firstly, let’s look at parsing a factor. The production for a
is: parseFactor() { if (tokens.isMatch(Token.LPAREN)) { tokens.match(Token.LPAREN); parseExp(); tokens.match(Token.RPAREN); } else if (tokens.isMatch(Token.NUMBER)) { tokens.match(Token.NUMBER); } else { errors.error("Syntax error"); } }
-
Evaluate the expression and store the result in a local variable.
parseFactor() { int result; if (tokens.isMatch(Token.LPAREN)) { tokens.match(Token.LPAREN); result = parseExp(); tokens.match(Token.RPAREN); } else if (tokens.isMatch(Token.NUMBER)) { result = tokens.getIntValue(); // Get the integer value of the current token, // assuming it's a number tokens.match(Token.NUMBER); // Moves to the next token, so the line before // MUST be before } else { errors.error("Syntax error"); result = 0x8080808080; // Set to a random garbage value } return result; }
-
Next, let’s look at parsing a
int parseTerm() { int result; result = parseFactor(); // Result returned if times or divide not in expression while (tokens.isIn(TERMS_OPS_SET)) { // Times or divide boolean times; if (tokens.isMatch(Token.TIMES)) { tokens.match(Token.TIMES); times = true; } else if (tokens.isMatch(Token.DIVIDE)) { tokens.match(Token.DIVIDE); times = false; } else { fatal("Internal Error"); // Unreachable. } int factor = parseFactor(); // The inner factor if (times) { result = result * factor; } else { result = result / factor; } } return result; }
-
Next, let’s look at parsing a
int parseExp() { boolean isPlus = false; boolean isMinus = false; if (tokens.isMatch(Token.PLUS)) { tokens.match(Token.PLUS); isPlus = true; } else if (tokens.isMatch(Token.MINUS)){ tokens.match(Token.MINUS); isMinus = true; } int term = parseTerm(); parseExp(); }
4.0 - Building an Abstract Syntax Tree (AST)
- As well as parsing a language, we would like to create a representation of the language in the form of an Abstract Syntax Tree (AST)
- To do this, we change the parse method for a non-terminal
, so that it returns the abstract syntax tree representation of , assuming that all of the parse method it calls returns the abstract syntax tree representations of the non-terminals they recognise.
4.1 - Building an Abstract Syntax Tree (AST) - Parsing Example
- We’re now building an AST instead of parsing the expression
- We return to the PL0 Compiler Data Structures document and use some of the data structure defined there to create our abstract syntax tree
- Every expression node
ExpNode
has a location (Location loc) and a type (discussed later)Location
Line number and column number
- We can represent our operators as
BinaryNode
s as they have two statements on either side of them.-
A
BinaryNode
has the following arguments: argumentsBinaryNode(Location loc, Type t, Operator op, ExpNode left, ExpNode right)
-
We now want our parseTerm method to return an ExpNode (Expression Node)
ExpNode parseTerm() {
ExpNode result = parseFactor();
while (tokens.isIn(TERM_OPS_SET)) {
Operator op = Operator.INVALID_OP;
Location opLoc = tokens.getLocation(); // Location of times or divide. Used in AST.
if (tokens.isMatch(Token.TIMES)) {
tokens.match(Token.TIMES);
op = Operator.MUL_OP;
} else if (tokens.isMatch(Token.DIVIDE)) {
tokens.match(Token.DIVIDE);
op = Operator.DIV_OP;
} else {
fatal("Internal error"); // Unreachable
}
ExpNode right = parseFactor();
result = new ExpNode.BinaryNode(opLoc, op, result, right);
}
return result;
}
4.2 - Building an AST - Visual Example
Let’s build the AST for
-
At the start, our current token points to
. - So parseFactor() will return an AST for
- Note that
denotes Location 1
in this example but should be replaced by a line number and column number.
- So parseFactor() will return an AST for
-
Our current token now points to
- We recognise that it is either a TIMES or DIVIDE - and so we parse the sequence
. - We know that the location for the start of this sequence is
Location 2
- We also know that the operation is a multiplication operation (
MUL_OP
)
- We recognise that it is either a TIMES or DIVIDE - and so we parse the sequence
-
Our current token now points to
-
We now parse the
component of the sequence -
We now have the LHS and RHS components of our
BinaryOperator
, so we can add pointers to them to create the AST for the binary operator.
-
-
Our current token now points to
(the divide operator) - Our current token is a divide symbol, so we can’t terminate the loop - we have to read in another occurrence of
- From the information at hand, we know that the operator is located at
Location 4
() and that it’s a division operator DIV_OP
- In reading in the BinaryOperator token, we move our current token pointer to the factor
. - As before, we parse it
- Our current token is a divide symbol, so we can’t terminate the loop - we have to read in another occurrence of
-
Our current pointer now points to nothing / null
- As before, we put together our Abstract Syntax Tree
-
We can terminate as the pointer is no longer pointing at a MUL_OP or
DIV_OP
4.3 - Recursive Descent Parser So Far...
- We have looked at:
- A systematic process for writing a recursive-descent parser that recognises sentences in the language of a context-free grammar (of a restricted form) written in EBNF
- How to build an abstract syntax tree for the sentence that we are recognising
- But programmers make errors