Aho - Chapter 2.1 to 2.2 Summary
1.0 - Introduction to Syntax-Directed Translator
- The analysis phase of a compiler breaks up a source program into constituent pieces and produces an internal representation of the source program called the intermediate code
- The analysis phase of compilation is centred around the syntax of the language to be compiled.
- Syntax Describes the proper form of programs belonging to a programming language
- Semantics defines what a program means, for a given programming language (i.e. what each program does when it is executed)
- The analysis phase of compilation is centred around the syntax of the language to be compiled.
- The synthesis phase translates the intermediate code into the target program
- Syntax is specified using a notation called
context-free grammars
or BNF (Backus-Naur Form)- A context-free grammar can also be used to help guide translation of programs
- Section 2.3 - Syntax-Directed Translation
- Section 2.4 - Parsing / Syntax Analysis
- Section 2.5 + Model of Compiler Front-End
- A context-free grammar can also be used to help guide translation of programs
2.0 - Syntax Definition
Aho et al, Chapter 2.2
Context Free Grammar (alt. Grammar)
Used to specify the syntax of a language- A grammar describes the hierarchical structure of most programming language constructs.
- Suppose we have the following
if-else
statement in Java syntax:if (expression) statement else statement
- Suppose we use the variables ‘stmt’ to denote a statement and ‘expr’ to denote an expression.
- Then we can create a structuring rule for an if-else statement as:
- Note here that the arrow (
) is used to denote ‘can have the form’. - Such a rule is called a
production
, and is the basis of grammars.- In a production, lexical elements like the keyword
and the parentheses are called terminals
- Variables like ‘expr’ and ‘stmt’ represent sequences of terminals and are called non-terminals.
- In a production, lexical elements like the keyword
1.1 - Definition of Grammars
Aho et al, Chapter 2.2.1
- A context-free grammar has four key components:
- A set of
terminal symbols
, sometimes referred to as “tokens” - The terminals are the elementary symbols of the language defined by the grammar. - A set of
non-terminals
, sometimes called “syntactic variables”. Each non-terminal represents a string of terminals. - A set of productions, where each production consists of a non-terminal (LHS) and an arrow, and a sequence of
terminals
and/ornon-terminals
(RHS) - A designation of one of the terminals as the start symbol (or if none specified, the LHS of the topmost production).
- A set of
- Grammars are specified by listing their productions, with productions for the start symbol listed first.
- The following assumptions can be made about the formatting of grammar productions
- Digits, signs such as
and boldface strings such as are terminal symbols. - Italicized name is non-terminal
- Any non-italicized name or symbol is terminal.
- Digits, signs such as
We often want to create expressions consisting of digits, plus and minus signs, such as
. Since a plus or a minus sign must appear between two digits, we refer to such expressions as “lists of digits separated by a plus or minus sign”.
- The following grammar describes the syntax of these expressions, with the following productions:
-
The bodies of the three production with
non-terminal
as head can equivalently be grouped as: -
According to our conventions, the
terminals
of the grammar are the symbols -
The
non-terminals
are the italicized names list and digit, withbeing the start symbol, because its productions are given first. -
We say a production is for a
non-terminal
if the non-terminal is the head of the production -
A string of
terminals
is a sequence of zero or more terminals. -
The string of zero terminals, written as
, is called the empty string.
1.2 - Grammar Derivations
Aho et al, Chapter 2.2.2
- A grammar derives strings by beginning with the start symbol, and replacing a non-terminal by the body of a production for that non-terminal
- The terminal strings that can be derived from the start symbol form the language defined by the grammar.
1.2.1 - List Production Example 1 - Productions & Lists
- Consider the following productions from above
-
From the last production, we know that a single digit itself is a list
-
Then from the first production in this grammar, this expands to any list followed by a plus or minus sign and then a new digit makes up a new list.
-
From this we can deduce:
is a by the second production since is a is a by the production since is a and 5 is a is a by the production since is a and 5 is a
1.2.2 - List Production Example 2 - List of Parameters
-
Another type of list is the list of parameters in a function call.
-
In Java, the parameters are enclosed within parentheses, as in the call
max(x,y)
of functionmax
with parametersx
andy
-
One nuance of such a list is that an empty list of parameters may be found between the terminals
(
and)
. -
We may start to develop a grammar for such sequences within the productions:
- Note that the second possible body for
(the optional parameter list) is , which stands for the empty string of symbols. - That is,
optparams
can be replaced by the empty string, so acan consist of a function followed by the two-terminal string ()
- Notice that the productions for
are analogous to those for in the previous example with comma in place of the arithmetic operators and in place of - Note that the productions for
haven’t been shown as they’re essentially arbitrary parameters.
- That is,
- Note that the second possible body for
1.3 - Parse Trees
-
Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar (and if it cannot be derived from the start symbol of the grammar, then reporting syntax errors within the strings)
-
Parsing is one of the most fundamental problems in all of compiling
-
Will start by considering simple source programs like
in which each character is a terminal; - In general, a source program has multi-character lexemes that are grouped by the lexical analyser into tokens, whose first components are the terminals processed by the parser
-
A parse tree pictorially shows how the start symbol of a grammar derives a string in a language.
-
If a non-terminal has a production
then a parse tree may have an interior node labelled A with three children labelled and from left to right.

- Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
- The root is labelled by the start symbol
- Each leaf is labelled by a terminal or by
- Each interior node is labelled by a non-terminal.
- If
is the non-terminal labelling some interior node and are the labels of the children of that node from left to right, then there must be a production . - Here,
each stand for a symbol that is either a terminal or non-terminal - As a special case if
is a production, then a node labelled may have a single child labelled
- Here,
1.4 - Ambiguity
- Have to be careful in talking about the structure of a string according to a grammar
- A grammar can have more than one parse tree generating a given string of terminals.
- Such a grammar is said to be ambiguous
- To show that a grammar is ambiguous, need to find a terminal string that is the yield of more than one parse tree
- Since a string with more than one parse tree usually has more than one meaning, we need to design unambiguous for compiling applications, or to use ambiguous grammars with additional rules to resolve ambiguities
Suppose that in the example above, we used a single non-terminal,
and did not distinguish between digits and lists. Then, we could have written the grammar:
- Merging the notion of a
and into the non-terminal makes superficial sense, because a single is a special case of - However, using this grammar an expression like
has more than one parse tree

- We can either interpret the statement as (9-5)+2 or 9-(5+2) which have different results:
1.5 - Associativity
-
By convention, the statement
is equivalent to and is equivalent to . -
When an operand has operators on its left and right, conventions are needed for deciding which operator applies to that operand.
-
We say that the operator + associates to the left, because an operation with a plus sign on both sides belongs to the operator to its left.
-
In most programming languages the four arithmetic operations
are left-associative. - However, some common operators such as exponentiation are right-associative.
- Additionally, the assignment operator
in C and its descendants are right-associative. - That is, the expression
is treated in the same way as
- That is, the expression
-
Strings like
with a right-associative operator are generated by the following grammar: -
The contrast between a parse tree for a left-associative operator like
and a parse rtee for a right-associative operator like is shown in the figure below (left - left associative, right - right associative) Figure: Parse Tree for Left-Associative Operators (LHS) and Right-Associative Operators (RHS)
1.6 - Precedence of Operators
-
Consider the expression
. There are two possible interpretations of this expression - or . -
The associativity rules for
and apply to occurrences of the same operator, so they do not resolve this ambiguity. -
Rules defining the relative precedence of operators are needed when more than one kind of operator is present
-
We say that
has a higher precedence than if takes its operands before does. -
In ordinary arithmetic, multiplication and division have higher precedence than addition and subtraction.
-
Therefore,
is taken by in both sentences and . Figure Parse tree operator precedence.