MicroPy Compiler — Parser Documentation

Overview#

The MicroPy parser is Phase 2 of the compiler pipeline. It takes the flat list of tokens produced by the lexer and builds an Abstract Syntax Tree (AST) — a tree of objects that represents the structure and meaning of the program.

1
Source Code  →  [Lexer]  →  Token Stream  →  [Parser]  →  AST

The parser is a hand-written Recursive Descent Parser, also called a top-down predictive parser or LL(1) parser. This is the most widely taught parsing technique in universities and is used in real-world compilers like Go and early versions of GCC.

The Two Files#

The parser is split into two files, each with a distinct job:

File	Job
`parser/nodes.py`	Defines the data structures (node classes) that make up the AST
`parser/parser.py`	Contains the logic that reads tokens and builds the AST

Think of nodes.py as the blueprint and parser.py as the builder.

Understanding AST Nodes#

Every construct in MicroPy maps to a node — a Python object that holds exactly the information needed to represent that construct.

Literal Nodes#

These are the leaf nodes of the tree. They have no children — they just hold a value.

1
@dataclass
2
class NumberNode(Node):
3
    value: str = ""     # e.g. "42"
4

5
@dataclass
6
class StringNode(Node):
7
    value: str = ""     # e.g. "hello"
8

9
@dataclass
10
class BooleanNode(Node):
11
    value: str = ""     # "True" or "False"
12

13
@dataclass
14
class IdentifierNode(Node):
15
    name: str = ""      # e.g. "result", "num1"

Expression Nodes#

These hold other nodes as children. They always have two sides and an operator.

1
@dataclass
2
class BinaryOpNode(Node):
3
    left:  Any = None   # left side — another node
4
    op:    str = ""     # "+", "-", "*", "/", "==", "<", etc.
5
    right: Any = None   # right side — another node
6

7
@dataclass
8
class LogicalOpNode(Node):
9
    left:  Any = None   # left condition
10
    op:    str = ""     # "and" or "or"
11
    right: Any = None   # right condition

Example: num1 + num2 becomes:

1
BinaryOpNode(op="+")
2
├── IdentifierNode → num1
3
└── IdentifierNode → num2

Statement Nodes#

These represent complete lines of MicroPy code.

1
@dataclass
2
class AssignmentNode(Node):
3
    name:  str = ""    # variable name on the left
4
    value: Any = None  # expression on the right
5

6
@dataclass
7
class PrintNode(Node):
8
    value: Any = None  # what to print
9

10
@dataclass
11
class IfNode(Node):
12
    condition:  Any           = None
13
    then_block: List[Any]     = field(default_factory=list)
14
    else_block: Optional[Any] = None   # None if no else
15

16
@dataclass
17
class WhileNode(Node):
18
    condition: Any       = None
19
    body:      List[Any] = field(default_factory=list)
20

21
@dataclass
22
class BuiltinCallNode(Node):
23
    func_name: str = ""    # "int", "input", "str", "float"
24
    argument:  Any = None  # the argument passed in

The Root Node#

1
@dataclass
2
class ProgramNode(Node):
3
    statements: List[Any] = field(default_factory=list)

Every MicroPy program is a ProgramNode containing a flat list of top-level statements. This is the entry point of the entire tree.

How the Parser Works#

The Token Pointer#

Just like the lexer had a position pointer moving through characters, the parser has a position pointer moving through tokens:

1
class Parser:
2
    def __init__(self, tokens, error_handler):
3
        self.tokens = tokens   # the full list from the lexer
4
        self.pos    = 0        # current position
5
        self.errors = error_handler

The Four Helper Methods#

1
def current(self) -> Token:
2
    """What token are we looking at right now?"""
3
    return self.tokens[self.pos]
4

5
def peek(self, offset=1) -> Token:
6
    """Look ahead without consuming."""
7
    return self.tokens[self.pos + offset]
8

9
def advance(self) -> Token:
10
    """Consume current token and move forward."""
11
    token = self.tokens[self.pos]
12
    self.pos += 1
13
    return token
14

15
def expect(self, type, value=None) -> Token:
16
    """Consume a token — but only if it matches what we expect.
17
       Reports an error if it does not match."""
18
    token = self.current()
19
    if token.type != type:
20
        self.errors.report("Parser", f"Expected {type.value}", token.line)
21
    return self.advance()

The expect() method is how the parser enforces grammar rules. Every time the grammar says a specific token must appear, the parser calls expect().

Grammar Rules → Parser Functions#

Every BNF production rule in the MicroPy grammar maps directly to one function in the parser. This is the defining feature of a recursive descent parser.

`<program> ::= <statement>+`#

1
def parse(self) -> ProgramNode:
2
    statements = []
3
    self.skip_newlines()
4

5
    while not self.is_at_end():
6
        stmt = self.parse_statement()  # parse one statement
7
        if stmt:
8
            statements.append(stmt)    # add to list
9
        self.skip_newlines()
10

11
    return ProgramNode(statements=statements)

The + in the grammar means “one or more” — the while loop handles this.

`<statement> ::= <assignment> | <if_stmt> | <while_stmt> | <print_stmt>`#

1
def parse_statement(self):
2
    token = self.current()
3

4
    if token.value == "if":      return self.parse_if_statement()
5
    if token.value == "while":   return self.parse_while_statement()
6
    if token.value == "print":   return self.parse_print_statement()
7
    if token.type == IDENTIFIER: return self.parse_assignment()
8
    if token.value in ("int", "input", "str"):
9
                                 return self.parse_builtin_call()

The | in BNF (meaning “or”) becomes an if/elif chain in code. The parser looks at the current token and immediately knows which rule to apply — this is what makes it predictive.

`<assignment> ::= IDENTIFIER "=" <expression>`#

1
def parse_assignment(self) -> AssignmentNode:
2
    name_token = self.expect(TokenType.IDENTIFIER)  # consume x
3
    self.expect(TokenType.ASSIGNMENT)               # consume =
4
    value = self.parse_condition()                  # parse right side
5

6
    return AssignmentNode(name=name_token.value, value=value)

Example trace for result = num1 + num2:

1
expect(IDENTIFIER) → consumes "result"
2
expect(ASSIGNMENT) → consumes "="
3
parse_condition()  → parse_expression() → parse_term() → parse_factor()
4
                   → returns BinaryOpNode(IdentifierNode(num1), +, IdentifierNode(num2))
5

6
returns AssignmentNode(name="result", value=BinaryOpNode(...))

`<if_stmt> ::= "if" <condition> ":" <block> [ "else" ":" <block> ]`#

1
def parse_if_statement(self) -> IfNode:
2
    self.expect(TokenType.KEYWORD, "if")    # consume "if"
3
    condition = self.parse_condition()       # parse the condition
4
    self.expect(TokenType.DELIMITER, ":")    # consume ":"
5
    self.skip_newlines()
6
    then_block = self.parse_block()          # parse indented block
7

8
    else_block = None
9
    if self.current().value == "else":       # optional else
10
        self.advance()                       # consume "else"
11
        self.expect(TokenType.DELIMITER, ":")
12
        self.skip_newlines()
13
        else_block = self.parse_block()
14

15
    return IfNode(condition=condition, then_block=then_block, else_block=else_block)

`<block> ::= INDENT <statement>+ DEDENT`#

1
def parse_block(self) -> List:
2
    statements = []
3
    self.expect(TokenType.INDENT)    # must see INDENT — block starts
4

5
    while not self.is_at_end() and self.current().type != TokenType.DEDENT:
6
        stmt = self.parse_statement()
7
        if stmt:
8
            statements.append(stmt)
9
        self.skip_newlines()
10

11
    self.expect(TokenType.DEDENT)    # must see DEDENT — block ends
12
    return statements

The INDENT and DEDENT tokens were generated by the lexer from Python-style indentation. The parser treats them exactly like { and } in C-style languages.

Operator Precedence#

One of the most important design decisions in the parser is how it handles operator precedence — ensuring 2 + 3 * 4 evaluates as 2 + (3 * 4) and not (2 + 3) * 4.

This is handled by the layered structure of three functions:

1
parse_condition()       handles:  ==  !=  <  >  and  or   (lowest priority)
2
    └── parse_expression()  handles:  +  -
3
            └── parse_term()      handles:  *  /               (highest priority)
4
                    └── parse_factor()   handles: values, ()

Rule: The deeper a function is, the higher its priority.

`<expression> ::= <term> | <expression> "+" <term> | <expression> "-" <term>`#

1
def parse_expression(self):
2
    left = self.parse_term()    # always go deeper first
3

4
    while self.current().value in ('+', '-'):
5
        op    = self.advance().value
6
        right = self.parse_term()
7
        left  = BinaryOpNode(left=left, op=op, right=right)
8

9
    return left

`<term> ::= <factor> | <term> "*" <factor> | <term> "/" <factor>`#

1
def parse_term(self):
2
    left = self.parse_factor()  # always go deeper first
3

4
    while self.current().value in ('*', '/'):
5
        op    = self.advance().value
6
        right = self.parse_factor()
7
        left  = BinaryOpNode(left=left, op=op, right=right)
8

9
    return left

Proof with `2 + 3 * 4`#

1
parse_expression() called
2
  → parse_term()
3
      → parse_factor() → NumberNode(2)
4
      → sees '*'? NO — '+' not handled here
5
      → returns NumberNode(2)
6
  → left = NumberNode(2)
7
  → sees '+' ✅ — consumes it
8
  → parse_term()                    ← right side
9
      → parse_factor() → NumberNode(3)
10
      → sees '*' ✅ — consumes it
11
      → parse_factor() → NumberNode(4)
12
      → returns BinaryOpNode(3 * 4) ← * resolved FIRST
13
  → left = BinaryOpNode(2 + BinaryOpNode(3 * 4))

Result:

1
BinaryOpNode (+)
2
├── NumberNode → 2
3
└── BinaryOpNode (*)    ← deeper = higher priority = runs first
4
    ├── NumberNode → 3
5
    └── NumberNode → 4

Walking Through a Full Example#

Let’s trace what happens when the parser processes this line:

1
if operator == 1:
2
    result = num1 * num2

Token stream:

1
KEYWORD(if), IDENTIFIER(operator), COMPARE_OP(==), NUMBER(1),
2
DELIMITER(:), NEWLINE, INDENT, IDENTIFIER(result), ASSIGNMENT(=),
3
IDENTIFIER(num1), OPERATOR(*), IDENTIFIER(num2), NEWLINE, DEDENT

Parser trace:

1
parse_statement()
2
  → sees KEYWORD "if" → calls parse_if_statement()
3
      expect("if")           → ✅ consumes KEYWORD(if)
4
      parse_condition()
5
        parse_expression()
6
          parse_term()
7
            parse_factor()   → returns IdentifierNode("operator")
8
        sees COMPARE_OP(==)  → consumes it
9
        parse_expression()
10
          parse_term()
11
            parse_factor()   → returns NumberNode("1")
12
        returns BinaryOpNode(operator == 1)
13
      expect(":")            → ✅ consumes DELIMITER(:)
14
      skip_newlines()        → skips NEWLINE
15
      parse_block()
16
        expect(INDENT)       → ✅ consumes INDENT
17
        parse_statement()
18
          → sees IDENTIFIER + ASSIGNMENT → calls parse_assignment()
19
              expect(IDENTIFIER)  → ✅ consumes "result"
20
              expect(ASSIGNMENT)  → ✅ consumes "="
21
              parse_condition()
22
                parse_expression()
23
                  parse_term()
24
                    parse_factor() → IdentifierNode("num1")
25
                    sees OPERATOR(*) → consumes it
26
                    parse_factor() → IdentifierNode("num2")
27
                    returns BinaryOpNode(num1 * num2)
28
              returns AssignmentNode("result", BinaryOpNode(*))
29
        expect(DEDENT)       → ✅ consumes DEDENT
30
        returns [AssignmentNode(...)]
31
      returns IfNode(
32
          condition  = BinaryOpNode(operator == 1),
33
          then_block = [AssignmentNode(result = BinaryOpNode(num1 * num2))],
34
          else_block = None
35
      )

The Final AST Output#

Running the full calculator program through the parser produces this tree (abbreviated):

1
ProgramNode (20 statements)
2
├── AssignmentNode → num1
3
│   └── BuiltinCallNode → int()
4
│       └── BuiltinCallNode → input()
5
│           └── StringNode → 'enter first number'
6
├── AssignmentNode → num2
7
│   └── BuiltinCallNode → int()
8
│       └── BuiltinCallNode → input()
9
│           └── StringNode → 'enter second number'
10
├── IfNode
11
│   ├── condition:
12
│   │   └── BinaryOpNode (==)
13
│   │       ├── IdentifierNode → operator
14
│   │       └── NumberNode → 1
15
│   └── then:
16
│       └── AssignmentNode → result
17
│           └── BinaryOpNode (*)
18
│               ├── IdentifierNode → num1
19
│               └── IdentifierNode → num2
20
├── WhileNode
21
│   ├── condition:
22
│   │   └── BinaryOpNode (<)
23
│   │       ├── IdentifierNode → result
24
│   │       └── NumberNode → 5
25
│   └── body:
26
│       └── PrintNode
27
│           └── IdentifierNode → message
28
└── IfNode
29
    ├── condition:
30
    │   └── LogicalOpNode (and)
31
    │       ├── IdentifierNode → boolean_value
32
    │       └── BinaryOpNode (<)
33
    │           ├── IdentifierNode → num1
34
    │           └── NumberNode → 3
35
    └── then:
36
        └── PrintNode
37
            └── FStringNode → f'the boolean value was {boolean_value}...'

Error Handling#

The parser uses a non-crashing error strategy — when it encounters an unexpected token, it reports the error and tries to continue parsing the rest of the program. This gives you all errors at once instead of stopping at the first one.

1
def expect(self, type, value=None):
2
    token = self.current()
3
    if token.type != type:
4
        self.errors.report(
5
            "Parser",
6
            f"Expected {type.value} but got {token.type.value} '{token.value}'",
7
            token.line
8
        )
9
        return token   # ← return without advancing, try to recover
10
    return self.advance()

Example error output:

1
[Parser Error] Line 5: Expected DELIMITER ':' but got NEWLINE '\n'

Running the Parser#

1
# Run on the default sample
2
python main.py
3

4
# Run on your own .mpy file
5
python main.py myprogram.mpy
6

7
# Run on the test precedence file
8
python main.py samples/test_precedence.mpy

The output shows Phase 1 (lexical analysis with token list) followed by Phase 2 (parser results with any syntax errors).

Sample Output:

1
  PHASE 1 — Lexical Analysis
2

3
  TOKEN TYPE      | VALUE                               | LINE
4
  NUMBER          | '10'                                | 1
5
  ASSIGNMENT      | '='                                 | 1
6
  IDENTIFIER      | 'x'                                 | 1
7

8
  Total tokens: 3
9

10
 PHASE 2 : Parser => Parsing (AST)
11
  ────────────────────────────────────────────────────────────
12
Parser Error: Unexpected token '10' on line 1
13
Parser Error: Unexpected token '=' on line 1
14
Parser Error: Unexpected token 'x' on line 1
15

16
3 error(s) found

Quick Reference#

BNF Rule	Parser Function	Returns
`<program>`	`parse()`	`ProgramNode`
`<statement>`	`parse_statement()`	Any statement node
`<assignment>`	`parse_assignment()`	`AssignmentNode`
`<if_stmt>`	`parse_if_statement()`	`IfNode`
`<while_stmt>`	`parse_while_statement()`	`WhileNode`
`<print_stmt>`	`parse_print_statement()`	`PrintNode`
`<block>`	`parse_block()`	`List[Node]`
`<condition>`	`parse_condition()`	`BinaryOpNode` or `LogicalOpNode`
`<expression>`	`parse_expression()`	`BinaryOpNode` or factor
`<term>`	`parse_term()`	`BinaryOpNode` or factor
`<factor>`	`parse_factor()`	Leaf node or `BuiltinCallNode`
`<builtin_call>`	`parse_builtin_call()`	`BuiltinCallNode`

Source Code#

The full parser source code is available at:

github.com/John-hack321/micropy

1
micropy/
2
├── parser/
3
│   ├── nodes.py      ← AST node definitions
4
│   └── parser.py     ← Recursive descent parser
5
├── lexer/
6
│   ├── token.py      ← Token definitions
7
│   └── lexer.py      ← Scanner
8
├── utils/
9
│   └── error_handler.py
10
└── main.py           ← Entry point