Initial Parser Implementation and Feature Completion #1

Merged
me merged 16 commits from parser-dev into main 2025-12-28 11:54:12 -06:00
Owner

Pull Request: Initial Parser Implementation and Feature Completion

Description

This PR merges the parser-dev branch into main, completing the first major stage of the artichoke compiler.

This introduces the complete parser for the artichoke Programming Language. Moving beyond the existing tokenizer, this implementation provides the full infrastructure required to transform artichoke source code into a structured Abstract Syntax Tree (AST).

The parser is built using a hybrid approach: a Handwritten Recursive Descent parser for high-level program structure (modules, functions, statements) and a Pratt (Precedence Climbing) parser for expressions.

Some changes in the language grammar and definition were needed in order to keep a clean context-agnostic parser implementation which was the primary goal until now.

Core Architecture

1. Hybrid Parsing Strategy

  • Top-Down Recursive Descent: Handles the EBNF grammar for declarations, control flow (if, for, loop, match), and module scoping.
  • Pratt Expression Parser: Manages complex operator precedence and associativity, ensuring intuitive evaluation of math, logic, and member access.

2. The artichoke Type System (Initial Integration)

  • Implemented TypeNode logic to handle nested qualifiers including pointers (*), mutability ($), optionals (?), and slices ([]).
  • Added support for generic types with a specialized TypesList parser.

3. Syntax Resolution & Disambiguation

  • The Turbofish (::< >): Solved the generic-vs-comparison ambiguity in expressions by implementing the ::<> syntax.
  • Type-Initiated Expressions: Enabled anonymous literals (e.g., []int {1, 2}) by allowing the parser to transition from "Expression Mode" to "Type Mode" when encountering slice starters.
  • Precedence Capping: Implemented a precedence-limit mechanism to allow the -> token to function as both a pointer member access operator and a match-case delimiter without ambiguity.

Key Language Features Supported

  • Program Structure: Recursive modules, imports, and type aliases.
  • Declarations: Generic structs, enums (with storage types), and functions with this parameter support.
  • Control Flow: - if/else and while with variable unwrappers (|val|).
    • Labeled loops (loop, for, while, do-while).
    • match and switch statements with arrow-delimited cases.
    • defer and errdefer for resource management.
  • Expressions: - Full operator set (Arithmetic, Bitwise, Logical, Assignment).
    • Advanced Suffixes: Slicing ([start:end]), Reflection (.@), and Slice conversions (.#, .*).
    • Struct/Slice object literals with named or positional fields.

Technical Implementation Details

  • AST Visualization: Included toDot (Graphviz) and toString visitors for tree debugging.
  • Error Handling: Implemented Unexpected<> result types to provide clear error reporting without compiler crashes.
  • Operator Hierarchy: Established a 24-level precedence table to match modern systems programming expectations.
# Pull Request: Initial Parser Implementation and Feature Completion ## Description This PR merges the `parser-dev` branch into `main`, completing the first major stage of the `artichoke` compiler. This introduces the complete parser for the `artichoke` Programming Language. Moving beyond the existing tokenizer, this implementation provides the full infrastructure required to transform `artichoke` source code into a structured Abstract Syntax Tree (AST). The parser is built using a hybrid approach: a **Handwritten Recursive Descent** parser for high-level program structure (modules, functions, statements) and a **Pratt (Precedence Climbing)** parser for expressions. Some changes in the language grammar and definition were needed in order to keep a clean context-agnostic parser implementation which was the primary goal until now. ## Core Architecture ### 1. Hybrid Parsing Strategy - **Top-Down Recursive Descent:** Handles the EBNF grammar for declarations, control flow (if, for, loop, match), and module scoping. - **Pratt Expression Parser:** Manages complex operator precedence and associativity, ensuring intuitive evaluation of math, logic, and member access. ### 2. The `artichoke` Type System (Initial Integration) - Implemented `TypeNode` logic to handle nested qualifiers including pointers (`*`), mutability (`$`), optionals (`?`), and slices (`[]`). - Added support for generic types with a specialized `TypesList` parser. ### 3. Syntax Resolution & Disambiguation - **The Turbofish (`::< >`):** Solved the generic-vs-comparison ambiguity in expressions by implementing the `::<>` syntax. - **Type-Initiated Expressions:** Enabled anonymous literals (e.g., `[]int {1, 2}`) by allowing the parser to transition from "Expression Mode" to "Type Mode" when encountering slice starters. - **Precedence Capping:** Implemented a precedence-limit mechanism to allow the `->` token to function as both a pointer member access operator and a match-case delimiter without ambiguity. ## Key Language Features Supported - **Program Structure:** Recursive modules, imports, and type aliases. - **Declarations:** Generic structs, enums (with storage types), and functions with `this` parameter support. - **Control Flow:** - `if/else` and `while` with variable unwrappers (`|val|`). - Labeled loops (`loop`, `for`, `while`, `do-while`). - `match` and `switch` statements with arrow-delimited cases. - `defer` and `errdefer` for resource management. - **Expressions:** - Full operator set (Arithmetic, Bitwise, Logical, Assignment). - Advanced Suffixes: Slicing (`[start:end]`), Reflection (`.@`), and Slice conversions (`.#`, `.*`). - Struct/Slice object literals with named or positional fields. ## Technical Implementation Details - **AST Visualization:** Included `toDot` (Graphviz) and `toString` visitors for tree debugging. - **Error Handling:** Implemented `Unexpected<>` result types to provide clear error reporting without compiler crashes. - **Operator Hierarchy:** Established a 24-level precedence table to match modern systems programming expectations.
me added 16 commits 2025-12-28 11:42:51 -06:00
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

This commit introduces the foundational structure for the parser and
Abstract Syntax Tree (AST). It includes a new `Parser.hpp` header that
outlines the primary parsing functions for top-level declarations like
`modules`, `structs`, `enums`, and `functions`. It also adds a
`toString` function for the AST to aid in debugging and visualization.

The commit also updates the `Expected.hpp` utility by adding new error
codes like `ecUnexpectedToken`, `ecExpectedSemicolon`,
`ecImportInsideModule`, and `ecUnimplemented` to provide more granular
and descriptive parsing errors. The `Tokenizer` has been updated to use
these new, more specific exceptions.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>
Signed-off-by: erick-alcachofa <erick@artichoke.dev>
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Major refactoring of the Parser and Tokenizer components to improve code
maintainability, strengthen error messaging, and streamline AST
generation.

This version intentionally focuses on top-level declarations, with
statement parsing stubbed for the next development phase.

- **Path Sanitization**: Added `sanitizePath` to extract filenames from
  input paths, ensuring consistent `unitName` identification regardless
  of directory depth.
- **Improved Output**: Wrapped AST string output in Markdown code blocks
  and added a commented-out entry for the new DOT graph visualization.

- **Unified Consumption**: Replaced manual token checks with a more
  robust `consume()` method that leverages `peekExpect()` for
  centralized error handling.
- **New Predicates**: Introduced `match()` and `matchAndConsume()`
  helpers to handle optional tokens and branching logic without
  redundant peek/consume calls.
- **Exception Handling**: Standardized the use of `langException` across
  all parsing functions, providing more descriptive "Expected X, found
  Y" messages.

- **Declarations**: Refactored `parseTopLevelDeclaration` and
  sub-parsers (Module, Struct, Enum, Fn) to use the new matching
  patterns.
- **Looping Logic**: Replaced recursive-style parsing loops with
  `while(keepParsing)` iterative blocks to prevent stack depth issues
  and clarify termination conditions (e.g., finding a closing brace or
  failing to find a comma).
- **Namespaced Identifiers**: Rewrote `parseNamespacedIdentifier` to
  correctly handle multi-part paths (`A::B::C`) and edge cases.
- **Generic Support**: Improved handling of generic parameter and
  argument lists, ensuring strict enforcement of delimiters like `<` and
  `>`.

- **Contextual Errors**: Updated `peekExpect` to accept a custom
  `message` string, allowing the parser to describe *what* it was
  looking for (e.g., "Expected ';'").
- **Token Lookahead**: Enhanced `peek` and `peekExpect` reliability with
  better bounds checking and buffer management.

- **Removed `lib/src/Parser/AST/AST.cpp`**: Deleted the monolithic AST
  stringification file in favor of the previously introduced modular
  implementations.
- **Build System**: Updated `.gitignore` to ignore
  `cpm-package-lock.cmake`.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>
Signed-off-by: erick-alcachofa <erick@artichoke.dev>
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Relocate core parsing utility methods from the header to the
implementation file to reduce header bloat and improve compilation
times.

- **Parser API**: Moved the definitions of `consume()`,
  `matchAndConsume()`, and `match()` from `Parser.hpp` to `Parser.cpp`.
- **Cleanup**: Removed an unused `<print>` include in `Types.cpp`
  discovered during the refactor.
- **Organization**: Methods are now declared in the header and defined
  in the source file, maintaining a cleaner separation between interface
  and implementation.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Complete the transition from a declarations-only parser to a functional
imperative parser. This commit introduces the implementation for all
major statement types, loop constructs, and core control flow logic.

- **Match Case Update**: Updated `grammar.ebnf` to use pipe delimiters
  `|id|` for unwrapped variables in match cases, replacing the previous
  parenthetical syntax.
- **Labels**: Implemented loop labeling using the `ident := loop`
  syntax. Labels are validated to ensure they only prefix valid loop
  constructs.
- **Labels and Ranges**: Standardized the use of the `:=` operator for
  both loop labels (`label := loop`) and range-for declarations (`let i
  := range`).

- **Conditional Branches**:
    - Fully implemented `if` and `else` statements.
    - Added support for optional variable unwrapping (e.g., `if (expr)
      |val|`).
    - Supported `else if` chaining by recursively parsing if-statements
      within else-branches.
- **Loops**:
    - **C-Style For**: Implemented `for (init; cond; post)` with
      optional initializers and post-loop expressions.
    - **Range For**: Implemented `for (let i := range)` with mutability
      controls.
    - **While & Do-While**: Implemented standard condition-based loops.
    - **Infinite Loop**: Added the explicit `loop` keyword for infinite
      iteration.
    - **Loop Dispatch**: Added a lookahead mechanism in
      `parseForLoopStatement` to differentiate between C-style and
      Range-style loops based on token positioning.

- **Variables**: Implemented `let`/`def` parsing within local scopes,
  including type annotations and initializers.
- **Defer Logic**: Implemented `defer` and `errdefer` for scope-guarded
  execution.
- **Jumps**: Implemented `break`, `continue` (with optional label
  targets), and `return` (with optional expressions).
- **Match & Switch**: Fully implemented branch parsing, with possible
  default cases via the `_` (underscore) keyword.

- **Expression Integration**: Stubbed `parseExpression` in a new
  `Expressions.cpp` to serve as the integration point for value parsing.
- **OverloadSet**: Integrated `OverloadSet` utility in `Statements.cpp`
  to cleanly handle AST node variant visitation for label injection.
- **Error Handling**: Standardized error reporting across all new paths
  using `langException`, providing specific "expected" messages for
  delimiters and keywords.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

This commit addresses several critical issues in the recursive descent
parser, specifically regarding the handling of empty constructs,
statement termination, and AST representation of nested scopes. These
changes bring the implementation in line with the Artichoke EBNF
specification.

* **CodeBlock as Statement:** Added `CodeBlockStmtNode` to the
  `StatementNode` variant. This allows a bare `{}` to be treated as a
  valid statement, enabling manual scoping within functions.
* **Visitor Support:** Updated `toDot.cpp` (Graphviz) and `toString.cpp`
  (Pretty-print) to support the new `CodeBlockStmtNode` during AST
  traversal.

* **Empty Member Lists:** Implemented a pre-loop check for the closing
  brace `}` in `parseStruct` and `parseEnum`. This prevents the parser
  from attempting to parse members in empty declarations (e.g., `struct
  Empty {}`).
* **Diagnostic Accuracy:** Enhanced the member-parsing loop to provide
  better error context. If a member is not followed by a comma or a
  closing brace, the parser now explicitly suggests `',' or '}'` as the
  expected tokens.

* **Nested Scopes:** The parser now correctly identifies a `{` at the
  start of a statement and dispatches to `parseCodeBlock`.
* **Empty Code Blocks:** Added a guard in the block-parsing loop to
  check for `}` immediately after `{`, allowing functions or nested
  scopes to be empty.

* **C-Style For-Loops:** Replaced `match` with `matchAndConsume` for the
  initialization semicolon. This allows the parser to correctly handle
  loops where the initialization is omitted (e.g., `for (; 1; 1)`).

* **Correctness:** Resolves parser hangs or errors when encountering
  empty blocks.
* **Compliance:** Fully supports the EBNF definition of zero-or-more
  members/statements.
* **Visuals:** AST diagrams now accurately reflect nested block
  structures.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Overhaul the expression parsing mechanism to utilize a Pratt (top-down
operator precedence) parser. This change provides a more scalable and
maintainable way to handle operator precedence and associativity
compared to standard recursive descent.

As part of this transition, the nomenclature for operators has been
refined to reflect their position in the grammar (Prefix, Infix,
Postfix) rather than their arity.

* Renamed `UnaryOperator` and `UnaryExpression` to `PrefixOperator` and
  `PrefixExpression`.
* Renamed `BinaryOperator` and `BinaryExpression` to `InfixOperator` and
  `InfixExpression`.
* Renamed `ScopeAccessExpression` to `ModuleAccessExpression`.
* Introduced `PostfixOperator` enum and associated logic for function
  calls, slicing, and reflection attributes.
* Updated `toDot.cpp` and `toString.cpp` to support the new node types
  and renamed operators.

* Added `Pratt.hpp` and `Pratt.cpp` to define `BindingPower` and map
  operators to their respective precedence levels.
* Added `Operators.cpp` to handle token-to-operator mapping and
  classification (isPrefix, isInfix, isPostfix).
* Refactored `Parser::parseExpression` to implement the core Pratt loop
  using binding power comparisons.

* Moved literal parsing logic into a dedicated `Literals.cpp`.
* Implemented explicit parsing methods for `Integer`, `Float`, `Char`,
  `String`, `Boolean`, and `Null` literals.
* Added support for `this` and `_` (underscore) as identifier
  expressions.

* **Prefix**: `!`, `-`, `~`, `&` (MemPtr), `*` (DerefPtr).
* **Infix**: Arithmetic, Comparison, Bitwise, Logical, and all Compound
  Assignments.
* **Postfix**: `()` (Call), `[]` (Slice/Access), `.#` (Slice length),
  `.*` (Slice pointer), and `.@` (Reflection).

* **Missing Literals**: Struct literals and Array literals are not yet
  implemented in the new parsing flow.
* **Node Specialization**: `MemberAccess`, `PointerMemberAccess`, and
  `ModuleAccess` currently use generic infix logic and need to be
  migrated to their specific AST node types.
* **Error Handling**: Literal parsing (specifically `std::stold` and
  `std::stoul`) needs safety checks to prevent potential exceptions
  during conversion.
* **Diagnostics**: Refine the error message for unexpected tokens in
  postfix expressions to explicitly list supported operators.
* **Generic Ambiguity**: Generic type/function instantiation currently
  causes parsing conflicts with comparison operators (e.g., `Foo<T>`).
  This is a known issue that will be resolved by transitioning the
  grammar to a turbofish-style `::<...>` syntax.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Refactor the termination logic for generic parameter lists in type
parsing to correctly handle nested generics. By replacing manual peeking
with `peekExpect(TokenV::opGt)`, the parser now correctly handles cases
where two closing angle brackets appear consecutively (e.g.,
`List<List<Int>>`).

Previously, the parser manually checked for a literal `>` token. If the
lexer encountered `>>` (a right-shift operator), the parser would fail
to recognize it as two closing brackets. The transition to `peekExpect`
allows the tokenizer to "split" the `>>` token into two individual `>`
tokens when a single closing bracket is expected, resolving the classic
nested template ambiguity.

Key changes:
- Replaced manual token validation and error reporting with
  `peekExpect`.
- Enabled support for nested generic types without requiring spaces
  between closing brackets.
- Simplified the `keepParsing` loop state in `lib/src/Parser/Types.cpp`.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Overhaul the AST and parser logic to support explicit generic
instantiation in expressions (e.g., `Result::<u32, u32>::Ok(0)`). This
is achieved by implementing the "turbofish" operator (`::<>`) and
specializing how member and module access are handled.

* Added `GenericExpression` to represent generic instantiations in
  expressions.
* Updated the Pratt parser to look for `<` immediately following a `::`
  (ModuleAccess) operator. If found, it parses a `GenericExpression`
  containing the generic arguments.
* This change resolves the ambiguity between generic lists and
  comparison operators in the expression parser.

* Renamed `PointerAccessExpression` to `PointerMemberAccessExpression`.
* Refactored `MemberAccessExpression` and
  `PointerMemberAccessExpression` to store the member as an
  `ExpressionNode`. This allows the right-hand side of a `.` or `->` to
  be a complex expression (like a generic call).
* Simplified `ModuleAccessExpression` to a binary `left`/`right`
  structure, separating scope resolution from generic instantiation.

* Flattened the `Type` AST: replaced recursive `baseType` structures
  with a `Vector<TypeExpressionNode>` (`typeNodes`) to represent
  namespaced paths (e.g., `std::collections::Map`) more efficiently.
* Removed redundant `NamespacedType` and `NamespacedIdentifier` nodes.
* Simplified `GenericType` and `IdentifierType` to use direct `String`
  type names.

* Refactored `parseType` to iterate through namespaced components and
  populate the new flattened `typeNodes` vector.
* Updated the Pratt infix loop to correctly dispatch to `ModuleAccess`,
  `MemberAccess`, or `GenericExpression` based on the operator and
  lookahead tokens.
* Adjusted `toDot` and `toString` visitors to match the new AST
  definitions.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Implement support for object literals using a unified syntax for both
struct and slice initialization. Since the parser lacks the semantic
context to distinguish between a struct or a slice at this stage, both
are represented by the new `ObjectLiteral` AST node.

initialization within curly braces following a type expression:
* **Named Initializers**: Uses the `.field = value` syntax (e.g.,
  `Point { .x = 10, .y = 20 }`).
* **Positional Initializers**: Uses a comma-separated list of
  expressions (e.g., `[]i32 { 1, 2, 3 }`).

* Renamed `StructLiteral` and `SliceLiteral` nodes to `ObjectLiteral`.
* Refactored initialization helper nodes (e.g.,
  `StructLiteralNamedFieldInit` is now `ObjectLiteralNamedFieldInit`).
* Unified the representation in `Expressions.hpp` and `Literals.hpp` to
  use a single `ObjectLiteral` struct containing a `type` and an
  optional `initializer`.

* Integrated the opening brace `{` (`opLSquirly`) as a high-precedence
  postfix operator (binding power 19).
* Implemented parsing logic in `Expressions.cpp` to handle the
  transition from a type expression to an object initializer.
* Updated `toDot` and `toString` visitors to handle the unified
  `ObjectLiteral` nodes and their respective initializer variants.

* Improved robustness in `Declarations.cpp` by ensuring list parsing
  correctly handles closing braces in specific edge cases.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Implement precedence capping in `parseExpression` for switch cases to
prevent the parser from misinterpreting the case arrow (`->`) as a
pointer member access operator.

Additionally, increased the binding power of `ModuleAccess` (::) to
ensure namespaced identifiers are correctly resolved within case
patterns before hitting the precedence limit.

- Use `PointerMemberAccess.right` as the precedence floor for cases.
- Update `ModuleAccess` binding power to {23, 24}.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Implement TypeExpression AST node to allow types to be used within
expressions, enabling the parsing of anonymous slice and array
initializers like `[]Type { ... }`.

- Register `[` as a prefix-style token (NUD) in the Pratt parser.
- Add `TypeExpression` node to AST and expression variants.
- Update `toDot` and `toString` visitors for AST visualization.
- Update frontend to open source files directly to fix issues at opening
  paths.
Signed-off-by: erick-alcachofa <erick@artichoke.dev>

Update the SliceAccess postfix operator logic to handle the full variety
of slice range syntaxes. This allows for open-ended slices by making the
start and end expressions optional within the brackets.

- Add logic to detect a leading colon for `[:end]` and `[:]` forms.
- Support trailing colons for `[start:]` forms.
- Differentiate between a single index access and a slice range based on
  the presence of the colon operator.
- Update SliceRangeExprNode construction to handle optional boundaries.
me self-assigned this 2025-12-28 11:52:44 -06:00
me merged commit 25486fbace into main 2025-12-28 11:54:12 -06:00
Sign in to join this conversation.
No description provided.