Skip to content

Lexer File

Kyle Brown edited this page Apr 22, 2020 · 3 revisions

Introduction

The Lexer file precisely describes how to divide a given input into tokens. It does this by employing a header, which configures various settings and features, and a body, which is composed of lexing rules. These two sections are separated by a single line containing only three hyphens ---.

<Header>
---
<Body>

Note: The practice of "ignoring" tokens is not allowed. All of the input must be placed within tokens as representing the complete input as tokens can be necessary for formatting, document generation, code generation, or refactoring tooling.

Header

The header is the section of the lexer file preceding the --- delimiter. It must be valid TOML and is used to specify settings/meta-data for the lexer.

This format assumes no defaults for parameters and makes no assumptions about what users want. This decision has been made to emphasize clarity and backwards compatibility; it's main downside, verbosity is of limited importance because the total number of settings is limited.

Pattern

The pattern section defines what pattern elements and operators are allowed.

To enable a specific pattern feature pattern.<feature>.allowed to true.

Name Description
string Allows patterns of the form " ... " which match their contents as literal strings
regex Allows patterns of the form / ... / which represent regular expressions

In the future, more pattern features may be added, like the following features which are based on PEG and EBNF/ABNF.

Name Description
disjunction Allows patterns of the form <pattern> | <pattern> which represent the logical or of the two patterns
concatenation Allows patterns of the form <pattern> ~ <pattern> which represent the concatenation of the two patterns
grouping Allows patterns of the form ( <pattern> ) to be used on their own and within other patterns for clarity

Matching

The matching definition defines how the lexer will decide what tokens to match.

span

When reading a token from a specific start point in the input, the span setting dictates which possible span of input to match

  • greedy - Match the longest valid span
  • lazy - Match the shortest valid span

priority

When there are multiple rules that accept a given span, the priority setting dictates which rule is to be used.

  • specific - Match specific rules (strings and exact regexes) before broad rules
  • first - Match the rule that appears first in this file

Feature Sections

A feature section is a section that's presence enables and then configures a given feature. More feature sections may be added over time as necessary, this will not affect specs which do not use these features.

Pydent

The Pydent feature adds support for Python styled Ident/Dedent token generation. This means that the generated parser must be a form of pushdown automata instead of a DFA.

The following parameters must all be defined.

Name Description
ident.name The token name to assign to the Ident tokens generated by this rule
dedent.name The token name to assign to the Dedent tokens generated by this rule
newline.name The token name to assign to the Newline tokens generated by this rule
space.name The token name to assign to the Space tokens generated by this rule

Body

The body contains the actual definition of the lexer rules, which are individually defined in the form <name> = <pattern>; where the name matches this pattern: /[a-zA-Z_][a-zA-Z0-9_]*/. The pattern can be any valid pattern, which uses only the pattern features enabled in the pattern section.

Examples

In the following examples, the header and body are shown in separate code blocks so that TOML highlighting can be employed in the header. The delimiter has been omitted but would appear between the information provided in the code blocks.

Simple Lexer

[pattern]
string.allowed = true
regex.allowed = true

[matching]
span = "greedy"
priority = "specific"
whitespace = /\s+/
ident = /[a-zA-Z_][a-zA-Z0-9_]*/
number = /[1-9][0-9]*(.[0-9]+)?/
k_if = "if"
op_eq = "=="
op_lt = "<"
op_gt = ">"
op_lte = "<="
op_gte = ">="
op_plus = "+"
op_minus = "-"
op_assign = "="

Python Style Lexer

[pydent]
ident.name = "IDENT"
dedent.name = "DEDENT"
newline.name = "NEWLINE"
space.name = "WS"

[pattern]
string.allowed = true
regex.allowed = true

[matching]
span = "greedy"
priority = "specific"
...

Home

File Formats

Model

Research

Clone this wiki locally