-
Notifications
You must be signed in to change notification settings - Fork 0
Lexer File
The Lexer file precisely describes how to divide a given input into tokens.
It does this by employing a header, which configures various settings and features,
and a body, which is composed of lexing rules.
These two sections are separated by a single line containing only three hyphens ---
.
<Header>
---
<Body>
Note: The practice of "ignoring" tokens is not allowed. All of the input must be placed within tokens as representing the complete input as tokens can be necessary for formatting, document generation, code generation, or refactoring tooling.
The header is the section of the lexer file preceding the ---
delimiter.
It must be valid TOML and is used to specify settings/meta-data for the lexer.
This format assumes no defaults for parameters and makes no assumptions about what users want. This decision has been made to emphasize clarity and backwards compatibility; it's main downside, verbosity is of limited importance because the total number of settings is limited.
The pattern section defines what pattern elements and operators are allowed.
To enable a specific pattern feature pattern.<feature>.allowed
to true
.
Name | Description |
---|---|
string |
Allows patterns of the form " ... " which match their contents as literal strings |
regex |
Allows patterns of the form / ... / which represent regular expressions |
In the future, more pattern features may be added, like the following features which are based on PEG and EBNF/ABNF.
Name | Description |
---|---|
disjunction |
Allows patterns of the form <pattern> | <pattern> which represent the logical or of the two patterns |
concatenation |
Allows patterns of the form <pattern> ~ <pattern> which represent the concatenation of the two patterns |
grouping |
Allows patterns of the form ( <pattern> ) to be used on their own and within other patterns for clarity |
The matching definition defines how the lexer will decide what tokens to match.
When reading a token from a specific start point in the input,
the span
setting dictates which possible span of input to match
-
greedy
- Match the longest valid span -
lazy
- Match the shortest valid span
When there are multiple rules that accept a given span,
the priority
setting dictates which rule is to be used.
-
specific
- Match specific rules (strings and exact regexes) before broad rules -
first
- Match the rule that appears first in this file
A feature section is a section that's presence enables and then configures a given feature. More feature sections may be added over time as necessary, this will not affect specs which do not use these features.
The Pydent feature adds support for Python styled Ident/Dedent token generation. This means that the generated parser must be a form of pushdown automata instead of a DFA.
The following parameters must all be defined.
Name | Description |
---|---|
ident.name |
The token name to assign to the Ident tokens generated by this rule |
dedent.name |
The token name to assign to the Dedent tokens generated by this rule |
newline.name |
The token name to assign to the Newline tokens generated by this rule |
space.name |
The token name to assign to the Space tokens generated by this rule |
The body contains the actual definition of the lexer rules, which are individually defined in the form <name> = <pattern>;
where the name matches this pattern: /[a-zA-Z_][a-zA-Z0-9_]*/
.
The pattern can be any valid pattern, which uses only the pattern features enabled in the pattern section.
In the following examples, the header and body are shown in separate code blocks so that TOML highlighting can be employed in the header. The delimiter has been omitted but would appear between the information provided in the code blocks.
[pattern]
string.allowed = true
regex.allowed = true
[matching]
span = "greedy"
priority = "specific"
whitespace = /\s+/
ident = /[a-zA-Z_][a-zA-Z0-9_]*/
number = /[1-9][0-9]*(.[0-9]+)?/
k_if = "if"
op_eq = "=="
op_lt = "<"
op_gt = ">"
op_lte = "<="
op_gte = ">="
op_plus = "+"
op_minus = "-"
op_assign = "="
[pydent]
ident.name = "IDENT"
dedent.name = "DEDENT"
newline.name = "NEWLINE"
space.name = "WS"
[pattern]
string.allowed = true
regex.allowed = true
[matching]
span = "greedy"
priority = "specific"
...
File Formats
- ladle.toml
- Lexer File
- Parser File
- Styler File
- Formatter File
- Scope File
- Token Stream File
- Parse Tree File
Model
Research