Lexical Rules

Purpose

The Lexical Rules define the complete vocabulary recognized by the ARES language. Every ARES program begins as a simple stream of individual characters. The Lexical Rules are the set of instructions that tell the compiler how to group those characters into meaningful units called Tokens. This is the very first step in understanding your code.

Why it exists

A computer cannot understand a program as a single long string of text. It needs to know which parts are names of variables, which parts are math symbols, and which parts are instructions like read or print. The lexical rules exist to provide this basic understanding. They act like a dictionary for the language. By grouping characters into tokens, the system can quickly and accurately identify your intent before it even begins to analyze the logic of your program.

How it works

The system uses a tool called a "lexer" to scan your code from left to right.

Categorizing tokens. Every word and symbol in ARES is assigned to a specific category, such as Keywords, Operators, Literals (like numbers and text), or Punctuation.
Matching patterns. The system uses specific patterns (known as Regular Expressions) to identify these tokens. For example, any whole number is identified as a NumberLiteral.
The Priority Rule. Because some words look like others, the system uses a strict "order of operation" to decide which token to choose. For example, it checks if a word is a special keyword like read before it decides if it's a normal variable name.
Skipping noise. The system automatically identifies and ignores characters that don't affect the logic, such as spaces, tabs, and comments. This is called "skipping," and it ensures the compiler only focuses on the meaningful parts of your code.

Intuition

Think of the lexical rules like the individual words in a dictionary. Before you can understand a whole sentence (a line of code), you first need to identify each individual word and its meaning. Some words are specialized (Keywords), some are actions (Operators), and some are the subjects of your sentence (Identifiers). The lexical rules are the foundation that allows the compiler to "read" your code one word at a time before it tries to understand the "story" you are telling.

Implementation details

The definition of the ARES vocabulary is found in src/parser/lexer.ts. It uses a professional tool called Chevrotain to turn these rules into a working lexer.

Longest Match: If two rules match the same characters, the system always chooses the one that captures the most characters.
Token Ordering: The system follows a strict order in its dictionary to ensure that complex symbols like == are recognized before simple ones like =.

Complexity

The lexing process is incredibly fast. It can process millions of characters in less than a second because it only needs to look at every character once to identify the tokens.

Trace example

This is what happens when the system reads if x == 10:

Word 1: It identifies if as a special Keyword.
Space: It identifies the space and skips it.
Word 2: It identifies x as a name for a variable (Identifier).
Space: It identifies the space and skips it.
Word 3: It identifies == as a math symbol (Operator).
Space: It identifies the space and skips it.
Word 4: It identifies 10 as a raw value (NumberLiteral).

Related entities

src/parser/lexer.ts: The actual code that implements these vocabulary rules.
2_language_reference/02_grammar.md: Explains how these individual words are combined to form valid sentences.