{#sec-lexical-grammar}
# Lexical and Low Level Grammar

As with most languages, Dafny syntax is defined in two levels. First the stream
of input characters is broken up into _tokens_. Then these tokens are parsed
using the Dafny grammar.

The Dafny grammar is designed as an _attributed grammar_, which is a
conventional BNF-style set of productions, but in which the productions can
have arguments. The arguments control some alternatives within
the productions, such as whether an alternative is allowed or not in a specific context.
These arguments allow for a more compact and understandable grammar.

The precise, technical details of the grammar are presented together in <project:#sec-grammar-details>.
The expository parts of this manual present the language structure less formally.
Throughout this document there are embedded hyperlinks to relevant grammar sections,
marked as [grammar](#sec-grammar-details).

{#sec-unicode}
## Dafny Input

Dafny source code files are readable text encoded in UTF-8.
All program text other than the contents of comments, character, string and verbatim string literals
consists of printable and white-space ASCII characters,
that is, ASCII characters in the range `!` to `~`, plus space, tab,
carriage return and newline (ASCII 9, 10, 13, 32) characters.
(In some past versions of Dafny, non-ASCII, unicode representations of some mathematical symbols were
permitted in Dafny source text; these are no longer recognized.)

String and character literals and comments may contain any unicode character,
either directly or as an escape sequence.

{#sec-token-types}
## Tokens and whitespace

The characters used in a Dafny program fall into four groups:

* White space characters: space, tab, carriage return and newline
* alphanumerics: letters, digits, underscore (`_`), apostrophe (`'`), and question mark (`?`)
* punctuation: ``(){}[],.`;``
* operator characters (the other printable characters)

Except for string and character literals, each Dafny token consists of a
sequence of consecutive characters from just one of these
groups, excluding white-space. White-space is ignored except that it
separates tokens and except in the bodies of character and string literals.

A sequence of alphanumeric characters (with no preceding or following additional
alphanumeric characters) is a _single_ token. This is true even if the token
is syntactically or semantically invalid and the sequence could be separated into
more than one valid token. For example, `assert56` is one identifier token,
not a keyword `assert` followed by a number; `ifb!=0` begins with the token
`ifb` and not with the keyword `if` and token `b`; `0xFFFFZZ` is an illegal
token, not a valid hex number `0xFFFF` followed by an identifier `ZZ`.
White-space must be used to separate two such tokens in a program.

Somewhat differently, operator tokens need not be separated.
Only specific sequences of operator characters are recognized and these
are somewhat context-sensitive. For example, in `seq<set<int>>`, the grammar
knows that `>>` is two individual `>` tokens terminating the nested
type parameter lists; the right shift operator `>>` would never be valid here. Similarly, the
sequence `==>` is always one token; even if it were invalid in its context,
separating it into `==` and `>` would always still be invalid.

In summary, except for required white space between alphanumeric tokens,
adding or removing white space between tokens can never result in changing the meaning of a Dafny program.
For most of this document, we consider Dafny programs as sequences of tokens.

{#sec-character-classes}
## Character Classes

This section defines character classes used later in the token definitions.
In this section

* a backslash is used to start an escape sequence (so for example
`'\n'` denotes the single linefeed character)
* double quotes
enclose the set of characters constituting a character class
* enclosing single
quotes are used when there is just one character in the class
(perhaps expressed with a `\` escape character)
* `+` indicates
the union of two character classes
* `-` is the set-difference between the
two classes
* `ANY` designates all [unicode characters](#sec-unicode).

 name              | description
-------------------|---------------------------
letter             | ASCII upper or lower case letter; no unicode characters
digit              | base-ten digit ("0123456789")
posDigit           | digits, excluding 0 ("123456789")
posDigitFrom2      | digits excluding 0 and 1 ("23456789")
hexdigit           | a normal hex digit ("0123456789abcdefABCDEF")
special            | `?_"
cr                 | carriage return character (ASCII 10)
lf                 | line feed character (ASCII 13)
tab                | tab character (ASCII 9)
space              | space character (ASCII 32)
nondigitIdChar     | characters allowed in an identifier, except digits (letter + special)
idchar             | characters allowed in an identifier (nondigitIdChar + digits)
nonidchar          | characters not in identifiers (ANY - idchar)
charChar           | characters allowed in a character constant (ANY - '\'' - '\\' - cr - lf)
stringChar         | characters allowed in a string constant (ANY - '"' - '\\' - cr - lf)
verbatimStringChar | characters allowed in a verbatim string constant (ANY - '"')


The _special_ characters are the characters in addition to alphanumeric characters
that are allowed to appear in a Dafny identifier. These are

* `'` because mathematicians like to put primes on identifiers and some ML
  programmers like to start names of type parameters with a `'`,
* `_` because computer scientists expect to be able to have underscores in identifiers, and
* `?` because it is useful to have `?` at the end of names of predicates,
  e.g., `Cons?`.

A `nonidchar` is any character except those that can be used in an identifier.
Here the scanner generator will interpret `ANY` as any unicode character.
However, `nonidchar` is used only to mark the end of the `!in` token;
in this context any character other than [whitespace or printable ASCII](#sec-unicode)
will trigger a subsequent scanning or parsing error.

{#sec-comments}
## Comments

Comments are in two forms.

* They may go from `/*` to `*/` .
* They may go from `//` to the end of the line.

A comment is identified as a token during the tokenization of
input text and is then discarded for the purpose of interpreting the
Dafny program. (It is retained to enable auto-formatting
and provide accurate source locations for error messages.)
Thus comments are token separators: `a/*x*/b` becomes two tokens
`a` and `b`.

Comments may be nested,
but note that the nesting of multi-line comments is behavior that is different
from most programming languages. In Dafny,

<!-- %check-resolve -->
```dafny
method m() {
  /* comment
     /* nested comment
     */
     rest of outer comment
  */
}
```

is permitted; this feature is convenient for commenting out blocks of
program statements that already have multi-line comments within them.
Other than looking for end-of-comment delimiters,
the contents of a comment are not interpreted.
Comments may contain any characters.

Note that the nesting is not fool-proof. In

<!-- %check-resolve Grammar.1.expect -->
```dafny
method m() {
  /* var i: int;
     // */ line comment
     var j: int;
  */
}
```

and
<!-- %check-resolve Grammar.2.expect -->

```dafny
method m() {
  /* var i: int;
     var s: string := "a*/b";
     var j: int;
   */
}
```
the `*/` inside the line comment and the string are seen as the end of the outer
comment, leaving trailing text that will provoke parsing errors.

{#sec-documentation-comments}
## Documentation comments

Like many other languages, Dafny permits _documentation comments_ in a program file.
Such comments contain natural language descriptions of program elements and may be used
by IDEs and documentation generation tools to present information to users.

In Dafny programs.
* Documentation comments (a) either begin with `/**` or (b) begin with `//` or `/*` in specific locations
* Documentation comments may be associated with any declaration, including type definitions, export declarations, and datatype constructors.
* They may be placed before the declaration, between the declaration and the definition or after the definition. The priority of these comments is the following:
   * If there is a comment starting with `/**` right before the definition, it's the documentation comment.
   * Otherwise, if one or more comments appear between a declaration and its definition — which is uncommon in other programming languages — they are treated as documentation comments, regardless of whether they start with `//`, `/**`, or `/*`. This is because there is no other plausible reason for a comment to be placed in this position.
   * Otherwise, if there is a single comment starting with `//` or `/**` or `/*` at the end of the definition, it's the documentation comment. In this case, multi-line documentation comments must be starting with `/*`
* The extraction of the doc-string from a documentation comment follows the following rules
  * On the first line, an optional `*` right after `/*` and an optional space are removed, if present
  * On other lines, the indentation space (with possibly one star in it) is removed, as if the content was supposed to align with `A` if the comment started with `/** A` for example.
* The documentation string is interpreted as plain text, but it is possible to provide a user-written
  plugin that provides other interpretations. VSCode as used by Dafny interprets any markdown
  syntax in the doc-string.

Here are examples:

<!-- %check-resolve -->

```dafny
/* Just a comment */
const c0: int := 8 // docstring about c0
// Just a comment

/** docstring about c1 */
newtype c1 = x : int | x > 8 // Just a comment

/* Just a comment */
function c2(): (r: int) // Docstring about c2
  ensures r > 0
{ 8 } // Just a comment

// just a comment
const c5 := 8
```

Datatype constructors may also have comments:

<!-- %check-resolve -->
```dafny
datatype T // Docstring for T
= | A(x: int,
      y: int) // Docstring for A
  | B()       /* Docstring for B */ |
    C()       // Docstring for C

/** Docstring for T0 */
datatype T0 =
  | /** Docstring for A */
    A(x: int,
      y: int)
  | /** Docstring for B */
    B()
  | /** Docstring for C */
    C()
```

As can `export` declarations:

<!-- %check-resolve -->
```dafny
module M {
  const A: int
  const B: int
  const C: int
  const D: int

  export
    // This is the docstring of the eponymous export set intended for most clients
    provides A, B, C

  /** This is the docstring */
  export AB provides A, B

  export Friends extends AB
    // This is the docstring of the export set is for clients who need to know more of the
    // details of the module's definitions.
    reveals A
    provides D
}
```


{#sec-tokens}
## Tokens ([grammar](#sec-g-tokens))

The Dafny tokens are defined in this section.

{#sec-reserved-words}
### Reserved Words

Dafny has a set of reserved words that may not
be used as identifiers of user-defined entities.
These are listed [here](#sec-g-tokens).

In particular note that

- `array`, `array2`, `array3`, etc. are reserved words, denoting array types of given rank.
However,  `array1` and `array0` are ordinary identifiers.
- `array?`, `array2?`, `array3?`, etc. are reserved words,
denoting possibly-null array types of given rank,
but not `array1?` or `array0?`.
- `bv0`, `bv1`, `bv2`, etc. are reserved words that denote the types of
bitvectors of given length.
The sequence of digits after 'array' or 'bv' may not have leading zeros:
for example, `bv02` is an ordinary identifier.

### 2.6.2. Identifiers {#sec-identifiers}

In general, an `ident` token (an identifier) is a sequence of ``idchar`` characters where
the first character is a ``nondigitIdChar``. However tokens that fit this pattern
are not identifiers if they look like a character literal
or a reserved word (including array or bit-vector type tokens).
Also, `ident` tokens that begin with an `_` are not permitted as user identifiers.

### 2.6.3. Digits {#sec-digits}

A `digits` token is a sequence of decimal digits (`digit`), possibly interspersed with
underscores for readability (but not beginning or ending with an underscore).
Example: `1_234_567`.

A `hexdigits` token denotes a hexadecimal constant, and is a sequence of hexadecimal digits (`hexdigit`)
prefaced by `0x` and
possibly interspersed with underscores for readability (but not beginning or ending with an underscore).
Example: `0xffff_ffff`.

A `realnumber` token is a real number literal, possibly interspersed with
underscores for readability (but not beginning or ending with an underscore).
Real numbers can be written as decimal fractions (like `123.456`) or using
scientific notation (like `1.23e5` or `123e5`). Real number literals can use
trailing-dot shorthand (like `1.` for `1.0`). They also support
leading-dot shorthand (like `.5` for `0.5` or `.5e2` for `50.0`). All produce real number values.
Examples: `123_456.789_123`, `1.23e5`, `123e-2`, `5e10`, `1.`, `.5`, `.25e-3`.

### 2.6.4. Escaped Character {#sec-escaped-characters}

The `escapedChar` token is a multi-character sequence that denotes a non-printable or non-ASCII character.
Such tokens begin with a backslash characcter (`\`) and denote
 a single- or double-quote character, backslash,
null, new line, carriage return, tab, or a
Unicode character with given hexadecimal representation.
Which Unicode escape form is allowed depends on the value of the `--unicode-char` option.

If `--unicode-char:false` is stipulated,
`\uXXXX` escapes can be used to specify any UTF-16 code unit.

If `--unicode-char:true` is stipulated,
`\U{X..X}` escapes can be used to specify any Unicode scalar value.
There must be at least one hex digit in between the braces, and at most six.
Surrogate code points are not allowed.
The hex digits may be interspersed with underscores for readability
(but not beginning or ending with an underscore), as in `\U{1_F680}`.
The braces are part of the required character sequence.

Note that although Unicode
letters are not allowed in Dafny identifiers, Dafny does support [Unicode
in its character, string, and verbatim strings constants and in its comments](#sec-unicode).

### 2.6.5. Character Constant Token {#sec-character-constant-token}

The `charToken` token denotes a character constant.
It is either a `charChar` or an `escapedChar` enclosed in single quotes.

### 2.6.6. String Constant Token {#sec-string-constant-token}

A `stringToken` denotes a string constant.
It consists of a sequence of `stringChar` and `escapedChar` characters enclosed in
double quotes.

A `verbatimStringToken` token also denotes a string constant.
It is a sequence of any `verbatimStringChar` characters (which includes newline characters),
enclosed between `@"` and `"`, except that two
successive double quotes represent one quote character inside
the string. This is the mechanism for escaping a double quote character,
which is the only character needing escaping in a verbatim string.
Within a verbatim string constant, a backslash character represents itself
and is not the first character of an `escapedChar`.

### 2.6.7. Ellipsis {#sec-ellipsis}

The `ellipsisToken` is the character sequence `...` and is typically used to designate something missing that will
later be inserted through refinement or is already present in a parent declaration.

## 2.7. Low Level Grammar Productions {#sec-grammar}

### 2.7.1. Identifier Variations {#sec-identifier-variations}

#### 2.7.1.1. Identifier

A basic ordinary identifier is just an `ident` token.

It may be followed by a sequence of suffixes to denote compound entities.
Each suffix is a dot (`.`) and another token, which may be

- another `ident` token
- a `digits` token
- the `requires` reserved word
- the `reads` reserved word

Note that

* Digits can be used to name fields of classes and destructors of
  datatypes. For example, the built-in tuple datatypes have destructors
  named 0, 1, 2, etc. Note that as a field or destructor name, a digit sequence
  is treated as a string, not a number: internal
  underscores matter, so `10` is different from `1_0` and from `010`.
* `m.requires` is used to denote the [precondition](#sec-requires-clause) for method `m`.
* `m.reads` is used to denote the things that method `m` may [read](#sec-reads-clause).

#### 2.7.1.2. No-underscore-identifier

A `NoUSIdent` is an identifier except that identifiers with a **leading**
underscore are not allowed. The names of user-defined entities are
required to be ``NoUSIdent``s or, in some contexts, a ``digits``.
 We introduce more mnemonic names
for these below (e.g. ``ClassName``).

A no-underscore-identifier is required for the following:

- module name
- class or trait name
- datatype name
- newtype name
- synonym (and subset) type name
- iterator name
- type variable name
- attribute name

A variation, a no-underscore-identifier or a `digits`, is allowed for

- datatype member name
- method or function or constructor name
- label name
- export id
- suffix that is a typename or constructor

All _user-declared_ names do not start with underscores, but there are
internally generated names that a user program might _use_ that begin
with an underscore or are just an underscore.

#### 2.7.1.3. Wild identifier {#sec-wild-identifier}

A wild identifier is a no-underscore-identifier except that the singleton
`_` is allowed. The `_` is replaced conceptually by a unique
identifier distinct from all other identifiers in the program.
A `_` is used when an identifier is needed, but its content is discarded.
Such identifiers are not used in expressions.

Wild identifiers may be used in these contexts:

- formal parameters of a lambda expression
- the local formal parameter of a quantifier
- the local formal parameter of a subset type or newtype declaration
- a variable declaration
- a case pattern formal parameter
- binding guard parameter
- for loop parameter
- LHS of update statements

### 2.7.2. Qualified Names

A qualified name starts with the name of a top-level entity and then is followed by
zero or more ``DotSuffix``s which denote a component. Examples:

* `Module.MyType1`
* `MyTuple.1`
* `MyMethod.requires`
* `A.B.C.D`

The identifiers and dots are separate tokens and so may optionally be
separated by whitespace.

### 2.7.3. Identifier-Type Combinations

Identifiers are typically declared in combination with a type, as in
<!-- %no-check -->
```dafny
var i: int
```

However, Dafny infers types in many circumstances, and in those, the type can be omitted. The type is required
for field declarations and formal parameters of methods, functions and constructors (because there is no initializer).
It may be omitted (if the type can be inferred) for local variable declarations, pattern matching variables,
quantifiers,

Similarly, there are circumstances in which the identifier name is not needed, because it is not used.
This is allowed in defining algebraic datatypes.

In some other situations a wild identifier can be used, as described [above](#sec-wild-identifier).

### 2.7.4. Quantifier Domains ([grammar](#g-quantifier-domain)) {#sec-quantifier-domains}

Several Dafny constructs bind one or more variables to a range of possible values.
For example, the quantifier `forall x: nat | x <= 5 :: x * x <= 25` has the meaning
"for all integers x between 0 and 5 inclusive, the square of x is at most 25".
Similarly, the set comprehension `set x: nat | x <= 5 :: f(x)` can be read as
"the set containing the result of applying f to x, for each integer x from 0 to 5 inclusive".
The common syntax that specifies the bound variables and what values they take on
is known as the *quantifier domain*; in the previous examples this is `x: nat | x <= 5`,
which binds the variable `x` to the values `0`, `1`, `2`, `3`, `4`, and `5`.

Here are some more examples.

- `x: byte` (where a value of type `byte` is an int-based number `x` in the range `0 <= x < 256`)
- `x: nat | x <= 5`
- `x <- integerSet`
- `x: nat <- integerSet`
- `x: nat <- integerSet | x % 2 == 0`
- `x: nat, y: nat | x < 2 && y < 2`
- `x: nat | x < 2, y: nat | y < x`
- `i | 0 <= i < |s|, y <- s[i] | i < y`

A quantifier domain declares one or more *quantified variables*, separated by commas.
Each variable declaration can be nothing more than a variable name, but it
may also include any of three optional elements:

1. The optional syntax `: T` declares the type of the quantified variable.
   If not provided, it will be inferred from context.

2. The optional syntax `<- C` attaches a collection expression `C` as a *quantified variable domain*.
   Here a collection is any value of a type that supports the `in` operator, namely sets, multisets, maps, and sequences.
   The domain restricts the bindings to the elements of the collection: `x <- C` implies `x in C`.
   The example above can also be expressed as `var c := [0, 1, 2, 3, 4, 5]; forall x <- c :: x * x <= 25`.

3. The optional syntax `| E` attaches a boolean expression `E` as a *quantified variable range*,
   which restricts the bindings to values that satisfy this expression.
   In the example above `x <= 5` is the range attached to the `x` variable declaration.

Note that a variable's domain expression may reference any variable declared before it,
and a variable's range expression may reference the attached variable (and usually does) and any variable declared before it.
For example, in the quantifier domain `i | 0 <= i < |s|, y <- s[i] | i < y`, the expression `s[i]` is always [well-formed](#sec-assertion-batches)
because the range attached to `i` ensures `i` is a valid index in the sequence `s`.

Allowing per-variable ranges is not fully backwards compatible, and so it is not yet allowed by default;
the `--quantifier-syntax:4` option needs to be provided to enable this feature (See [Section 13.9.5](#sec-controlling-language)).

### 2.7.5. Numeric Literals ([grammar](#g-literal-expression)) {#sec-numeric-literals}

Integer and bitvector literals may be expressed in either decimal or hexadecimal (`digits` or `hexdigits`).

Real number literals are written as real number tokens (`realnumber`).
