--- title: Strings and Characters --- # Strings and Characters This document describes how the built-in Dafny types `string` and `char` are compiled to the various supported target languages. The `string` type is just an alias for `seq`, so most of the interesting decisions center around to represent the `char` type faithfully in each target languages. Historically, the `char` type has acted much like the following newtype declaration: ```dafny newtype char = c: int | 0 <= c < 65536 ``` That is, `char` really meant "UTF-16 code unit". Since Dafny `string` values can contain invalid use of surrogates, and since this is suboptimal for interoperating with other encodings such as UTF-8, the `--unicode-char` option was added to enable defining `char` to mean "Unicode scalar value" instead. This is equivalent to the following newtype declaration instead: ```dafny newtype char = c: int | 0 <= c < 0xD800 || 0xE000 <= c < 0x11_0000 ``` The selection of target language type for `char` is chosen primarily to ensure enough bits to efficiently store and work with these ranges of values, but also secondarily to align with the idiomatic character type where possible, and to support outputting the expected representation when passing characters and strings to the `print` statement. | Language | --unicode-char=false | --unicode-char=true | | ------------- | ------------------------------ | ---------------------------------------------- | | C# | `char` | `Dafny.Rune` | | Java | `char` | `int` / `dafny.CodePoint` | | JavaScript | `string` of length 1 | `_dafny.CodePoint` (`number` wrapper) | | Python | `str` of length 1 | `_dafny.CodePoint` (`str` of length 1 wrapper) | | Go | `_dafny.Char` (`rune` wrapper) | `_dafny.CodePoint` (`rune` wrapper) | | C++ | `char` | (not supported) | The various runtimes for each language support both use cases, without any need for conditional compilation or multiple packages. Instead some utility functions have two different implementations with similar names but different signatures, and the corresponding compilers select which function to reference based on the value of `--unicode-char`. For example, most runtimes have a function named something like `SeqFromString`, for converting from the native string type to the appropriate type for the Dafny runtime if `--unicode-char=false`, but also a version named something like `UnicodeSeqFromString` for the `--unicode-char=true` case. Both accept the same input type, but the former will return something like a `Seq` whereas the latter will return something like a `Seq`. Here are some notes to explain and justify the choices in each language when `--unicode-char=true`: ## C# The `Dafny.Rune` struct is a wrapper around an `int` value, and its API guarantees that invalid values (e.g. surrogates) will be rejected on construction. Because C# optimizes this special case of a struct with a single value, using this in place of a direct `int` value when `--unicode-char` is enabled does not have any runtime overhead. We would use the standard `System.Text.Rune` struct instead, but the `DafnyRuntime` package is built for older .NET framework versions that don't include this type. ## Java Java does not include a dedicated type for Unicode scalar values and instead uses the 32-bit wide `int` primitive type in general. The Java backend uses `int` whenever the static type of a value is known, but boxing of some kind is necessary to cast a primitive `int` value as a reference value when interfacing with generic code. The solution is to define our own `CodePoint` boxing type that wraps an `int` just as `Integer` does, but knows the true intended type more precisely and overrides `toString()` accordingly. We also define a `TypeDescriptor` for `UNICODE_CHAR` that uses `int` as the primitive, unboxed type but `CodePoint` as the boxing type, and this enables using the optimized `dafny.Array` class to store a sequence of code points in an `int[]` instead of in a `CodePoint[]`. ## JavaScript / Python / Go In all three of these runtimes, the `_dafny.CodePoint` type is a wrapper around a reasonable built-in type for code points. Note that in Go, the underlying `rune` type allows surrogate values, and hence can be used whether `--unicode-char` is enabled or not. ## C++ The C++ `char` type is unfortunately only 8 bits wide, and hence is not an adequate choice for arbitrary UTF-16 code units. Addressing this and supporting the 21-bit range of Unicode scalar values is deferred for now, as it will require pulling in additional libraries to support printing to standard out in order to implement the `print` statement. # Printing strings and characters Although `string` is treated as a pure type alias within the core semantics of Dafny, it is highly beneficial to `print` a string value as `Hello` rather than `['H', 'e', 'l', 'l', 'o']`. With `--unicode-char=false`, the various runtimes make best effort attempts to identify which sequence values contain character values and hence should be printed as strings. This means this behavior depends on how easy it is to track this type information at runtime, and hence is not consistent between backends. With `--unicode-char=true`, however, this is simplified and consistent across backends: a value is only printed as a string if its static type is known to be `seq`. This means that when using some more generic datatypes, string values will be printed as sequence values: printing an `Option` value for example. This is generally implemented by including a function named something similar to `ToVerbatimString` on the sequence type in each runtime. This is only invoked in generated code when the sequence is known to be a `seq`, and hence the runtimes are free to cast as needed to treat the values as strings. `--unicode-char` also changes how character values are printed by the `print` statement: `'D'` will print as `D` when `--unicode-char=false`. but `'D'` when `--unicode-char=true`. # String and character literals The simplest way to compile literals would be to materialize them as equivalent sequence displays of code point values, un-escaping escape sequences as needed. In other words, compiling a Dafny `"Hello"` string literal to something like `Seq(72, 101, 108, 108, 111)`. This would be a substantial hit to the readability of the compiled code, however, making it harder for Dafny users and contributors to debug issues: string literals are often helpful landmarks for orienting oneself when trying to understand the compiled code. Therefore, character and string literals are still translated to literals in the target language where possible. The `Util.MightContainNonAsciiCharacters` method is used to conservatively identify string and character literals that should use the more general approach using raw code point values instead.