Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions rfcs/error-identifiers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Short identifiers for error messages

## Motivation

We propose to assign stable, short identifiers to each error produced
by the OCaml compiler, and include this short name in the error
message. This makes it much easier to refer to those errors message in
documentation, user questions / answers, and to search on the web for
material on a specific error.

This feature was inspired by Rust which does precisely this:

- Example rust error message:

```
error[E0268]: `break` outside of a loop
--> src/lib.rs:2:5
|
2 | break rust;
| ^^^^^^^^^^ cannot `break` outside of a loop
```

The error identifier is `E0268` here.

- Index of all errors:
https://doc.rust-lang.org/error-index.html


## Imaginary example

Current behavior:

```
# let x = 2 + "foo";;
Error: This expression has type string but an expression was expected of type
int
```

New behavior:

```
# let x = 2 + "foo";;
Error[tyco007]: This expression has type string but an expression
was expected of type int
```

In this example the error identifier is `tyco007`.

## Proposed identifier format

(We are of course happy to change this if there is demand.)

We propose messages such as "tyco007" formed of a "group", the prefix
"tyco", and a number, 7.

We propose to keeep a fixed length for messages, 7 characters,
following the Rust approach. (A guess about the motivations: it makes
formatting more consistent and search easier.) Currently the format we
use has one group per compiler module emitting errors (hre "tyco" is
short for "typecore"), but one could move to E123456 later in the
future using a single group "E".

```ocaml
type error_identifier = {
group: string; (* 4 characters *)
num: int; (* 0 <= num < 1000 *)
}
```

Note that the format proposed here is very different from the format
of warning identifiers (`missing-record-field-pattern`). Users want to
refer to specific warnings in their build system and code to
enable/disable them, so we want descriptive, human-friendly
identifiers. In constrat we don't know of a reason to refer to
a specific error in our code, so error identifiers need not be
readable and informative; we want them short and to the point.

(Why the group prefix then? Actually it simplifies the implementation
to give per-module identifiers instead of trying to pick global
numbers and avoiding conflicts between two errors trying to use the
same number. It might also help users identify classes of errors.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really convinced by the format nor the rationale.

I think the fact that it's an abbreviation that maps to the module is likely to be interesting only to compiler devs. And the day your code moves to another module what do you do with the uid ? Is it that complicated to have a single module Error_id where you define errors and which you draw from ?

Why not keep it to EXXX/WXXX, short and actually tells the user whether that's an error or a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module name in the example was mostly a proxy for the class of errors. For instance, extending the examples

  • synt : syntax errors
  • tyco : core language type errors
  • tymo : module languages
  • tymi module inclusion errors
  • tyex : type expression errors
  • link: linking error
  • cpmo : module compilation error
    If we ever change the location/classification of an error, it seems fine to change the identifiers of the errors.

One advantage that I see with those errors is that thematically adjacent errors will share similar identifiers even after many renaming and moves (which doesn't happen that often). Contrarily, with dense set of numerical identifiers, a single splits of errors might move two sibling errors far away from each other.
However, I kind of agree that the "themes" themselves might be only partially understandable at first glance for non-compiler developers, but I think that the similarity of the errors will make sense for users.

Defining all error datatypes in a single module does not really work that well with the current architecture: error constructors often carry complex payload that are strongly tied to the concepts of the modules that raise them.
We could define the identifier part centrally, and dispatch the error constructors themselves to those identifiers, but that creates two distinct source of truths and probably make a coherency tool a little more mandatory.

Overall, we could definitively have a single numerical identifiers, but the idea of having group of errors seemed slightly attractive both in term of user experience and compiler development. But it could be totally be than my vision of user experience is far too biased in that specific instances.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents

synt : syntax errors
tyco : core language type errors
tymo : module languages
tymi : module inclusion errors
tyex : type expression errors
link : linking error
cpmo : module compilation error

This map of knowledge only makes sense when you look at it entirely. When those abbreviations might appear on your console, out of context it wouldn't give any valuable information.

For example when "tyex" appears on your errors for the first time until the N time and realise it has some meaning (and there are other codes that mean other errors) that's already late since you would understand the map as a hole and context isn't needed anymore.

I like the proposal, it adds searchability to errors and given the extended places to ask/receive help online it's huge. I would either make it indexable by a number or give the entire context in the error. Instead of synt -> SYNTAX_ERROR_X

PS: Sorry If I jumped into an issue where no one called me out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davesnx this is a reasonable point, but we also want to balance with the size of the identifier which should not become too long, because it is less interesting to humans than the actual human-readable explanation of the error, and still displayed first. We could decide to move the error identifier at the end of the error message, which would un-constrain this aspect of the design space.

@dbuenzli my gut feeling is that having per-class numbers, instead of whole-compiler numbers, is going to make our life easier. For example, it is going to greatly decrease the number of situations where two independent compiler changes conflict (logically and/or textually) because they want to introduce the same error number.



## Impact on the compiler implementation

An important question with this proposal is the maintainability impact on the compiler codebase. We sketched an implementation to see how invasive it would be.

Currently our error-printing logic has code looking as follows, in each module reporting new errors:

```ocaml
let report_error ~loc env = function
| Constructor_arity_mismatch(lid, expected, provided) ->
Location.errorf ~loc
"@[The constructor %a@ expects %i argument(s),@ \
but is applied here to %i argument(s)@]"
longident lid expected provided
| [..]
```

We assume a `identifier_of_error : error -> Location.error_identifier` in each such module,
so the change to `report_error` would be minimal:

```ocaml
let report_error ~loc env err =
let identifier = identifier_of_error err in
match err with
| Constructor_arity_mismatch(lid, expected, provided) ->
Location.errorf ~loc ~identifier
"@[The constructor %a@ expects %i argument(s),@ \
but is applied here to %i argument(s)@]"
longident lid expected provided
| [..]
```

### Assigning identifiers to error values

This cannot easily be done automatically, because the error number
must be stable, even if the error type changes (a constructor is
renamed, reordered, split in two, several constructors are factored
into one...).

We thus propose to have a manually-written `identifier_of_error` function

```ocaml
let identifier_of_error err =
let group = "tyco" in
let open Location in
match err with
| Constructor_arity_mismatch _ -> { group; num = 7 }
| [...]
```

(Note: in general there is no one-to-one mapping between constructors
in the error type and the user-level concept of "errors", two
constructors could be the same error or one constructor could be
different errors depending on its argument.)

The exhaustivity check for pattern-matching on this function
guarantees that we always assign am identifier to all errors, and in
particular don't forget picking an identifier for new errors.


#### Advanced

If we wanted to be able to generate nice documentation for each error,
or at least compute statically the set of error identifiers currently in
use, we could attach an "example" of each error (an OCaml value
belonging to this error case, chosen to be representative for this
error and read nicely when printed by the error report) to each case
of the `identifier_of_error` function using a bespoke attribute
`[error.example]`:

```ocaml
let identifier_of_error err =
let group = "tyco" in
let open Location in
match[@error.table] err with
| Constructor_arity_mismatch _ ->
{ group; num = 7 }
[@error.example (Constructor_arity_mismatch (Lident "foo", 1, 2))]
| [...]
```

Then a custom ppx could easily extract a list of all "errors"
expressions in the `[error.list]` match, and for example print all
errors or build the set of error identifiers. (We already have
a similar ppx running on utils/warnings.ml to check the links to the
manual.)

Having the examples in this function is safer than simply defining
a global list of all errors separately: if we add a new error
constructor, exhaustivity warnings will force us to add a new case to
`identifier_of_error` and thus include an example.

Such a list of all errors can also be used to check that no two errors
are given the same identifier.