Skip to content

Add interchange data model description + JSON Schema definition #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 24, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions spec/data-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# DRAFT MessageFormat 2.0 Data Model

To work with messages defined in other syntaxes than that of MessageFormat 2,
an equivalent data model representation is also defined.
Implementations MAY provide interfaces which allow
for MessageFormat 2 syntax to be parsed into this representation,
for this representation to be serialized into MessageFormat 2 syntax
or any other syntax,
for messages presented in this representation to be formatted,
or for other operations to be performed on or with messages in this representation.

Implementations are not required to use this data model for their internal representation of messages.

To ensure compatibility across all platforms,
this interchange data model is defined in terms of JSON-compatible values
using TypeScript syntax for their definition.

## Messages

A `SelectMessage` corresponds to a syntax message that includes _selectors_.
A message without _selectors_ and with a single _pattern_ is represented by a `PatternMessage`.

```ts
type Message = PatternMessage | SelectMessage

interface PatternMessage {
type: 'message'
declarations: Declaration[]
pattern: Pattern
}

interface SelectMessage {
type: 'select'
declarations: Declaration[]
selectors: Expression[]
variants: Variant[]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the ICU4J Mf2DataModel class has this method to get the message's variants:

 public OrderedMap<SelectorKeys, Pattern>  getVariants();

This is less convenient to implement in ICU4C than it is to return a list of Variants, but I'm trying to do it anyway for the sake of parity with ICU4J.

I like what you have here (Variant[]) better because it's easier to implement in C++, where ICU4C doesn't enjoy the benefits of Java's polymorphic OrderedMap class. But also because while the list and map representations are isomorphic, I think it's more appealing to have an API that returns a list and let users build their own on top of it that does some kind of optimization for more efficient pattern-matching than it is to return a map when maybe efficient lookup isn't always necessary.

Whether it ends up being a list or a map, mostly I just wanted to highlight that the ICU4J and ICU4C implementations should match what's defined here.

}
```

Each message _declaration_ is represented by a `Declaration`,
which connects the `name` of the left-hand side _variable_
with its right-hand side `value`.
The `name` does not include the initial `$` of the _variable_.

```ts
interface Declaration {
name: string
value: Expression
}
```

In a `SelectMessage`,
the `keys` and `value` of each _variant_ are represented as an array of `Variant`.
For the `CatchallKey`, a string `value` may be provided to retain an identifier.
This is always `'*'` in MessageFormat 2 syntax, but may vary in other formats.

```ts
interface Variant {
keys: Array<Literal | CatchallKey>
value: Pattern
}

interface CatchallKey {
type: '*'
value?: string
}
```

## Patterns

Each `Pattern` represents a linear sequence, without selectors.
Each element of the sequence MUST have either a `Text` or an `Expression` shape.
`Text` represents literal _text_,
while `Expression` wraps each of the potential _expression_ shapes.
The `value` of `Text` is the "cooked" value (i.e. escape sequences are processed).

Implementations MUST NOT rely on the set of `Expression` `body` values being exhaustive,
as future versions of this specification MAY define additional expressions.
If encountering a `body` with an unrecognised value,
an implementation SHOULD treat it as it would a `Reserved` value.

```ts
interface Pattern {
body: Array<Text | Expression>
}

interface Text {
type: 'text'
value: string
}

interface Expression {
type: 'expression'
body: Literal | VariableRef | FunctionRef | Reserved
Copy link
Collaborator

@stasm stasm Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A FunctionRef can also have a Literal or a VariableRef as an argument, and in fact, due to how our syntax is designed, I'd argue that the function's argument is more important than the function. (E.g. it comes first in the syntax.)

I'd like to suggest an alternative way to structure our expressions, to more closely map to our syntax:

expression = "{" [s] ((operand [s annotation]) / annotation) [s] "}"

Let's special-case argument-less functions rather than function-less operands.

Instead of Literal | VariableRef | FunctionRef here and operand?: Literal | VariableRef inside FunctionRef, we can do:

type Expression = OperandExpr | FunctionExpr;
interface OperandExpr {
    operand: Literal | VariableRef;
    annotation?: FunctionExpr;
}
interface FunctionExpr {
    name: string;
    options: Map<string, Literal | VariableRef>;
}

FWIW, this is how I implemented expressions in stasm/message2:
https://github.com/stasm/message2/blob/4abf43f2023b6e20d8ee1d462684d0741ece791b/syntax/ast.ts#L44-L70

(Not blocking this PR on this, but I'd like to discuss this change as a follow-up.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be happy to discuss this further in a follow-on issue or PR.

Copy link
Collaborator

@stasm stasm Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #436 to continue this.

}
```

## Expressions

The `Literal` and `VariableRef` correspond to the the _literal_ and _variable_ syntax rules.
When they are used as the `body` of an `Expression`,
they represent _expression_ values with no _annotation_.

An _unquoted_ value is represented by a `Literal` with `quoted: false`,
while a _quoted_ value would have `quoted: true`.
The `value` of `Literal` is the "cooked" value (i.e. escape sequences are processed).

In a `VariableRef`, the `name` does not include the initial `$` of the _variable_.

```ts
interface Literal {
type: 'literal'
quoted: boolean
value: string
}

interface VariableRef {
type: 'variable'
name: string
}
```

A `FunctionRef` represents an _expression_ with a _function_ _annotation_.
In a `FunctionRef`,
the `kind` corresponds to the starting sigil of a _function_:
`'open'` for `+`, `'close'` for `-`, and `'value'` for `:`.
The `name` does not include this starting sigil.

The optional `operand` is the _literal_ or _variable_
before the _annotation_ in the _expression_, if present.
Each _option_ is represented by an `Option`.

```ts
interface FunctionRef {
type: 'function'
kind: 'open' | 'close' | 'value'
name: string
operand?: Literal | VariableRef
options?: Option[]
}

interface Option {
name: string
value: Literal | VariableRef
}
```

A `Reserved` represents an _expression_ with a _reserved_ _annotation_.
The `sigil` corresponds to the starting sigil of the _reserved_.
The `source` is the "raw" value (i.e. escape sequences are not processed)
and includes the starting `sigil`.

Implementations MUST NOT rely on the set of `sigil` values remaining constant,
as future versions of this specification MAY assign other meanings to such sigils.

If the _expression_ includes a _literal_ or _variable_ before the _annotation_,
it is included as the `operand`.

```ts
interface Reserved {
type: 'reserved'
sigil: '!' | '@' | '#' | '%' | '^' | '&' | '*' | '<' | '>' | '/' | '?' | '~'
source: string
operand?: Literal | VariableRef
}
```

## Extensions

Implementations MAY extend this data model with additional interfaces,
as well as adding new fields to existing interfaces.
When encountering an unfamiliar field, an implementation MUST ignore it.
For example, an implementation could include a `span` field on all interfaces
encoding the corresponding start and end positions in its source syntax.

In general,
implementations MUST NOT extend the sets of values for any defined field or type
when representing a valid message.
However, when using this data model to represent an invalid message,
an implementation MAY do so.
This is intended to allow for the representation of "junk" or invalid content within messages.