forked from hohle/ion-rfc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathion-rfc-05-symbols.nroff
708 lines (560 loc) · 23.6 KB
/
ion-rfc-05-symbols.nroff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
.tm 5. Ion Symbols & Symbol Tables .................................... \n%
.ti 0
5. Ion Symbols & Symbol Tables
Ion symbols are critical to the notation's performance and
space-efficiency.
In Ion binary, all symbols are represented as integers. These integers
are symbol IDs whose corresponding text forms are defined by a
symbol table.
In Ion text, symbols are represented in three ways:
.in 9
.ti 6
o Quoted symbol: a sequence of zero or more characters between
single-quotes, e.g., 'hello', 'a symbol', '123', ''. This
representation can denote any symbol text.
.ti 6
o Identifier: an unquoted sequence of one or more ASCII letters, digits,
or the characters $ (dollar sign) or _ (underscore), not starting with
a digit.
.ti 6
o Operator: an unquoted sequence of one or more of the following
nineteen ASCII characters: !#%&*+-./;<=>?@^`|~ Operators can only be
used as (direct) elements of an S-expression. In any other context
those characters require single-quotes.
.in 3
A subset of identifiers have special meaning:
.in 9
.ti 6
o Symbol Identifier: an identifier that starts with $ (dollar sign)
followed by one or more digits. These identifiers directly represent
the symbol's integer symbol ID, not the symbol's text. This form is
not typically visible to users, but they should be aware of the
reserved notation so they don't attempt to use it for other purposes.
.in 3
By convention, symbols starting with $ should be reserved for system
tools, processing frameworks, and the like, and should be avoided by
applications and end users. In particular, the symbol $ion and all
symbols starting with $ion_ are reserved for use by the Ion notation and
by related standards.
.tm _ 5.1. Symbol Tables ............................................. \n%
.ti 0
5.1. Symbol Tables
There are two kinds of symbol tables: shared and local.
A shared symbol table is intended for use from multiple data sources.
Each shared table is uniquely identified by a (string) name and (int)
version number. Shared symbol tables are key to the compactness of Ion
binary data, extracting the text of frequently used symbols (field
names, enumerations, keywords, etc.) out of individual documents and
into a common data structure.
A system symbol table is a shared table with the name "$ion". Every
version of the Ion notation maps to a specific version of the system
symbol table.
A local symbol table is for the sole use of a well-defined scope of Ion
data. Since it does not need to be referenced from other contexts, it
has no name or version number. Local tables may import symbols defined
in one or more shared tables or import the symbols in the previously
defined local symbol table (but not both). Local tables also accumulate
any other symbols encountered within their scoped data, such as when
encoding Ion text into binary.
At any point during processing, there is a current symbol table, which
is either a local symbol table or a system symbol table. At the start of
input, the current symbol table is initialized to be the system symbol
table for Ion 1.0. The current symbol table is only changed in two
circumstances: (1) encountering a system identifier at the top-level,
or (2) encountering a local symbol table at the top-level.
.tm _ 5.1.1. The Catalog .......................................... \n%
.ti 0
5.1.1 The Catalog
This specification refers to a "catalog". That's simply an abstraction
for the set of available Ion shared symbol tables. It's not necessarily
a static set: one could implement a catalog that pulls symbol tables
from a network repository, or one that has application symbol tables
"compiled in", or (very likely) some composition of these techniques.
The mechanism by which shared symbol tables are acquired is irrelevant
to this specification.
.tm _ 5.1.2. Top-Level Semantics ................................. \n%
.ti 0
5.1.2. Top-Level Semantics
Symbol tables only (meaningfully) occur at the top level of a data
stream or datagram. An Ion data stream is structured as follows:
.in 9
.ti 6
o An initial Ion Version Marker is required in binary data, and optional
(but highly recommended) in text. All Ion text implicitly starts with
$ion_1_0 when not explicitly provided. Every (top-level) IVM switches
the parser to the indicated Ion version and sets the current symbol
table to the indicated Ion system symbol table.
.ti 6
o Every top-level value (and all the hierarchical data within it) is
interpreted with respect to the current symbol table at the point
where the value starts.
.ti 6
o Every top-level local symbol table becomes the current symbol table
for the value(s) following it. A local table may be injected or
extended by the implementation during processing of the rest of
the stream.
Certain top-level values such as IVMs and local symbol tables are
referred to as system values; all other values are referred to as user
values. An Ion implementation may give applications the ability to
"skip over" the system values, since they are generally irrelevant to
the semantics of the user data.
.in 3
.tm _ 5.1.3. System Symbols ....................................... \n%
.ti 0
5.1.3. System Symbols
The version included in the system identifier is independent of the
version of the implied system symbol table (named "$ion"). Each version
of the Ion specification defines the corresponding system symbol table
version. Ion 1.0 uses the "$ion" symbol table, version 1, and future
versions of Ion will use larger versions of the "$ion" symbol table.
$ion_1_1 will probably use version 2, while $ion_2_0 might use
version 5.
Applications and users should never have to care about these symbol
table versions, since they are never explicit in user data: this
specification disallows (by ignoring) imports named "$ion").
Here are the system symbols for Ion 1.0.
.nf
Symbol ID Symbol Name
1 $ion
2 $ion_1_0
3 $ion_symbol_table
4 name
5 version
6 imports
7 symbols
8 max_id
9 $ion_shared_symbol_table
.in 3
Equivalently:
.nf
$ion_shared_symbol_table::
{
name: "$ion", version: 1,
symbols:
[ "$ion", "$ion_1_0", "$ion_symbol_table", "name", "version",
"imports", "symbols", "max_id", "$ion_shared_symbol_table"
]
}
.in 3
.tm _ 5.2. Ion Version Markers ....................................... \n%
.ti 0
5.2. Ion Version Markers
In Ion text, the Ion Version Marker (IVM) is represented by the
following symbol.
.nf
$ion_1_0
.in 3
This stand-alone symbol is recommended at the start of Ion text data. It
identifies a specific major/minor version of the Ion notation. It resets
the current symbol table to be the corresponding system symbol table,
and simultaneously switches the parser into the appropriate mode for
parsing the right version of Ion notation.
A version marker can also occur at non-initial positions at the top
level, and it will have the same effect; when encountered below
top-level, it has no processing effect and is treated as an ordinary
user value.
IVMs do not have annotations. The input ann::$ion_1_0 is not a version
marker, it's a symbol with an annotation.
In Ion binary, there is a special sequence of bytes that represent
the IVM.
.nf
E0 01 00 EA
.in 3
This sequence of bytes can only appear at the top-level, much like the
text IVM, and can occur at non-initial positions as well. Note that this
particular form is equivalent to its textual counterpart $ion_1_0 and
has the same processing semantics, but is a special encoding artifact in
the binary format.
At the top-level, any encoding of $ion_1_0 that does not match the forms
specified above are system values that have no processing semantics
(a NOP).
Below are examples of the symbol $ion_1_0 that are not interpreted
as IVMs:
.nf
// explicitly quoted
'$ion_1_0'
// explicitly quoted with some newline escapes
'$ion_\
1\
_\
0'
// symbol ID mapping $ion_1_0 declared in the system symbol table
$2
$ion_symbol_table::
{
symbols:["$ion_1_0"]
}
// a locally declared symbol ID mapping to $ion_1_0
$10
.in 3
It is important to round-trip the forms above correctly, here is an
example of IVMs mixed with these NOP encodings:
.nf
// IVM
$ion_1_0
$ion_symbol_table::
{
symbols:["a"]
}
// not the IVM
'$ion_1_0'
// also not the IVM
$2
// maps to "a"
$10
.in 3
The above is equivalent to the following, more concise Ion:
.nf
$ion_1_0
a
.in 3
Here is a bad example of re-encoding the previous example in a
naive way:
.nf
// IVM
$ion_1_0
$ion_symbol_table::
{
symbols:["a"]
}
// quoted form improperly got converted to an IVM
$ion_1_0
// ERROR! the following symbol ID is not defined
$10
.in 3
The problem with the above example is that the conversion of '$ion_1_0'
to $ion_1_0 changed it from being a NOP to an IVM which resets the
current symbol table to the system symbol table.
.tm _ 5.2.2. Local Symbol Tables .................................. \n%
.ti 0
5.2.2. Local Symbol Tables
A local symbol table defines symbols through two mechanisms, both of
which are optional.
First, it imports the symbols from one or more shared symbol tables,
offsetting symbol IDs appropriately so they do not overlap. Instead of
importing the symbols from shared symbol tables, a local symbol table
may import the current symbol table.
Second, it defines local symbols similarly to shared tables. The latter
aspect is generally not managed by users: the system uses this form in
the binary encoding to record local symbols encountered during parsing.
.nf
// a local symbol table that resets the context, imports some shared
// symbol tables and adds three local symbols
$ion_symbol_table::
{
imports:[ { name: "com.amazon.ols.symbols.offer",
version: 1,
max_id: 75 },
// ...
],
symbols:[ "rock", "paper", "scissors" ]
}
// a local symbol table that adds two local symbols to the context
$ion_symbol_table::
{
imports:$ion_symbol_table,
symbols:[ "lizard", "spock" ]
}
.in 3
When immediately following an explicit system ID, a top-level struct
whose first annotation is $ion_symbol_table is interpreted as a local
symbol table. If the struct is null (null.struct) then it is treated as
if it were an empty struct.
The imports field should be the symbol $ion_symbol_table or a list as
specified in the following section.
The symbols field should be a list of strings. If the field is missing
or has any other type, it is treated as if it were an empty list.
Null elements in the symbols list declare unknown symbol text ("gaps")
for its SID within the sequence. Any element of the list that is not a
string must be interpreted as if it were null. Any SIDs that refer to
null slots in a local symbol table are equivalent to symbol zero.
Any other field (including, for example, name or version) is ignored.
.tm _ 5.2.2.1. Imports ......................................... \n%
.ti 0
5.2.2.1. Imports
A local symbol table implicitly imports the system symbol table that is
active at the point where the local table is encountered.
If the value of the imports field is the symbol $ion_symbol_table, then
the all of the symbol ID assignments in the current symbol table are
imported into the new local table. Thus, if the current symbol table was
the system symbol table, then processing is identical to having no
imports field value.
If the value of the imports field is a list, each element of the list
must be a struct; each element that is null or is not a struct
is ignored.
Each import (including the implicit system table import) allocates a
contiguous, non-overlapping sequence of symbol IDs. The system symbols
start at 1, each import starts one past the end of the previous import,
and the local symbols start immediately after the last import. The size
of each import's subsequence is defined by the max_id on the import
statement, regardless of the actual size of the referenced table.
Import structs in an import list are processed in order as follows:
.in 9
.ti 6
o If no name field is defined, or if it is not a non-empty string, the
import clause is ignored.
.ti 6
o If the name field is "$ion", the import clause is ignored.
.ti 6
o If no version field is defined, or if it is null, not an int, or less
than 1, act as if it is 1.
.ti 6
o If a max_id field is defined but is null, not an int, or less than
zero, act as if it is undefined.
.ti 6
o Select a shared symbol table instance as follows:
.ti 6
o Query the catalog to retrieve the specified table by name
and version.
.ti 6
o If an exact match is not found:
.ti 6
o If max_id is undefined, implementations MUST raise an error and
halt processing.
.ti 6
o Otherwise query the catalog to retrieve the table with the given
name and the greatest version available.
.ti 6
o If no table has been selected, substitute a dummy table containing
max_id undefined symbols.
.ti 6
o If max_id is undefined, set it to the largest symbol ID of the
selected table (which will necessarily be an exact match).
.ti 6
o Allocate the next max_id symbol IDs to this imported symbol table.
.in 3
After processing imports, a number of symbol IDs will have been
allocated, including at least those of a system symbol table. This
number is always well-defined, and any local symbols will be numbered
immediately beyond that point. We refer to the smallest local symbol ID
as the local min_id.
Note: This specification allows a local table to declare multiple
imports with the same name, perhaps even the same version.
Such a situation provides redundant data and allocates unnecessary
symbol IDs but is otherwise harmless.
.tm _ 5.2.2.2. Semantics ....................................... \n%
.ti 0
5.2.2.2. Semantics
When mapping from symbol ID to string, there is no ambiguity. However,
due to unavailable imports, certain IDs may appear to be undefined when
binary data is decoded. Any symbol ID outside of the range of the local
symbol table (or system symbol table if no local symbol table is
defined) for which it is encoded under MUST raise an error.
When mapping from string to symbol ID, there may be multiple assigned
IDs; implementations MUST select the lowest known ID. If an imported
table is unavailable, this may cause selection of a greater ID than
would be the case otherwise. This restriction ensures that symbols
defined by system symbol tables can never be mapped to other IDs.
Put another way, string-to-SID mappings have the following precedence:
o The system table is always consulted first.
o Each imported table is consulted in the order of import.
o Local symbols are last.
.tm _ 5.2.3. Shared Symbol Tables ................................. \n%
.ti 0
5.2.3. Shared Symbol Tables
This section defines the serialized form of shared symbol tables.
Unlike local symbol tables, the Ion parser does not intrinsically
recognize or process this data; it is up to higher-level specifications
or conventions to define how shared symbol tables are communicated.
.KS
.nf
$ion_shared_symbol_table::
{
name: "com.amazon.ols.symbols.offer",
version: 1,
imports: // For informational purposes only.
[
{ name:"..." , version:1 },
// ...
],
symbols:
[
"fee", "fie", "foe", /* ... */ "hooligan"
]
}
.KE
.in 3
A shared symbol table is serialized as a struct with the annotation
$ion_shared_symbol_table.
The name field should be a string with length at least one. If the
field has any other value, then materialization of this symbol table
must fail.
The version field should be an int and at least 1. If the field is
missing or has any other value, it is treated as if it were 1.
The imports field is for informational purposes only in shared tables.
They assert that this table contains a superset of the strings in each
of these named tables. It makes no assertion about any relationship
between symbol IDs in this table and the imports, only that the symbols'
text occurs here. An implementation MAY issue a warning if these claims
don't match what's in the symbols field.
The symbols field should be a list of strings. If the field is missing
or has any other type, it is treated as if it were an empty list.
Null elements declare undefined symbol IDs ("gaps") within the sequence;
implementations must handle requests for such symbols the same as if the
requested ID beyond the end of the list. Any element of the list that is
not a string must be interpreted as if it were null.
A few things worth noting:
.in 9
.ti 6
o Shared symbol tables do not make use of a max_id field since the
largest SID is implicit in the symbols list. If a max_id field exists,
it must be ignored.
.ti 6
o A shared table isn't coupled to any particular system table, so it can
be used in any context.
.ti 6
o The algorithm for SID assignment differs between shared and local
tables. SIDs in shared tables always start at one. SIDs in local tables
are always offset by the sum of the sizes of the system symbol table and
all imported tables.
.in 3
.tm _ 5.2.3.1. Semantics ....................................... \n%
.ti 0
5.2.3.1. Semantics
Symbol IDs are assigned to the symbols strings in order of their
appearance in the list: the first element has symbol ID 1 (aka $1), the
last has the symbol ID equal to the length of the list.
When mapping from symbol ID to string, a simple index into the list is
all that's needed.
When mapping from string to symbol ID, there may be multiple associated
IDs (the same string could appear twice as children of the symbols
field). Implementations MUST select the lowest known ID, and all other
associated IDs MUST be handled as if undefined.
.tm _ 5.2.3.2. Versioning ...................................... \n%
.ti 0
5.2.3.2. Versioning
A shared symbol table with version greater than one should usually be a
strict extension of the immediately preceding version, but Ion does not
(and in reality cannot) enforce this. Symbols may be removed, but they
cannot be renumbered or given different text. This ensures that when
version N is requested, any version larger than N can be used without
changing semantics. However, if symbols become undefined then some
extant data may become unreadable when an exact-match import cannot
be found.
The use of symbol tables that violate these restriction will lead to undefined and potentially incorrect interpretation of Ion data. Therefore implementations should enforce these restrictions at appropriate points.
Version N+1 of a table MAY be the same as version N.
.tm _ 5.3. Symbol Zero ............................................... \n%
.ti 0
5.3. Symbol Zero
SID zero (i.e. `$0`) is a special symbol that is not assigned text by any
symbol table, even the system symbol table. Symbol zero always has unknown
text, and can be useful in synthesizing symbol identifiers where the text
image of symbol is not known in a particular operating context.
It is important to note that `$0` is only semantically equivalent to itself
and to locally-declared SIDs with unknown text. It is not semantically
equivalent to SIDs with unknown text from shared symbol tables, so
replacing such SIDs with `$0` is a destructive operation to the semantics
of the data.
.tm _ 5.4. Data Model ................................................ \n%
.ti 0
5.4. Data Model
An important consideration for symbols is what semantics they have in
the Ion data model. Any symbol which has the same text image as another
symbol irrespective of the ID integer or the shared symbol table (if
applicable) used to encode it is considered to be equivalent.
Ion symbols may have text that is unknown. That is, there is no binding
to a (potentially empty) sequence of text. This can happen as a result
of not having access to a shared symbol table being imported, or having
a symbol table (shared or local) that contains a null slot.
When operating on data that contains symbols with unknown text, it is
important to not treat them as equivalent unless any of the
following hold:
.in 9
.ti 6
o Symbols with unknown text declared in a local symbol table are all
equivalent to one another and to SID 0.
.ti 6
o For symbols defined from shared symbol table imports, symbols are
equivalent only if all of the following hold:
.ti 6
- The name of the table that the symbols were imported from is the
same string.
.ti 6
- The position in the table that the symbols were imported from is the
same spot. Note that this is not the same as the local SID value,
but can be calculated from the SIDs by the allocation
algorithm above.
.ti 6
o SID 0 is only equivalent to itself.
.in 3
A processor encountering a symbol with unknown text and a valid SID
other than $0 MAY produce an error because this means that the context
of the data is missing, however any implementation that chooses not to
MUST conform to the above semantics with respect to round-tripping data.
.tm _ 5.5. Examples .................................................. \n%
.ti 0
5.5. Examples
A typical text document looks like:
.nf
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1 },
{ name:"com.amazon.ims3.symbols.submission", version:1 }]
}
// Here's the user data, one or more top-level values.
submission::{ /* ... */ local_symbol /* ... */ }
submission::{ /* ... */ 'another one' /* ... */ }
.in 3
The example above shows a local table with imports but no symbols. This
is a typical scenario for human-authored data. When parsing this text,
the local table will be extended on the fly to contain any new symbols.
Here's the same data printed after parsing, in which the local table has
been extended with symbols encountered in the user data.
.nf
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1, max_id:75 },
{ name:"com.amazon.ims3.symbols.submission", version:1, max_id:100 }],
symbols:["local_symbol", "another one"]
}
submission::{ /* ... */ local_symbol /* ... */ }
submission::{ /* ... */ 'another one' /* ... */ }
.in 3
Since the $ion_1_0 defines eight symbols ($1 through $9), the offer
table covers ids $10 through $84, the submission table covers ids $85
through $184, and local symbols start at $185.
Here's the same data as above serialized with a local symbol table being
"flushed" between each top-level value.
.nf
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1, max_id:75 },
{ name:"com.amazon.ims3.symbols.submission", version:1, max_id:100 }],
symbols:["local_symbol"]
}
submission::{ /* ... */ local_symbol /* ... */ }
$ion_symbol_table::
{
imports:$ion_symbol_table,
symbols:["another one"]
}
submission::{ /* ... */ 'another one' /* ... */ }
.in 3
In this case, the first local symbol table generated only needs to add
one new local symbol for the top-level value being serialized in its
context. The second symbol table adds a subsequent symbol to the context
for the value immediately following it. This pattern of local symbol
tables allows top-level values to be written to a stream without knowing
all symbols ahead of time.
5.3.x Annotating local symbol tables
Although a local symbol table struct may have multiple annotations, its
first annotation must be $ion_symbol_table in order to be interpreted as
a local symbol table.
The following will be interpreted as a valid local symbol table:
.nf
$ion_symbol_table::annotated::
{
symbols:["a", "b"]
}
.in 3
The example below, however, will be interpreted as a simple struct with
two annotations:
.nf
annotated::$ion_symbol_table::
{
symbols:["a", "b"]
}
.in 3