Skip to content

Commit 5c58487

Browse files
committed
[FLINK-36905] Update Chinese doc on serialization to reflect the latest changes in Flink 2.0
1 parent 26d3426 commit 5c58487

File tree

3 files changed

+212
-295
lines changed

3 files changed

+212
-295
lines changed

docs/content.zh/docs/dev/datastream/fault-tolerance/serialization/custom_serialization.md

Lines changed: 80 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,6 @@ to specify the state's name, as well as information about the type of the state.
4242
It is also possible to completely bypass this and let Flink use your own custom serializer to serialize managed states,
4343
simply by directly instantiating the `StateDescriptor` with your own `TypeSerializer` implementation:
4444

45-
{{< tabs "ee215ff6-2e21-4a40-a1b4-7f114560546f" >}}
46-
{{< tab "Java" >}}
4745
```java
4846
public class CustomTypeSerializer extends TypeSerializer<Tuple2<String, Integer>> {...};
4947

@@ -54,20 +52,6 @@ ListStateDescriptor<Tuple2<String, Integer>> descriptor =
5452

5553
checkpointedState = getRuntimeContext().getListState(descriptor);
5654
```
57-
{{< /tab >}}
58-
{{< tab "Scala" >}}
59-
```scala
60-
class CustomTypeSerializer extends TypeSerializer[(String, Integer)] {...}
61-
62-
val descriptor = new ListStateDescriptor[(String, Integer)](
63-
"state-name",
64-
new CustomTypeSerializer)
65-
)
66-
67-
checkpointedState = getRuntimeContext.getListState(descriptor)
68-
```
69-
{{< /tab >}}
70-
{{< /tabs >}}
7155

7256
## State serializers and schema evolution
7357

@@ -84,10 +68,10 @@ mind is how the serialization schema can be changed in the future.
8468
When speaking of *schema*, in this context the term is interchangeable between referring to the *data model* of a state
8569
type and the *serialized binary format* of a state type. The schema, generally speaking, can change for a few cases:
8670

87-
1. Data schema of the state type has evolved, i.e. adding or removing a field from a POJO that is used as state.
88-
2. Generally speaking, after a change to the data schema, the serialization format of the serializer will need to be upgraded.
89-
3. Configuration of the serializer has changed.
90-
71+
1. Data schema of the state type has evolved, i.e. adding or removing a field from a POJO that is used as state.
72+
2. Generally speaking, after a change to the data schema, the serialization format of the serializer will need to be upgraded.
73+
3. Configuration of the serializer has changed.
74+
9175
In order for the new execution to have information about the *written schema* of state and detect whether or not the
9276
schema has changed, upon taking a savepoint of an operator's state, a *snapshot* of the state serializer needs to be
9377
written along with the state bytes. This is abstracted a `TypeSerializerSnapshot`, explained in the next subsection.
@@ -129,15 +113,15 @@ restored execution of an operator, the old serializer snapshot is provided to th
129113
This method returns a `TypeSerializerSchemaCompatibility` representing the result of the compatibility resolution,
130114
which can be one of the following:
131115

132-
1. **`TypeSerializerSchemaCompatibility.compatibleAsIs()`**: this result signals that the new serializer is compatible,
133-
meaning that the new serializer has identical schema with the previous serializer. It is possible that the new
134-
serializer has been reconfigured in the `resolveSchemaCompatibility` method so that it is compatible.
135-
2. **`TypeSerializerSchemaCompatibility.compatibleAfterMigration()`**: this result signals that the new serializer has a
136-
different serialization schema, and it is possible to migrate from the old schema by using the previous serializer
137-
(which recognizes the old schema) to read bytes into state objects, and then rewriting the object back to bytes with
138-
the new serializer (which recognizes the new schema).
139-
3. **`TypeSerializerSchemaCompatibility.incompatible()`**: this result signals that the new serializer has a
140-
different serialization schema, but it is not possible to migrate from the old schema.
116+
1. **`TypeSerializerSchemaCompatibility.compatibleAsIs()`**: this result signals that the new serializer is compatible,
117+
meaning that the new serializer has identical schema with the previous serializer. It is possible that the new
118+
serializer has been reconfigured in the `resolveSchemaCompatibility` method so that it is compatible.
119+
2. **`TypeSerializerSchemaCompatibility.compatibleAfterMigration()`**: this result signals that the new serializer has a
120+
different serialization schema, and it is possible to migrate from the old schema by using the previous serializer
121+
(which recognizes the old schema) to read bytes into state objects, and then rewriting the object back to bytes with
122+
the new serializer (which recognizes the new schema).
123+
3. **`TypeSerializerSchemaCompatibility.incompatible()`**: this result signals that the new serializer has a
124+
different serialization schema, but it is not possible to migrate from the old schema.
141125

142126
The last bit of detail is how the previous serializer is obtained in the case that migration is required.
143127
Another important role of a serializer's `TypeSerializerSnapshot` is that it serves as a factory to restore
@@ -151,48 +135,48 @@ To wrap up, this section concludes how Flink, or more specifically the state bac
151135
abstractions. The interaction is slightly different depending on the state backend, but this is orthogonal
152136
to the implementation of state serializers and their serializer snapshots.
153137

154-
#### Off-heap state backends (e.g. `RocksDBStateBackend`)
155-
156-
1. **Register new state with a state serializer that has schema _A_**
157-
- the registered `TypeSerializer` for the state is used to read / write state on every state access.
158-
- State is written in schema *A*.
159-
2. **Take a savepoint**
160-
- The serializer snapshot is extracted via the `TypeSerializer#snapshotConfiguration` method.
161-
- The serializer snapshot is written to the savepoint, as well as the already-serialized state bytes (with schema *A*).
162-
3. **Restored execution re-accesses restored state bytes with new state serializer that has schema _B_**
163-
- The previous state serializer's snapshot is restored.
164-
- State bytes are not deserialized on restore, only loaded back to the state backends (therefore, still in schema *A*).
165-
- Upon receiving the new serializer, the previous serializer's snapshot is provided to the new serializer's snapshot via the
138+
#### Off-heap state backends (e.g. `EmbeddedRocksDBStateBackend`)
139+
140+
1. **Register new state with a state serializer that has schema _A_**
141+
- the registered `TypeSerializer` for the state is used to read / write state on every state access.
142+
- State is written in schema *A*.
143+
2. **Take a savepoint**
144+
- The serializer snapshot is extracted via the `TypeSerializer#snapshotConfiguration` method.
145+
- The serializer snapshot is written to the savepoint, as well as the already-serialized state bytes (with schema *A*).
146+
3. **Restored execution re-accesses restored state bytes with new state serializer that has schema _B_**
147+
- The previous state serializer's snapshot is restored.
148+
- State bytes are not deserialized on restore, only loaded back to the state backends (therefore, still in schema *A*).
149+
- Upon receiving the new serializer, the previous serializer's snapshot is provided to the new serializer's snapshot via the
166150
`TypeSerializer#resolveSchemaCompatibility` to check for schema compatibility.
167-
4. **Migrate state bytes in backend from schema _A_ to schema _B_**
168-
- If the compatibility resolution reflects that the schema has changed and migration is possible, schema migration is
151+
4. **Migrate state bytes in backend from schema _A_ to schema _B_**
152+
- If the compatibility resolution reflects that the schema has changed and migration is possible, schema migration is
169153
performed. The previous state serializer which recognizes schema _A_ will be obtained from the serializer snapshot, via
170-
`TypeSerializerSnapshot#restoreSerializer()`, and is used to deserialize state bytes to objects, which in turn
171-
are re-written again with the new serializer, which recognizes schema _B_ to complete the migration. All entries
172-
of the accessed state is migrated all-together before processing continues.
173-
- If the resolution signals incompatibility, then the state access fails with an exception.
174-
175-
#### Heap state backends (e.g. `MemoryStateBackend`, `FsStateBackend`)
176-
177-
1. **Register new state with a state serializer that has schema _A_**
178-
- the registered `TypeSerializer` is maintained by the state backend.
179-
2. **Take a savepoint, serializing all state with schema _A_**
180-
- The serializer snapshot is extracted via the `TypeSerializer#snapshotConfiguration` method.
181-
- The serializer snapshot is written to the savepoint.
182-
- State objects are now serialized to the savepoint, written in schema _A_.
183-
3. **On restore, deserialize state into objects in heap**
184-
- The previous state serializer's snapshot is restored.
185-
- The previous serializer, which recognizes schema _A_, is obtained from the serializer snapshot, via
154+
`TypeSerializerSnapshot#restoreSerializer()`, and is used to deserialize state bytes to objects, which in turn
155+
are re-written again with the new serializer, which recognizes schema _B_ to complete the migration. All entries
156+
of the accessed state is migrated all-together before processing continues.
157+
- If the resolution signals incompatibility, then the state access fails with an exception.
158+
159+
#### Heap state backends (e.g. `HashMapStateBackend`)
160+
161+
1. **Register new state with a state serializer that has schema _A_**
162+
- the registered `TypeSerializer` is maintained by the state backend.
163+
2. **Take a savepoint, serializing all state with schema _A_**
164+
- The serializer snapshot is extracted via the `TypeSerializer#snapshotConfiguration` method.
165+
- The serializer snapshot is written to the savepoint.
166+
- State objects are now serialized to the savepoint, written in schema _A_.
167+
3. **On restore, deserialize state into objects in heap**
168+
- The previous state serializer's snapshot is restored.
169+
- The previous serializer, which recognizes schema _A_, is obtained from the serializer snapshot, via
186170
`TypeSerializerSnapshot#restoreSerializer()`, and is used to deserialize state bytes to objects.
187-
- From now on, all of the state is already deserialized.
188-
4. **Restored execution re-accesses previous state with new state serializer that has schema _B_**
189-
- Upon receiving the new serializer, the previous serializer's snapshot is provided to the new serializer's snapshot via the
171+
- From now on, all of the state is already deserialized.
172+
4. **Restored execution re-accesses previous state with new state serializer that has schema _B_**
173+
- Upon receiving the new serializer, the previous serializer's snapshot is provided to the new serializer's snapshot via the
190174
`TypeSerializer#resolveSchemaCompatibility` to check for schema compatibility.
191-
- If the compatibility check signals that migration is required, nothing happens in this case since for
192-
heap backends, all state is already deserialized into objects.
193-
- If the resolution signals incompatibility, then the state access fails with an exception.
194-
5. **Take another savepoint, serializing all state with schema _B_**
195-
- Same as step 2., but now state bytes are all in schema _B_.
175+
- If the compatibility check signals that migration is required, nothing happens in this case since for
176+
heap backends, all state is already deserialized into objects.
177+
- If the resolution signals incompatibility, then the state access fails with an exception.
178+
5. **Take another savepoint, serializing all state with schema _B_**
179+
- Same as step 2., but now state bytes are all in schema _B_.
196180

197181
## Predefined convenient `TypeSerializerSnapshot` classes
198182

@@ -211,9 +195,9 @@ essentially meaning that the serialization schema of the serializer is solely de
211195
There will only be 2 possible results of the compatibility resolution when using the `SimpleTypeSerializerSnapshot`
212196
as your serializer's snapshot class:
213197

214-
- `TypeSerializerSchemaCompatibility.compatibleAsIs()`, if the new serializer class remains identical, or
215-
- `TypeSerializerSchemaCompatibility.incompatible()`, if the new serializer class is different then the previous one.
216-
198+
- `TypeSerializerSchemaCompatibility.compatibleAsIs()`, if the new serializer class remains identical, or
199+
- `TypeSerializerSchemaCompatibility.incompatible()`, if the new serializer class is different then the previous one.
200+
217201
Below is an example of how the `SimpleTypeSerializerSnapshot` is used, using Flink's `IntSerializer` as an example:
218202

219203
```java
@@ -284,13 +268,14 @@ public class MapSerializerSnapshot<K, V> extends CompositeTypeSerializerSnapshot
284268
}
285269
```
286270

271+
287272
When implementing a new serializer snapshot as a subclass of `CompositeTypeSerializerSnapshot`,
288273
the following three methods must be implemented:
289-
* `#getCurrentOuterSnapshotVersion()`: This method defines the version of
290-
the current outer serializer snapshot's serialized binary format.
291-
* `#getNestedSerializers(TypeSerializer)`: Given the outer serializer, returns its nested serializers.
292-
* `#createOuterSerializerWithNestedSerializers(TypeSerializer[])`:
293-
Given the nested serializers, create an instance of the outer serializer.
274+
* `#getCurrentOuterSnapshotVersion()`: This method defines the version of
275+
the current outer serializer snapshot's serialized binary format.
276+
* `#getNestedSerializers(TypeSerializer)`: Given the outer serializer, returns its nested serializers.
277+
* `#createOuterSerializerWithNestedSerializers(TypeSerializer[])`:
278+
Given the nested serializers, create an instance of the outer serializer.
294279

295280
The above example is a `CompositeTypeSerializerSnapshot` where there are no extra information to be snapshotted
296281
apart from the nested serializers' snapshots. Therefore, its outer snapshot version can be expected to never
@@ -300,9 +285,9 @@ that needs to be persisted along with the nested component serializer. An exampl
300285
the nested element serializer.
301286

302287
In these cases, an additional three methods need to be implemented on the `CompositeTypeSerializerSnapshot`:
303-
* `#writeOuterSnapshot(DataOutputView)`: defines how the outer snapshot information is written.
304-
* `#readOuterSnapshot(int, DataInputView, ClassLoader)`: defines how the outer snapshot information is read.
305-
* `#resolveOuterSchemaCompatibility(TypeSerializerSnapshot)`: checks the compatibility based on the outer snapshot information.
288+
* `#writeOuterSnapshot(DataOutputView)`: defines how the outer snapshot information is written.
289+
* `#readOuterSnapshot(int, DataInputView, ClassLoader)`: defines how the outer snapshot information is read.
290+
* `#resolveOuterSchemaCompatibility(TypeSerializerSnapshot)`: checks the compatibility based on the outer snapshot information.
306291

307292
By default, the `CompositeTypeSerializerSnapshot` assumes that there isn't any outer snapshot information to
308293
read / write, and therefore have empty default implementations for the above methods. If the subclass
@@ -311,6 +296,7 @@ has outer snapshot information, then all three methods must be implemented.
311296
Below is an example of how the `CompositeTypeSerializerSnapshot` is used for composite serializer snapshots
312297
that do have outer snapshot information, using Flink's `GenericArraySerializer` as an example:
313298

299+
314300
```java
315301
public final class GenericArraySerializerSnapshot<C> extends CompositeTypeSerializerSnapshot<C[], GenericArraySerializer> {
316302

@@ -365,6 +351,7 @@ public final class GenericArraySerializerSnapshot<C> extends CompositeTypeSerial
365351
}
366352
```
367353

354+
368355
There are two important things to notice in the above code snippet. First of all, since this
369356
`CompositeTypeSerializerSnapshot` implementation has outer snapshot information that is written as part of the snapshot,
370357
the outer snapshot version, as defined by `getCurrentOuterSnapshotVersion()`, must be upticked whenever the
@@ -387,8 +374,8 @@ Flink restores serializer snapshots by first instantiating the `TypeSerializerSn
387374
along with the snapshot bytes). Therefore, to avoid being subject to unintended classname changes or instantiation
388375
failures, `TypeSerializerSnapshot` classes should:
389376

390-
- avoid being implemented as anonymous classes or nested classes,
391-
- have a public, nullary constructor for instantiation
377+
- avoid being implemented as anonymous classes or nested classes,
378+
- have a public, nullary constructor for instantiation
392379

393380
#### 2. Avoid sharing the same `TypeSerializerSnapshot` class across different serializers
394381

@@ -431,19 +418,19 @@ and could be problematic once you want to upgrade serializer classes or perform
431418
To be future-proof and have flexibility to migrate your state serializers and schema, it is highly recommended to
432419
migrate from the old abstractions. The steps to do this is as follows:
433420

434-
1. Implement a new subclass of `TypeSerializerSnapshot`. This will be the new snapshot for your serializer.
435-
2. Return the new `TypeSerializerSnapshot` as the serializer snapshot for your serializer in the
436-
`TypeSerializer#snapshotConfiguration()` method.
437-
3. Restore the job from the savepoint that existed before Flink 1.7, and then take a savepoint again.
438-
Note that at this step, the old `TypeSerializerConfigSnapshot` of the serializer must still exist in the classpath,
439-
and the implementation for the `TypeSerializer#ensureCompatibility(TypeSerializerConfigSnapshot)` method must not be
440-
removed. The purpose of this process is to replace the `TypeSerializerConfigSnapshot` written in old savepoints
441-
with the newly implemented `TypeSerializerSnapshot` for the serializer.
442-
4. Once you have a savepoint taken with Flink 1.7, the savepoint will contain `TypeSerializerSnapshot` as the
443-
state serializer snapshot, and the serializer instance will no longer be written in the savepoint.
444-
At this point, it is now safe to remove all implementations of the old abstraction (remove the old
445-
`TypeSerializerConfigSnapshot` implementation as will as the
446-
`TypeSerializer#ensureCompatibility(TypeSerializerConfigSnapshot)` from the serializer).
421+
1. Implement a new subclass of `TypeSerializerSnapshot`. This will be the new snapshot for your serializer.
422+
2. Return the new `TypeSerializerSnapshot` as the serializer snapshot for your serializer in the
423+
`TypeSerializer#snapshotConfiguration()` method.
424+
3. Restore the job from the savepoint that existed before Flink 1.7, and then take a savepoint again.
425+
Note that at this step, the old `TypeSerializerConfigSnapshot` of the serializer must still exist in the classpath,
426+
and the implementation for the `TypeSerializer#ensureCompatibility(TypeSerializerConfigSnapshot)` method must not be
427+
removed. The purpose of this process is to replace the `TypeSerializerConfigSnapshot` written in old savepoints
428+
with the newly implemented `TypeSerializerSnapshot` for the serializer.
429+
4. Once you have a savepoint taken with Flink 1.7, the savepoint will contain `TypeSerializerSnapshot` as the
430+
state serializer snapshot, and the serializer instance will no longer be written in the savepoint.
431+
At this point, it is now safe to remove all implementations of the old abstraction (remove the old
432+
`TypeSerializerConfigSnapshot` implementation as will as the
433+
`TypeSerializer#ensureCompatibility(TypeSerializerConfigSnapshot)` from the serializer).
447434

448435
## Migrating from deprecated `TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer newSerializer)` before Flink 1.19
449436

0 commit comments

Comments
 (0)