You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: Partition-Strategy-Format.md
+6-5
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,15 @@
1
1
---
2
2
layout: page
3
+
title: Partition Strategy Format
3
4
---
4
5
## Partition Strategy JSON Format
5
6
6
-
A partition strategy is made of a list of partition fields. Each field defines how to take source data from an entity and produce a value that will be used to store the entity. For example, a field may produce the year an event happened from its timestamp. Another field in the strategy may be the month from the timestamp.
7
+
A partition strategy is made up of a list of partition fields. Each field defines how to take source data from an entity and produce a value that is used to store the entity. For example, a field can produce the year an event happened from its timestamp. Another field in the strategy can be the month from the timestamp.
7
8
8
-
Partition strategies are defined in [JSON][json] format. The strategy must be a list of objects---name/value pairs---each of which define a field in the partition strategy. All field definitions require at least two attributes:
9
+
Partition strategies are defined in [JSON][json] format. The strategy must be a list of objects---name/value pairs---each of which defines a field in the partition strategy. All field definitions require at least two attributes:
9
10
10
-
*`source` -- a source field on the entity, like "created_at"
11
-
*`type` -- the type of partition derived from the source data, like "year"
11
+
*`source` -- a source field on the entity, such as "created_at"
12
+
*`type` -- the type of partition derived from the source data, such as "year"
12
13
13
14
Each definition can be thought of as a function run on the entity's source to produce the partition field's data. The order of the partition fields is preserved and used when the strategy is applied.
14
15
@@ -24,7 +25,7 @@ The available types are:
24
25
|`identity`| any string or number | the source value, unchanged | must be a string or numeric |
25
26
|`hash`| any object | int hash of the value, 0-B | requires B, `buckets` integer attribute[<sup>2</sup>](#notes)|
26
27
27
-
A field definition may optionally provide a `name` attribute, which is used to reference the partition field. HDFS datasets use this name when creating partition paths. If the name attribute is missing, it will be defaulted based on the partition type and source field name.
28
+
A field definition can optionally provide a `name` attribute, which is used to reference the partition field. HDFS datasets use this name when creating partition paths. If the name attribute is missing, it is defaulted based on the partition type and source field name.
28
29
29
30
Requirements for the source data are validated when schemas and partition strategies are used together.
Copy file name to clipboardexpand all lines: Schema-URL-Warning.md
+6-2
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,11 @@
1
+
---
2
+
layout: page
3
+
title: Schema URL Warning
4
+
---
1
5
This page explains the schema URL warning:
2
6
> The Dataset is using a schema literal rather than a URL which will be attached to every message.
3
7
4
-
This warning means that the Dataset has been configured using an avro schema string, schema object, or by reflection. Configuring with a HDFS URL where the schema can be found instead of the other options allows certain components to pass the schema URL rather than the schema's string literal, which cuts down on the size of headers that must be sent with each message.
8
+
This warning means that the Dataset is configured using an Avro schema string, a schema object, or by reflection. Configuring with an HDFS URL where the schema can be found, instead of the other options, allows certain components to pass the schema URL rather than the schema's string literal. This cuts down on the size of headers that must be sent with each message.
5
9
6
10
### Fixing the problem
7
11
@@ -26,4 +30,4 @@ DatasetDescriptor newDescriptor = new DatasetDescriptor.Builder(dataset.getDescr
0 commit comments