You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: Column-Mapping.md
-2
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,6 @@ layout: page
3
3
title: Column Mapping
4
4
---
5
5
6
-
## Column Mapping
7
-
8
6
Column mapping allows you to configure how your records should be stored in HBase for maximum performance and efficiency. You define the column mapping in JSON format in a data-centric way. Kite stores and retrieves the data correctly.
9
7
10
8
A column mapping is a JSON list of definitions that specify how to store each field in the record. Each definition is a JSON object with a `source`, a `type`, and any additional properties required by the type. The `source` property specifies which field in the source record the definition applies to. The `type` property controls where the source field's data is stored.
Copy file name to clipboardexpand all lines: HBase-Storage-Cells.md
-1
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,6 @@
2
2
layout: page
3
3
title: HBase Storage Cells
4
4
---
5
-
## HBase Storage Cells
6
5
7
6
HBase stores data as a group of values, or cells. HBase uniquely identifies each cell by a key. Using a key, you can look up the data for records stored in HBase very quickly. You can also insert, modify, or delete records in the middle of a dataset. HBase makes this possible by organizing data by storage key.
Copy file name to clipboardexpand all lines: Inferring-a-Schema-from-an-Avro-Data-File.md
+1-2
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,7 @@
1
1
---
2
2
layout: page
3
-
title: Infer a Schema from Avro
3
+
title: Inferring a Schema from an Avro Data File
4
4
---
5
-
## Inferring a Schema from an Avro Data File
6
5
7
6
You can use the `DatasetDescriptor.Builder.schemaFromAvroDataFile` method to use the schema of an existing data file in Avro format. The source can be a local file, an InputStream, or a URI.
Copy file name to clipboardexpand all lines: Kite-Data-Module-Overview.md
+8-10
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,8 @@
1
1
---
2
2
layout: page
3
-
title: Kite Data Overview
3
+
title: Kite Data Module Overview
4
4
---
5
5
6
-
## Kite Data Module Overview
7
-
8
6
The Kite Data module is a set of APIs for interacting with data in Hadoop; specifically, direct reading and writing of datasets in storage subsystems such as the Hadoop Distributed FileSystem (HDFS).
9
7
10
8
These APIs do not replace or supersede any of the existing Hadoop APIs. Instead, the Data module streamlines application of those APIs. You still use HDFS and Avro APIs directly, when necessary. The Kite Data module reflects best practices for default choices, data organization, and metadata system integration.
@@ -16,7 +14,7 @@ The data module contains APIs and utilities for defining and performing actions
16
14
* <ahref="#entities">entities</a>
17
15
* <ahref="#schemas">schemas</a>
18
16
* <ahref="#datasets">datasets</a>
19
-
* <ahref="#repositories">repositories</a>
17
+
* <ahref="#repositories">dataset repositories</a>
20
18
* <ahref="#loading">loading data</a>
21
19
* <ahref="#viewing">viewing data</a>
22
20
@@ -25,7 +23,7 @@ Many of these objects are interfaces, permitting multiple implementations, each
25
23
While, in theory, any implementation of Hadoop’s `FileSystem` abstract class is supported by the Kite Data module, only the local and HDFS filesystem implementations are tested and officially supported.
26
24
27
25
28
-
###Entities
26
+
## Entities
29
27
30
28
An entity is a single record in a dataset. The name _entity_ is a better term than _record_, because _record_ sounds as if it is a simple list of primitives, while _entity_ sounds more like a Plain Old Java Object you would find in a JPA class (see [JPA Entity](https://en.wikipedia.org/wiki/Java_Persistence_API#Entities) in Wikipedia.org). That said, _entity_ and _record_ are often used interchangeably when talking about datasets.
31
29
@@ -34,7 +32,7 @@ Entities can be simple types, representing data structures with a few string att
34
32
Best practices are to define the output for your system, identifying all of the field values required to produce the report or analytics results you need. Once you identify your required fields, you define one or more related entities where you store the information you need to create your output. Define the format and structure for your entities using a schema.
35
33
36
34
37
-
###Schemas
35
+
## Schemas
38
36
39
37
A schema defines the field names and datatypes for a dataset. Kite relies on an Apache Avro schema definition for each dataset. For example, this is the schema definition for a table listing movies from the `movies.csv` dataset.
40
38
@@ -61,7 +59,7 @@ The goal is to get the schema into `.avsc` format and store it in the Hadoop fil
61
59
62
60
63
61
64
-
###Datasets
62
+
## Datasets
65
63
A dataset is a collection of zero or more entities, represented by the interface `Dataset`. The relational database analog of a dataset is a table.
66
64
67
65
The HDFS implementation of a dataset is stored as Snappy-compressed Avro data files by default. The HDFS implementation is made up of zero or more files in a directory. You also have the option of storing your dataset in the column-oriented Parquet file format.
@@ -72,7 +70,7 @@ You can work with a subset of dataset entities using the Views API.
72
70
73
71
<aname="repositories" />
74
72
75
-
###Dataset Repository
73
+
## Dataset Repositories
76
74
77
75
A _dataset repository_ is a physical storage location for datasets. Keeping with the relational database analogy, a dataset repository is the equivalent of a database of tables.
78
76
@@ -84,13 +82,13 @@ Each dataset belongs to exactly one dataset repository. Kite doesn't provid
84
82
85
83
<aname="loading" />
86
84
87
-
###Loading data from CSV
85
+
## Loading Data from CSV
88
86
89
87
You can load comma separated value data into a dataset repository using the command line interface function [csv-import](../Kite-Dataset-Command-Line-Interface/index.html#csvImport).
90
88
91
89
<aname="viewing" />
92
90
93
-
###Viewing Your Data
91
+
## Viewing Your Data
94
92
95
93
Datasets you create Kite are no different than any other Hadoop dataset in your system, once created. You can query the data with Hive or view it using Impala.
Copy file name to clipboardexpand all lines: Kite-Dataset-Command-Line-Interface.md
+1-3
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,7 @@
1
1
---
2
2
layout: page
3
-
title: Kite CLI
3
+
title: Kite Dataset Command Line Interface
4
4
---
5
-
## Kite Dataset Command Line Interface
6
-
7
5
8
6
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
Copy file name to clipboardexpand all lines: Kite-SDK-Guide.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
2
layout: page
3
-
title: Kite SDK Guide
3
+
title: What is Kite?
4
4
---
5
5
6
-
## What Is Kite, and Why Is It Awesome?
6
+
## (and What Makes It So Awesome?)
7
7
8
8
You can learn about why Kite is awesome by watching this <ahref="http://www.youtube.com/watch?feature=player_embedded&v=JXAm3aasI6c">Kite Overview video</a>.
9
9
@@ -27,7 +27,7 @@ Things should just work together. Hadoop forces you to spend more time thinking
27
27
Hadoop is not difficult to use. The complexity comes from the many parts that comprise a very large animal. Each piece, in isolation, is straightforward and easy to understand.
28
28
29
29
30
-
###Enter Kite
30
+
## Enter Kite
31
31
32
32
This is where Kite comes in. Kite provides additional support for this infrastructure one level up in the stack so that they are codified in APIs that make sense to developers.
Copy file name to clipboardexpand all lines: Partition-Strategy-Format.md
+3-4
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,7 @@
1
1
---
2
2
layout: page
3
-
title: Partition Strategy Format
3
+
title: Partition Strategy JSON Format
4
4
---
5
-
## Partition Strategy JSON Format
6
5
7
6
A partition strategy is made up of a list of partition fields. Each field defines how to take source data from an entity and produce a value that is used to store the entity. For example, a field can produce the year an event happened from its timestamp. Another field in the strategy can be the month from the timestamp.
8
7
@@ -29,7 +28,7 @@ A field definition can optionally provide a `name` attribute, which is used to r
29
28
30
29
Requirements for the source data are validated when schemas and partition strategies are used together.
31
30
32
-
###Examples
31
+
## Examples
33
32
34
33
This strategy uses the year, month, and day from the "received_at" timestamp field on an event.
35
34
@@ -50,7 +49,7 @@ This strategy hashes and embeds the "email" field from a user record.
50
49
]
51
50
```
52
51
53
-
####Notes:
52
+
### Notes:
54
53
1. Source timestamps must be [long][avro-types] fields. The value encodes the number of milliseconds since unix epoch, as in Joda Time's [Instant][timestamp] and Java's Date.
55
54
2. The `buckets` attribute is required for `hash` partitions and controls the number of partitions into which the entities should be pseudo-randomly distributed.
Copy file name to clipboardexpand all lines: Schema-URL-Warning.md
+7-3
Original file line number
Diff line number
Diff line change
@@ -3,13 +3,17 @@ layout: page
3
3
title: Schema URL Warning
4
4
---
5
5
This page explains the schema URL warning:
6
+
7
+
```bash
6
8
> The Dataset is using a schema literal rather than a URL which will be attached to every message.
9
+
```
10
+
11
+
This warning means that the dataset is configured using an Avro schema string, a schema object, or by reflection. Configuring with an HDFS URL where the schema can be found, instead of the other options, allows certain components to pass the schema URL rather than the schema's string literal. This cuts down on the size of headers that must be sent with each message.
7
12
8
-
This warning means that the Dataset is configured using an Avro schema string, a schema object, or by reflection. Configuring with an HDFS URL where the schema can be found, instead of the other options, allows certain components to pass the schema URL rather than the schema's string literal. This cuts down on the size of headers that must be sent with each message.
13
+
## Fixing the problem
9
14
10
-
### Fixing the problem
15
+
The following Java code demonstrates how to change the descriptor to use a schema URL instead of a schema literal:
11
16
12
-
The following java code demonstrates how to change the descriptor to use a schema URL instead of a schema literal:
Kite provides a set of tools that handle the basic legwork for creating a dataset, allowing you to focus on the specifics of the business problem you want to solve. This short tutorial walks you through the process of creating a dataset and viewing the results using the command line interface (CLI).
13
11
14
-
###Preparation
12
+
## Preparation
15
13
16
14
If you have not done so already, download the Kite command-line interface jar. This jar is the executable that runs the command-line interface, so save it as `dataset`. To download with curl, run:
If you have a CSV file sitting around waiting to be used, you can substitute your file for the one that follows. The truth is, it doesn't matter if you have 100 columns or 2 columns, the process is the same. Larger datasets are only larger, not more complex.
26
24
@@ -33,7 +31,7 @@ Reuben, Pastrami and sauerkraut on toasted rye with Russian dressing.
33
31
PBJ, Peanut butter and grape jelly on white bread.
34
32
```
35
33
36
-
###Infer the Schema
34
+
## Infer the Schema
37
35
38
36
All right. Now we get to use the CLI. Start by inferring an Avro schema file from the *sandwiches.csv* file you just created. Enter the following command to create an Avro schema file named *sandwich.avsc* with the class name *Sandwich*. The schema details are based on the headings and data in *sandwiches.csv*.
39
37
@@ -58,7 +56,7 @@ If you open *sandwich.avsc* in a text editor, it looks something like the code b
58
56
}
59
57
```
60
58
61
-
###Create the Dataset
59
+
## Create the Dataset
62
60
63
61
With a schema, you can create a new dataset. Enter the following command.
64
62
@@ -89,7 +87,7 @@ You'll get the same schema back, but this time, trust me, it's coming from the H
89
87
}
90
88
```
91
89
92
-
###Import the CSV Data
90
+
## Import the CSV Data
93
91
You've created a dataset in the Hive repository, which is the container, but not the information itself. Next, you might want to add some data so that you can run some queries. Use the following command to import the sandwiches in your CSV file.
0 commit comments