Skip to content

Commit 78bfe7f

Browse files
author
DennisDawson
committedJul 9, 2014
Fixed titles. Removed title links from header for 0.15.
1 parent 1a58514 commit 78bfe7f

14 files changed

+40
-53
lines changed
 

‎Column-Mapping.md

-2
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@ layout: page
33
title: Column Mapping
44
---
55

6-
## Column Mapping
7-
86
Column mapping allows you to configure how your records should be stored in HBase for maximum performance and efficiency. You define the column mapping in JSON format in a data-centric way. Kite stores and retrieves the data correctly.
97

108
A column mapping is a JSON list of definitions that specify how to store each field in the record. Each definition is a JSON object with a `source`, a `type`, and any additional properties required by the type. The `source` property specifies which field in the source record the definition applies to. The `type` property controls where the source field's data is stored.

‎HBase-Storage-Cells.md

-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
layout: page
33
title: HBase Storage Cells
44
---
5-
## HBase Storage Cells
65

76
HBase stores data as a group of values, or cells. HBase uniquely identifies each cell by a key. Using a key, you can look up the data for records stored in HBase very quickly. You can also insert, modify, or delete records in the middle of a dataset. HBase makes this possible by organizing data by storage key.
87

‎Inferring-a-Schema-from-a-Java-Class.md

+2-4
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
---
22
layout: page
3-
title: Infer a Schema from Java
3+
title: Inferring a Schema from a Java Class
44
---
55

6-
## Inferring a Schema from a Java Class
7-
86
You can use the `DatasetDescriptor.Builder#schema(Class<?> type)` method to infer a dataset schema from the instance variable fields of a Java class.
97

108
For example, the following class defines a Java object that provides access to the ID, Title, Release Date, and IMDB URL for a movie database.
@@ -79,7 +77,7 @@ DatasetDescriptor movieDesc = new DatasetDescriptor.Builder()
7977

8078
The Builder uses the field names and data types to construct an Avro schema definition, which for the `Movie` class looks like this.
8179

82-
```
80+
```json
8381
{
8482
"type":"record",
8583
"name":"Movie",

‎Inferring-a-Schema-from-an-Avro-Data-File.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
---
22
layout: page
3-
title: Infer a Schema from Avro
3+
title: Inferring a Schema from an Avro Data File
44
---
5-
## Inferring a Schema from an Avro Data File
65

76
You can use the `DatasetDescriptor.Builder.schemaFromAvroDataFile` method to use the schema of an existing data file in Avro format. The source can be a local file, an InputStream, or a URI.
87

‎Kite-Data-Module-Overview.md

+8-10
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,8 @@
11
---
22
layout: page
3-
title: Kite Data Overview
3+
title: Kite Data Module Overview
44
---
55

6-
## Kite Data Module Overview
7-
86
The Kite Data module is a set of APIs for interacting with data in Hadoop; specifically, direct reading and writing of datasets in storage subsystems such as the Hadoop Distributed FileSystem (HDFS).
97

108
These APIs do not replace or supersede any of the existing Hadoop APIs. Instead, the Data module streamlines application of those APIs. You still use HDFS and Avro APIs directly, when necessary. The Kite Data module reflects best practices for default choices, data organization, and metadata system integration.
@@ -16,7 +14,7 @@ The data module contains APIs and utilities for defining and performing actions
1614
* <a href="#entities">entities</a>
1715
* <a href="#schemas">schemas</a>
1816
* <a href="#datasets">datasets</a>
19-
* <a href="#repositories">repositories</a>
17+
* <a href="#repositories">dataset repositories</a>
2018
* <a href="#loading">loading data</a>
2119
* <a href="#viewing">viewing data</a>
2220

@@ -25,7 +23,7 @@ Many of these objects are interfaces, permitting multiple implementations, each
2523
While, in theory, any implementation of Hadoop’s `FileSystem` abstract class is supported by the Kite Data module, only the local and HDFS filesystem implementations are tested and officially supported.
2624

2725

28-
### Entities
26+
## Entities
2927

3028
An entity is a single record in a dataset. The name _entity_ is a better term than _record_, because _record_ sounds as if it is a simple list of primitives, while _entity_ sounds more like a Plain Old Java Object you would find in a JPA class (see [JPA Entity](https://en.wikipedia.org/wiki/Java_Persistence_API#Entities) in Wikipedia.org). That said, _entity_ and _record_ are often used interchangeably when talking about datasets.
3129

@@ -34,7 +32,7 @@ Entities can be simple types, representing data structures with a few string att
3432
Best practices are to define the output for your system, identifying all of the field values required to produce the report or analytics results you need. Once you identify your required fields, you define one or more related entities where you store the information you need to create your output. Define the format and structure for your entities using a schema.
3533

3634

37-
### Schemas
35+
## Schemas
3836

3937
A schema defines the field names and datatypes for a dataset. Kite relies on an Apache Avro schema definition for each dataset. For example, this is the schema definition for a table listing movies from the `movies.csv` dataset.
4038

@@ -61,7 +59,7 @@ The goal is to get the schema into `.avsc` format and store it in the Hadoop fil
6159

6260

6361

64-
### Datasets
62+
## Datasets
6563
A dataset is a collection of zero or more entities, represented by the interface `Dataset`. The relational database analog of a dataset is a table.
6664

6765
The HDFS implementation of a dataset is stored as Snappy-compressed Avro data files by default. The HDFS implementation is made up of zero or more files in a directory. You also have the option of storing your dataset in the column-oriented Parquet file format.
@@ -72,7 +70,7 @@ You can work with a subset of dataset entities using the Views API.
7270

7371
<a name="repositories" />
7472

75-
### Dataset Repository
73+
## Dataset Repositories
7674

7775
A _dataset repository_ is a physical storage location for datasets. Keeping with the relational database analogy, a dataset repository is the equivalent of a database of tables.
7876

@@ -84,13 +82,13 @@ Each dataset belongs to exactly one dataset repository. Kite doesn&apos;t provid
8482

8583
<a name="loading" />
8684

87-
### Loading data from CSV
85+
## Loading Data from CSV
8886

8987
You can load comma separated value data into a dataset repository using the command line interface function [csv-import](../Kite-Dataset-Command-Line-Interface/index.html#csvImport).
9088

9189
<a name="viewing" />
9290

93-
### Viewing Your Data
91+
## Viewing Your Data
9492

9593
Datasets you create Kite are no different than any other Hadoop dataset in your system, once created. You can query the data with Hive or view it using Impala.
9694

‎Kite-Dataset-Command-Line-Interface.md

+1-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
---
22
layout: page
3-
title: Kite CLI
3+
title: Kite Dataset Command Line Interface
44
---
5-
## Kite Dataset Command Line Interface
6-
75

86
The Kite Dataset command line interface (CLI) provides utility commands that let you perform essential tasks such as creating a schema and dataset, importing data from a CSV file, and viewing the results.
97

‎Kite-SDK-Guide.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
22
layout: page
3-
title: Kite SDK Guide
3+
title: What is Kite?
44
---
55

6-
## What Is Kite, and Why Is It Awesome?
6+
## (and What Makes It So Awesome?)
77

88
You can learn about why Kite is awesome by watching this <a href="http://www.youtube.com/watch?feature=player_embedded&v=JXAm3aasI6c">Kite Overview video</a>.
99

@@ -27,7 +27,7 @@ Things should just work together. Hadoop forces you to spend more time thinking
2727
Hadoop is not difficult to use. The complexity comes from the many parts that comprise a very large animal. Each piece, in isolation, is straightforward and easy to understand.
2828

2929

30-
### Enter Kite
30+
## Enter Kite
3131

3232
This is where Kite comes in. Kite provides additional support for this infrastructure one level up in the stack so that they are codified in APIs that make sense to developers.
3333

‎Parquet-vs-Avro-Format.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
---
22
layout: page
3-
title: Parquet vs Avro
3+
title: Parquet vs Avro Format
44
---
5-
## Parquet versus Avro Format
65

76
Avro is a row-based storage format for Hadoop.
87

‎Partition-Strategy-Format.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
---
22
layout: page
3-
title: Partition Strategy Format
3+
title: Partition Strategy JSON Format
44
---
5-
## Partition Strategy JSON Format
65

76
A partition strategy is made up of a list of partition fields. Each field defines how to take source data from an entity and produce a value that is used to store the entity. For example, a field can produce the year an event happened from its timestamp. Another field in the strategy can be the month from the timestamp.
87

@@ -29,7 +28,7 @@ A field definition can optionally provide a `name` attribute, which is used to r
2928

3029
Requirements for the source data are validated when schemas and partition strategies are used together.
3130

32-
### Examples
31+
## Examples
3332

3433
This strategy uses the year, month, and day from the "received_at" timestamp field on an event.
3534

@@ -50,7 +49,7 @@ This strategy hashes and embeds the "email" field from a user record.
5049
]
5150
```
5251

53-
#### Notes:
52+
### Notes:
5453
1. Source timestamps must be [long][avro-types] fields. The value encodes the number of milliseconds since unix epoch, as in Joda Time's [Instant][timestamp] and Java's Date.
5554
2. The `buckets` attribute is required for `hash` partitions and controls the number of partitions into which the entities should be pseudo-randomly distributed.
5655

‎Partitioned-Datasets.md

-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
layout: page
33
title: Partitioned Datasets
44
---
5-
## Partitioned Datasets
65

76
<a href="https://www.youtube.com/watch?v=rU1YAvmU6mY&index=3&list=PLGzsQf6UXBR-BJz5BGzJb2mMulWTfTu99">
87
<img src="https://raw.githubusercontent.com/DennisDawson/KiteImages/master/partitionTitleSlide.png"

‎Schema-URL-Warning.md

+7-3
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,17 @@ layout: page
33
title: Schema URL Warning
44
---
55
This page explains the schema URL warning:
6+
7+
```bash
68
> The Dataset is using a schema literal rather than a URL which will be attached to every message.
9+
```
10+
11+
This warning means that the dataset is configured using an Avro schema string, a schema object, or by reflection. Configuring with an HDFS URL where the schema can be found, instead of the other options, allows certain components to pass the schema URL rather than the schema's string literal. This cuts down on the size of headers that must be sent with each message.
712

8-
This warning means that the Dataset is configured using an Avro schema string, a schema object, or by reflection. Configuring with an HDFS URL where the schema can be found, instead of the other options, allows certain components to pass the schema URL rather than the schema's string literal. This cuts down on the size of headers that must be sent with each message.
13+
## Fixing the problem
914

10-
### Fixing the problem
15+
The following Java code demonstrates how to change the descriptor to use a schema URL instead of a schema literal:
1116

12-
The following java code demonstrates how to change the descriptor to use a schema URL instead of a schema literal:
1317
```java
1418
// a path in HDFS where schemas should be stored
1519
Path schemaFolder = new Path("hdfs:/data/schemas");

‎Using-the-Kite-CLI-to-Create-a-Dataset.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
---
22
layout: page
3-
title: Using Kite CLI
3+
title: Using the Kite Command Line Interface to Create a Dataset
44
---
55

6-
## Using the Kite Command Line Interface to Create a Dataset
7-
86
<a href="https://www.youtube.com/watch?v=li3erFGiEw8&list=PLGzsQf6UXBR-BJz5BGzJb2mMulWTfTu99&index=2">
97
<img src="https://raw.githubusercontent.com/DennisDawson/KiteImages/master/CLItitle.jpg"
108
alt="Kite CLI Video" width="240" height="180" border="10" align="right" title="Link to Kite CLI Video"/></a>
119

1210
Kite provides a set of tools that handle the basic legwork for creating a dataset, allowing you to focus on the specifics of the business problem you want to solve. This short tutorial walks you through the process of creating a dataset and viewing the results using the command line interface (CLI).
1311

14-
### Preparation
12+
## Preparation
1513

1614
If you have not done so already, download the Kite command-line interface jar. This jar is the executable that runs the command-line interface, so save it as `dataset`. To download with curl, run:
1715

@@ -20,7 +18,7 @@ curl https://repository.cloudera.com/artifactory/libs-release-local/org/kitesdk/
2018
chmod +x dataset
2119
```
2220

23-
### Create a CSV Data File
21+
## Create a CSV Data File
2422

2523
If you have a CSV file sitting around waiting to be used, you can substitute your file for the one that follows. The truth is, it doesn't matter if you have 100 columns or 2 columns, the process is the same. Larger datasets are only larger, not more complex.
2624

@@ -33,7 +31,7 @@ Reuben, Pastrami and sauerkraut on toasted rye with Russian dressing.
3331
PBJ, Peanut butter and grape jelly on white bread.
3432
```
3533

36-
### Infer the Schema
34+
## Infer the Schema
3735

3836
All right. Now we get to use the CLI. Start by inferring an Avro schema file from the *sandwiches.csv* file you just created. Enter the following command to create an Avro schema file named *sandwich.avsc* with the class name *Sandwich*. The schema details are based on the headings and data in *sandwiches.csv*.
3937

@@ -58,7 +56,7 @@ If you open *sandwich.avsc* in a text editor, it looks something like the code b
5856
}
5957
```
6058

61-
### Create the Dataset
59+
## Create the Dataset
6260

6361
With a schema, you can create a new dataset. Enter the following command.
6462

@@ -89,7 +87,7 @@ You'll get the same schema back, but this time, trust me, it's coming from the H
8987
}
9088
```
9189

92-
### Import the CSV Data
90+
## Import the CSV Data
9391
You've created a dataset in the Hive repository, which is the container, but not the information itself. Next, you might want to add some data so that you can run some queries. Use the following command to import the sandwiches in your CSV file.
9492

9593
`dataset csv-import sandwiches.csv sandwiches`

‎_includes/header.html

+2-2
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@
1616
c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"/>
1717
</svg>
1818
</a>
19-
<div class="trigger">
19+
<!--div class="trigger">
2020
{% for page in site.pages %}
2121
<a class="page-link" href="{{ page.url | prepend: site.baseurl }}">{{ page.title }}</a>
2222
{% endfor %}
23-
</div>
23+
</div -->
2424
</nav>
2525

2626
</div>

‎index.md

+6-8
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,33 @@
11
---
22
layout: page
3-
title: Welcome to Kite
3+
title: Welcome to the Kite Wiki!
44
---
55

6-
## Welcome to the Kite Wiki!
7-
86
This is the default landing page for Kite SDK documentation.
97

10-
### Kite SDK Overview
8+
## Kite SDK Overview
119

1210

13-
* [What is Kite, and Why Is It Awesome?](Kite-SDK-Guide/)
11+
* [What is Kite? (and What Makes It So Awesome?)](Kite-SDK-Guide/)
1412

1513
* [Kite Data Module Overview](Kite-Data-Module-Overview/)
1614

1715

1816

19-
### Kite Dataset Command Line Interface
17+
## Kite Dataset Command Line Interface
2018

2119
* [Kite SDK Dataset CLI](Kite-Dataset-Command-Line-Interface/)
2220
* [Using the Kite Dataset CLI to Create a Dataset](Using-the-Kite-CLI-to-Create-a-Dataset/)
2321

2422

25-
### Conceptual Topics
23+
## Conceptual Topics
2624

2725
* [Parquet vs Avro Format](Parquet-vs-Avro-Format/)
2826
* [Partitioned Datasets](Partitioned-Datasets/)
2927
* [Column Mapping](Column-Mapping/)
3028
* [HBase Storage Cells](HBase-Storage-Cells/)
3129

32-
### Miscellaneous
30+
## Miscellaneous
3331

3432
* [Schema URL Warning](Schema-URL-Warning/)
3533

0 commit comments

Comments
 (0)
Please sign in to comment.