Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kite-examples #30

Open
wants to merge 142 commits into
base: 0.10.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
7fa7cc0
Update snapshot branch to 0.10.1-SNAPSHOT.
rdblue Dec 10, 2013
dd68d39
CDK-275. Upgrade to Crunch 0.9.0.
tomwhite Jan 8, 2014
367412b
Update examples for CDH 4.4.0 and the new VM.
rdblue Jan 9, 2014
e47cec5
Fixing old kite:drop-dataset mvn references.
rdblue Jan 11, 2014
8f3dbd3
Adding new JSON example.
rdblue Jan 11, 2014
485d6fd
Update snapshot examples for version 0.10.2-SNAPSHOT
rdblue Jan 14, 2014
387df07
Adding bash instructions for port forwarding
markgrover Jan 26, 2014
930e824
Fixing a typo
markgrover Jan 26, 2014
e126749
Add missing plugin repositories section for kite plugin.
tomwhite Jan 28, 2014
75c2a78
Add missing plugin repositories section for kite plugin.
tomwhite Jan 28, 2014
ebd454a
Remove age field that was mistakenly added.
tomwhite Jan 28, 2014
321e946
Remove age field that was mistakenly added.
tomwhite Jan 28, 2014
d801728
Merge pull request #1 from markgrover/bash_commands
tomwhite Jan 30, 2014
a227ce2
Merge pull request #2 from markgrover/typo1
tomwhite Jan 30, 2014
bde43cb
Adding bash instructions for port forwarding
markgrover Jan 26, 2014
b3b1fc4
Fixing a typo
markgrover Jan 26, 2014
d0fe377
CDK-107. Upgrade to CDH 4.4.0.
tomwhite Feb 5, 2014
a0f8cda
Remove instruction to bind daemons to the wildcard address since
tomwhite Feb 5, 2014
805fe58
Fix dataset name for HBase.
tomwhite Feb 6, 2014
ec1ef8d
Sync version with snapshot
tomwhite Feb 6, 2014
174dbba
Merge branch 'snapshot'
tomwhite Feb 6, 2014
d619bcb
Update stable examples for version 0.11.0
tomwhite Feb 6, 2014
580729b
Update snapshot examples for version 0.11.1-SNAPSHOT
tomwhite Feb 6, 2014
b642441
test - pls ignore
Feb 7, 2014
bcb20a1
CDK-312: Add a module that contains examples for how to unit test Mor…
Feb 7, 2014
04dffef
fix typo
Feb 7, 2014
ac4ba9c
bit more javadoc
Feb 7, 2014
d16abe9
cleanup
Feb 7, 2014
73c3ca6
cleanup
Feb 7, 2014
ac84ed7
bit more javadoc
Feb 7, 2014
d36d2aa
cleanup
Feb 7, 2014
a478721
make Locale configurable
Feb 7, 2014
b0c1186
cleanup
Feb 7, 2014
b40d357
Add instructions on how to generate list of all needed jar files
Feb 11, 2014
6e07593
add doc on kite-morphlines-core vs kite-morphlines-all
Feb 11, 2014
1449a19
fix typo
Feb 11, 2014
ec7a318
Add doc on using the Maven CLI to run test data through a morphline
Feb 11, 2014
6cfc255
fix typo
Feb 11, 2014
2c8fcf8
fix doc
Feb 11, 2014
6dfc248
fix typo
Feb 11, 2014
ff3370c
Mention that Flume user impersonation is already enabled in CM5.
tomwhite Feb 13, 2014
57e869e
add bit more doc
Feb 13, 2014
9ecb416
add bit more doc
Feb 13, 2014
3d167b5
add doc about 'mvn dependency:tree'
Feb 18, 2014
be4a4f4
update version number in preparation for upcoming release
Mar 4, 2014
09afc23
update version number in preparation for upcoming release
Mar 4, 2014
7f33e35
Add Erick's CSV unit tests plus corresponding docs. Thanks Erick!
Mar 6, 2014
b9e4b1f
add more of Erick's doc
Mar 6, 2014
a6f6e5d
formatting
Mar 6, 2014
f4a8b56
formatting
Mar 6, 2014
f0fbdb3
formatting
Mar 6, 2014
91b8659
formatting
Mar 6, 2014
6f437eb
formatting
Mar 6, 2014
7d49944
formatting
Mar 6, 2014
32a10d9
formatting
Mar 6, 2014
604e23c
formatting
Mar 6, 2014
099d4cb
add license header
Mar 6, 2014
1ab6c94
formatting
Mar 6, 2014
4a2a6bb
formatting
Mar 6, 2014
079f6f8
formatting
Mar 6, 2014
c1c4497
formatting
Mar 6, 2014
b62250c
add a bit more doc
Mar 6, 2014
1354e9a
add bit more doc
Mar 6, 2014
8d5b83e
cleanup
Mar 6, 2014
896076b
formatting
Mar 6, 2014
3043433
CDK-247. The dataset-staging example should run in parallel.
tomwhite Nov 14, 2013
24f89ea
avoid compiler warning
Mar 6, 2014
8c9ce64
bit more doc
Mar 8, 2014
fd922b7
bit more doc
Mar 8, 2014
17d3429
bit more doc
Mar 8, 2014
0e47a7c
CDK-361. Demo example fails with 'java.lang.ClassNotFoundException: o…
tomwhite Mar 10, 2014
3deb95d
Merge branch 'snapshot' into master2
tomwhite Mar 11, 2014
ddbef15
Update stable examples for version 0.12.0
tomwhite Mar 11, 2014
b1a510c
Update snapshot examples for version 0.12.1-SNAPSHOT
tomwhite Mar 11, 2014
dd7a3e2
CDK-200: Fix http port conflict with yarn.
rdblue Mar 12, 2014
37bf586
Add license header.
tomwhite Mar 13, 2014
720031b
CDK-252: Configure log4j in examples.
rdblue Mar 18, 2014
c6821fc
Update staging example to use hive-compatible names.
rdblue Mar 18, 2014
1ec4409
Update stable examples for version 0.12.1
rdblue Mar 19, 2014
fdedd90
Update snapshot examples for version 0.12.2-SNAPSHOT
rdblue Mar 19, 2014
10857a8
Update stable examples for version 0.13.0
rdblue Apr 23, 2014
88a679c
Update snapshot examples for version 0.13.1-SNAPSHOT
rdblue Apr 23, 2014
6c9e840
Update stable examples for version 0.14.0
rdblue May 14, 2014
38e244c
Update snapshot examples for version 0.14.1-SNAPSHOT
rdblue May 14, 2014
033bc4f
CDK-330. Move event schema to examples.
tomwhite May 19, 2014
31a92ac
Fix dependencies so StagingToPersistent runs correctly.
tomwhite May 20, 2014
deab191
CDK-442. Fix dataset-compatibility instructions to make it clear how …
tomwhite May 21, 2014
2b4ca6f
CDK-408. Dataset examples don't work against local filesystem.
tomwhite May 21, 2014
256fa57
Minor improvements to the instructions.
tomwhite May 22, 2014
6fe00ab
Update stable examples for version 0.14.1
tomwhite May 23, 2014
dcaadac
Update snapshot examples for version 0.14.2-SNAPSHOT
tomwhite May 23, 2014
a7d0b34
CDK-423. Remove HCatalog dependency.
tomwhite May 26, 2014
4a4d8b1
Fixed up issue with DataDescriptor.Builder using wrong API
Jun 9, 2014
c2324c6
Merge pull request #5 from mkwhitacre/fixupExamples
tomwhite Jun 10, 2014
c8e12ca
CDK-94. Use Kite application pom
tomwhite Jun 3, 2014
38af900
CDK-514: Update dataset-staging.
rdblue Jul 3, 2014
0c1e9b3
CDK-514: Update dataset example.
rdblue Jul 4, 2014
f29768b
CDK-514: Update dataset-compatibility.
rdblue Jul 4, 2014
934733a
CDK-514: Update dataset-hbase example.
rdblue Jul 4, 2014
12c01f1
Use UTC for timestamps for avoid timezone issues.
tomwhite May 13, 2014
7f17a73
Address review comments.
rdblue Jul 9, 2014
0109e58
CDK-514: Update examples for CDK-511.
rdblue Jul 10, 2014
bde9190
CDK-533: Use full path for dataset-compatibility locations.
rdblue Jul 10, 2014
e578922
CDK-534. Demo example no longer compiles since it uses deprecated met…
tomwhite Jul 14, 2014
b78438b
Merge branch 'snapshot' into 0.15.0
tomwhite Jul 15, 2014
9872e82
Update stable examples for version 0.15.0
tomwhite Jul 15, 2014
25c8c94
Update snapshot examples for version 0.15.1-SNAPSHOT
tomwhite Jul 15, 2014
c37f82b
CDK-568: Replaced use of DatasetRepository with Datasets API
Aug 11, 2014
89c9e45
CDK-534: Switch demo crunch jobs to use views.
rdblue Jul 11, 2014
913feca
CDK-539. Convert demo example to use views.
tomwhite Aug 5, 2014
cbc2a0a
CDK-546: Added a Java Spark demo
Jul 24, 2014
827f6a3
CDK-593: Update with API changes to CrunchDatasets
Aug 20, 2014
74681f3
CDK-595: Added a note in the README.md requiring CDH5
Aug 20, 2014
282233c
CDK-597: Update stable examples for version 0.16.0
Aug 21, 2014
4404a8f
CDK-597: Update snapshot examples for version 0.16.1-SNAPSHOT
Aug 21, 2014
b0c6362
CDK-575: Update parent pom to CDH5 app parent.
Aug 28, 2014
c88ba69
CDK-647: Update examples to use Flume's Log4jAppender and DatasetSink
Sep 3, 2014
147969d
CDK-670. Rename dataset CLI tool to "kite-dataset".
tomwhite Sep 30, 2014
f67a9d9
CDK-656: Updates to the examples based on testing with the QS VM 5.1
Oct 2, 2014
a9f9532
CDK-716: Removed instructions for running examples on a cluster
Oct 7, 2014
a25ca3b
CDK-722. In the examples the MR job history server should listen on t…
tomwhite Oct 9, 2014
eed5651
CDK-724: Seperate the steps always needed to setup the QS VM from hos…
Oct 9, 2014
7c7dbc0
CDK-726: Fixed doc issues in json, logging, and spark examples.
Oct 9, 2014
094d3c8
CDK-605: Removed oozie from demo example.
Oct 9, 2014
eef25d6
CDK-720: Update stable examples for version 0.17.0
Oct 10, 2014
e3f2dca
CDK-720: Update snapshot examples for version 0.17.1-SNAPSHOT
Oct 10, 2014
6be3250
CDK-788. Move examples integration tests into examples modules.
tomwhite Nov 25, 2014
efb1246
Remove debug flag.
tomwhite Dec 3, 2014
4fcd39a
Fix kite version param
Dec 4, 2014
8d365ff
Merge branch 'snapshot' into 0.17.1
tomwhite Dec 10, 2014
94e77a7
CDK-806: Update stable examples for version 0.17.1
tomwhite Dec 10, 2014
7f09c25
CDK-806: Update snapshot examples for version 0.17.2-SNAPSHOT
tomwhite Dec 10, 2014
e5162e1
CDK-788. Add instructions on how to run the integration tests.
tomwhite Dec 10, 2014
1903c7d
Fix typos
tomwhite Feb 11, 2015
2e84ec3
CDK-910: Update stable examples for version 0.18.0
tomwhite Feb 11, 2015
d1a54e8
CDK-910: Update snapshot examples for version 0.18.1-SNAPSHOT
tomwhite Feb 11, 2015
69d51ec
Check if writer is Flushable.
tomwhite Feb 23, 2015
1f4c78e
CDK-931: Update stable examples for version 1.0.0
tomwhite Feb 24, 2015
ba07e3a
CDK-931: Update snapshot examples for version 1.0.1-SNAPSHOT
tomwhite Feb 24, 2015
a762c3a
update version used in morphline docs to latest stable release
Apr 24, 2015
2b4494a
Merge branch 'snapshot' into 1.1.0
rdblue Jun 16, 2015
f7200f1
KITE-1021: Create 1.1.0 examples branch
rdblue Jun 17, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,9 @@
.settings
.project
target
build
test-output
.surefire-*
.DS_Store
.idea
*.iml
115 changes: 81 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ Each example is a standalone Maven module with associated documentation.
## Getting Started

The easiest way to run the examples is on the
[Cloudera QuickStart VM](https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM),
[Cloudera QuickStart VM](http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html),
which has all the necessary Hadoop services pre-installed, configured, and
running locally. See the notes below for any initial setup steps you should take.

The current examples run on version 4.4.0 of the QuickStart VM.
The current examples run on version 5.1.0 of the QuickStart VM.

Checkout the latest [branch](https://github.com/kite-sdk/kite-examples/branches) of this repository in the VM:

Expand All @@ -32,8 +32,6 @@ git clone git://github.com/kite-sdk/kite-examples.git
cd kite-examples
```

If you are using a prepared Kite VM, the `git clone` command is already done for you.

Then choose the example you want to try and refer to the README in the relevant subdirectory.

### Setting up the QuickStart VM
Expand All @@ -44,47 +42,95 @@ There are two ways to run the examples with the QuickStart VM:
2. From your host computer.

The advantage of the first approach is that you don't need to install anything extra on
your host computer, such as Java or Maven, so there are no extra set up steps.
your host computer, such as Java or Maven, so there are no fewer set up steps.

For either approach, you need to make the following changes while logged into the VM:

* __Sync the system clock__ For some of the examples it's important that the host and
guest times are in sync. To synchronize the guest, login and type
`sudo ntpdate pool.ntp.org`.
* __Configure the NameNode to listen on all interfaces__ In order to access the cluster from
the host computer, the NameNode must be configured to listen on all network interfaces. This
is done by setting the `dfs.namenode.rpc-bind-host` property in `/etc/hadoop/conf/hdfs-site.xml`:
```xml
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
```
* __Configure the History Server to listen on all interfaces__ In order to access the
cluster from the host computer, the History Server must be configured to listen on all
network interfaces. This is done by setting the `mapreduce.jobhistory.address` property
in `/etc/hadoop/conf/mapred-site.xml`:
```xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>0.0.0.0:10020</value>
</property>
```
* __Configure HBase to listen on all interfaces__ In order to access the cluster from
the host computer, HBase must be configured to listen on all network interfaces. This
is done by setting the `hbase.master.ipc.address` and `hbase.regionserver.ipc.address`
properties in `/etc/hbase/conf/hbase-site.xml`:
```xml
<property>
<name>hbase.master.ipc.address</name>
<value>0.0.0.0</value>
</property>

<property>
<name>hbase.regionserver.ipc.address</name>
<value>0.0.0.0</value>
</property>
```
* __Restart the vm__ Restart the VM with `sudo shutdown -r now`

The second approach is preferable when you want to use tools from your own development
environment (browser, IDE, command line). However, there are a few extra steps you
need to take to configure the QuickStart VM, listed below.

* __Enable port forwarding__ For VirtualBox, open the Settings dialog for the VM,
select the Network tab, and click the Port Forwarding button. Map the following ports -
in each case the host port and the guest port should be the same.
* 7180 (Cloudera Manager web UI)
* 8020, 50010, 50020, 50070, 50075 (HDFS NameNode and DataNode)
* 8021 (MapReduce JobTracker)
* 8888 (Hue web UI)
* 9083 (Hive/HCatalog metastore)
* 41415 (Flume agent)
* 11000 (Oozie server)
* 21050 (Impala JDBC port)
* __Bind daemons to the wildcard address__ Daemons that are accessed from the host need
to listen on all network interfaces. In [Cloudera Manager]
(http://localhost:7180/cmf/services/status) for each of the services listed below,
select the service, click "View and Edit" under the Configuration tab then
search for "wildcard", check the box, then save changes.
* HDFS NameNode and DataNode
* Hue server
* MapReduce JobTracker
* __Add a host entry for localhost.localdomain__ If you host computer does not have a
mapping for `localhost.localdomain`, then add a line like the following to `/etc/hosts`
need to take to configure the QuickStart VM, listed below:

* __Add a host entry for quickstart.cloudera__ Add or edit a line like the following
in `/etc/hosts` on the host machine
```
127.0.0.1 localhost localhost.localdomain
127.0.0.1 localhost.localdomain localhost quickstart.cloudera
```
* __Enable port forwarding__ Most of the ports that need to be forward are pre-configured
on the QuickStart VM, but there are few that we need to add. For VirtualBox, open
the Settings dialog for the VM, select the Network tab, and click the Port Forwarding
button. Map the following ports - in each case the host port and the guest port
should be the same. Also, your VM should not be running when you are making these changes.
* 8032 (YARN ResourceManager)
* 10020 (MapReduce JobHistoryServer)

If you have VBoxManage installed on your host machine, you can do this via
command line as well. In bash, this would look something like:

```bash
# Set VM_NAME to the name of your VM as it appears in VirtualBox
VM_NAME="QuickStart VM"
PORTS="8032 10020"
for port in $PORTS; do
VBoxManage modifyvm "$VM_NAME" --natpf1 "Rule $port,tcp,,$port,,$port"
done
```

## Running integration tests

Some of the examples include integration tests. You can run them all with the following
command:

```bash
for module in $(ls -d -- */); do
(cd $module; mvn clean verify; if [ $? -ne 0 ]; then break; fi)
done
```
* __Sync the system clock__ For some of the examples it's important that the host and
guest times are in sync. To synchronize the guest, login and type
`sudo ntpdate pool.ntp.org`.
* __Restart the cluster__ Restart the whole cluster in Cloudera Manager.

# Troubleshooting

## Working with the VM

* __What are the usernames/passwords for the VM?__
* Cloudera manager: 4.4.0: cloudera/cloudera, 4.3.0: admin/admin
* Cloudera manager: cloudera/cloudera
* HUE: cloudera/cloudera
* Login: cloudera/cloudera

Expand Down Expand Up @@ -124,3 +170,4 @@ guest times are in sync. To synchronize the guest, login and type
* Using VMWare? Try using VirtualBox.

[vbox]: https://www.virtualbox.org/wiki/Downloads

53 changes: 53 additions & 0 deletions configure-flume.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash

if [[ "$EUID" -ne 0 ]]; then
echo "Please run using sudo: sudo $0"
exit
fi

# Make sure there isn't a plugins.d in /usr/lib/flume-ng already
if [[ -d /usr/lib/flume-ng/plugins.d && ! -L /usr/lib/flume-ng/plugins.d ]]; then
echo "Error: /usr/lib/flume-ng/plugins.d already exists and is a directory"
exit
fi

# Create the plugins.d folder in /var/lib/flume-ng
if [[ ! -d /var/lib/flume-ng/plugins.d ]]; then
mkdir -p /var/lib/flume-ng/plugins.d
fi


# Link /usr/lib/flume-ng/plugins.d to /var/lib/flume-ng/plugins.d
if [[ -d /usr/lib/flume-ng && ! -L /usr/lib/flume-ng/plugins.d ]]; then
ln -s /var/lib/flume-ng/plugins.d /usr/lib/flume-ng/plugins.d
fi

# Create the lib and libext directories for the dataset-sink plugin
mkdir -p /var/lib/flume-ng/plugins.d/dataset-sink/lib
mkdir -p /var/lib/flume-ng/plugins.d/dataset-sink/libext

# Remove any existing libraries/symlinks
rm -f /var/lib/flume-ng/plugins.d/dataset-sink/lib/*
rm -f /var/lib/flume-ng/plugins.d/dataset-sink/libext/*

BASE_DIR=/usr/lib
if [[ ! -d /usr/lib/kite && -d /opt/cloudera/parcels/CDH/lib ]]; then
BASE_DIR=/opt/cloudera/parcels/CDH/lib;
fi

# Create links to the kite-data-hcatalog and kite-data-hbase jars
ln -s ${BASE_DIR}/kite/kite-data-hcatalog.jar /var/lib/flume-ng/plugins.d/dataset-sink/lib/kite-data-hcatalog.jar
ln -s ${BASE_DIR}/kite/kite-data-hbase.jar /var/lib/flume-ng/plugins.d/dataset-sink/lib/kite-data-hbase.jar

# Create links to the Kite dependencies
ln -s ${BASE_DIR}/hive/lib/antlr-2.7.7.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/antlr-2.7.7.jar
ln -s ${BASE_DIR}/hive/lib/antlr-runtime-3.4.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/antlr-runtime-3.4.jar
ln -s ${BASE_DIR}/hive/lib/datanucleus-api-jdo-3.2.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-api-jdo-3.2.1.jar
ln -s ${BASE_DIR}/hive/lib/datanucleus-core-3.2.2.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-core-3.2.2.jar
ln -s ${BASE_DIR}/hive/lib/datanucleus-rdbms-3.2.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-rdbms-3.2.1.jar
ln -s ${BASE_DIR}/hive/lib/hive-common.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-common.jar
ln -s ${BASE_DIR}/hive/lib/hive-exec.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-exec.jar
ln -s ${BASE_DIR}/hive/lib/hive-metastore.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-metastore.jar
ln -s ${BASE_DIR}/hive/lib/jdo-api-3.0.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/jdo-api-3.0.1.jar
ln -s ${BASE_DIR}/hive/lib/libfb303-0.9.0.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/libfb303-0.9.0.jar
ln -s ${BASE_DIR}/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-hcatalog-core.jar
27 changes: 14 additions & 13 deletions dataset-compatibility/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@ all of the rating data and a `u.item` file with information about each movie.
To add these file to HDFS:

1. Unzip the file: `unzip ml-100k.zip`
2. Copy the `u.data` file into HDFS: `hadoop fs -copyFromLocal ml-100k/u.data`
3. Copy the `u.item` file into HDFS: `hadoop fs -copyFromLocal ml-100k/u.item`
2. Copy the `u.data` file into HDFS: `hdfs dfs -copyFromLocal ml-100k/u.data ratings.tsv`
3. Copy the `u.item` file into HDFS: `hdfs dfs -copyFromLocal ml-100k/u.item movies.psv`

This also renames the files to be a little more friendly.

### Configuring Kite Datasets

Expand Down Expand Up @@ -66,7 +68,7 @@ Next, we need to create a `DatasetDescriptor` with the schema and rest of the
information, like location and format:
```java
DatasetDescriptor ratings = DatasetDescriptor.Builder()
.location("hdfs:u.data")
.location("hdfs:ratings.tsv")
.format(Formats.CSV)
.property("kite.csv.delimiter", "\t")
.schema(csvSchema)
Expand All @@ -75,7 +77,7 @@ DatasetDescriptor ratings = DatasetDescriptor.Builder()

Finally, save the descriptor so it can be used later:
```java
repo.create("ratings", ratings);
Datasets.create("dataset:hdfs:/tmp/data/ratings", ratings);
```

Similarly, we will create a dataset for movies the same way. The file of movies
Expand All @@ -94,25 +96,24 @@ Schema movieSchema = SchemaBuilder.record("Movie")
// ignore genre fields for now
.endRecord();

repo.create("movies", new DatasetDescriptor.Builder()
.location("hdfs:u.item")
Datasets.create("dataset:hdfs:/tmp/data/movies", new DatasetDescriptor.Builder()
.location("hdfs:movies.psv")
.format(Formats.CSV)
.property("kite.csv.delimiter", "|")
.schema(movieSchema)
.build());
```

*This doesn't currently work because the files need to be in directories
already under the repo root.* We need to fix this by allowing the user to pass
a location to the FS repository. To get this working right now, just put the
data files in directories named "movies" and "ratings" under the repository
root.

These steps are done in the `org.kitesdk.examples.data.DescribeDatasets`
program. You can run this and then read the movies using these commands:
program:
```bash
mvn compile
mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.DescribeDatasets"
```

Now the datasets are ready to be used. You can read movies with this command:

```bash
mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.ReadMovies"
```

Expand Down
52 changes: 17 additions & 35 deletions dataset-compatibility/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,16 @@

<groupId>org.kitesdk.examples</groupId>
<artifactId>dataset-compatibility</artifactId>
<version>0.10.1</version>
<version>1.1.0</version>
<packaging>jar</packaging>

<name>Kite Dataset Compatibility Example</name>

<properties>
<!-- Keep this updated to the latest Kite release! -->
<kite-version>0.10.1</kite-version>
</properties>
<parent>
<groupId>org.kitesdk</groupId>
<artifactId>kite-app-parent-cdh5</artifactId>
<version>1.1.0</version>
</parent>

<build>
<plugins>
Expand All @@ -51,41 +52,22 @@

<dependencies>
<dependency>
<groupId>org.kitesdk</groupId>
<artifactId>kite-data-core</artifactId>
<version>${kite-version}</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>11.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.0.0-cdh4.3.0</version>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${hadoop.log4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.1</version>
<version>${hadoop.slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.kitesdk</groupId>
<artifactId>kite-hadoop-cdh5-dependencies</artifactId>
<version>${kite.version}</version>
<type>pom</type>
<scope>compile</scope> <!-- provide Hadoop dependencies -->
</dependency>
</dependencies>

<repositories>
<repository>
<id>cdh.repo</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<name>Cloudera Repositories</name>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>

</project>
Loading