kite-sdk · siegenv · Dec 10, 2013 · Jan 8, 2014 · Jan 9, 2014 · Jan 11, 2014
diff --git a/.gitignore b/.gitignore
@@ -2,5 +2,9 @@
 .settings
 .project
 target
+build
+test-output
+.surefire-*
+.DS_Store
 .idea
 *.iml
diff --git a/README.md b/README.md
@@ -19,11 +19,11 @@ Each example is a standalone Maven module with associated documentation.
 ## Getting Started
 
 The easiest way to run the examples is on the
-[Cloudera QuickStart VM](https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM),
+[Cloudera QuickStart VM](http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html),
 which has all the necessary Hadoop services pre-installed, configured, and
 running locally. See the notes below for any initial setup steps you should take.
 
-The current examples run on version 4.4.0 of the QuickStart VM.
+The current examples run on version 5.1.0 of the QuickStart VM.
 
 Checkout the latest [branch](https://github.com/kite-sdk/kite-examples/branches) of this repository in the VM:
 
@@ -32,8 +32,6 @@ git clone git://github.com/kite-sdk/kite-examples.git
 cd kite-examples
 ```
 
-If you are using a prepared Kite VM, the `git clone` command is already done for you.
-
 Then choose the example you want to try and refer to the README in the relevant subdirectory.
 
 ### Setting up the QuickStart VM
@@ -44,47 +42,95 @@ There are two ways to run the examples with the QuickStart VM:
 2. From your host computer.
 
 The advantage of the first approach is that you don't need to install anything extra on
-your host computer, such as Java or Maven, so there are no extra set up steps.
+your host computer, such as Java or Maven, so there are no fewer set up steps.
+
+For either approach, you need to make the following changes while logged into the VM:
+
+* __Sync the system clock__ For some of the examples it's important that the host and
+guest times are in sync. To synchronize the guest, login and type
+`sudo ntpdate pool.ntp.org`.
+* __Configure the NameNode to listen on all interfaces__ In order to access the cluster from
+the host computer, the NameNode must be configured to listen on all network interfaces. This
+is done by setting the `dfs.namenode.rpc-bind-host` property in `/etc/hadoop/conf/hdfs-site.xml`:
+```xml
+  <property>
+    <name>dfs.namenode.rpc-bind-host</name>
+    <value>0.0.0.0</value>
+  </property>
+```
+* __Configure the History Server to listen on all interfaces__ In order to access the
+cluster from the host computer, the History Server must be configured to listen on all
+network interfaces. This is done by setting the `mapreduce.jobhistory.address` property
+in `/etc/hadoop/conf/mapred-site.xml`:
+```xml
+  <property>
+    <name>mapreduce.jobhistory.address</name>
+    <value>0.0.0.0:10020</value>
+  </property>
+```
+* __Configure HBase to listen on all interfaces__ In order to access the cluster from
+the host computer, HBase must be configured to listen on all network interfaces. This
+is done by setting the `hbase.master.ipc.address` and `hbase.regionserver.ipc.address`
+properties in `/etc/hbase/conf/hbase-site.xml`:
+```xml
+  <property>
+    <name>hbase.master.ipc.address</name>
+    <value>0.0.0.0</value>
+  </property>
+
+  <property>
+    <name>hbase.regionserver.ipc.address</name>
+    <value>0.0.0.0</value>
+  </property>
+```
+* __Restart the vm__ Restart the VM with `sudo shutdown -r now`
 
 The second approach is preferable when you want to use tools from your own development
 environment (browser, IDE, command line). However, there are a few extra steps you
-need to take to configure the QuickStart VM, listed below.
-
-* __Enable port forwarding__ For VirtualBox, open the Settings dialog for the VM,
-select the Network tab, and click the Port Forwarding button. Map the following ports -
-in each case the host port and the guest port should be the same.
-    * 7180 (Cloudera Manager web UI)
-    * 8020, 50010, 50020, 50070, 50075 (HDFS NameNode and DataNode)
-    * 8021 (MapReduce JobTracker)
-    * 8888 (Hue web UI)
-    * 9083 (Hive/HCatalog metastore)
-    * 41415 (Flume agent)
-    * 11000 (Oozie server)
-    * 21050 (Impala JDBC port)
-* __Bind daemons to the wildcard address__ Daemons that are accessed from the host need
-to listen on all network interfaces. In [Cloudera Manager]
-(http://localhost:7180/cmf/services/status) for each of the services listed below,
-select the service, click "View and Edit" under the Configuration tab then
-search for "wildcard", check the box, then save changes.
-    * HDFS NameNode and DataNode
-    * Hue server
-    * MapReduce JobTracker
-* __Add a host entry for localhost.localdomain__ If you host computer does not have a
-mapping for `localhost.localdomain`, then add a line like the following to `/etc/hosts`
+need to take to configure the QuickStart VM, listed below:
+
+* __Add a host entry for quickstart.cloudera__ Add or edit a line like the following
+in `/etc/hosts` on the host machine
 ```
-127.0.0.1       localhost       localhost.localdomain
+127.0.0.1       localhost.localdomain   localhost       quickstart.cloudera
+```
+* __Enable port forwarding__ Most of the ports that need to be forward are pre-configured
+on the QuickStart VM, but there are few that we need to add. For VirtualBox, open
+the Settings dialog for the VM, select the Network tab, and click the Port Forwarding
+button. Map the following ports - in each case the host port and the guest port
+should be the same. Also, your VM should not be running when you are making these changes.
+  * 8032 (YARN ResourceManager)
+  * 10020 (MapReduce JobHistoryServer)
+
+If you have VBoxManage installed on your host machine, you can do this via
+command line as well. In bash, this would look something like:
+
+```bash
+# Set VM_NAME to the name of your VM as it appears in VirtualBox
+VM_NAME="QuickStart VM"
+PORTS="8032 10020"
+for port in $PORTS; do
+  VBoxManage modifyvm "$VM_NAME" --natpf1 "Rule $port,tcp,,$port,,$port"
+done
+```
+
+## Running integration tests
+
+Some of the examples include integration tests. You can run them all with the following
+command:
+
+```bash
+for module in $(ls -d -- */); do
+  (cd $module; mvn clean verify; if [ $? -ne 0 ]; then break; fi)
+done
 ```
-* __Sync the system clock__ For some of the examples it's important that the host and
-guest times are in sync. To synchronize the guest, login and type
-`sudo ntpdate pool.ntp.org`.
-* __Restart the cluster__ Restart the whole cluster in Cloudera Manager.
 
 # Troubleshooting
 
 ## Working with the VM
 
 * __What are the usernames/passwords for the VM?__
-  * Cloudera manager: 4.4.0: cloudera/cloudera, 4.3.0: admin/admin
+  * Cloudera manager: cloudera/cloudera
   * HUE: cloudera/cloudera
   * Login: cloudera/cloudera
 
@@ -124,3 +170,4 @@ guest times are in sync. To synchronize the guest, login and type
   * Using VMWare? Try using VirtualBox.
 
 [vbox]: https://www.virtualbox.org/wiki/Downloads
+
diff --git a/configure-flume.sh b/configure-flume.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+
+if [[ "$EUID" -ne 0 ]]; then
+  echo "Please run using sudo: sudo $0"
+  exit
+fi
+
+# Make sure there isn't a plugins.d in /usr/lib/flume-ng already
+if [[ -d /usr/lib/flume-ng/plugins.d && ! -L /usr/lib/flume-ng/plugins.d ]]; then
+  echo "Error: /usr/lib/flume-ng/plugins.d already exists and is a directory"
+  exit
+fi
+
+# Create the plugins.d folder in /var/lib/flume-ng
+if [[ ! -d /var/lib/flume-ng/plugins.d ]]; then
+  mkdir -p /var/lib/flume-ng/plugins.d
+fi
+
+
+# Link /usr/lib/flume-ng/plugins.d to /var/lib/flume-ng/plugins.d
+if [[ -d /usr/lib/flume-ng && ! -L /usr/lib/flume-ng/plugins.d ]]; then
+  ln -s /var/lib/flume-ng/plugins.d /usr/lib/flume-ng/plugins.d
+fi
+
+# Create the lib and libext directories for the dataset-sink plugin
+mkdir -p /var/lib/flume-ng/plugins.d/dataset-sink/lib
+mkdir -p /var/lib/flume-ng/plugins.d/dataset-sink/libext
+
+# Remove any existing libraries/symlinks
+rm -f /var/lib/flume-ng/plugins.d/dataset-sink/lib/*
+rm -f /var/lib/flume-ng/plugins.d/dataset-sink/libext/*
+
+BASE_DIR=/usr/lib
+if [[ ! -d /usr/lib/kite && -d /opt/cloudera/parcels/CDH/lib ]]; then
+  BASE_DIR=/opt/cloudera/parcels/CDH/lib;
+fi
+
+# Create links to the kite-data-hcatalog and kite-data-hbase jars
+ln -s ${BASE_DIR}/kite/kite-data-hcatalog.jar /var/lib/flume-ng/plugins.d/dataset-sink/lib/kite-data-hcatalog.jar
+ln -s ${BASE_DIR}/kite/kite-data-hbase.jar /var/lib/flume-ng/plugins.d/dataset-sink/lib/kite-data-hbase.jar
+
+# Create links to the Kite dependencies
+ln -s ${BASE_DIR}/hive/lib/antlr-2.7.7.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/antlr-2.7.7.jar
+ln -s ${BASE_DIR}/hive/lib/antlr-runtime-3.4.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/antlr-runtime-3.4.jar
+ln -s ${BASE_DIR}/hive/lib/datanucleus-api-jdo-3.2.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-api-jdo-3.2.1.jar
+ln -s ${BASE_DIR}/hive/lib/datanucleus-core-3.2.2.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-core-3.2.2.jar
+ln -s ${BASE_DIR}/hive/lib/datanucleus-rdbms-3.2.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/datanucleus-rdbms-3.2.1.jar
+ln -s ${BASE_DIR}/hive/lib/hive-common.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-common.jar
+ln -s ${BASE_DIR}/hive/lib/hive-exec.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-exec.jar
+ln -s ${BASE_DIR}/hive/lib/hive-metastore.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-metastore.jar
+ln -s ${BASE_DIR}/hive/lib/jdo-api-3.0.1.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/jdo-api-3.0.1.jar
+ln -s ${BASE_DIR}/hive/lib/libfb303-0.9.0.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/libfb303-0.9.0.jar
+ln -s ${BASE_DIR}/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar /var/lib/flume-ng/plugins.d/dataset-sink/libext/hive-hcatalog-core.jar
diff --git a/dataset-compatibility/README.md b/dataset-compatibility/README.md
@@ -17,8 +17,10 @@ all of the rating data and a `u.item` file with information about each movie.
 To add these file to HDFS:
 
 1. Unzip the file: `unzip ml-100k.zip`
-2. Copy the `u.data` file into HDFS: `hadoop fs -copyFromLocal ml-100k/u.data`
-3. Copy the `u.item` file into HDFS: `hadoop fs -copyFromLocal ml-100k/u.item`
+2. Copy the `u.data` file into HDFS: `hdfs dfs -copyFromLocal ml-100k/u.data ratings.tsv`
+3. Copy the `u.item` file into HDFS: `hdfs dfs -copyFromLocal ml-100k/u.item movies.psv`
+
+This also renames the files to be a little more friendly.
 
 ### Configuring Kite Datasets
 
@@ -66,7 +68,7 @@ Next, we need to create a `DatasetDescriptor` with the schema and rest of the
 information, like location and format:
 ```java
 DatasetDescriptor ratings = DatasetDescriptor.Builder()
-    .location("hdfs:u.data")
+    .location("hdfs:ratings.tsv")
     .format(Formats.CSV)
     .property("kite.csv.delimiter", "\t")
     .schema(csvSchema)
@@ -75,7 +77,7 @@ DatasetDescriptor ratings = DatasetDescriptor.Builder()
 
 Finally, save the descriptor so it can be used later:
 ```java
-repo.create("ratings", ratings);
+Datasets.create("dataset:hdfs:/tmp/data/ratings", ratings);
 ```
 
 Similarly, we will create a dataset for movies the same way. The file of movies
@@ -94,25 +96,24 @@ Schema movieSchema = SchemaBuilder.record("Movie")
     // ignore genre fields for now
     .endRecord();
 
-repo.create("movies", new DatasetDescriptor.Builder()
-    .location("hdfs:u.item")
+Datasets.create("dataset:hdfs:/tmp/data/movies", new DatasetDescriptor.Builder()
+    .location("hdfs:movies.psv")
     .format(Formats.CSV)
     .property("kite.csv.delimiter", "|")
     .schema(movieSchema)
     .build());
 ```
 
-*This doesn't currently work because the files need to be in directories
-already under the repo root.* We need to fix this by allowing the user to pass
-a location to the FS repository. To get this working right now, just put the
-data files in directories named "movies" and "ratings" under the repository
-root.
-
 These steps are done in the `org.kitesdk.examples.data.DescribeDatasets`
-program. You can run this and then read the movies using these commands:
+program:
 ```bash
 mvn compile
 mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.DescribeDatasets"
+```
+
+Now the datasets are ready to be used. You can read movies with this command:
+
+```bash
 mvn exec:java -Dexec.mainClass="org.kitesdk.examples.data.ReadMovies"
 ```
 

diff --git a/dataset-compatibility/pom.xml b/dataset-compatibility/pom.xml
@@ -22,15 +22,16 @@
 
   <groupId>org.kitesdk.examples</groupId>
   <artifactId>dataset-compatibility</artifactId>
-  <version>0.10.1</version>
+  <version>1.1.0</version>
   <packaging>jar</packaging>
 
   <name>Kite Dataset Compatibility Example</name>
 
-  <properties>
-    <!-- Keep this updated to the latest Kite release! -->
-    <kite-version>0.10.1</kite-version>
-  </properties>
+  <parent>
+    <groupId>org.kitesdk</groupId>
+    <artifactId>kite-app-parent-cdh5</artifactId>
+    <version>1.1.0</version>
+  </parent>
 
   <build>
     <plugins>
@@ -51,41 +52,22 @@
 
   <dependencies>
     <dependency>
-      <groupId>org.kitesdk</groupId>
-      <artifactId>kite-data-core</artifactId>
-      <version>${kite-version}</version>
-    </dependency>
-    <dependency>
-      <groupId>com.google.guava</groupId>
-      <artifactId>guava</artifactId>
-      <version>11.0.2</version>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.avro</groupId>
-      <artifactId>avro</artifactId>
-      <version>1.7.5</version>
-    </dependency>
-    <dependency>
-      <groupId>org.apache.hadoop</groupId>
-      <artifactId>hadoop-client</artifactId>
-      <version>2.0.0-cdh4.3.0</version>
+      <groupId>log4j</groupId>
+      <artifactId>log4j</artifactId>
+      <version>${hadoop.log4j.version}</version>
     </dependency>
     <dependency>
       <groupId>org.slf4j</groupId>
       <artifactId>slf4j-log4j12</artifactId>
-      <version>1.6.1</version>
+      <version>${hadoop.slf4j.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.kitesdk</groupId>
+      <artifactId>kite-hadoop-cdh5-dependencies</artifactId>
+      <version>${kite.version}</version>
+      <type>pom</type>
+      <scope>compile</scope> <!-- provide Hadoop dependencies -->
     </dependency>
   </dependencies>
 
-  <repositories>
-    <repository>
-      <id>cdh.repo</id>
-      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
-      <name>Cloudera Repositories</name>
-      <snapshots>
-        <enabled>false</enabled>
-      </snapshots>
-    </repository>
-  </repositories>
-
 </project>