kite-sdk · DennisDawson · Mar 4, 2015 · Mar 5, 2015 · Mar 12, 2015 · Mar 30, 2015
diff --git a/tutorials/create-events-dataset.md b/tutorials/create-events-dataset.md
@@ -0,0 +1,117 @@
+---
+layout: page
+title: Creating the Events Dataset
+---
+## Purpose
+
+This lesson shows you how to create a dataset suitable for storing standard event records, as defined in [The Unified Logging Infrastructure for Data Analytics at Twitter][paper]. You define a [dataset schema][schema], a [partition strategy][partstrat], and a URI that specifies the storage [scheme][scheme], then use [`kite-dataset create`][create] to make a Hive dataset.
+
+[paper]:http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf
+[schema]:{{site.baseurl}}/introduction-to-datasets.html#schemas
+[partstrat]:{{site.baseurl}}/Partitioned-Datasets.html#partition-strategies
+[scheme]:{{site.baseurl}}/introduction-to-datasets.html#uri-schemes
+[create]:{{site.baseurl}}/cli-reference.html#create
+
+### Prerequisites
+
+* A [Quickstart VM][prepare] or instance of CDH 5.2 or later.
+* The [kite-dataset][kite-dataset] command.
+
+[prepare]:{{site.baseurl}}/tutorials/preparing-the-vm.html
+[kite-dataset]:{{site.baseurl}}/Install-Kite.html
+
+### Result
+
+You create `dataset:hive:events`, where you can store standard event objects. You can use the dataset with several Kite tutorials that demonstrate data capture, storage, and analysis.
+
+## Defining the Schema
+
+The `standard_event.avsc` schema is self-describing, with a _doc_ property for each field. StandardEvent records store the `user_id` for the person who initiates an event, the user's IP address, and a timestamp for when the event occurred.
+
+### standard_event.avsc
+
+```JSON
+{
+  "name": "StandardEvent",
+  "namespace": "org.kitesdk.data.event",
+  "type": "record",
+  "doc": "A standard event type for logging, based on the paper 'The Unified Logging Infrastructure for Data Analytics at Twitter' by Lee et al, http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf",
+  "fields": [
+    {
+      "name": "event_initiator",
+      "type": "string",
+      "doc": "Source of the event in the format {client,server}_{user,app}; for example, 'client_user'. Required."
+    },
+    {
+      "name": "event_name",
+      "type": "string",
+      "doc": "A hierarchical name for the event, with parts separated by ':'. Required."
+    },
+    {
+      "name": "user_id",
+      "type": "long",
+      "doc": "A unique identifier for the user. Required."
+    },
+    {
+      "name": "session_id",
+      "type": "string",
+      "doc": "A unique identifier for the session. Required."
+    },
+    {
+      "name": "ip",
+      "type": "string",
+      "doc": "The IP address of the host where the event originated. Required."
+    },
+    {
+      "name": "timestamp",
+      "type": "long",
+      "doc": "The point in time when the event occurred, represented as the number of milliseconds since January 1, 1970, 00:00:00 GMT. Required."
+    }
+  ]
+}
+```
+
+## Defining the Partition Strategy
+
+Analytics for the `events` dataset are time-based. Partitioning the dataset on the `timestamp` field allows Kite to go directly to the files for a particular day, ignoring files outside the time period. Partition strategies are defined in JSON format. See [Partition Strategy JSON Format][partition-strategies].
+
+The following sample defines a strategy that partitions a dataset by _year_, _month_, and _day_, based on a _timestamp_ field.
+
+### partition_year_month_day.json
+
+```
+[ {
+  "source" : "timestamp",
+  "type" : "year",
+  "name" : "year"
+}, {
+  "source" : "timestamp",
+  "type" : "month",
+  "name" : "month"
+}, {
+  "source" : "timestamp",
+  "type" : "day",
+  "name" : "day"
+} ]
+```
+
+[partition-strategies]:{{site.baseurl}}/Partition-Strategy-Format.html
+
+## Creating the Events Dataset Using the Kite CLI
+
+Create the _events_ dataset using the default Hive scheme.
+
+To create the _events_ dataset:
+
+1. Open a terminal window.
+1. Use the `create` command to create the dataset. This example assumes that you stored the schema and partition definitions in your home directory. Substitute the correct path if you stored them in a different location.
+
+```
+kite-dataset create events \
+             --schema ~/standard_event.avsc \
+             --partition-by ~/partition_year_month_day.json
+```
+
+Use [Hue][hue] to confirm that the dataset appears in your table list and is ready to use.
+
+[hue]:http://quickstart.cloudera:8888/beeswax/execute#query
diff --git a/tutorials/flume-capture-events.md b/tutorials/flume-capture-events.md
@@ -0,0 +1,199 @@
+---
+layout: page
+title: Capturing Events with Flume
+---
+
+## Purpose
+
+This lesson demonstrates how you can configure Flume to capture events from a web application with minimal impact on performance or the user. Flume collects individual events and writes them in groups to the dataset.
+
+The Flume agent receives the events over inter-process communication (IPC), and writes the events to the Hive file sink. Each time you send a message, Log4j writes a new `INFO` line in the terminal window.
+
+This example demonstrates how to generate Flume configuration information from the Kite CLI. In addition, JSP and servlet samples allow you to test the data capture mechanism.
+
+### Prerequisites
+
+* A VM or cluster configured with Flume user impersonation. See [Preparing the Virtual Machine][vm].
+* An [Events dataset][events] in which to capture session events.
+
+[vm]:{{site.baseurl}}/tutorials/preparing-the-vm.html
+[events]:{{site.baseurl}}/tutorials/create-events-dataset.html
+
+### Result
+
+Flume is configured to listen for events on a Tomcat server instance. Use the JSP and servlets to send events to Tomcat. Log4j logs each event to the terminal window. Flume stores the events in `dataset:hive:events`.
+
+## Configuring Flume
+
+Follow these steps to configure Flume to channel log information directly to the `events` dataset. You first generate the configuration information using the Kite command-line interface, copy the results, paste them in the Flume configuration file, and then restart Flume.
+
+You can configure Flume for this example using either Cloudera Manager or the command line.
+
+### Configuring Flume in Cloudera Manager
+
+1. In a terminal window, type `kite-dataset flume-config --channel-type memory events`.
+1. Copy the output from the terminal window.
+1. Open Cloudera Manager.
+1. Under __Status__, click the link to __Flume__.
+1. Choose the __Configuration__ tab.
+1. Click __Agent Base Group__.
+1. Right-click the Configuration File text area and choose __Select All__.
+1. Right-click the Configuration File text area and choose __Paste__.
+1. Click __Save Changes__.
+1. From the __Actions__ menu, choose __Restart__, and confirm the action.
+
+### Configuring Flume from the Command Line
+
+1. In a terminal window, enter `kite-dataset flume-config --channel-type memory events -o flume.conf`.
+1. To update Flume configuration, enter `sudo cp flume.conf /etc/flume-ng/conf/flume.conf`.
+1. To restart the Flume agent, enter `sudo /etc/init.d/flume-ng-agent restart`.
+
+Flume is now configured to listen for web application events and record them in the `events` dataset.
+
+## Running the Web Application
+
+Follow these steps to build the web application, start the Tomcat server, and then use the web application to generate events that are sent to the Hadoop dataset.
+
+1. In a terminal window, navigate to `kite-examples/demo`.
+1. To compile the application, enter `mvn install`.
+1. To start the Tomcat server, enter `mvn tomcat7:run`.
+1. In a web browser, enter the URL [`http://quickstart.cloudera:8034/demo-logging-webapp/`][logging-app].
+1. On the web form, enter any user ID and a message, and then click **Send** to create a web event. 
+
+View the log messages in the terminal window where you launched Tomcat. View the records in Hive using the Hue File Browser.
+
+[logging-app]:http://quickstart.cloudera:8034/demo-logging-webapp/
+
+## Creating Web Application Pages
+
+These JSP and servlet examples create message events that can be captured by Flume. These examples are not Kite- or Flume-specific; they send messages to the Tomcat server, and Flume captures the events independent of the web application.
+
+## index.jsp
+
+The default landing page for the web application is `index.jsp`. It defines a form with fields for an arbitrary User ID and a message. The __Send__ button submits the input values to the Tomcat server.
+
+```JSP
+<html>
+  <head>
+    <title>Kite Example</title>
+  <head>
+  <body>
+    <h2>Kite Example</h2>
+    <form name="input" action="send" method="get">
+        User ID: <input type="text" name="user_id" value="1">
+        Message: <input type="text" name="message" value="Hello!">
+        <input type="submit" value="Send">
+    </form>
+  </body>
+</html>
+```
+
+## LoggingServlet
+
+When you submit a message from the JSP, the LoggingServlet receives and processes the request. The following is mostly standard servlet code, with some notes about application-specific snippets.
+
+```Java
+package org.kitesdk.examples.demo;
+```
+
+The servlet parses information from the request to create a StandardEvent object. However, you won't find any source code for `org.kitesdk.data.event.StandardEvent`. During the Maven build, the `avro-maven-plugin` runs before the compile phase. Any `.avsc` file in the `/main/avro` folder is defined as a Java class. The autogenerated classes have  the methods required to build corresponding Avro `SpecificRecord` objects of that type. `SpecificRecord` objects permit efficient access to object fields.
+
+```Java
+
+import org.kitesdk.data.event.StandardEvent;
+import java.io.IOException;
+import java.io.PrintWriter;
+import javax.servlet.ServletException;
+import javax.servlet.http.HttpServlet;
+import javax.servlet.http.HttpServletRequest;
+import javax.servlet.http.HttpServletResponse;
+
+```
+
+This example sends Log4j messages directly to the Hive data sink via Flume.
+
+```Java
+import org.apache.log4j.Logger;
+
+public class LoggingServlet extends HttpServlet {
+
+  private final Logger logger = Logger.getLogger(LoggingServlet.class);
+
+  @Override
+  protected void doGet(HttpServletRequest request, HttpServletResponse
+      response) throws ServletException, IOException {
+
+    response.setContentType("text/html");
+```    
+
+Create a PrintWriter instance to write the response page.
+
+```Java
+  PrintWriter pw = response.getWriter();
+
+    pw.println("<html>");
+    pw.println("<head><title>Kite Example</title></head>");
+    pw.println("<body>");
+```
+
+Get the user ID and message values from the servlet request.
+
+```Java
+    String userId = request.getParameter("user_id");
+    String message = request.getParameter("message");
+```
+
+If there's no message, don't create a log entry.
+
+```Java
+    if (message == null) {
+      pw.println("<p>No message specified.</p>");
+
+```
+
+Otherwise, print the message at the top of the page body.
+
+```Java
+    } else {
+      pw.println("<p>Message: " + message + "</p>");
+
+```
+
+Create a new StandardEvent builder.
+
+```Java
+      StandardEvent event = StandardEvent.newBuilder()
+```
+The event initiator is a user on the client. The event is a web message. You can set these values as string literals, because the event initiator and event name are always the same.
+
+```Java
+          .setEventInitiator("client_user")
+          .setEventName("web:message")
+```
+
+Parse the arbitrary user ID, provided by the user, as a long integer.
+
+```Java
+          .setUserId(Long.parseLong(userId))
+
+```
+
+The application obtains the session ID and IP address from the request object, and creates a timestamp based on the local machine clock.
+
+```Java
+          .setSessionId(request.getSession(true).getId())
+          .setIp(request.getRemoteAddr())
+          .setTimestamp(System.currentTimeMillis())
+```
+
+Build the StandardEvent object, and then send the object to the logger with the level _info_.
+
+```Java
+          .build();
+      logger.info(event);
+    }
+    pw.println("<p><a href=\"/demo-logging-webapp\">Home</a></p>");
+    pw.println("</body></html>");
+  }
+}
+```